Andy Henson, 11 March 2004 (modified 15 March 2004)
Normally most RAID arrays require disks to be of the same size or they waste the extra space. But in systems upgraded over time or with disks being replaced when they fail, it is likely that drives of different sizes will be used in many systems. This document gives a method for efficiently providing RAID 5 features with disks of different sizes.
RAID 5 stripes the data over a number of disks (N-1), with the final disk being a checksum disk to allow recovery if any one of the disks fail. RAID 5 differs from RAID 4 in that the checksum appears on a different disk each row of stripes, thus distributing the normal data over N disks, not N-1. This increases read speed, without affecting write speed.
For example, if 4 disks are used, A, B and C are segments of data, and D is the checksum.
|
A |
B |
C |
D |
The checksum is computed as:
D = A xor B xor C.
The next row of stripes has the data and checksum moved around:
|
D |
A |
B |
C |
Let us keep our example of 4 disks, but we will split the data into fragments. I've labeled the disks hda, hdb, hdc, hdd for clarity, so they are not confused with the data, which is now in 16 smaller units which I'll call blocks.
|
A1 |
A2 |
A3 |
A4 |
B1 |
B2 |
B3 |
B4 |
C1 |
C2 |
C3 |
C4 |
D1 |
D2 |
D3 |
D4 |
|
hda |
hdb |
hdc |
hdd |
||||||||||||
Checksum D is computed as before, block by block:
D1 = A1 xor B1 xor C1
D2 = A2 xor B2 xor C2
etc.
Since we now have smaller sizes, we can use different disk configuations. Consider 2 x 120GB disks and 2 x 200GB disks. We can allocate 3 blocks to the 120GB and 5 to the 200GB disks in proportion to their size:
|
A1 |
A2 |
A3 |
A4 |
B1 |
B2 |
B3 |
B4 |
C1 |
C2 |
C3 |
C4 |
D1 |
D2 |
D3 |
D4 |
|
hda |
hdb |
hdc |
hdd |
||||||||||||
It looks easy, but it doesn't work. If hdd fails, we loose not only the checksum D but also C4, which we cannot recreate from just A4 and B4. Similarly loosing hdc means we loose both B3 and C3, and we cannot recreate either with just A3 and D3. So we need a bigger checksum area, indeed we need one big enough for the largest disk in use, because if (in our example) 5 of our blocks are lost we need 5 redundant ones to be able to recreate them.
So we'll need 5 checksum blocks, D1 to D5, instead of 4. We can then have 5 A and 5 B blocks as well, but we only need one C1 since we have a total of 16 blocks:
|
A1 |
A2 |
A3 |
A4 |
A5 |
B1 |
B2 |
B3 |
B4 |
B5 |
C1 |
D1 |
D2 |
D3 |
D4 |
D5 |
|
hda |
hdb |
hdc |
hdd |
||||||||||||
D1 = A1 xor B1 xor C1 (as before)
D2 = A2 xor B2 (since there's no C2)
This will work, because a lost disk will loose only one in each XOR line.
Note that we have 11 data blocks in this table; 5 are checksum blocks.
The next row of blocks needs to be rotated relative to this to maximise read throughput. How much we rotate is fairly arbitrary, simple RAID 5 rotated by one whole disk, but since our disks are different sizes it's not clear what the equivalent would be. We could rotate by a whole large disk (5 blocks), or by just one block. For this example I'll rotate by 4 blocks, the same as the examples above before:
|
D2 |
D3 |
D4 |
D5 |
A1 |
A2 |
A3 |
A4 |
A5 |
B1 |
B2 |
B3 |
B4 |
B5 |
C1 |
D1 |
|
hda |
hdb |
hdc |
hdd |
||||||||||||
Loss of hda is fine, it only holds checksums. Loss of hdb or hdc would be ok, we can recreate the data.
But what if we loose hdd now? If so, we loose C1 and D1, and D1 = A1 xor B1 xor C1, so we cannot recreate C1 from just A1 and B1. This is bad. What went wrong?
The only really bad thing is loosing two blocks of the same number at the same time (ie, on the same disk). In the first example, the C1 / D1 boundary was on a disk boundary so it was ok. All the other groups (A, B, D) had at least 5 blocks so there is no danger of them causing this problem since the same-numbered blocks are so far apart that they would not fit on the same disk. It's only the C / D boundary which is a problem because C is the short set of data blocks.
Is seems we numbered it wrongly. If find it following B3, B4, B5 and before D1 we should label it C2, and then it forms part of the D2 checksum which is on a different disk. In the general case we may have more than one C block, but the rule is simply: number the C block (or blocks) to avoid conflict with other blocks on the same disk.
Up till now I've been holding a close analogy with RAID 5 in my diagrams, rather too close perhaps. In order to get good disk random read (and write) performance, blocks should be large (so seek time is small compared to read/write time) yet in order to get good RAID write performance we need to write a whole row of blocks to disk quickly with minimum read or write of unnecessary data.
So the table needs reorganising a little:
This is the version shown above, with 16 blocks in a row:
|
A1 |
A2 |
A3 |
A4 |
A5 |
B1 |
B2 |
B3 |
B4 |
B5 |
C1 |
D1 |
D2 |
D3 |
D4 |
D5 |
|
hda |
hdb |
hdc |
hdd |
||||||||||||
This is the newly reorganised table:
|
A1 |
B1 |
C1 |
D1 |
|
D2 |
|
A2 |
B2 |
|
|
B3 |
D3 |
A3 |
|
A4 |
|
B4 |
D4 |
|
|
D5 |
B5 |
A5 |
|
hda |
hdb |
hdc |
hdd |
This is the same as before, just reorganised to put each XOR row in each row of the table. Assuming we need to keep a reasonable block size this just reduces the size of each row to a number of adjacent locations which are easier to maintain and manange when computing the checksum or reversing the process to recover the data.
We still have 11 data blocks and 5 checksum blocks in the table. The gaps in the table are to allow for hda and hdb being smaller than the bigger disks, just as the other table format had less space for hda and hdb.
This mapping could be precalculated into a table whose input is one of the 11 real-data blocks,and this table looked up when a mapping is needed to a checksum block and a disk (actually our example table maps the wrong way, from disk to data or checksum). Alternatively if the 11-block units are undesirable, a bigger mapping table could be constructed from a uniform size irrespective of the number and size of disks (eg, 256 blocks) which would still be very small (about 512 bytes for the table) and could be written to the front of every disk.
One possibility is the case of adding or removing disks from a live system – not just replacing existing ones which fail, but changing the number and size of the disks.
When such a change occurs it is possible to update the mapping table yet retain a way of knowing if a row of the table uses the old or new table. A recovery pass (like that necessary if a disk has been removed faulty or replaced) can move the data from the old into the new layout, meanwhile reads and writes can continue unchanged, reading from the old format where necessary.
An issue arises if more or less data space is available: this cannot be resolved by the RAID layer alone. Interaction with a filing system is necessary in this case: this is beyond the scope of this paper. Tools to perform similar tasks (such as Logical Volume Management) are now available even on low end systems, but the ultimate goal is for the system to recognise and make the extra space available without user intervention.
Written by Andy Henson. Copyright © 2004 by Zexia Access Ltd
If you want to contact me regarding this, please email <raid8@zexia.co.uk>
Please feel free to link to this document.
None.