Search

Login



Privacy Policy

New user sign-up is disabled because of spam bots harassing this domain. Until I found a working solution, you can use the contact form and ask me for a new account. Sorry.
-Cygon

Advertising

Home Blog Personal Aligning an SSD on Linux

Aligning an SSD on Linux Print E-mail
Written by Markus Ewald   
Thursday, December 24 2009 20:38

I've got a small home server with a software RAID-5 for storing my files. It also runs a few virtual machines and acts as a NAT router for internet access. Nothing expensive, just some Frankensteinian patchwork built from old hardware left over when I upgraded my workstation. Nevertheless, I granted it a brand new Intel X25-M SSD last week.

Photo of an Intel X25-M SSD drive, which is a metal box smaller than a CD case

Did I mention that this server is running Gentoo Linux? I thought this would be a good time to do a fresh install and get everything right that might have gone wrong the first time. Besides, installing Linux always is an interesting (and masochistic) experience, especially when your chosen distribution has no installer :)

Because getting my partitions and file systems aligned also proved to be difficult task, I thought why not make a small article out of this!

Erase Block Size

SSDs always operate on entire blocks of memory. This is so because, before writing to a memory cell, flash memory needs to be erased, which requires the application of a large voltage to the memory cells, which can only happen to an entire memory cell block at once (probably because this kind of power would affect other cells around the one being erased, at least that's my guess.)

Anyway, this means that if you write 1 KB of data to an SSD with an erase block size of 128 KB, the SSD needs to read 127 KB from the target block, erase the block and write the old data plus the new data back into the block. That's something one just has to accept when using an SSD. Modern SSD firmware will do its best to pre-erase blocks when it's idle and try to write new data into these pre-erased blocks (by mapping data to other locations on the drive without the knowledge of the OS.)

Still, watch what happens if a file system just sees the SSD as a brick of memory and writes data at a random position:

A box of cells with a small section highlighted that goes across a cell border

The SSD now has to erase and write two blocks, even though one would have sufficed for the amount of data being written. To fix this, the drive's firmware would have to do data mapping on the byte level, which likely isn't going to happen (in the worst case, you would need more memory for the remapping table than the drive's capacity!)

If the file system's write was aligned to a multiple of the SSD's erase block size, the result would be this:

A box of cells with a small section highlighted that stays inside a single cell

Thus, it's generally a good idea to make sure your file system's writes are aligned to multiples of your SSD's erase block size. As I found out, this isn't quite as easy as it sounds. The first road block is already encountered when you partition a hard drive:

Partition Alignment

If the partitions of a hard drive aren't aligned to begin at multiples of 128 KiB, 256 KiB or 512 KiB (depending on the SSD used), aligning the file system is useless because everything is skewed by the start offset of the partition. Thus, the first thing you have to take care of is aligning the partitions you create.

A spindle with three discs with a red ring superimposed on each of the discs
A cylinder.

A spindle with three discs with a red pie slice superimposed on each of the discs
A sector.

Traditionally, hard drives were addressed by indicating the cylinder, head and sector at which data was to be read or written. These represented the radial position, the drive head (= platter and side) and the axial position of the data respectively. With LBA (logical block addressing), this is no longer the case. Instead, the entire hard drive is addressed as one continuous stream of data.

Linux' fdisk, however, still uses a virtual C-H-S system where you can define any number of heads and sectors yourself (the cylinders are calculated automatically from the drive's capacity), with partitions always starting and ending at intervals of heads x cylinders. Thus, you need to choose a number of heads and sectors of which the SSD's erase block size is a multiple.

I found two posts which detail this process: Aligning Filesystems to an SSD's Erase Block Size and Partition alignment for OCZ Vertex in Linux. The first one recommends 224 heads and 56 sectors, but I can't quite understand where those numbers come from, so I used the advice from the post on the OCZ forums with 32 heads and 32 sectors which means fdisk uses a cylinder size of 1024 bytes. And because fdisk partitions in units of 512 cylinders (= 512 x heads x sectors) fdisk's unit size now happens to be an SSD's maximum erase block size. Nice!

To make fdisk use 32 heads and 32 sectors, remove all partitions from a hard drive and then launch fdisk with the following command line when you create the first partition:

fdisk -S 32 -H 32 /dev/sda

The OCZ post also recommends starting at the second 512-cylinder unit because the first partition is otherwise shifted by one track. Don't ask me why :)

Here's how I partitioned my SSD in the end:

Screenshot of a linux console where fdisk reports 32 heads and 32 sectors

For a normal hard drive, I'd probably use 128 heads and 32 tracks now to achieve 4 KiB boundaries for my partitions.

RAID Chunk Size

If you plan on running a software RAID array, I've seen chunk sizes of 64 KiB and 128 KiB being recommended. This can be specified using the --chunk parameter for mdadm, eg.

mdadm --create /dev/md3 --level=1 --chunk=128 --raid-devices=2 /dev/sda3 /dev/sdb3

Probably the larger chunk size is more useful if you are storing large files on the RAID partition, but I haven't found any advice which included benchmarks or at least a solid explanation yet.

File System Alignment

Now that the partitions have been taken care of, the file systems need to use proper alignment as well. Generally all file systems use some kind of allocation blocks, usually with a size of 4 KiB. But increasing this size to 128 KiB (or even 512 KiB) would waste a lot of space since any file would use up memory in a multiple of that number.

Luckily, Linux file systems can be tweaked a lot. I'm using ext4, here the -E stride,stripe-width parameters control the alignment. The HowTos/Disk Optimization page in the CentOS wiki gives this advice:

The drive calculation works like this: You divide the chunk size by the block size for one spindle/drive only. This gives you your stride size. Then you take the stride size, and multiply it by the number of data-bearing disks in the RAID array. This gives you the stripe width to use when formatting the volume. This can be a little complex, so some examples are listed below.

For example if you have 4 drives in RAID5 and it is using 64K chunks and given a 4K file system block size. The stride size is calculated for the one disk by (chunk size / block size), (64K/4K) which gives 16K. While the stripe width for RAID5 is 1 disk less, so we have 3 data-bearing disks out of the 4 in this RAID5 group, which gives us (number of data-bearing drives * stride size), (3*16K) gives you a stripe width of 48K.

The Linux Kernel RAID wiki offers further insight:

Calculation

  • chunk size = 128kB (set by mdadm cmd, see chunk size advise above)
  • block size = 4kB (recommended for large files, and most of time)
  • stride = chunk / block = 128kB / 4k = 32kB
  • stripe-width = stride * ( (n disks in raid5) - 1 ) = 32kB * ( (3) - 1 ) = 32kB * 2 = 64kB

If the chunk-size is 128 kB, it means, that 128 kB of consecutive data will reside on one disk. If we want to build an ext2 filesystem with 4 kB block-size, we realize that there will be 32 filesystem blocks in one array chunk.

stripe-width=64 is calculated by multiplying the stride=32 value with the number of data disks in the array.

A raid5 with n disks has n-1 data disks, one being reserved for parity. (Note: the mke2fs man page incorrectly states n+1; this is a known bug in the man-page docs that is now fixed.) A raid10 (1+0) with n disks is actually a raid 0 of n/2 raid1 subarrays with 2 disks each.

So these are the stride and stripe-width parameters I'd use:

  • Intel SSDs with an erase block size of 128 (or 512 KiB -- Intel isn't quite straightforward with this, see the comments section for a discussion on the subject - if anyone from Intel is reading this, help us out! ;-)) that are not part of a software RAID:
    -E stride=32,stripe-width=32

  • OCZ Vertex SSDs with an erase block size of 512 KiB that are not part of a software RAID:
    -E stride=128,stripe-width=128

  • Normal hard drives that are not part of a software RAID
    trust the defaults

  • Any software RAID:
    -E stride=raid chunk size / file system block size,stripe-width=raid chunk size x number of data bearing disks

Thus, I set up the file systems on the Intel SSD like this:

mkfs.ext4 -b 1024 -E stride=128,stripe-width=128 -O ^has_journal /dev/sda1
mkfs.ext4 -b 4096 -E stride=32,stripe-width=32 /dev/sda3

mkfs.ext4 defaulted to 1024 byte allocation units on my boot partition, so I adjusted the stride up to 128 KiB according to the advice from the CentOS wiki. The alignment of my boot partition is probably not of any relevance because the system will read maybe 10 files from it and not modify anything, but I wanted to stay consistent :)

 

Comments  

 
+2 #1 adbge 2009-12-27 18:04
Beautifully written article -- very helpful.
Quote
 
 
0 #2 trx 2010-03-24 16:04
great text!

Quote:
The OCZ post also recommends starting at the second 512-cylinder unit because the first partition is otherwise shifted by one track. Don't ask me why :)


Because the first sector (512bytes) is Master Boot Record and cannot be part of the first partition.
Quote
 
 
0 #3 trx 2010-03-25 02:47
btw, can you please copy-paste output of print command in your fdisk after starting it with:
fdisk /dev/sda -u -c
Quote
 
 
+1 #4 Cygon 2010-04-01 12:03
Ah, so that's the reason. Thanks!

I don't know what -c does (and neither does my fdisk ;-)), but here's fdisk -u -l /dev/sda for you:

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        1024      132095       65536   83  Linux
/dev/sda2          132096    16909311     8388608   82  Linux swap / Solaris
/dev/sda3        16909312   151127039    67108864   83  Linux
Quote
 
 
0 #5 trx 2010-04-01 17:10
well, @fdisk
Usage:
fdisk [options] change partition table
fdisk [options] -l list partition table(s)
fdisk -s give partition size(s) in blocks

Options:
-b sector size (512, 1024, 2048 or 4096)
-c switch off DOS-compatible mode
-h print help
-u give sizes in sectors instead of cylinders
-v print version
-C specify the number of cylinders
-H specify the number of heads
-S specify the number of sectors per track


and, AFAIK, Intel based SSDs use 128 blocks of 4KB as errase block size, so it should be 512kB, just as OCZ:
http://www.anandtech.com/show/2614/3
http://www.xbitlabs.com/articles/storage/display/intel-x25m-ssd_2.html

please, correct me if I'm wrong.
Quote
 
 
0 #6 Cygon 2010-04-20 14:39
My fdisk doesn't support '-c' - a quick google for some man pages also don't reveal any such parameter (eg. linux.die.net/man/8/fdisk). No idea why - running busybox 1.15.3 from January 27, 2010 - I guess it must have been removed.

I'm quite sure that the erase block size on Intel drives is 128 KB, not 512 KB. It's one of the features used by Intel to market their SSDs as superior to others (check this: techreport.com/articles.x/15433) - they say the smaller erase block size reduces write overhead, thereby increasing the drive's longevity.

I believe the first article you linked got it a bit wrong. The graph clearly says 4 KB write requires a 128 KB erase (= write amplification of 32), but then the text from the article incorrectly states that 16 KB write would require a 512 KB erase, assuming the write amplification to be a static property of the drive. The write amplification in this case would be 8. And for a 128 KB write, it would be 1.
Quote
 
 
0 #7 trx 2010-04-30 00:58
I've asked on few forums, even on Intel's community forum, contacted Intel support, but got no clear answer.

This is closest so far:
http://forums.anandtech.com/showthread.php?t=2069082
Quote
 
 
0 #8 Samat Jain 2010-05-16 09:18
Quote:
Any software RAID:
--stride=raid chunk size
--stripe-width=raid chunk size x number of data bearing disks


This is wrong, or at least misleading from what you said earlier.

Chunk size is typically reported in bytes (or kilobytes). For example, a 64 KiB chunk size.

stride is the number of blocks in a chunk size. For a block size of 4 KiB, this is 64/4 = 16.

Likewise for stripe-width, it is the number of blocks in a stripe width. This is 16*number of data disks.

I've been trying to add correct information the Linux RAID wiki: https://raid.wiki.kernel.org/index.php/RAID_setup which should be considered the authoritative source on the subject.
Quote
 
 
0 #9 clesch 2010-05-16 10:51
Great article, but I think I noticed an error:

mkfs.ext4 -b 4096 -E stride=32 -E stripe-width=32 /dev/sda3

will give you 0 blocks stride with 32 stripe width as can be observed by the output on screen after entering the command.

The correct command I needed to enter in order to get 32 stride with 32 stripe width is:

mkfs.ext4 -b 4096 -E stride=32,stripe-width=32 /dev/sda3
Quote
 
 
0 #10 Cygon 2010-05-17 16:03
@trx Thanks! I've dropped Intel an email, too. Hopefully one of us will get a straight answer eventually :)

@Samat Jain: Wow, I din't even didn't know the kernel team had a wiki for the raid modules. Thanks for the link. I fixed the formula and also quoted the relevant section from the kernel raid wiki.

@clesch: Whoops. That's unexpected. I changed the commands to read the way you're using (and the kernel raid wiki does, too). Thanks for the clarification!
Quote
 

Add comment


Security code
Refresh



Joomla Template by Joomlashack