As you might know, if disk partitions containing Oracle datafiles are not aligned with the underlying storage system, then some I/O’s can suffer from some overhead as they are effectively translated in two I/O’s.
If you want more info, google for “EMC disk alignment” and you’ll find plenty of information, explaining the issue.
Update 28-03-2013: I wrote a follow-up for this post describing the same thing for Linux (Red Hat / CentOS / OEL) versions 6. For that, you might want to jump straight to the new post as this one gets a bit outdated 😉
One example is http://www.vmware.com/pdf/esx3_partition_align.pdf for Vmware ESX version 3.x.
In short: If you create partitions in Intel based Operating Systems, then by default, the first partition will start at an offset of 15 x 512 byte blocks (equals 7680 bytes) – which does not match typical SAN storage systems that use 4K or 8K disk chunks. A write to a block crossing the boundary will cause 2 writes (plus some partial reads) in the disk backend (and the remote copy if you use remote storage mirroring) and will sometimes cause an extra cache slot to be allocated. Performance improvement when changing to the right alignment can be between 5 and 15% depending on workloads and other configuration settings.
Recent Linux distributions will sometimes already do this by default, if that is the case, make sure it actually does so (see end of this article) and you probably don’t have to change anything.
Now the way most documentation explains how to resolve this in Linux is, in my opinion, too complex, you need to manually enter “fdisk”, go into expert mode, change the starting block mode etc. Not nice if you have to configure a few hundred Oracle ASM disks at once.
There is an easier way.
Here goes…
(assuming you have a completely empty disk and you only want to create exactly one aligned partition, i.e. for Oracle ASM)
- Check if your linux system has the command “sfdisk”. I bet most linux systems will have it installed by default.
- Make sure you know the linux device name of the disk (such as /dev/sdk)
- Enter the command:
echo "128,," | sfdisk -uS /dev/sdk
Note the command will fail if there is already a partition (so it’s reasonably safe). This is what the output looks like on my system:
Checking that no-one is using this disk right now ... OK Disk /dev/sdk: 1044 cylinders, 255 heads, 63 sectors/track Old situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdk1 0 - 0 0 Empty /dev/sdk2 0 - 0 0 Empty /dev/sdk3 0 - 0 0 Empty /dev/sdk4 0 - 0 0 Empty New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdk1 128 16771859 16771732 83 Linux /dev/sdk2 0 - 0 0 Empty /dev/sdk3 0 - 0 0 Empty /dev/sdk4 0 - 0 0 Empty Warning: no primary partition is marked bootable (active) This does not matter for LILO, but the DOS MBR will not boot this disk. Successfully wrote the new partition table Re-reading the partition table ... If you created or changed a DOS partition, /dev/foo7, say, then use dd(1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1 (See fdisk(8).)
Explanation:
sfdisk will read from “stdin” any commands it has to perform. To work around having to enter everything manually by ourselves we use “echo” to feed the commands directly into sfdisk. From the man page of sfdisk we can find out how sfdisk accepts commands:
sfdisk reads lines of the form <start> <size> <id> <bootable> <c,h,s> <c,h,s>
And using the -uS options we tell sfdisk to use sizes of sectors (of 512 bytes each) instead of cylinders or anything else.
As we want to use the full size of the disk we leave that field empty and let sfdisk figure it out. The id will be default (Linux partition). If you want something else then read the man page and you’ll find it. We ignore also the bootable and disk cylinders/heads/sectors parameters (they are optional).
The disk will be aligned exactly at 64KB offset (8 chunks of 8K which fits nicely with either EMC CLARiiON or EMC Symmetrix).
Sometimes you might want another alignment value. Common is one megabyte (2048 sectors). The command would then be:
echo "2048,," | sfdisk -uS /dev/sdk
To verify disk alignment:
sfdisk -uS -l <disk>
Example:
Here is the partition overview of my small Oracle RAC cluster.
[root@oradb1 ~]# listasm #dev scsi lun ASMVol SizeMB /dev/sda 0 0 - 101 /dev/sdb 1 0 - 9 /dev/sdc 1 1 ASM1 8189 /dev/sdd 1 2 ASM2 8189 /dev/sde 1 3 - 1019 /dev/sdf 1 4 - 1019
sda is the boot disk, sdb contains Oracle binaries, sdc/sdd are ASM volumes and sde/sdf are cluster resources / voting disks.
Let’s look at the boot volume.
[root@oradb1 ~]# sfdisk -uS -l /dev/sda Disk /dev/sda: 2088 cylinders, 255 heads, 63 sectors/track Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sda1 * 63 208844 208782 83 Linux /dev/sda2 208845 33543719 33334875 8e Linux LVM /dev/sda3 0 - 0 0 Empty /dev/sda4 0 - 0 0 Empty
You can see that sda1 is mis-aligned at 63 sectors. I don’t really care as the boot (OS) disk in Linux will not cause much I/O anyway. The LVM volume is also misaligned at 208845 sectors. I only keep OS stuff in there so don’t care.
Now let’s check the ASM disks.
[root@oradb1 ~]# sfdisk -uS -l /dev/sdc Disk /dev/sdc: 1044 cylinders, 255 heads, 63 sectors/track Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdc1 128 16771859 16771732 83 Linux /dev/sdc2 0 - 0 0 Empty /dev/sdc3 0 - 0 0 Empty /dev/sdc4 0 - 0 0 Empty
Nicely aligned at 64K (128 sectors) !
Let’s take a look at another Linux server that I installed with Ubuntu Server 10.10 recently.
root@silverstone:~# sfdisk -uS -l /dev/sda Disk /dev/sda: 48641 cylinders, 255 heads, 63 sectors/track Warning: extended partition does not start at a cylinder boundary. DOS and Linux will interpret the contents differently. Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sda1 * 2048 499711 497664 83 Linux /dev/sda2 501758 781422591 780920834 5 Extended /dev/sda3 0 - 0 0 Empty /dev/sda4 0 - 0 0 Empty /dev/sda5 501760 781422591 780920832 8e Linux LVM
You can see that on this system, even the boot volume is aligned at 1 Megabyte (2048 sectors). So some modern Linux distros will remove the burden of doing this yourself.
Let’s see what happens if I accidentally try to overwrite an existing partition.
[root@oradb1 ~]# echo "128,," | sfdisk -uS /dev/sdc Checking that no-one is using this disk right now ... BLKRRPART: Device or resource busy This disk is currently in use - repartitioning is probably a bad idea. Umount all file systems, and swapoff all swap partitions on this disk. Use the --no-reread flag to suppress this check. Use the --force flag to overrule all checks.
For those geeks who still think this is not enough, here the real proof.
dd if=/dev/sdc bs=512 count=130 | xxd -c 32 000ffc0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ................................ 000ffe0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ................................ 0010000: 0182 0101 0000 0000 0000 0080 bc9c b1ab 0000 0000 0000 0000 0000 0000 0000 0000 ................................ 0010020: 4f52 434c 4449 534b 4153 4d31 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ORCLDISKASM1.................... 0010040: 0000 100a 0000 0103 4441 5441 5f30 3030 3000 0000 0000 0000 0000 0000 0000 0000 ........DATA_0000............... 0010060: 0000 0000 0000 0000 4441 5441 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ........DATA.................... 0010080: 0000 0000 0000 0000 4441 5441 5f30 3030 3000 0000 0000 0000 0000 0000 0000 0000 ........DATA_0000...............
You can see that the ASM volume starts at offset 0x10000 which equals 65536.
Hope this makes your life a bit easier! Needless to say that you can put the given commands in a simple script to make it even easier 🙂
Update 1
My colleague Erik Zandboer has an excellent explanation of the alignment problem on his blog. You can find it here and here. Or search for keyword “alignment” on his site: http://www.vmdamentals.com/?tag=alignment
Also, I found that the “cfdisk” command shows weird behavior in CentOS 6.0 (probably also in Red Hat version 6). You might have to use the “–force” option to make it work in those Linux distributions. The drawback of this is that using that option does not prevent overwriting existing partitions. Be careful! (or write a script to prevent mistakes).
Nice post, thanks! Of all the different methods, this is the easiest I’ve seen.
You’re welcome!
Wrapping it up in a script to align a dozen volumes at once should not be too hard, either. Try that with the standard “fdisk” method 😉
Thanks, that was helpful
Do you know where those seemingly constant numbers come from?
I mean 1 Megabyte (2048 sectors), 1 block (1024 bytes), …
“sfdisk -g” only returns the number of cylinders which is not of any help
Sure!
When magnetic disks were invented in the 1950’s, they standardized on 512 byte block sizes. Only much later, some disk vendors started developing disks with larger block sizes (i.e. 1024, 4096 byte blocks).
As disk blocks are (more or less) standardized on 512 bytes, the direct result is that 1 megabyte equals 2048 sectors, etc.
The reason for EMC to recommend 64K as starting offset is that all our storage systems work nicely with that value, including the state-of-the-art V-max (with cache slots of 64K).
Oracle sometimes recommends 1 MB which is also fine (as it is a multiple of 64K).
I guess the weird 63-block default offset for the first partition on an Intel architecture originated in the MS-DOS days (maybe even before) when many (IDE) disks had 63 sectors per track. In those days, starting the first partition on the 63-block offset was beautifully aligned. Soon after, when larger disks evolved, the real geometry of drives could not be represented anymore with the old CHS (Cylinders-Heads-Sectors) method and they had to translate (emulate) weird formats (think of it as some sort of virtualization 😉
Misalignment on plain (simple) disks (we call this JBOD – just a bunch of disks) was never a big problem as I/O was 512-byte anyway. But intelligent disk arrays (like EMC’s) are optimized for much larger I/O and use 4096 or even 8192 bytes per block internally, to make better use of cache and other internal resources. The drawback is that misaligned partitions (Intel platforms) cause a noticeable performance drop. The issue was never on high-end UNIX (AIX, SPARC Solaris, HP-UX) because they use different partitioning methods.
Dude, the reason is that 64k aligns perfect with the track size. Nothing to do with the cache slot. As you prefetch, the subsystem with prefetch at a multiple of 4K. As 64K is the track size and it is also the element size in all RAID systems, then you optimize the efficiency of the IO.
Dude…
I wonder who told you this (I certainly hope it’s not one of my EMC colleagues)… I am talking about EMC (Symmetrix or CLARiiON) storage systems. Maybe you are correct for simple RAID controllers or JBOD (just a bunch of disks).
For EMC:
– Cache slot sizes DO matter. If you have misaligned I/O then some of them will cause two I/O’s in the backend and if the two cross a cache slot boundary they will require two cache slots.
– Subsystems will prefetch with 8K increments in more recent Symmetrix systems. Actually it’s dynamic so you will also see larger prefetch sizes (always multiples of the disk block size). With modern architectures that can easily handle larger blocks, using legacy 4K size is causing more overhead in memory management. Going to 16K would be even more efficient if it wasn’t that many databases and filesystems still use 8K block sizes.
I wonder what you mean by “element size”. Seems weird that (according to you) with many different RAID architectures they all – without exception – have the same element size (whatever that means). Maybe you mean “stripe size” (sometimes called “stripe element size”) but that is different for many storage systems. For example, EMC VMAX uses 256K.
But feel free to ignore all this and do things differently 🙂
Hi Bart,
Can you kindly respond for couple of questions:
1) Is it necessary to use partition alignment also for DM multipath Linux native devices?;
2) does offset correspond to ASM AU size?
Thank you
Hi Alexei,
>> Is it necessary to use partition alignment also for DM multipath Linux native devices?
Yes. Multipath does not change the layout of partitions. Unless you don’t use (linux) partitions and directly provide linux volumes to ASM – so the ASM volume is something like /dev/sdk (full disk) instead of /dev/sdk1 (first partition).
The rule is: If you use old-style Intel (PC) partitioning (the one using 4 primary partitions where 1 of them can be an extended partition holding more logical partitions) then you need to be aware of alignment. If you use another partition method (such as non-Intel or the newer GPT partitioning) then there are no issues.
Beware of VMware VMFS by the way… You could have alignment issues on the VMFS partition but none on the VMDK virtual LUN presented to the virtual machine. You need to make sure *both* are aligned.
>> does offset correspond to ASM AU size?
No. The default Oracle AU size is 1MB but I/O’s to an ASM disk group will be much smaller. As long as the Oracle blocks (typically 8K by default) are aligned then you’re fine.
For current EMC storage, any offset that is a multiple of 8K is fine. So the very first byte of the partition may be at 64K, 100K, 128K, 1024K (1MB) or whatever. But (techie) people tend to use powers of 2 so that’s why it will typically be something like 64K, 128K, 1024K.
Hope this helps, let me know if you have more questions!
Thank you very much
Alexei
Can you please clarify your response previous comment ?
If a whole, unpartitioned volume e.g. /dev/sdk is presented to the Linux kernel’s DM Multipath or LVM is alignment still an issue ?
i.e. do DM multipath or LVM somehow offset the beginning of the data by an odd number of sectors/segments that is not readily obvious or are all sectors/segments used from the beginning of the volume and alignment is not a problem ?
Unless you want multiple partitions per disk, is there any value in partitioning the disk at all ? Using a whole volume would seem to save the trouble of deciding how much to offset by.
Hi Hans,
The Linux kernel multipath or LVM do not change anything. They do not introduce a new alignment offset neither get rid of the already existing one.
The multipath feature allows you two paths to the same (partitioned or not) volume.
The LVM is one level higher in the stack and works with whatever you give it as an LVM physical volume (i.e. whole disk ala /dev/sdk, or partition a la /dev/sdk1 or whatever). If the physical volume is misaligned then every LV in the volume group (using the same PV) will be misalgined)…
If you present an empty unpartitioned disk (/dev/sdk) to the kernel then there is not (yet) an alignment issue. The issue is introduced by ONE factor only, and that’s is the (legacy) PC (MSDOS) partitioning method. If you have no partitioning at all (i.e. you create a LVM physical volume directly onto the whole disk) then you don’t have to worry. Or if you use alternative partitioning (i.e. GPT) or if you use an alignment aware partitioner (i.e. the most recent Ubuntu linux distro).
To be more clear: Every time you encounter a “DOS” partitioning style (the one having max. 4 primary partitions of which one can be an extended partitions holding more logical partitions) you should be cautious.
Hope this helps clearing things up!
Great explanations ! Thanks 🙂
Hi, quick question, I have the following scenario
vmware 4.1 / netapp nfs / rhel 5.5 vm
Inside the vm I have 10 ms-dos partitions, no lvm.
I have created the first one /boot, aligned, from sector #64, do I need to align the 9 remaining ones ? to have every partition starting and ending with a /64 number of sectors ? thanks
Normally if the first partition is aligned, then additional partitions on the same disk will be aligned as well. But you can always check with the sfdisk -uS -l command.
thanks
Quick question: We are using a Symmetrix VMAX 20K with virtual provisioning on RAID5 TDATs and using 4 and 8 member Striped METAs allocated to Linux RHEL v6. We are experiencing high write response times (above 64ms in SPA) from the array during heavy random write loads. We see high queuing at the FA ports, but are still getting 100% write hit to cache. No device pending event. Is an offset of 2048 aligned?
Hi Aaron,
Yep 2048 blocks of 512 bytes equals exactly 1 Megabyte (1048576 bytes). This is divisable by 8192 (128 x 8K) so you’re good. By the way I noticed that RHEL aligns correctly by default since version 6. The high response time must be caused by something else. Or maybe you’re just pushing too much IO for the # of FA ports to handle (in that case, go wider across more FA ports). Write resp. time should be closer to 1-5 ms if you get 100% write hit. 64ms is way too high.
Let me know if you want me to take a look at it…
excellent stuff.
Unfortunately, this solution is incomplete because for some reason it only solves the problem in some cases. This was confirmed by hours and hours of testing with NetApp’s “nfsstat -d” output which shows misaligned IO per file.
To reliably solve the problem and prevent the underlying storage system from doing partial reads and writes, I had to use “advanced mode” and adjust the “offset” of each partition up to the nearest multiple of 8 before creating the file system. (Note: This is a different value than the starting sector)
In fdisk: Use the command “x” to enter “expert mode”, then “p” to show the current offset and “b” to adjust it.
Always interesting if people claim something doesn’t work for them and then come up with insufficient, vague information on why it doesn’t – and proving it with a claim based on a proprietary tool bundled with a competitive product but without giving full insight in what’s happening… Way to go…
Granted, in some cases you might have more than one problem. If you create misaligned vmfs file systems and then again misaligned file systems on top in the guest OS, then you may find yourself indeed spending many hours finding a root cause of performance problems. Or you might have an application (not being Oracle DB but another obscure DBMS) with bad manners that causes misaligned IO within the file (which has nothing to do with misaligned partitions). Or you’re using a state-of-the-art geeky cool new file system or innovative volume manager which hasn’t been used much in mission critical computing yet – just because you can…
FWIW you can get exactly what you need with the method I described (sfdisk one-liner) without going through manual fdisk menus as you say. Just use a different value than 128 if that works for you. Also in my follow-up post I describe how sfdisk is causing some trouble in RHEL 6 and “parted” might serve you better.
Hope this helps.
Fyi, it can be a problem installing Linux Mint 17.3 on “newer”(post-2009) SATA HDD that r 250GB or above in capacity bc of newer HDD technology = partitions not properly aligned with the disk. Something to do with 512 bytes data blocks in older IDE/PATA HDD(= came in 120GB or less sizes) n 4096 bytes data blocks in newer HDD. Eg the error message during installation = “… offset of 3584 bytes from minimum alignment …”
If u get this error message during install of Linux Mint 17.3 on an external/USB HDD, u need to first use GParted to realign the modern hard disk, ie unmount the HDD, delete the whole old partition n create the 1st new partition meant for the “/” or root partition(eg 20GB in size), with 1 MiB(= the default value) at the “beginning of this space” n 0 MiB at the “end of this space”, ensure “MiB” alignment box is checked(= do not select “cylinder”), set as Primary partition n format to fat32.
……. When creating the new 2nd partition meant for the home partition(eg 30GB in size), set 0 MiB for both the beginning n end of this space, Primary partition n format to fat32.
……. If u hv less than 4GB of RAM, u can also create a new 3rd partition for swap area(eg set as 2GB in size) for virtual memory/RAM on the hard disk, … do as for the home partition above.
U can leave the remaining space as unallocated or free space. After the install, u can create a 4th new partition to store yr back-ups, movies, music, photos, files or install another Linux distro(= need to “sudo update-grub” to dual-boot).
Only then, can u proceed with the Install via the Live DVD or USB-stick. After clicking “Something else”, click the 1st partition that u had created earlier with GParted, click “Change”, click “ext 4 file system”, click “/” for Mount point, n so on for the /home n swap partitions.
……. Ensure that the device for the bootloader is set to the external HDD, eg /dev/sdb or /dev/sdc. The cptr’s internal HDD/SSD is usually identified as /dev/sda.
Seems, the Linux installer does not realign modern HDD when u create a “New Partition Table” for the “/”(root), /home n swap partitions during the install process.
I hv just successfully installed n booted LM 17.3 on a “modern” 250GB external/USB HDD which initially had the above error message about the 1st root partition not properly aligned with the HDD during install = could not proceed with the install. This was after many hours of reinstalling, trial n error.
Thanks for the info! I’m actually using Mint 17.3 myself as main desktop and in my case it started the first partition at 2048 sectors (nicely aligned at 1 MiB) – but that’s in a virtual machine with a small virtual disk. My main desktop (with 256GB SSD) is multi-boot and has wrong alignment but that may be because I installed Windows (ugh) first…
Not sure what happens if you install on a large blank disk but I had the impression modern Linux distros did not have this issue anymore except if within the OS you use the old tools (fdisk) instead of the new (parted/gparted).
BTW I don’t think the block size because even (most?) modern SSDs will emulate 512byte sectors for legacy reasons. Alignment issues on PC hardware are purely the result of the old MSDOS partitioning scheme (max 4 primary partitions of which one can be extended).