Yet another customer was asking me for advice on implementing the ZFS file system on EMC storage systems. Recently I did some hands-on testing with ZFS as Oracle database file store so that I could get an opinion on the matter.
One of the frequent discussions comes up is on the fragmentation issue. ZFS uses a copy-on-write allocation mechanism which basically means, every time you write to a block on disk (whether this is a newly allocated block, or, very important, overwriting a previously allocated one) ZFS will buffer the data and write it out on a completely new location on disk. In other words, it will never overwrite data in place. Now a lot of discussions can be found in the blogosphere and on forums debating whether this is really the case, how serious this is, what the impact is on performance and what ZFS has done to either prevent, or, alternatively, to mitigate the issue (i.e. by using caching, smart disk allocation algorithms, etc).
In this post I attempt to prove how database files on ZFS file systems get fragmented on disk quickly. I will not make any comments on how this affects performance (I’ll save that for a future post). I also deliberately ignore ZFS caching and other optimizing features – the only thing I want to show right now is how much fragmentation is caused on physical disk by using ZFS for Oracle data files. Note that this is a deep technical and lengthy article so you might want to skip all the details and jump right to the conclusion at the bottom 🙂
My test environment is legacy hardware and therefore I am not even attempting to break performance records or come close. The performance I get is not impressive at all. I’m using single- and dual-core servers with limited memory and single SATA disks with a limited data size that are in no way representing modern datacenter equipment. For showing the effects of fragmentation you don’t need fast hardware.
Now a bit of background on what I expected from my testing. First of all, database data files behave much different than, say, office files (i.e. PowerPoint, PDF, word docs) or other “file serving” style environments (i.e. content management, software development and many other things).
An office file typically gets written as one whole. On ZFS, I expect this to cause more or less sequential, low fragmented allocation even with larger files (5 to 10 MB). If you load the file in your text editor, then save it, most office software (Microsoft Office, Libreoffice, …) will overwrite the whole file at once. Much the same for other regular file serving purposes. In ZFS due to the copy-on-write mechanism this causes the new file version to be completely written to a new disk location but the new file is still largely unfragmented.
A database file behaves differently. Such files typically get created only once (when creating or extending tablespaces) and only grow if the database needs more space. Otherwise the file itself does not change (although the contents within the file change heavily – as we will see).
Especially if the workload is OLTP-like with many small random updates, every update will modify a single database block – or a few of them at a time. What’s the best tool to create random I/O on an Oracle database? Kevin Closson’s SLOB!
But I made a few modifications to SLOB, first of all to make it run on my very modest hardware (128 sessions on a single CPU is not a good idea), tweak the data size (82MB for a single user is likely to stick in cache and I don’t want that), and finally, I needed a method to track database blocks on a physical disk.
So let’s go over the test details.
Equipment
ZFS Server:
- FreeBSD 9.1
- Dual processor (32 bit) FreeBSD 9.1
- Memory: 6 GB (of which 3.5 GB usable without PAE)
- Disks: 3x 73GB 10.000 rpm SCSI via Adaptec SCSI controller (not used in this test)
- 1x SATA 400 GB via SATA PCI controller
- Gigabit Ethernet (with Jumbo frames but set to MTU 7200 to match the DB server)
DB server:
- CentOS 5.9 64-bit single CPU
- Memory: 2GB
- Disks: 1x SATA 400GB
- Gigabit Ethernet (MTU size max 7200 due to low cost consumer hardware 😉
- Database software: Oracle 11.2.0.3.0 with Direct NFS enabled
On Linux, I created an LVM logical volume of 1GB in size and zeroed it out:
# lvcreate -Ay -L1GB -nslob data # dd if=/dev/zero of=/dev/data/slob bs=65536
Then created a standard EXT3 file system on it:
# mkfs.ext3 /dev/data/slob # mount /u02/slob/ # df | grep slob /dev/mapper/data-slob 1032088 34092 945568 4% /u02/slob # chown oracle:dba /u02/slob
Then I created a tablespace for SLOB:
CREATE BIGFILE TABLESPACE "EXT" DATAFILE '/u02/slob/slob.dbf' SIZE 2M AUTOEXTEND ON NEXT 2M MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO ; # ls -alh /u02/slob/slob.dbf ; df -h /u02/slob -rw-r----- 1 oracle dba 2.1M Feb 25 11:55 /u02/slob/slob.dbf Filesystem Size Used Avail Use% Mounted on /dev/mapper/data-slob 1008M 36M 973M 4% /u02/slob
Setup SLOB table with only one user in EXT tablespace:
$ ./setup.sh EXT 1
However, I modified SLOB a bit to be able to find row offsets in a tablespace later (via the MAGIC string). I also increased the default number of rows (12000) to 100000 so that it will create an 830MB tablespace (default will only create about 93MB per user) – forcing 83% space allocation on the FS.
SQL> desc cf1 Name Null? Type ----------------------------------------- -------- ---------------------------- MAGIC CHAR(32) CUSTID NUMBER(8) C2 VARCHAR2(128) C3 VARCHAR2(128) . . C19 VARCHAR2(128) C20 VARCHAR2(128)
The code to enter MAGIC values in CF1.MAGIC looks like this and is run once after loading the table:
DECLARE i number := 1; BEGIN FOR p IN ( SELECT rowid FROM cf1) LOOP UPDATE CF1 SET MAGIC=chr(10) || 'SLOBROW ' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid; i := i+1; end loop; end;
This causes a sequential run through all rows of the table (starting from the first row) and setting a searchable text string with the row number as ASCII. The chr(10) are newlines so that the output looks better later on (using the “strings” command).
The table content looks like this:
> select magic,custid from cf1 where rownum < 10 MAGIC CUSTID -------------------------------- ---------- 6051 SLOBROW 000001 6305 SLOBROW 000002 … SLOBROW 000008 2735 SLOBROW 000009 9 rows selected.
Note that the newlines are caused by the char(10) so this is on purpose.
We check the fragmentation within the datafile:
# strings -t x /u02/slob/slob.dbf |grep "SLOBROW" >outfile1.txt
The outfile1.txt looks like this (hex block offset, followed by the MAGIC string):
109daf SLOBROW 000001 10bdaf SLOBROW 000002 10ddaf SLOBROW 000003 10fdb1 SLOBROW 000004 111daf SLOBROW 000005 113dad SLOBROW 000006 115daf SLOBROW 000007 117daf SLOBROW 000008 119daf SLOBROW 000009 11bdad SLOBROW 000010 11ddad SLOBROW 000011 . . . 31213dad SLOBROW 099997 31215dad SLOBROW 099998 31217dad SLOBROW 099999 31219dad SLOBROW 100000
So within the datafile there is no fragmentation.
Now we look at the 1GB block device that holds the /u02/slob filesystem:
# strings -t x /dev/data/slob |grep "SLOBROW" >outfile2.txt
The outfile2.txt looks like:
790adad SLOBROW 000001 790cdad SLOBROW 000002 790edad SLOBROW 000003 7910dad SLOBROW 000004 7912dad SLOBROW 000005 7914daf SLOBROW 000006 7916dad SLOBROW 000007 7918dad SLOBROW 000008 791adad SLOBROW 000009 791cdad SLOBROW 000010 791edad SLOBROW 000011 . . . 39a0ddaf SLOBROW 099995 39a0fdaf SLOBROW 099996 39a11dad SLOBROW 099997 39a13dad SLOBROW 099998 39a15dad SLOBROW 099999 39a17dad SLOBROW 100000
Again, no fragmentation within the block device.
Now I ran SLOB with one user (writer) against the table.
One SLOB writer (lite) will update 1000 x 256 = 256000 rows x 8K blocks = about 2 GB, which is more than the size of the table contents (830MB). After a few runs we can assume most blocks have been updated a few times.
Note that I updated the line in SLOB’s writer.sql.lite so that it touches all table rows:
v_r := dbms_random.value(257, 10000) ;
to
v_r := dbms_random.value(257, 100000) ;
to generate writes against all table rows (830MB).
Note that my SGA is 500MB, buffer cache only about 184 MB, which forces most of the block writes to physical I/O.
$ ./runit.sh 1 0 Tm 1538
(1538 seconds to read and write 2GB worth of random 8K blocks – about 300 iops – I only use a single slow SATA disk so this is not even too bad 😉
From awr.txt FYI:
physical read bytes 1,692,778,496 1,099,279.6 1,241,950.5 physical read total IO requests 207,392 134.7 152.2 physical read total bytes 1,722,757,120 1,118,747.6 1,263,945.1 physical write bytes 1,860,108,288 1,207,942.5 1,364,716.3 physical write total IO requests 210,036 136.4 154.1 physical write total bytes 2,107,344,896 1,368,496.4 1,546,107.8
Let’s check fragmentation on the block device again now that we have updated many blocks:
# strings -t x /dev/data/slob |grep "SLOBROW" >outfile3.txt
The file looks like this:
790adad SLOBROW 000001 790cc71 SLOBROW 000002 790cdad SLOBROW 000002 790ec71 SLOBROW 000003 790edad SLOBROW 000003 7910c71 SLOBROW 000004 7910dad SLOBROW 000004 7912dad SLOBROW 000005 7914c74 SLOBROW 000006 7914daf SLOBROW 000006 7916c71 SLOBROW 000007 7916dad SLOBROW 000007 7918c71 SLOBROW 000008 7918dad SLOBROW 000008 791ac71 SLOBROW 000009 791adad SLOBROW 000009 791cc71 SLOBROW 000010 791cdad SLOBROW 000010 . . . 39a0dc74 SLOBROW 099995 39a0ddaf SLOBROW 099995 39a0fc74 SLOBROW 099996 39a0fdaf SLOBROW 099996 39a11c71 SLOBROW 099997 39a11dad SLOBROW 099997 39a13dad SLOBROW 099998 39a15c71 SLOBROW 099999 39a15dad SLOBROW 099999 39a17c71 SLOBROW 100000 39a17dad SLOBROW 100000
So why the double entries? I guess Oracle modifies the block contents after updating a row, in such a way that offsets within the blocks change slightly. The double entries are just a leftover feature of original data but within Oracle this is now unallocated content.
Notice that:
a) No fragmentation occurs, not on the datafile, not on the block device
b) the offsets of every row are mostly the same.
Lets take a look at row 10, row 5000 and row 10000 as an example:
# cat outfile2.txt |grep 000010 791cdad SLOBROW 000010 # cat outfile3.txt |grep 000010 791cc71 SLOBROW 000010 791cdad SLOBROW 000010 # cat outfile2.txt |grep 005000 a307dad SLOBROW 005000 # cat outfile3.txt |grep 005000 a307c71 SLOBROW 005000 a307dad SLOBROW 005000 # cat outfile2.txt |grep 010000 ca8fdad SLOBROW 010000 # cat outfile3.txt |grep 010000 ca8fc71 SLOBROW 010000 ca8fdad SLOBROW 010000
Note that the offset of the magic only changed a little bit: 316 bytes. Well within a single 8K oracle block. Again this shows that Oracle modified the contents of the blocks but not the block offsets within the datafile.
Let’s do another run…
$ ./runit.sh 1 0 Tm 1180
(bit faster, probably due to ext3 file system cache warmed up)
Check if anything changed on the layout since the last run:
# strings -t x /dev/data/slob |grep "SLOBROW" >outfile4.txt # diff outfile3.txt outfile4.txt |wc -l 1512 # diff outfile3.txt outfile4.txt |grep SLOB|wc -l 756 # diff outfile3.txt outfile4.txt |grep SLOB|grep -v ">" <no output>
So we have a few changes (additions to the new file – reflecting 756 changes within Oracle blocks!)
# diff outfile3.txt outfile4.txt | head -6 376a377 > 7a94c71 SLOBROW 000196 513a515 > 7b28c71 SLOBROW 000268 601a604 > 7b86c71 SLOBROW 000315
Again Oracle moved some data within the blocks. But no single block moved positions within the datafile, and no blocks within the datafile moved offsets on disk.
We can conclude that on EXT3 there is no fragmentation other than that caused by initial file creation. We can update blocks as many times as we like but the block offset never changes again. This also makes me confident that filling up an ext3 filesystem up to 99% with large static datafiles will have minimal impact on fragmentation.
Let’s now do the same thing on ZFS.
On my BSD ZFS server:
root@zfs:/root # gpart show . . => 63 781422705 ada0 MBR (372G) 63 126 - free - (63k) 189 16776963 1 freebsd [active] (8G) 16777152 16777215 2 freebsd (8G) 33554367 2097144 3 freebsd (1G) 35651511 745771257 - free - (355G)
The disk slice I will put the Zpool on is /dev/ada0s3, an 1GB partition (like the EXT3 logical volume). The other slices and disks will be used for performance testing later.
Ada0 is also a 400GB SATA disk like the one on the Linux DB server (not that this matters for this test). Let’s zero out the data device that we want to use for the ZPOOL (so that we can use “strings” later on the device):
root@zfs:/root # dd if=/dev/zero of=/dev/ada0s3 bs=65536 dd: /dev/ada0s3: short write on character device dd: /dev/ada0s3: end of device 16384+0 records in 16383+1 records out 1073737728 bytes transferred in 15.815460 secs (67891652 bytes/sec)
Now we create the zpool on the RAID device and a zfs filesystem on it:
root@zfs:/root # zpool create data ada0s3 root@zfs:/root # zfs create -o recordsize=8K data/data root@zfs:/root # zfs set sharenfs=on data/data root@zfs:/root # zfs set logbias=throughput data/data root@zfs:/root # zfs set sync=disabled data/data root@zfs:/root # zfs set atime=off data/data root@zfs:/root # chown oracle:dba /data/data/
Note that I used 8K recordsize because this is the Oracle recommendation for datafiles. It also prevents excess reads when updating a single 8K block.
“Logbias=throughput” is a tricky one as it might cause lost transactions or even corruption when suffering a system crash. But I am not interested in that right now and I need this setting to get a bit decent performance. Same for “sync=disabled”. This is normally not recommended but on my not-so-powerful system it boosts performance bigtime and for this test it will not matter.
root@zfs:/root # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT data 1016M 140K 1016M 0% 1.00x ONLINE - root@zfs:/root # zpool status pool: data state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 ada0s3 ONLINE 0 0 0 errors: No known data errors root@zfs:/root # zfs list NAME USED AVAIL REFER MOUNTPOINT data 140K 984M 32K /data data/data 31K 984M 31K /data/data
Going back to the Linux db host:
# mount -a # df | grep zfs zfs:/ufs 8106752 0 7458176 0% /zfs/ufs zfs:/data/data 1007552 64 1007488 1% /zfs/data
(the UFS entry is there because I plan to test UFS against ZFS performance as well)
I have prepared Direct NFS and this is what the oranfstab looks like:
# cat /etc/oranfstab server: zfs path: 192.168.0.9 export: /data/data mount: /zfs/data export: /ufs mount: /zfs/ufs
Now we have an empty ZFS pool/FS. Let’s create a tablespace on it. (after bouncing the database to re-enable direct NFS):
CREATE BIGFILE TABLESPACE "ZFS" DATAFILE '/zfs/data/zfs.dbf' SIZE 2M REUSE AUTOEXTEND ON NEXT 2M MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO ;
Now let’s drop the existing SLOB tablespace (on EXT3) and users first:
$ sqlplus / as sysdba @drop_users
And create the SLOB table:
$ ./setup.sh ZFS 1 Setting up user 1 Waiting for background processes - Mon Feb 25 13:44:55 CET 2013 Table created. PL/SQL procedure successfully completed.
So now we have an identical datafile as in the previous test but now on ZFS (via dNFS) instead of local EXT3 file system.
root@zfs:/root # df -h /data/data/ ; ls -alh /data/data/zfs.dbf Filesystem Size Used Avail Capacity Mounted on data/data 983M 841M 142M 86% /data/data -rw-r----- 1 oracle dba 834M Feb 25 13:48 /data/data/zfs.dbf
Before even running the first SLOB test, let’s look at the MAGIC offsets again (first in the datafile, then on the raw disk):
root@zfs:/root # strings -t x /data/data/zfs.dbf | grep "SLOBROW" > zfsout1.txt ; cat zfsout1.txt 109daf SLOBROW 000001 10bdad SLOBROW 000002 10ddaf SLOBROW 000003 10fdad SLOBROW 000004 111dad SLOBROW 000005 . . . 3120fdad SLOBROW 099995 31211dad SLOBROW 099996 31213dad SLOBROW 099997 31215dad SLOBROW 099998 31217daf SLOBROW 099999 31219dad SLOBROW 100000
That looks very familiar. So the zfs.dbf file on ZFS itself is not fragmented (like in the initial test on ext3).
Now let’s inspect the raw block device:
root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOBROW" > zfsout2.txt ; cat zfsout2.txt 4037ad SLOBROW 001087 4081ad SLOBROW 001088 40a1ad SLOBROW 001089 40cfaf SLOBROW 001090 410dad SLOBROW 001091 412dad SLOBROW 001092 . . . 3cbf7fad SLOBROW 084135 3cbf9faf SLOBROW 084136 3cbfbfad SLOBROW 084137 3cbfdfaf SLOBROW 084138 #
So you see the first block is not row 1. The last rows are not rows 99990 thru 100000.
Let’s inspect a random area in the file:
. . . 28c01dad SLOBROW 095075 28c03dad SLOBROW 095081 28c05dad SLOBROW 095080 28c07dad SLOBROW 095077 28c09dad SLOBROW 095082 28c0bdad SLOBROW 095083 28c0ddad SLOBROW 095084 28c0fdad SLOBROW 095085 28c11dad SLOBROW 095086 28c13dad SLOBROW 095076 28c15dad SLOBROW 095088 28c17dad SLOBROW 095087 28c19dad SLOBROW 095090 28c1bdad SLOBROW 095092 28c1ddad SLOBROW 095089 28c1fdad SLOBROW 095093 28c21dad SLOBROW 095094 28c23dad SLOBROW 095078 28c25dad SLOBROW 095096 28c27dad SLOBROW 095097 28c29dad SLOBROW 095098 28c2bdad SLOBROW 095095 28c2ddaf SLOBROW 095099 28c2fdad SLOBROW 095100 28c31dad SLOBROW 095102 28c33dad SLOBROW 095091 28c35dad SLOBROW 095104 28c37dad SLOBROW 095101 28c39daf SLOBROW 095106 28c3bdad SLOBROW 095103 . . .
So you see the rows are not linear allocated and we have some weird forward and backward row jumps all over the place. That’s directly after table creation and filling with data – and before throwing block updates at the table!
Let’s figure out how many sequential chunks we have. With ZFS copying blocks around this is less trivial than it seems. Consider this entry:
18e3afad SLOBROW 000879 27444bad SLOBROW 000879
How do we know which of these two reflects the real Oracle block and which one is the leftover from a previous write? Ignoring such problems for a moment (there’s only a few duplicates so far so I will work this out later) let’s try to find out how much fragmentation we have. How do you define that?
Let’s define the fragmentation as being the total disk seek distance (in bytes) if we had to read the file sequentially. So every time the disk heads have to reposition we are going to measure this as the disk offset jump on the disk.
Let’s say a seek is a forward jump larger than 4 blocks (32768) or a jump back larger than 4 blocks (ignoring data blocks not not together but very close i.e. on the same disk track).
The script I used to calculate seek distance:
#!/usr/local/bin/bash typeset -i c=0 typeset -i r=0 oldoff=0 oldrow=0 distance=0 cat $1 | sort -n -k +3 | while read offset magic row do off=$(printf "%d" 0x$offset) diff=$(expr $off - $oldoff) if [[ $oldoff -ne 0 ]] && [[ $diff -gt 32768 || $diff -lt -32768 ]]; then distance=$(expr $distance + $(expr ${diff#-} / 1024 )) rows=$(expr $row - $oldrow) oldrow=$row printf "%8s %8s %11s %11s %8s\n" $row $rows $off $diff $(expr $distance / 1024) fi oldoff=$off done
Now the script is not perfect – you could argue whether it reflects the real fragmentation or not and even propose better methods. But getting a bit closer to perfection the thing becomes very complex and I like to keep it simple.
Running the script against the output file:
[root@zfs ~]# ./countfragments zfsout2.txt | tee frags.txt
results in total seek distance of 94GB. Note that in the real world the physical disk seek distance is probably much less because a) we have duplicates that are not real, and b) if you see something like this:
28c2bdad SLOBROW 095095 28c2ddaf SLOBROW 095099 28c2fdad SLOBROW 095100 28c31dad SLOBROW 095102 28c33dad SLOBROW 095091
then the caching mechanism will most likely use prefetch and caching to work this out with only one or two seeks (instead of jumping up and down for each block). But I ignore that for now – we assume we have dumb disk without any cache or I/O order optimizations to make the point (otherwise the whole exercise would be much too complicated)
Now let’s kick off a slob run and see what happens.
$ ./runit.sh 1 0 Tm 3205
(note on ext3 it took 1180 seconds, but this delay could also be caused by the NFS network layer so I will not draw conclusions from this).
Some Awr.txt output again FYI:
physical write bytes 1,910,071,296 595,624.9 1,057,040.0 physical write total IO requests 172,951 53.9 95.7 physical write total bytes 2,254,415,872 703,003.2 1,247,601.5 physical write total multi block 1,866 0.6 1.0 physical writes 233,163 72.7 129.0
Created a new scan for SLOB blocks:
root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOBROW" > zfsout2.txt
Checking for differences before running SLOB:
root@zfs:/root # wc -l zfsout2.txt zfsout3.txt 100056 zfsout2.txt 212729 zfsout3.txt 312785 total
There are too many duplicates now to make realistic estimations about fragmentation. So let’s update the MAGIC strings so that we can look for actual database block and exclude most of the leftovers from previous writes.
updatemagic.sql:
DECLARE i number := 1; BEGIN FOR p IN ( SELECT rowid FROM cf1) LOOP UPDATE CF1 SET MAGIC=chr(10) || '&magic' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid; i := i+1; END LOOP; END; /
Running it:
$ sqlplus user1/user1 @updatemagic Enter value for magic: SLOB001 old 6: UPDATE CF1 SET MAGIC=chr(10) || '&magic' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid; new 6: UPDATE CF1 SET MAGIC=chr(10) || 'SLOB001' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid; PL/SQL procedure successfully completed.
(FYI runtime was 22 minutes – on EXT this runs in less than 1 minute – so you get some feeling about the delays here. I can tell you the network was not highly utilized at all, in fact it was hard to find any bottleneck but it seemed to be caused by high CPU context switches per second on the BSD machine)
Here the output on the BSD machine during the MAGIC update from iostat against the ZFS device:
root@zfs:/root # iostat -x ada0 10 device r/s w/s kr/s kw/s qlen svc_t %b . . . ada0 133.5 0.2 580.3 0.9 1 1.9 8 <-- it looked like this for a while, only reads but hardly any updates. ada0 133.4 0.2 581.8 0.9 0 2.0 9 -- guess this is due to delayed dbwriter activity ada0 116.8 61.1 1618.8 358.9 3 22.8 94 <-- here it finally starts writing ada0 113.0 84.6 1346.8 397.8 10 18.7 88 ada0 108.0 75.8 1344.9 443.2 7 20.3 90 ada0 109.8 73.4 1505.4 494.5 6 19.3 86 ada0 96.7 186.3 1409.4 1170.1 6 16.9 92 ada0 41.8 14.2 390.2 137.2 8 26.6 37 ada0 0.0 0.0 0.0 0.0 8 0.0 0 <-- here ZFS does not do any IO for about 20 seconds. Bottleneck assumed ada0 0.0 0.0 0.0 0.0 8 0.0 0 -- to be completely CPU/memory based – probably in part because of ada0 17.4 186.7 615.3 1334.9 10 126.4 329 -- my old hardware ;-) ada0 0.0 0.0 0.0 0.0 10 0.0 0 ada0 0.0 0.0 0.0 0.0 10 0.0 0 ada0 25.5 26.9 231.8 118.6 1 589.4 322 ada0 111.4 85.5 1145.5 630.2 1 17.4 94 ada0 98.4 156.5 1269.0 1074.7 1 16.2 89 . .
Check for the results:
root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOB001" > zfsout4.txt root@zfs:/root # wc -l zfsout4.txt 82841 zfsout4.txt
(Weird. There should be at least 100,000 rows) The fix: flush Oracle’s “dirty cache” 😉
SQL> alter system checkpoint; . . root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOB001" > zfsout5.txt root@zfs:/root # wc -l zfsout5.txt 101223 zfsout5.txt
(much better, and only 1223 duplicates)
Let’s check for fragmentation in the new environment:
root@zfs:/root # ./countfragments zfsout5.txt | tee frags.txt root@zfs:/root # tail -5 frags.txt 099996 1 95640177 -633761280 2433521 099997 1 729409649 633769472 2434125 099998 1 95918705 -633490944 2434730 099999 1 730164340 634245635 2435334 100000 1 95943281 -634221059 2435939
So the total seek distance after we performed a SLOB write run followed by an update of the magic numbers, is now 2435939 MB or about 2500 GB (up from 64 GB after initial creation). You can see the fragmentation taking its heavy toll. Let’s also find out how large the largest continuous sequence of blocks are:
root@zfs:/root # cat frags.txt | sort -n -k +2 | tail 040557 13 653829233 967168 184067 040570 13 180370545 -473556992 184519 067558 13 79769713 -408209920 510990 084989 13 205561969 -730385214 613430 031627 14 12778609 -165509632 151051 033906 14 855796337 3719168 159611 034090 15 264289393 -594196992 160196 085566 15 4227185 -605910016 614971 033971 31 858530417 166912 159618 085389 36 609938548 35331 614351
(note that the sorted second column indicates the number of adjacent blocks before jumping to another offset so a value of 15 means the script fount 15 adjacent 8K blocks)
Within our new layout of the datafile on ZFS, there are no sequential chunks on zfs larger than 36 Oracle blocks! And actually, the vast majority of continuous block areas is less than 4 blocks in size.
That means very heavy fragmentation!
For the record, let’s look at the datafile itself again (not the zfs block device):
root@zfs:/root # strings -t x /data/data/zfs.dbf | grep "SLOB001" > zfsout6.txt root@zfs:/root # cat zfsout6.txt | tail -5 31211c71 SLOB001 099996 31213c71 SLOB001 099997 31215c71 SLOB001 099998 31217c74 SLOB001 099999 31219c71 SLOB001 100000
That’s still 100% sequential. So within the datafile not a single block has changed offsets!
root@zfs:/root # ./countfragments zfsout6.txt . . . 096804 1020 798006385 40957 3 097824 1020 806394993 40960 3 098844 1020 814783601 40960 3 099864 1020 823172212 40963 3
Total seek distance within the datafile: Only 3 MB! (as opposed to 2500 GB on the ZFS block device).
Conclusion
First a disclaimer: I used ZFS on FreeBSD 9.1. ZFS might behave different on Solaris (however I don’t think it so). ZFS is an OS feature that seems to be a very popular choice for some administrators, up to a point where I would say, an almost “religious” choice. So I would not be surprised to get some flaming comments on this post. As long as it is honest and valuable feedback I will approve them even if they prove me wrong.
Random(ish) workload against database tables on ZFS causes heavy fragmentation.
Whether this impacts performance is another discussion and I will attempt to blog on that later. ZFS (AFAIK) does not have defragmentation capabilities today, so to fix this, some of Oracle’s documentation vaguely recommends to copy complete datafiles around:
For a data warehouse database, keep 20% free space in the storage pool as a general rule. Periodically copying data files reorganizes the file location on disk and gives better full scan response time.
And that’s exactly my point. Now how should I do that in a mission critical environment with 99,999% required availability? That would require to take the tablespace offline, then copy the datafile(s) to a separate file system, then copying them back???
Update: Interesting article on degrading ZFS write performance http://blog.delphix.com/uday/2013/02/19/78/
Comments are closed.