Disk Fragmentation
Disk Fragmentation – O&O technologies.
Hope they don’t mind the free advertising

Yet another customer was asking me for advice on implementing the ZFS file system on EMC storage systems. Recently I did some hands-on testing with ZFS as Oracle database file store so that I could get an opinion on the matter.

One of the frequent discussions comes up is on the fragmentation issue. ZFS uses a copy-on-write allocation mechanism which basically means, every time you write to a block on disk (whether this is a newly allocated block, or, very important, overwriting a previously allocated one) ZFS will buffer the data and write it out on a completely new location on disk. In other words, it will never overwrite data in place. Now a lot of discussions can be found in the blogosphere and on forums debating whether this is really the case, how serious this is, what the impact is on performance and what ZFS has done to either prevent, or, alternatively, to mitigate the issue (i.e. by using caching, smart disk allocation algorithms, etc).

In this post I attempt to prove how database files on ZFS file systems get fragmented on disk quickly. I will not make any comments on how this affects performance (I’ll save that for a future post). I also deliberately ignore ZFS caching and other optimizing features – the only thing I want to show right now is how much fragmentation is caused on physical disk by using ZFS for Oracle data files. Note that this is a deep technical and lengthy article so you might want to skip all the details and jump right to the conclusion at the bottom 🙂

My test environment is legacy hardware and therefore I am not even attempting to break performance records or come close. The performance I get is not impressive at all. I’m using single- and dual-core servers with limited memory and single SATA disks with a limited data size that are in no way representing modern datacenter equipment. For showing the effects of fragmentation you don’t need fast hardware.

Now a bit of background on what I expected from my testing. First of all, database data files behave much different than, say, office files (i.e. PowerPoint, PDF, word docs) or other “file serving” style environments (i.e. content management, software development and many other things).

An office file typically gets written as one whole. On ZFS, I expect this to cause more or less sequential, low fragmented allocation even with larger files (5 to 10 MB). If you load the file in your text editor, then save it, most office software (Microsoft Office, Libreoffice, …) will overwrite the whole file at once. Much the same for other regular file serving purposes. In ZFS due to the copy-on-write mechanism this causes the new file version to be completely written to a new disk location but the new file is still largely unfragmented.

A database file behaves differently. Such files typically get created only once (when creating or extending tablespaces) and only grow if the database needs more space. Otherwise the file itself does not change (although the contents within the file change heavily – as we will see).

Especially if the workload is OLTP-like with many small random updates, every update will modify a single database block – or a few of them at a time. What’s the best tool to create random I/O on an Oracle database? Kevin Closson’s SLOB!

But I made a few modifications to SLOB, first of all to make it run on my very modest hardware (128 sessions on a single CPU is not a good idea), tweak the data size (82MB for a single user is likely to stick in cache and I don’t want that), and finally, I needed a method to track database blocks on a physical disk.

So let’s go over the test details.


ZFS Server:

  • FreeBSD 9.1
  • Dual processor (32 bit) FreeBSD 9.1
  • Memory: 6 GB (of which 3.5 GB usable without PAE)
  • Disks: 3x 73GB 10.000 rpm SCSI via Adaptec SCSI controller (not used in this test)
  • 1x SATA 400 GB via SATA PCI controller
  • Gigabit Ethernet (with Jumbo frames but set to MTU 7200 to match the DB server)

DB server:

  • CentOS 5.9 64-bit single CPU
  • Memory: 2GB
  • Disks: 1x SATA 400GB
  • Gigabit Ethernet (MTU size max 7200 due to low cost consumer hardware 😉
  • Database software: Oracle with Direct NFS enabled

On Linux, I created an LVM logical volume of 1GB in size and zeroed it out:

# lvcreate -Ay -L1GB -nslob data
# dd if=/dev/zero of=/dev/data/slob bs=65536

Then created a standard EXT3 file system on it:

# mkfs.ext3 /dev/data/slob
# mount /u02/slob/
# df | grep slob
                       1032088     34092    945568   4% /u02/slob
# chown oracle:dba /u02/slob

Then I created a tablespace for SLOB:

# ls -alh /u02/slob/slob.dbf ; df -h /u02/slob
-rw-r----- 1 oracle dba 2.1M Feb 25 11:55 /u02/slob/slob.dbf
Filesystem            Size  Used Avail Use% Mounted on
1008M   36M  973M   4% /u02/slob

Setup SLOB table with only one user in EXT tablespace:

$ ./setup.sh EXT 1

However, I modified SLOB a bit to be able to find row offsets in a tablespace later (via the MAGIC string). I also increased the default number of rows (12000) to 100000 so that it will create an 830MB tablespace (default will only create about 93MB per user) – forcing 83% space allocation on the FS.

SQL> desc cf1
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
MAGIC                                              CHAR(32)
CUSTID                                             NUMBER(8)
C2                                                 VARCHAR2(128)
C3                                                 VARCHAR2(128)
C19                                                VARCHAR2(128)
C20                                                VARCHAR2(128)

The code to enter MAGIC values in CF1.MAGIC looks like this and is run once after loading the table:

i           number := 1;
FOR p IN ( SELECT rowid FROM cf1)
UPDATE CF1 SET MAGIC=chr(10) || 'SLOBROW ' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;
i := i+1;
end loop;

This causes a sequential run through all rows of the table (starting from the first row) and setting a searchable text string with the row number as ASCII. The chr(10) are newlines so that the output looks better later on (using the “strings” command).

The table content looks like this:

> select magic,custid from cf1 where rownum < 10
MAGIC                                CUSTID
-------------------------------- ----------
SLOBROW  000001
SLOBROW  000002
SLOBROW  000008
SLOBROW  000009

9 rows selected.

Note that the newlines are caused by the char(10) so this is on purpose.

We check the fragmentation within the datafile:

# strings -t x /u02/slob/slob.dbf |grep "SLOBROW" >outfile1.txt

The outfile1.txt looks like this (hex block offset, followed by the MAGIC string):

109daf SLOBROW  000001
10bdaf SLOBROW  000002
10ddaf SLOBROW  000003
10fdb1 SLOBROW  000004
111daf SLOBROW  000005
113dad SLOBROW  000006
115daf SLOBROW  000007
117daf SLOBROW  000008
119daf SLOBROW  000009
11bdad SLOBROW  000010
11ddad SLOBROW  000011
31213dad SLOBROW  099997
31215dad SLOBROW  099998
31217dad SLOBROW  099999
31219dad SLOBROW  100000

So within the datafile there is no fragmentation.

Now we look at the 1GB block device that holds the /u02/slob filesystem:

# strings -t x /dev/data/slob |grep "SLOBROW" >outfile2.txt

The outfile2.txt looks like:

790adad SLOBROW  000001
790cdad SLOBROW  000002
790edad SLOBROW  000003
7910dad SLOBROW  000004
7912dad SLOBROW  000005
7914daf SLOBROW  000006
7916dad SLOBROW  000007
7918dad SLOBROW  000008
791adad SLOBROW  000009
791cdad SLOBROW  000010
791edad SLOBROW  000011
39a0ddaf SLOBROW  099995
39a0fdaf SLOBROW  099996
39a11dad SLOBROW  099997
39a13dad SLOBROW  099998
39a15dad SLOBROW  099999
39a17dad SLOBROW  100000

Again, no fragmentation within the block device.

Now I ran SLOB with one user (writer) against the table.

One SLOB writer (lite) will update 1000 x 256 = 256000 rows x 8K blocks = about 2 GB, which is more than the size of the table contents (830MB). After a few runs we can assume most blocks have been updated a few times.

Note that I updated the line in SLOB’s writer.sql.lite so that it touches all table rows:

v_r := dbms_random.value(257, 10000) ;


v_r := dbms_random.value(257, 100000) ;

to generate writes against all table rows (830MB).

Note that my SGA is 500MB, buffer cache only about 184 MB, which forces most of the block writes to physical I/O.

$ ./runit.sh 1 0

Tm 1538

(1538 seconds to read and write 2GB worth of random 8K blocks – about 300 iops – I only use a single slow SATA disk so this is not even too bad 😉

From awr.txt FYI:

physical read bytes                   1,692,778,496    1,099,279.6   1,241,950.5
physical read total IO requests             207,392          134.7         152.2
physical read total bytes             1,722,757,120    1,118,747.6   1,263,945.1
physical write bytes                  1,860,108,288    1,207,942.5   1,364,716.3
physical write total IO requests            210,036          136.4         154.1
physical write total bytes            2,107,344,896    1,368,496.4   1,546,107.8

Let’s check fragmentation on the block device again now that we have updated many blocks:

# strings -t x /dev/data/slob |grep "SLOBROW" >outfile3.txt

The file looks like this:

790adad SLOBROW  000001
790cc71 SLOBROW  000002
790cdad SLOBROW  000002
790ec71 SLOBROW  000003
790edad SLOBROW  000003
7910c71 SLOBROW  000004
7910dad SLOBROW  000004
7912dad SLOBROW  000005
7914c74 SLOBROW  000006
7914daf SLOBROW  000006
7916c71 SLOBROW  000007
7916dad SLOBROW  000007
7918c71 SLOBROW  000008
7918dad SLOBROW  000008
791ac71 SLOBROW  000009
791adad SLOBROW  000009
791cc71 SLOBROW  000010
791cdad SLOBROW  000010
39a0dc74 SLOBROW  099995
39a0ddaf SLOBROW  099995
39a0fc74 SLOBROW  099996
39a0fdaf SLOBROW  099996
39a11c71 SLOBROW  099997
39a11dad SLOBROW  099997
39a13dad SLOBROW  099998
39a15c71 SLOBROW  099999
39a15dad SLOBROW  099999
39a17c71 SLOBROW  100000
39a17dad SLOBROW  100000

So why the double entries? I guess Oracle modifies the block contents after updating a row, in such a way that offsets within the blocks change slightly. The double entries are just a leftover feature of original data but within Oracle this is now unallocated content.

Notice that:
a) No fragmentation occurs, not on the datafile, not on the block device
b) the offsets of every row are mostly the same.

Lets take a look at row 10, row 5000 and row 10000 as an example:

# cat outfile2.txt |grep 000010
791cdad SLOBROW  000010
# cat outfile3.txt |grep 000010
791cc71 SLOBROW  000010
791cdad SLOBROW  000010

# cat outfile2.txt |grep 005000
a307dad SLOBROW  005000
# cat outfile3.txt |grep 005000
a307c71 SLOBROW  005000
a307dad SLOBROW  005000

# cat outfile2.txt |grep 010000
ca8fdad SLOBROW  010000
# cat outfile3.txt |grep 010000
ca8fc71 SLOBROW  010000
ca8fdad SLOBROW  010000

Note that the offset of the magic only changed a little bit: 316 bytes. Well within a single 8K oracle block. Again this shows that Oracle modified the contents of the blocks but not the block offsets within the datafile.

Let’s do another run…

$ ./runit.sh 1 0

Tm 1180

(bit faster, probably due to ext3 file system cache warmed up)

Check if anything changed on the layout since the last run:

# strings -t x /dev/data/slob |grep "SLOBROW" >outfile4.txt
# diff outfile3.txt outfile4.txt |wc -l
# diff outfile3.txt outfile4.txt |grep SLOB|wc -l
# diff outfile3.txt outfile4.txt |grep SLOB|grep -v ">"
<no output>

So we have a few changes (additions to the new file – reflecting 756 changes within Oracle blocks!)

# diff outfile3.txt outfile4.txt | head -6
> 7a94c71 SLOBROW  000196
> 7b28c71 SLOBROW  000268
> 7b86c71 SLOBROW  000315

Again Oracle moved some data within the blocks. But no single block moved positions within the datafile, and no blocks within the datafile moved offsets on disk.

We can conclude that on EXT3 there is no fragmentation other than that caused by initial file creation. We can update blocks as many times as we like but the block offset never changes again. This also makes me confident that filling up an ext3 filesystem up to 99% with large static datafiles will have minimal impact on fragmentation.

Let’s now do the same thing on ZFS.

On my BSD ZFS server:

root@zfs:/root # gpart show
=>       63  781422705  ada0  MBR  (372G)
63        126        - free -  (63k)
189   16776963     1  freebsd  [active]  (8G)
16777152   16777215     2  freebsd  (8G)
   33554367    2097144     3  freebsd  (1G) 
35651511  745771257        - free -  (355G)

The disk slice I will put the Zpool on is /dev/ada0s3, an 1GB partition (like the EXT3 logical volume). The other slices and disks will be used for performance testing later.

Ada0 is also a 400GB SATA disk like the one on the Linux DB server (not that this matters for this test). Let’s zero out the data device that we want to use for the ZPOOL (so that we can use “strings” later on the device):

root@zfs:/root # dd if=/dev/zero of=/dev/ada0s3 bs=65536
dd: /dev/ada0s3: short write on character device
dd: /dev/ada0s3: end of device
16384+0 records in
16383+1 records out
1073737728 bytes transferred in 15.815460 secs (67891652 bytes/sec)

Now we create the zpool on the RAID device and a zfs filesystem on it:

root@zfs:/root # zpool create data ada0s3
root@zfs:/root # zfs create -o recordsize=8K data/data
root@zfs:/root # zfs set sharenfs=on data/data
root@zfs:/root # zfs set logbias=throughput data/data
root@zfs:/root # zfs set sync=disabled data/data
root@zfs:/root # zfs set atime=off data/data
root@zfs:/root # chown oracle:dba /data/data/

Note that I used 8K recordsize because this is the Oracle recommendation for datafiles. It also prevents excess reads when updating a single 8K block.

“Logbias=throughput” is a tricky one as it might cause lost transactions or even corruption when suffering a system crash. But I am not interested in that right now and I need this setting to get a bit decent performance. Same for “sync=disabled”. This is normally not recommended but on my not-so-powerful system it boosts performance bigtime and for this test it will not matter.

root@zfs:/root # zpool list
data  1016M   140K  1016M     0%  1.00x  ONLINE  -
root@zfs:/root # zpool status
pool: data
state: ONLINE
scan: none requested


data        ONLINE       0     0     0
ada0s3    ONLINE       0     0     0

errors: No known data errors
root@zfs:/root # zfs list
data        140K   984M    32K  /data
data/data    31K   984M    31K  /data/data

Going back to the Linux db host:

# mount -a
# df | grep zfs
zfs:/ufs               8106752         0   7458176   0% /zfs/ufs
zfs:/data/data         1007552        64   1007488   1% /zfs/data

(the UFS entry is there because I plan to test UFS against ZFS performance as well)
I have prepared Direct NFS and this is what the oranfstab looks like:

# cat /etc/oranfstab
server: zfs
export: /data/data mount: /zfs/data
export: /ufs mount: /zfs/ufs

Now we have an empty ZFS pool/FS. Let’s create a tablespace on it. (after bouncing the database to re-enable direct NFS):


Now let’s drop the existing SLOB tablespace (on EXT3) and users first:

$ sqlplus / as sysdba @drop_users

And create the SLOB table:

$ ./setup.sh ZFS 1
Setting up user 1
Waiting for background processes - Mon Feb 25 13:44:55 CET 2013
Table created.
PL/SQL procedure successfully completed.

So now we have an identical datafile as in the previous test but now on ZFS (via dNFS) instead of local EXT3 file system.

root@zfs:/root # df -h /data/data/ ; ls -alh /data/data/zfs.dbf
Filesystem    Size    Used   Avail Capacity  Mounted on
data/data     983M    841M    142M    86%    /data/data
-rw-r-----  1 oracle  dba   834M Feb 25 13:48 /data/data/zfs.dbf

Before even running the first SLOB test, let’s look at the MAGIC offsets again (first in the datafile, then on the raw disk):

root@zfs:/root # strings -t x /data/data/zfs.dbf | grep "SLOBROW" > zfsout1.txt ; cat zfsout1.txt
109daf SLOBROW  000001
10bdad SLOBROW  000002
10ddaf SLOBROW  000003
10fdad SLOBROW  000004
111dad SLOBROW  000005
3120fdad SLOBROW  099995
31211dad SLOBROW  099996
31213dad SLOBROW  099997
31215dad SLOBROW  099998
31217daf SLOBROW  099999
31219dad SLOBROW  100000

That looks very familiar. So the zfs.dbf file on ZFS itself is not fragmented (like in the initial test on ext3).

Now let’s inspect the raw block device:

root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOBROW" > zfsout2.txt ; cat zfsout2.txt
4037ad SLOBROW  001087
4081ad SLOBROW  001088
40a1ad SLOBROW  001089
40cfaf SLOBROW  001090
410dad SLOBROW  001091
412dad SLOBROW  001092
3cbf7fad SLOBROW  084135
3cbf9faf SLOBROW  084136
3cbfbfad SLOBROW  084137
3cbfdfaf SLOBROW  084138

So you see the first block is not row 1. The last rows are not rows 99990 thru 100000.

Let’s inspect a random area in the file:

28c01dad SLOBROW  095075
28c03dad SLOBROW  095081
28c05dad SLOBROW  095080
28c07dad SLOBROW  095077
28c09dad SLOBROW  095082
28c0bdad SLOBROW  095083
28c0ddad SLOBROW  095084
28c0fdad SLOBROW  095085
28c11dad SLOBROW  095086
28c13dad SLOBROW  095076
28c15dad SLOBROW  095088
28c17dad SLOBROW  095087
28c19dad SLOBROW  095090
28c1bdad SLOBROW  095092
28c1ddad SLOBROW  095089
28c1fdad SLOBROW  095093
28c21dad SLOBROW  095094
28c23dad SLOBROW  095078
28c25dad SLOBROW  095096
28c27dad SLOBROW  095097
28c29dad SLOBROW  095098
28c2bdad SLOBROW  095095
28c2ddaf SLOBROW  095099
28c2fdad SLOBROW  095100
28c31dad SLOBROW  095102
28c33dad SLOBROW  095091
28c35dad SLOBROW  095104
28c37dad SLOBROW  095101
28c39daf SLOBROW  095106
28c3bdad SLOBROW  095103

So you see the rows are not linear allocated and we have some weird forward and backward row jumps all over the place. That’s directly after table creation and filling with data – and before throwing block updates at the table!

Let’s figure out how many sequential chunks we have. With ZFS copying blocks around this is less trivial than it seems. Consider this entry:

18e3afad SLOBROW  000879
27444bad SLOBROW  000879

How do we know which of these two reflects the real Oracle block and which one is the leftover from a previous write? Ignoring such problems for a moment (there’s only a few duplicates so far so I will work this out later) let’s try to find out how much fragmentation we have. How do you define that?

Let’s define the fragmentation as being the total disk seek distance (in bytes) if we had to read the file sequentially. So every time the disk heads have to reposition we are going to measure this as the disk offset jump on the disk.

Let’s say a seek is a forward jump larger than 4 blocks (32768) or a jump back larger than 4 blocks (ignoring data blocks not not together but very close i.e. on the same disk track).

The script I used to calculate seek distance:

typeset -i c=0
typeset -i r=0
cat $1 | sort -n -k +3 | while read offset magic row
  off=$(printf "%d" 0x$offset)
  diff=$(expr $off - $oldoff)
  if [[ $oldoff -ne 0 ]] && [[ $diff -gt 32768 || $diff -lt -32768 ]]; then
    distance=$(expr $distance + $(expr ${diff#-} / 1024 ))
    rows=$(expr $row - $oldrow)
    printf "%8s %8s %11s %11s %8s\n" $row $rows $off $diff $(expr $distance / 1024)

Now the script is not perfect – you could argue whether it reflects the real fragmentation or not and even propose better methods. But getting a bit closer to perfection the thing becomes very complex and I like to keep it simple.

Running the script against the output file:

[root@zfs ~]# ./countfragments zfsout2.txt | tee frags.txt

results in total seek distance of 94GB. Note that in the real world the physical disk seek distance is probably much less because a) we have duplicates that are not real, and b) if you see something like this:

28c2bdad SLOBROW  095095
28c2ddaf SLOBROW  095099
28c2fdad SLOBROW  095100
28c31dad SLOBROW  095102
28c33dad SLOBROW  095091

then the caching mechanism will most likely use prefetch and caching to work this out with only one or two seeks (instead of jumping up and down for each block). But I ignore that for now – we assume we have dumb disk without any cache or I/O order optimizations to make the point (otherwise the whole exercise would be much too complicated)

Now let’s kick off a slob run and see what happens.

$ ./runit.sh 1 0

Tm 3205

(note on ext3 it took 1180 seconds, but this delay could also be caused by the NFS network layer so I will not draw conclusions from this).

Some Awr.txt output again FYI:

physical write bytes                  1,910,071,296      595,624.9   1,057,040.0
physical write total IO requests            172,951           53.9          95.7
physical write total bytes            2,254,415,872      703,003.2   1,247,601.5
physical write total multi block              1,866            0.6           1.0
physical writes                             233,163           72.7         129.0

Created a new scan for SLOB blocks:

root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOBROW" > zfsout2.txt

Checking for differences before running SLOB:

root@zfs:/root # wc -l zfsout2.txt zfsout3.txt
100056 zfsout2.txt
212729 zfsout3.txt
312785 total

There are too many duplicates now to make realistic estimations about fragmentation. So let’s update the MAGIC strings so that we can look for actual database block and exclude most of the leftovers from previous writes.


i           number := 1;
  FOR p IN ( SELECT rowid FROM cf1)
    UPDATE CF1 SET MAGIC=chr(10) || '&magic' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;
    i := i+1;

Running it:

$ sqlplus user1/user1 @updatemagic

Enter value for magic: SLOB001
old   6:         UPDATE CF1 SET MAGIC=chr(10) || '&magic' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;
new   6:         UPDATE CF1 SET MAGIC=chr(10) || 'SLOB001' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;

PL/SQL procedure successfully completed.

(FYI runtime was 22 minutes – on EXT this runs in less than 1 minute – so you get some feeling about the delays here. I can tell you the network was not highly utilized at all, in fact it was hard to find any bottleneck but it seemed to be caused by high CPU context switches per second on the BSD machine)

Here the output on the BSD machine during the MAGIC update from iostat against the ZFS device:

root@zfs:/root # iostat -x ada0 10
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0     133.5   0.2   580.3     0.9    1   1.9   8 <-- it looked like this for a while, only reads but hardly any updates. 
ada0     133.4   0.2   581.8     0.9    0   2.0   9  -- guess this is due to delayed dbwriter activity
ada0     116.8  61.1  1618.8   358.9    3  22.8  94 <-- here it finally starts writing
ada0     113.0  84.6  1346.8   397.8   10  18.7  88
ada0     108.0  75.8  1344.9   443.2    7  20.3  90
ada0     109.8  73.4  1505.4   494.5    6  19.3  86
ada0      96.7 186.3  1409.4  1170.1    6  16.9  92
ada0      41.8  14.2   390.2   137.2    8  26.6  37
ada0       0.0   0.0     0.0     0.0    8   0.0   0 <-- here ZFS does not do any IO for about 20 seconds. Bottleneck assumed
ada0       0.0   0.0     0.0     0.0    8   0.0   0  -- to be completely CPU/memory based – probably in part because of
ada0      17.4 186.7   615.3  1334.9   10 126.4 329  -- my old hardware ;-)
ada0       0.0   0.0     0.0     0.0   10   0.0   0
ada0       0.0   0.0     0.0     0.0   10   0.0   0
ada0      25.5  26.9   231.8   118.6    1 589.4 322
ada0     111.4  85.5  1145.5   630.2    1  17.4  94
ada0      98.4 156.5  1269.0  1074.7    1  16.2  89

Check for the results:

root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOB001" > zfsout4.txt
root@zfs:/root # wc -l zfsout4.txt
82841 zfsout4.txt

(Weird. There should be at least 100,000 rows) The fix: flush Oracle’s “dirty cache” 😉

SQL> alter system checkpoint;
root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOB001" > zfsout5.txt
root@zfs:/root # wc -l zfsout5.txt
101223 zfsout5.txt

(much better, and only 1223 duplicates)

Let’s check for fragmentation in the new environment:

root@zfs:/root # ./countfragments zfsout5.txt | tee frags.txt
root@zfs:/root # tail -5 frags.txt
099996        1    95640177  -633761280  2433521
099997        1   729409649   633769472  2434125
099998        1    95918705  -633490944  2434730
099999        1   730164340   634245635  2435334
100000        1    95943281  -634221059  2435939

So the total seek distance after we performed a SLOB write run followed by an update of the magic numbers, is now 2435939 MB or about 2500 GB (up from 64 GB after initial creation). You can see the fragmentation taking its heavy toll. Let’s also find out how large the largest continuous sequence of blocks are:

root@zfs:/root # cat frags.txt | sort -n -k +2 | tail
040557       13   653829233      967168   184067
040570       13   180370545  -473556992   184519
067558       13    79769713  -408209920   510990
084989       13   205561969  -730385214   613430
031627       14    12778609  -165509632   151051
033906       14   855796337     3719168   159611
034090       15   264289393  -594196992   160196
085566       15     4227185  -605910016   614971
033971       31   858530417      166912   159618
085389       36   609938548       35331   614351

(note that the sorted second column indicates the number of adjacent blocks before jumping to another offset so a value of 15 means the script fount 15 adjacent 8K blocks)

Within our new layout of the datafile on ZFS, there are no sequential chunks on zfs larger than 36 Oracle blocks! And actually, the vast majority of continuous block areas is less than 4 blocks in size.

That means very heavy fragmentation!

For the record, let’s look at the datafile itself again (not the zfs block device):

root@zfs:/root # strings -t x /data/data/zfs.dbf  | grep "SLOB001" > zfsout6.txt
root@zfs:/root # cat zfsout6.txt | tail -5
31211c71 SLOB001 099996
31213c71 SLOB001 099997
31215c71 SLOB001 099998
31217c74 SLOB001 099999
31219c71 SLOB001 100000

That’s still 100% sequential. So within the datafile not a single block has changed offsets!

root@zfs:/root # ./countfragments zfsout6.txt
096804     1020   798006385       40957        3
097824     1020   806394993       40960        3
098844     1020   814783601       40960        3
099864     1020   823172212       40963        3

Total seek distance within the datafile: Only 3 MB! (as opposed to 2500 GB on the ZFS block device).


First a disclaimer: I used ZFS on FreeBSD 9.1. ZFS might behave different on Solaris (however I don’t think it so). ZFS is an OS feature that seems to be a very popular choice for some administrators, up to a point where I would say, an almost “religious” choice. So I would not be surprised to get some flaming comments on this post. As long as it is honest and valuable feedback I will approve them even if they prove me wrong.

Random(ish) workload against database tables on ZFS causes heavy fragmentation.

Whether this impacts performance is another discussion and I will attempt to blog on that later. ZFS (AFAIK) does not have defragmentation capabilities today, so to fix this, some of Oracle’s documentation vaguely recommends to copy complete datafiles around:

(from http://www.oracle.com/technetwork/server-storage/solaris10/documentation/wp-oraclezfsconfig-0510-ds-ac2-163550.pdf):

For a data warehouse database, keep 20% free space in the storage pool as a general rule. Periodically copying data files reorganizes the file location on disk and gives better full scan response time.

And that’s exactly my point. Now how should I do that in a mission critical environment with 99,999% required availability? That would require to take the tablespace offline, then copy the datafile(s) to a separate file system, then copying them back???

Update: Interesting article on degrading ZFS write performance http://blog.delphix.com/uday/2013/02/19/78/


ZFS and Database fragmentation
Tagged on:                                             

53 thoughts on “ZFS and Database fragmentation

  • So, whatever your DB_FILE_MULTIBLOCK_READ_COUNT parameter may fix, your ZFS filesystem will make worse in reality 🙂

    1. Hi Maciej!

      Yep it seems to be the case. Still I found some very short sequences left, like 2,3 or 4 adjacent blocks. The multiblock read count will still optimize. As long as Oracle does not read more than it really needs, multiblock I/O is good (even with ZFS).

  • nice article – the duplicates that you see are the remnants leftover by the updated rows/blocks.
    That data is there physically, but if you attempted to do a block dump you would not see the duplicates only the active block the data dictionary is aware of.

    This ghosting affect is what is mentioned in the implementation guide for Oracle’s Transparent Encryption option. You can enable an encrypt data in place – however the ghost data will remain until it is overwritten.
    Oracle recommends you create a new encrypted tablespace/table(s) and move your data into the encrypted location.
    Then delete the old tablespace including contents and files to remove any sensitive data in those ghost blocks.

    1. Hi Matthew,

      Yep that confirms what I already thought. I was wondering if there’s another way to clear out unused data within the blocks but I guess there isn’t.
      The “ghosting” effect is both an artifact of Oracle as well as ZFS (you can see that from the offset – if it’s small i.e. less than 8K then it’s Oracle ghosting. If it’s large (many MB) then it’s ZFS.
      I also get the impression that the Oracle ghosting only occurs if you write data to rows that had NULL values – not if you overwrite data (at least not for the fixed size strings).
      But I assumed as long as the ghosting occurs not too often – like in my case 1223 out of 100,000 – then it’s not too bad and my method of counting fragmentation is relatively valid.
      The problem with ZFS is that if you want to update even a single Oracle block you will find the new block at a completely new location but the old one is still “ghosted” somewhere. That’s why I had to update the MAGIC string.

  • Great article.

    Recently had a client who had issues with ZFS and looked a lot like a fragmentation issue. Adding Luns to the pool solved the issue. Oracle should provide some way to easily monitor the fragmentation and/or make it possible to defragment without having an impact on your database system.

    I still prefer ASM over ZFS.

    1. Hi Bjorn, thanks, same experiences here… adding LUNs helps if that means more physical spindles. But spinning disk is often already the bottleneck even without fragmentation. Why solve it by throwing more hardware at it when the root cause is the FS architecture…
      Totally agreed on ASM as ASM has no fragmentation (especially with larger AU sizes).

      I don’t expect Oracle to come up with a solution anytime soon because that would mean they admit the issue in the first place 🙂

      1. You forget that with more spindles this could be called distribution instead of fragmentation. Given the dbf is spread across 70 spindles and mirrors and ZFS being a Storage-FS you’ll understand why distribution on a single disk has symptoms of a fragmented flat-fs. As a Storage-FS ZFS _wants_ do distribute blocks in the first place (at least that’d be my explanation). Things get fragmentally worse if you bring ditto-blocks into the game. Yes, this distribution comes at a price, specially at a pool-space-usage >80%.

        For the performance-part:
        I’d like to see the results of a raid-10/ASM and zfs striped mirrors (SSD ZIL+L2ARC). From a RealWorld-DBA-PointOfView.

        1. Hi Robert,
          Whether it’s called distribution or fragmentation – that’s just a matter of naming. The question is what the impact is on performance.
          Distribution is fundamentally a good thing – as long as it doesn’t mess up other optimizations. That’s why Oracle ASM (with 1MB AU size default but adjustable to larger sizes) and EMC FAST-VP (depends on implementation but can have about 7,5 MB chunk sizes) are, IMHO, much better at “distribution” than ZFS – because a) the element size is much bigger, allowing for some pre-fetching in cache, and b) they don’t move data offsets with every individual write. Data once in place stays there (guess FAST-VP is a bit of an exception, but it moves data based on I/O stats in order to get *better* performance, not randomly messing stuff around causing trouble).

          That ZFS wants to distribute blocks is not a big problem if you write big files just once (office files, compiler builds, etc) but it becomes a problem if you have a database writing 8KB blocks at a time and then expects to get decent sequential performance when reading it back 🙂

          The comparison you mention – I’d like to see that as well. When I get my hands on an EMC lab with fast storage, and I can find some time I might do the comparison and blog about it indeed…


          1. Hallo,

            ZFS aggregates and asynchronizes IO, so, if a db-file is spread across multiple disks the 8K-blocks can be fetched (also prefetched) in parallel – minimizing latency. If it would have to read the data from a flat, serialized file (either single-disk or normal raid (usually stripe-size of 64k) – see also emc, hp or netapp volumes) then it couldn’t make much use of multiple disks – the only optimization would be: defragmentation. Block-Distribution _addresses_ the need of low latency IO – a requirement for this is: multiple disks.

            See also: http://docs.oracle.com/cd/E26502_01/html/E29022/chapterzfs-db1.html

            Pay attention for “Consider tuning storage array I/O queues (for systems with HDS or EMC storage arrays)”

  • Hi, great article.

    You’ll never guess why i found this in the first place 🙂 Indeed, Oracle on ZFS and very, very heavy fragmentation. The database is OLTP and doing a lot of random writes to wide range of data blocks.

    I tested, unlike you, with a small C program that
    1 – writes 1000 random 8KiB block with random content to random locations (on 8KiB boundaries) in a 10GiB file – OLTP load.
    2 – read the full file in 1MiB IO’s (at least that is what i request with the pread call), say backups, full table scans, and so on.

    the read performance drops dramatically after only a few random write IO’s (after roughly 100.000 write IO’s the 1 MiB read latency drops from 11ms to 35ms on my system), which is peanuts for a oltp db that runs for two years. the rate at which it drops depends on logbias and recordsize, but i did not find a ZFS configuration that does not suffer from the problem.

    the test also show that ZFS gives very poor random write performance when comparsed to other filesystems. EXT3 and JFS for example, are giving a steady 9ms no matter how many writes i make into the file.

    so i wholeheartedly concur that ZFS is not suitable for databases because it fragments the datafiles. did you notice that Oracle states in an Oracle-on-ZFS whitepaper that moving datafiles around may help performance? Yep, we all know our clients love downtime, and we all love to work late.


    1. Hi Jan-Marten,

      Great to hear my findings confirmed with a different testing method, appreciate your comment!

      Oracle will likely respond that things are not this worse, databases also write sequential I/O which more or less causes (accidental) defragmentation, and lots of mem/flash/etc will mitigate the problem – but I would not rely on that. I still plan sometime to analyze the performance impact on an OLTP and DWH workload on ZFS but haven’t found the time and resources to do that yet. ps. I have an earlier post on ZFS you might be interested in – although the statements are slightly outdated as Oracle has done some improvements on their appliance.

      Moving datafiles around – yes I have seen that one. Basically copying them means delete and re-write the entire file, sequentially, so that fragmentation disappears (for the time being). Downtime – in the era of big data, cloud and “the internet of things”. Sure… 🙂

      Thanks for your insight!

  • Today, many databases also COW, so do not expect them to perform better than ZFS on rotating media for the exact same reasons. If you want performance, you won’t bother with HDDs for this exact same reason.

    If you do use ZFS for a database, then you will also be happier using a separate log (slog) for the sync writes. You do this for the exact same reason that Oracle recommends putting the redo log on a separate device. If you do not do this, then your expectation of contiguous writing gets interleaved with the ZIL, leading towards more fragmentation, sooner.

    1. Hi Dan,

      That’s still on my to-do list… no ETA at this point. I am still figuring out if it is possible to do a reasonably unbiased test to compare. Will probably compare classic ASM vs ZFS but the question then is, how much cache do you provide to ZFS, what kind of DB workload, what’s the impact of Flash tiers, etc…

      You can think of workloads that are reasonably ZFS friendly and workloads that are extremely ZFS hostile. That said, I think that ASM will outperform ZFS in all cases. But only testing will prove it.

      Let’s see if I can find time and lab resources to do the follow-up sometime.

      Thanks for your interest!

      1. I hope you can, because at present, it’s, forgive the term, FUD. Yes, systems create fragmentation. Yes, systems deal with it. PostgreSQL on ZFS is not unheard of. It’s been done for year. Successfully.

        1. Dan,

          Yes it can be done, for sure. I have customers running Oracle on ZFS with high workloads in production. Don’t ask what they needed to do to get decent performance (plus, it highly depends on their workload profile as well: read/write ratio, small/large blocks, random/sequential…) Their issues triggered me to start investigating ZFS in the first place, otherwise I don’t care what file system our customers use as long as it’s supported and reliable.

          But given that your standpoint seems to be etched in stone, then let me ask: what kind of test (or results) would be needed to convince you otherwise?

        2. Solutions depend on the problem being solved. i.e. determine your workload, then design the ZFS solution accordingly.

          How did you conclude etched in stone?

          By ‘convince me otherwise’, I think you are referring to AFS vs ZFS. I’m always happy to see comparative results. The problem with comparing X and Y is both X and Y have to be used to their best abilities, which means a great learning curve for both products.

          For ZFS, I see ZIL, L2ARC, multiple vdev, etc. Without that, your critics will cry foul.

          I wish you well and I look forward to the results.

          1. Etched in stone because I basically prove, with a full disclosed method that anyone can repeat, how fragmentation occurs on ZFS. Proving something is not FUD. I think I also made it clear that I don’t know currently exactly how much this fragmentation affects performance – and agreed it would depend on a lot of things. Very specific workloads (i.e. near 100% write) could even benefit from copy-on-write file systems.

            My job is to help customers optimize their Oracle workloads on EMC systems. Because I have a number of customers who ran into trouble with ZFS I decided to put it to the test and show where the pain is – hopefully it warns my readers to be careful with ZFS when deploying Oracle on it. I have no deep knowledge of PostgreSQL or most other database systems for that matter. I know Greenplum (now part of Pivotal, but a product of the EMC family) is based on Postgres and they have used ZFS in the past – and Greenplum is not known for bad performance (on the contrary). Why didn’t they run into fragmentation issues?
            Most likely because Greenplum is used for analytics: Write LOTS of data once (sequentially) then access it read-only. No problem for ZFS. The thing goes belly-up if you do lots of small random updates (i.e. Oracle with OLTP workloads).

            If with AFS you mean Andrew File System – that’s irrelevant to what I work with as AFS is not supported for Oracle anyway. Alternatives for ZFS are: ASM (Oracle’s volume manager which would be my preferred alternative), directNFS (NFSv3 initiating from Oracle’s kernel, using a high-end NAS server – preferrably from EMC 😉 – as target), or ext3, or, on Solaris, UFS (granted, a bit limited in flexibility but performance is good) or VxFS.

            On the learning curve: interesting point. My opinion is that file systems should be simple. And I actually like the interface of ZFS, few commands, lots of control. But a file system should not depend on lots of RAM, flash, write intent log, lots of IO queues (LUNs/devs) etc to get good performance. A good file system is non-intrusive, i.e. it translates I/O requests from an application directly into disk I/O with as low overhead as possible. ZFS requires lots of RAM (I’d rather give that RAM to the database cache), a ZIL to make it perform (which, with RDBMS systems should already be handled by the redo logs, so it’s double work), a large cache to mitigate the fragmentation issues (they should not be there in the first place) etc.
            I would love to see a file system that combines volume manager capabilities but doesn’t fragment data (maybe a future version of ZFS could have a data placement scheme without copy-on-write but fixed-in-place. Could be very interesting).

            That said, if your database is 1TB and your Flash based L2ARC is 2TB then yes of course you have good performance – all read I/O is coming from fast Flash in any case. But then you might just as well buy an AFA (All Flash Array) in the first place, why bother with spinning rust 🙂

            It’s still on my list to do a performance comparison sometimes. The big problem is with ZFS you can not compare apples to apples. Whatever I would use as a setup, there will always be readers who say the ZIL was too small or the config was wrong or I was running the wrong type of workload. Can’t avoid that. People who prefer to go down the ZFS route, because their vendor told them it’s the future, will not change their mind because a stubborn Dutch guy did a small performance test and posted the results on an unofficial blog. Like with operating systems, some file system discussions tend to be more driven by “religion” than facts…

          2. ASM, I meant ASM no AFS. Sorry for my typo.

            Ahh, yes, sorry, no, I am not casting doubt upon fragmentation. That’s not the FUD.

            The FUD is whether or not this fragmentation causes a performance hit.

          3. Uh.
            You’re claiming tiered storage is no step forward?
            It does matter whether I am able to provide plenty of high demand data for a server farm from DDR3 or Flash. Thats how IOPS of 1.000.000 at random 8k blocks are achieved.

            ZFS may not be suitable locally on the DB. But it is suitable as SAN-FS.

            Running ASM on a ZFS SAN yields no performance problems. Please, don’t use ZFS the wrong way. Use 24 Disks, use good SLC ZIL, use good MLC L2ARC. Make it a 12-vdev mirror-pool. Compare to EMC with nvram. Compare prices. Compare dataintegrity (hash-tree). It’s about tradeoffs in either way.

          4. “Running ASM on a ZFS SAN yields no performance problems.”

            but the discussion is about using Oracle datafiles on a locally mounted ZFS filesystem.

            if a SAN uses ZFS internally below the block devices it presents, the issue of increased roundtrips against the SAN due to fragmentation of a locally mounted filesystem is absent. so your comment has no bearing on the discussion ‘Oracle datafiles on a locally mounted filesystem’.

          5. Dan, all,

            i have done extensive testing on ZFS, observing how streaming reads (aka oracle scattered reads aka full scans) are being affected by sustained random writes to a datafile. The results were appalling (the read bandwidth dropped to that of fully random reads) enough to advise customer to either

            a) get rid of ZFS
            b) provide ZFS with a sufficient amount of SSD’s as cache to counter the fragmentation effect.

            obviously, when the bulk of database IO is random (typical OLTP), the negative effect may not be noticeable although i will probably show up in RMAN (fuul/0) backup time – as that will do full scans on the datafiles too. In fact, the duration of the RMAN full backup was customers main problem and caused by ZFS – the read bandwidth was terrible due to years of sustained random writes to the datafiles. Copying the datafiles to a new location solved that problem , although it will slowwly build up again.

            datawarehouses typically do more full scans, so i would be very reluctant to use ZFS for that, unless provided with a big ass SSD cache – or runnin on SSD only.

            that is the downside of ZFS, or any copy on write filesystem, it fragments on random writes. There is no need to prove that, because it is a logical consequence of the technique – it is a fact.

            That being said, the days of rotational disks are numbered. On SSD’s fragmentation hardly matters at all (although it does a little in my tests, but that may improve/depend on the SSD type). Moreover, i find ZFS very easy to use – there is no other filesystem that can utilize a SSD as smart cache as easily as ZFS can – it’s just one line to create a cached pool.

            i forgot to point out one thing that is often overlooked, or mistaken for a valid argument:

            ‘you are talking about disks, but these are not real disks. they are luns, and there is no way of knowing if your data is adjacent/unfragmented or not, even if you filesystem believes it does. in fact, your lun is striped over many devices in the SAN. countering fragmentation does not solve any problem.’

            the last sentence is wrong. if the filesystem thinks blocks are adjacent, but they are really not, it will still submit a single larger IO. This still saves roundtrips to the block device (which is often remote eg a SAN), and still affects performance (significantly).

            kind regards, Jan-Marten

  • Dan,

    OK, I see… Glad we agree on the fragmentation 🙂

    Whether it causes a performance hit, that’s the big question indeed. So here’s my take on the matter:

    ZFS will never be *faster* than ASM because of fragmentation. Best case the performance impact is near equal to ASM or other (good) file systems, worst case it brings your database to its knees (not FUD, one of my retail customers lost millions in revenue due to sudden performance issues caused by fragmentation – and for the record, it wasn’t on EMC). That said, with spinning rust going to be replaced by all Flash in the future, the fragmentation thing becomes less of an issue (some problems will fade away without being solved) – it’s just a problem with mechanical rotating disk.

    Go figure – If you have a PC with an SSD the recommendation is to NOT defragment the drive anymore (it will only make the drive wear out sooner and not improve performance).

    But don’t take my words for it – check the TPC.org website and find benchmarks filed by Oracle (now owner of ZFS development). The most recent benchmarks (http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=113032601 – TPC-C) on Solaris are done using raw volumes striped with the classic Solaris volume manager (not even ASM to my surprise). Raw volumes are not even supported anymore with version 11.2. Nowhere you will find a ZFS based benchmark there.

    (The guys at Oracle know best how to squeeze the last bit of performance out of a database. Why not on ZFS? It’s the preferred FS on Solaris, right?)

    1. Bart,

      “(The guys at Oracle know best how to squeeze the last bit of performance out of a database. Why not on ZFS? It’s the preferred FS on Solaris, right?)”

      as i already stated above, Oracle states in an Oracle-on-ZFS whitepaper that “moving datafiles around may help performance”

      Oracle is clearly fully aware of the downsides of Oracle on ZFS, but it is commercially reluctant to say it out loud. Also bear in mind my remark above that although SSD’s have a constant seek time regardless of the seek location, even on SSD’s filesystem fragmentation increases roundtrips (IO’s). So there will still be a performance hit, and how big this hit is depends on the latency towards the block device.

      1. Right! Just trying to provide my customers the same insight. If for nothing else, it avoids fingers pointing to EMC systems when dealing with IO performance issues…

    2. “(The guys at Oracle know best how to squeeze the last bit of performance out of a database. Why not on ZFS? It’s the preferred FS on Solaris, right?)”

      I think the history of ZFS since Oracle took over answers that question.

  • BTW, ZFS continue outside Oracle. Look at Open-ZFS for starters. Oracle no longer owns ZFS.

  • In reply to Jan-Marten,
    (note WordPress limits the nested levels of replies…)

    Agreed, fully consistent with my experiences.

    One reaction:

    > If the filesystem thinks blocks are adjacent, but they are really not, it will still submit a single larger > IO. This still saves roundtrips to the block device (which is often remote eg a SAN), and still affects > performance (significantly).

    If the filesystem is ZFS and it’s local then if it thinks the blocks are adjacent then they are (in most cases). There is no layer below ZFS that is causing fragmentation (ignoring the occasional RAID stripe boundaries and the like).

    If the filesystem is, say, EXT3 (or ASM) but the backend storage system is using ZFS then I think your comment applies. You would indeed reduce the number of IO requests going to the storage system. But that storage box will have to chop a large IO in smaller pieces and send them to different offsets of different disks. Still causing random IO (called IOPS inflation).

    Interesting insights BTW, thanks!

    1. “If the filesystem is ZFS and it’s local then if it thinks the blocks are adjacent then they are (in most cases). There is no layer below ZFS that is causing fragmentation (ignoring the occasional RAID stripe boundaries and the like).”

      that’s not my point, or rather my point the wrong way around 🙂

      suppose a filesystem wants to read 64KiB, but it knows the 8KiB blocks are not adjacent on the local filesystem, then ZFS will issue 8 IO’s. Regardless whether these blocks are ‘really adjacent’ on the ‘real storage’ or not. If the filesystem thinks the 64KiB is adjacent, it will issue one 64KiB IO.

      It does not matter if the data on the real (remote) block device is really adjacent or not, it matters what the local filesystem (or rather, the IO scheduler) THINKS, because that is what determines the number of IO’s/roundtrips issued. Note that the IO scheduler will not merge IO’s if it thinks the blocks are not adjacent.

      “If the filesystem is, say, EXT3 (or ASM) but the backend storage system is using ZFS then I think your comment applies.”

      no, it applies always, and it is not limited to ZFS perse, although is a much bigger issue on filesystems that fragment more.

      “You would indeed reduce the number of IO requests going to the storage system. But that storage box will have to chop a large IO in smaller pieces and send them to different offsets of different disks. Still causing random IO (called IOPS inflation).”

      true, but that chopping up is very low latency as that is all in the same box, whereas the host IO has a fabric to travel. So if the problem should be parked anywhere, this is the best place 🙂

  • Nice article. Figuring out where the issue lie is the first step towards remediating them. Clearly the COW design leads to fragmentation — so the next step is to find out how to best combat this without being silly about hardware, but on the other hand not to unnecessarily lowball things either — for example, a ZIL on a small volume on an EMC array will likely be entirely cached, right?

    It would be interesting to see tests repeated with different settings for ZIL, SGA, ARC, etc. to see differences in performance. I would also like to see performance results that make it clear as to whether the memory is better spent on DB buffer cache or ARC — cache is the obvious assumption and may well be correct, but at least in some cases might not be the best choice if additional ARC offsets fragmentation. Whether or not you have flash L2ARC likely will shift that balance, as will extending SGA onto flash. What happens if you set ZFS for latency instead of throughput, as seems logical for OLTP work.

    The point at the end of the day is to maximize performance with the hardware that you (are likely to) have, right? There are some other settings (on Solaris at least) for hw prefetch and so forth that can be tuned as well. Then a decision may have to be made whether the ease of use features and checksumming of zfs are worth a performance, and possibly cost, tradeoff. The more hard data available for calculating any tradeoffs the better, so keep up the good work!

    1. Hi Kirk,

      Yep I’d like to do more testing and hopefully I will get the time to do that one of these days.
      Reading your comments – it looks like if for you, ZFS is a hard requirement and alternative file systems or storage methods are out of the question 🙂

      My take is that the ZFS architecture is no good for databases because of the typical small block updates. ZFS might work very well for classic (office) file serving, SW development and the like (haven’t tested so I couldn’t tell but I guess fragmentation is not as much an issue in those circumstances).

      So if we agree that ZFS fragmentation leads to IOPS inflation, why would you let that happen and then try to solve it with workarounds and complex configurations like separate ZILs, messing around with SSD caches and the like? It’s not going to fix the fundamental issues in the design (best case it masks them up to a point)…

      Anyway, my view on your suggestions then 🙂
      – The ZIL does not change fragmentation. The ZIL is only there to avoid data loss when the system crashes. So whatever you do with the ZIL could only potentially speed up sustained write throughput. Doesn’t fix the IOPS inflation problem. But you’re correct in that if you keep it small and put it on EMC (separate LUN) then you have a good chance that the whole ZIL stays in storage cache without much backend destaging. Remark that COW and automated storage tiering don’t work well together (or, actually, not at all). So forget EMC FAST-VP with ZFS.
      – Memory for SGA or ARC – I bet SGA does a much better job in improving DB performance than the ARC (because SGA is managed by Oracle DB processes that know what stuff is in there – ARC is just a simple read-ahead/LRU cache). Also larger memory for ZFS again does not change the fragmentation issue (it might mask it a bit)
      – Extending SGA onto flash: Oracle already tries to do that with their DB smart flash cache. I think EMC can do better, FYI watch this video: https://community.emc.com/videos/6740 (from 17’30 onwards if you don’t have time to watch the entire vid – although I highly recommend you do).
      – bias latency vs throughput: do you have a database that only does small block IO or only large block IO? Even Oracle claims there’s “no such thing as pure OLTP” ?? besides it only seems to work if you have redo logs on a separate ZPool which creates consistency issues…

      Bottom line, all these attempts to optimize are lipstick on a pig. Change the pig, not the lipstick 🙂

      Thanks for your comments!

      1. Again: fragmentation is no issue but a pro-spreading-argument if 48 disks kann serve 8k-blocks at once.
        Nobody reads 8k-blocks sequential THAT much.

        We serve 8k-blocks at 48*190 IOPS without ARC/L2ARC. I want “fragmentation”. I do not want to always read from the same 190 IOPS disk.

        Ora-DB knows how to manage data and its cache. ZFS knows how to manage many disks and blocks.

        1. Fragmentation is an issue by definition. Full scans are common in datawarehouse and taking backups. Fragmentation is not an involuntary alternative for designed striping.

        2. I agree with Jan-Marten, I’ll explain why.

          Say you have a bunch of disks in the Zpool (let’s make it easy numbers: 10 disks).
          Database files are created using (best practice) 8K blocks. Let’s say there’s moderate fragmentation (i.e. half of the 8K blocks is adjacent to another one in sequence, the rest is scattered across the pool).

          We compare this to a plain file system (ext3 such as in my post) on a plain set of similar disks (10)

          Now I want to read a sequential stream of data, 1 Megabyte, from disk. OS is using 64K I/O size (each I/O is 8 blocks. multi-block IO) as an example.

          ext3: position disk head to offset of first block of 64K to read. Read entire track (few megabytes of data) in cache. serve the rest of the 15 x 64K chunks from cache. Done. First IO: latency probably 7-8ms. Rest of the I/O: latency below 1ms.
          Total seeks: 1 (maybe 2 or 3 if crossing some ext3 extent boundaries)

          zfs: position disk head to offset of first block of 64K to read. Read 8K. Position disk head of another disk to read the next 8K of the 64K chunk. Read 8K. Find that we have 3 adjacent 8K blocks so the next 24K comes from cache. Position 3rd disk head to read the next 8K for the 1st 64K chunk. etc etc. until you have grasped together 128 pieces of 8K of which maybe half is adjacent, so we get around 64 seeks across the disk pool (avg 3.2 per disk).
          Now of course if you throw enough spindles at it then you will not hit the max iops per disk. But remind that all of these 64 seeks take 7-8ms to respond which equals to some 500ms total wait time. Where the ext3 example with (let’s assume worst case) needs 3 seeks of 7ms (21) and 15x 0.5 ms (another 8ms) totals roughly 30ms.

          Again (this example):
          ZFS: 32 seeks across 10 disks, total wait time 500ms
          EXT: 3 seeks across 10 disks, total wait time 30ms

          EXT: can handle sequential IO requests until you max out on the *bandwidth* of the drives (let’s say you striped so 10 drives doing 50MB/s each = 500MB/s)
          ZFS: can handle seq. IO requests until it maxes out on *IOPS*. 150 IOPS per drive = 1500 (random) IOPS. Let’s say you need to do a seek for each one out of 2 8K blocks. That means you max out on 3000 x 8K = 24 MB/s.

          Of course all kind of ‘lipstick on a pig’ optimizations such as ARC tuning, huge caches etc. ignored.

          1. If you want speed, you don’t use HDDs and seeks are no cost. Game over. SSDs won. Get over it.

          2. Err:

            Zfs fetches the blocklist for $object and reads the desired chunks/length in parallel into mem and gives it sequential from mem to the caller.

            Sequential scenarios: rman backup, datawarehouse.
            Full table scans are not common in oltp and shouldn’t occur.

            Zfs uses prefetching for sequential data, i.e. it reads the next 2,3,4,n blocks of the same disk while being there. _This_ prefetch-data is kept in a simple lru cache.

            Arc is NOT a simple lru cache, btw. Please see

            I wonder why you focus on sequential behavior. Reading 1tb files was done at 1,2 GB (no other load, no compression) – limited by the PCI-bus. Currently under business load with lots of syncwrites (ora-dbs, vmware farm, nfs homes) I have 300mb/s for read-tests per process (note: it doesn’t help to launch a single test, you need some in parallel on different zvols to see the full speed).

            And Jan-Marten: for traditional raids and FS: fragmentation is an issue – they have to seek sequential because they cannot access an object like zfs in parallel.
            Instead of fighting fragmentation, ZFS found a way to make use of it to a certain degree.
            Fragmentation becomes an issue also for ZFS when it goes over 80% used space. And then really heavy. (Write problems).

  • @Richard (can’t reply to you directly, nesting too deep?)

    “If you want speed, you don’t use HDDs and seeks are no cost. Game over. SSDs won. Get over it.”

    1) it’s true seeks are at no cost on ssd’s.
    2) it’s not true fragmentation is at no cost on ssd’s.

    suppose you can go to your bakery and get a loaf of bread really fast. i mean really really fast.

    do you think there will be a difference in the required time between these two scenario’s:

    a) you speed to the baker 6 times to get 1 loaf of bread each roundtrip.
    b) you speed to the baker 1 time to get 6 loafs of bread in one roundtrip.

    your shortest route to the bakery might be something like this:
    process, kernel driver. hba, fabric, SVC, san adapter, san ssd, san adapter, SVC, fabric, kernel driver, process.

    for all these components it’s more efficient to do N things in one go than to do one thing N times. system max IO size is typically 1 MiB, database block size for OLTP workloads is typically 8KiB. That’s N=128, two orders of magnitude. N=128 evens causes a notable difference when accessing memory pages directly from your host’s memory, which is in turn much faster than a SSD.

    fragmentation is an inefficiency, and that’s reason enough to avoid it if it can be, and it can be avoided. serving fragmented filesystems, your hardware, however fast it is, will require a capacity upgrade earlier under equally rising loads – and be slower too.

    as the topic is the feasibility of ZFS for such workloads, the answer to that, is that it is not.

    one can put square wheels under one’s car and counter criticism on the design by showing the stupendous jet engine under the hood. that leaves the fact that the car would be faster with round wheels, and the question why the wheels are square unanswered.

    perhaps some should get over that – that the debate is already settled.

    1. You miss the important pieces:

      There are 6 bakeries and you send 6 workers to get the bread.
      This is faster than waiting for one worker who waits for 6 breads at the bakery.

      Your understanding of disks and filesystems is from the 80s. If the object is fragmented on one disk – then yes, you are right.

      If it is fragmented over multiple disks, this is called ‘striping’. Just not linear as with classic raid.

      1. “You miss the important pieces:
        There are 6 bakeries and you send 6 workers to get the bread.
        This is faster than waiting for one worker who waits for 6 breads at the bakery.”

        if you lack the intellectual capacity to extrapolate the proof, i’ll show you you still lose.

        it is faster to send 6 workers simultaneously to 6 backeries to get 6 loafs of bread each go than to send 6 workers to 6 backeries 6 times to get 1 loaf of bread each roundtrip.

        please reread the part on the IO size and the factor 128.

        “If the object is fragmented on one disk – then yes, you are right.”

        you fail to see that the problem on N disks is the same as on one.

        “If it is fragmented over multiple disks, this is called ‘striping’.”

        perhaps you call that striping, but it isn’t.

        “Just not linear as with classic raid.”

        your so called ‘stripe size’ is the database blocksize (at best), which is typically far from ideal.. and it may not be striped at all, because in your so called ‘striping’ scenario the data would only become fragmented if copied on write at least once.

        “oh my god, my IO performance is poor.”
        “you forgot to fragment your filesystem”.
        “how do i do that?”
        ” well, ask your end users to cause random writes”

        “Your understanding of disks and filesystems is from the 80s.”

        i’ll be polite about your understanding of the subject.

        1. Sigh.

          “it is faster to send 6 workers simultaneously to 6 backeries to get 6 loafs of bread each go.”
          Ok, you increased the IO-size. For ASM it is 1MB. If you don’t follow the db-on-zfs-wp you set a recordsize of 128k for the data-volume.
          In order to read 1MB with the latency of one IO you need at least 8 disks, since you only read 128k from each disk, you’re done faster than if you’d have read from a classic raid. ASM on pure disks is equal. ASM on hardware-raid is slower.

          Buuuut! If you need 10k+IOPS – you go either with pure ssd for all asm-disks – which is expensive, or with a multi-tiered (ram, ssd, hdd) FS like ZFS where you need just 2 or 4 ssd.

          Or with emc which “lipsticks” IO with battery-backed “nvram” and read-caches (which are empty after storage boot and not cluster-aware).

          ” than to send 6 workers to 6 backeries 6 times to get 1 loaf of bread each roundtrip.”

          This is not a ZFS-scenario.

          1. “Ok, you increased the IO-size. ”

            No, ZFS decreased the IO size by fragmentation, which is the topic here.

            “This is not a ZFS-scenario.”

            yes it is, and the proof is presented by Bart above, i have seen proof in the real world with devastating effect,, was able to reproduce the performance collapse in my lab (witn suprisingly little random writes), and it is a consequence of ZFS design.

            scan read ZFS database data file with 4 years of random write history: around 14KiB IO size on average.
            move file to defragment: around 200KiB IO size on average. still not very good, but no longer deadly.

            are you in denial or in sales?

        2. “Ok, you increased the IO-size. ”

          No, ZFS decreased the IO size by fragmentation, which is the topic here.

          “This is not a ZFS-scenario.”

          “yes it is, and the proof is presented by Bart above, ”

          Bart didn’t use a “real world” or “proper” ZFS layout. Flawed tests to prove nothing but the wrong. The values are useless for evalution. They just state the obvious ZFS distributes blocks and does not need to care for fragmentation when used with the correct hardware-setup.
          It’s like you want to use one cylinder of a porsche motor because you heard porsches are good and want to prove the opposite (while proving they need fuel and stating this is bad).

          scan read ZFS database data file with 4 years of random write history: around 14KiB IO size on average.
          move file to defragment: around 200KiB IO size on average. still not very good, but no longer deadly.

          We backup large Databases coming from ZFS LUNSs daily at speed of 300mb/s and more. The whitepaper suggests 8KB recordsize for the data-volume. This isn’t really good. Also: maybe you didn’t align the devices correctly.

          “are you in denial or in sales?”

          No, I install infrastructures and learned to value the flexibility that ZFS luns provide. IO is nowhere an issue, instead we taggle IO issues with ZFS.


          “Copying datafiles around”:
          A) this is for those, who do not use many disks for ZFS (like bart)
          B) you have to do a database reorg from time to time with downtime as well.

          Again: if you use less than 12 or so spindles: use ASM.
          If you have a datacenter with 4+ Databases you want a SAN – with ZFS.
          And you want deep knowledge of ZFS as you want deep knowledge of Oracle DBs.

          What is ZFS for: A multitude of datacenter disks presented to all types of workloads.

          1. “Bart didn’t use a “real world” or “proper” ZFS layout. Flawed tests to prove nothing but the wrong. The values are useless for evalution. They just state the obvious ZFS distributes blocks and does not need to care for fragmentation when used with the correct hardware-setup.”

            the test is not flawed, you simply don’t like the result. it agrees with extensive testing i did. it agrees with real-world scenarios i have seen, it agrees with ZFS design. it does not agree with your opinion however, which means you and i are not debating ZFS, but your refusal to admit your beliefs are not as great as you would wish.

            you compensate with additional hardware to counter the inefficiencies of ZFS for (OLTP) database file workloads, which is admitting the inefficiency whilst using it to ‘prove’ it doesn’t.

            i do not care you refuse to admit you are wrong. the end.

          2. Agreed I didn’t use real-world layout. Whether this (one disk) is proper or not can be argued but I wanted to keep my test as simple as possible, so that my readers could understand what’s going on. I’d like to hear if fragmentation (or the effects of) would magically disappear when using more than 12 disks, as I fail to understand 🙂

            Yes 12 disks can do 12x the random IOPS of one disk – so if a given database needs 3 times the IOPS of a single disk then, yes, your problem would mostly disappear when using 12. But scale up workload and the problem will reappear again. Same for non-ZFS scenarios BTW. My point is that given a mix of random and sequential read and write, non-COW filesystems scale much higher (unless you mask the issues by throwing lots of SSD and RAM cache at it – but then one should really consider an AFA – All Flash Array).

            On proving the obvious: Maybe for experienced ZFS users it’s obvious. To many of my customers it was not. Some had serious performance issues and asked me for advice. When bringing up the COW allocation and how that impacts database block distribution, some were completely unaware of it. And you get the obvious FUD from certain vendors/people who claim:

            a) All filesystems have fragmentation
            b) It’s by design and it is supposed to happen
            c) It’s not that bad and really not an issue (like the Oracle guy responding to Jan-Marten’s question in your link – of course an Oracle chap will not say ZFS has issues. And indeed if you throw enough HW at the issue, it might disappear).
            d) Other layers in the stack have fragmentation issues too (so the fact that we have it is not relevant)

            Typical vendor response: “aww, your workload is REALLY random! Never seen something like that! You need more spindles/SSD cache/storage CPUs/memory/whatever!” – I can alread see the “Dirty Cache [sic]” signs blinking in their eyes. When the whole issue could have be avoided by choosing the right technology – and yes, EMC sales aren’t saints either 😉

            Actually, indeed all FS have fragmentation – but on many it only appears when creating files and it’s not really a big issue because of the very large fragment size. Just wanted to show how bad it gets on ZFS. That’s also why I compared with ext3.

            On the Porsche motor: apart from vibrations that would come from not being balanced and no even cylinder firing sequence, a single cilinder should have the same fuel efficiency for a given power output (per liter of cylinder volume). Agreed, nitpicking 🙂 but there is a truth here – I think if a file system has flaws using one disk only then it certainly has with many more disks.

            Re the 8KB recommended value from the white paper… If customers cannot trust best practices given from their vendors then what should they do? All become Ph.D experts on evil ZFS tuning? The original promise of ZFS was fire-and-forget.

            http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Tuning_is_Evil :
            “Tuning is often evil and should rarely be done.”. ‘Nuff said.

            Alignment: fully agreed and this is an issue not just on ZFS but on many more configurations. Wrote 2 posts on that (focused on Linux in general).

            >“Copying datafiles around”:
            >A) this is for those, who do not use many disks for ZFS (like bart)
            >B) you have to do a database reorg from time to time with downtime as well.

            Excuse me but A) and B) seem to contradict each other.
            A) suggests it is not an issue if you use enough disks.
            B) suggests it *IS* an issue but similar issues exist elsewhere? (don’t point somewhere else! Our whole economy is in trouble because of ppl ignoring their responsibilities)

            Note that many of my customers do database reorgs online. If you could do defrag online (such as with Netapp WAFL – the architecture that ZFS was based upon) then you could at least resolve some of the issues without downtime (up to a point).

            Few more notes here:

            – EMC usually is file-system agnostic. I myself didn’t care which FS or LVM customers were using until I got confronted by some customers with real and present problems. That said (full disclosure): Oracle started to sell ZFS appliances and they compete with EMC directly – So I have an extra incentive pointing out the differences. So I am not unbiased (anymore) and my readers should be aware, and hopefully they use all available information (not just mine) to make their own well-educated decisions.

            – Fragmentation, striping and pseudo-striping are all different things with different effects. Also for some ppl the difference between sector size, block size and IO size seems to be problematic. I think I shall write another post on this as it seems many are confusing those.

            – It is true that random seeks are no longer an issue with solid state storage (in whatever form). When solid state becomes equally priced as, say, SATA disk (and I think it will not take longer than a few more years) then many existing storage architectures (including some of our own EMC ones) will disappear or be (partially) re-architected from the ground up, based on the behaviour of this technology, rather than trying to make old architectures work that were designed with spinning rust in mind (a good example of what’s to come is the architecture of EMC XtremIO – which has fragmentation by design for that matter – if you want to call it as such – but behaves in a completely different way 🙂

            With the extinction of spinning disk we will also see ZFS (and other file systems) become a thing of the past. But for now we still need spinning disks because of economics and scale. So for the time being we should select our I/O stack wisely.

    2. As for why ZFS is popular with databases, it allows you to create snapshots and clones. This is simply not feasible on a system that tries to optimize for zero seeks.

      For ext, which has been a fine file system, you can see that it doesn’t scale. As you get into larger files, you get multiple levels of indirection. This limits its scalability and application. If we presume btrfs as the heir-apparent for Linux systems, then get used to having nice features like snapshots and clones and forget about worrying about HDD seeks.

      Finally, by you logic, all databases should use raw because then you will have the best possible data layout for the application. But that isn’t what wins in the market. Contemplating the best fit algorithm for HDDs is a like deciding on whether to use cowhide or buffalo hide for your buggy whips. Get over it. Move on.

      1. “As for why ZFS is popular with databases, it allows you to create snapshots and clones.”

        ZFS is not popular with databases, but i agree that is an advantage. But you do not need ZFS to do that, most storage can do that too.

        “This is simply not feasible on a system that tries to optimize for zero seeks.”

        ext3 (for example) does not try to optimize for zero seeks reads for staticially size datafiles. it just leaves data in place on write, so it does not have a problem to address.

        i agree that ext has problems. but that\s not the issue here.

        “Finally, by you logic, all databases should use raw because then you will have the best possible data layout for the application.”

        i would agree to saying that that is ideal, yes.

        “But that isn’t what wins in the market.”

        i don’t care what wins in the market. i like efficiency better.

        “Contemplating the best fit algorithm for HDDs is a like deciding on whether to use cowhide or buffalo hide for your buggy whips.”

        i fact, i earn a living correcting the consequences of this state of mind.

        “Get over it. Move on.”

        ZFS fragemnts and gives poor scan read performance over time.

        this fact follows from design and is seen in the real world with devastating effect.

        your messiah is great but not perfect, get over it.

        this will be my last comment btw.

  • Simplified, if Oracle issues an 8KiB IO and ZFS has an 8KiB record size, ZFS will issue an 8KiB IO. If Oracle issues a 16KiB IO and ZFS has an 8KiB block size, ZFS will issue two 8Kib IO. And so on.

    So if a system is suffering from this kind of IO blowout then it’s obviously not issuing 8KiB IO. Oracle’s recommendation should not have been followed and the system’s just not tuned. Presuming it’s being hit hard with a 1MiB MBRC, leaving it at the default 128KiB record size would result in a 16 fold decrease in IOPS.

    If you’re looking to just use ZFS as a drop in replacement for a simple filesystem only, that’s just a bad idea IMO. If you’re trying to eek out the last performance, you maybe don’t want to be on a filesystem in the first place.

    But if you have use for other features: compression, snapshots, clones, ZVOLs, and in conjunction with other technologies dNFS, COMSTAR/iSCSI/FC (ie, offloaded to storage hosts), the fun starts.

    1. Richard,

      Yes but the problem goes beyond a 16K IO becoming 2 8KB IOs. On classic file systems / storage subsystems etc. the chance is very high that the 2 8K IOs are adjacent and the disk only needs to do one seek, where with ZFS the chance is high you need 2 (and so on). Of course when the world slowly moves to AFA (All-Flash Arrays) this will no longer be an issue.
      Note that in my test I used 8K oracle block size and 8K ZFS block size (according to Oracle’s best practice).

      Granted ZFS has interesting additional features which are helpful if you can’t (or don’t want to) afford an enterprise storage array that has these built-in (typically with lower overhead and other advantages). My only point is that ZFS is just not the best choice for database files (and Oracle knows it – hence you will not find Oracle TPC benchmarks using ZFS, even with the most recent SPARC/Oracle benchmarks). ZFS might be a great choice however for other purposes depending on the requirements.

Leave a Reply