ZFS and Database fragmentation

Disk Fragmentation – O&O technologies.
Hope they don’t mind the free advertising

Yet another customer was asking me for advice on implementing the ZFS file system on EMC storage systems. Recently I did some hands-on testing with ZFS as Oracle database file store so that I could get an opinion on the matter.

One of the frequent discussions comes up is on the fragmentation issue. ZFS uses a copy-on-write allocation mechanism which basically means, every time you write to a block on disk (whether this is a newly allocated block, or, very important, overwriting a previously allocated one) ZFS will buffer the data and write it out on a completely new location on disk. In other words, it will never overwrite data in place. Now a lot of discussions can be found in the blogosphere and on forums debating whether this is really the case, how serious this is, what the impact is on performance and what ZFS has done to either prevent, or, alternatively, to mitigate the issue (i.e. by using caching, smart disk allocation algorithms, etc).

In this post I attempt to prove how database files on ZFS file systems get fragmented on disk quickly. I will not make any comments on how this affects performance (I’ll save that for a future post). I also deliberately ignore ZFS caching and other optimizing features – the only thing I want to show right now is how much fragmentation is caused on physical disk by using ZFS for Oracle data files. Note that this is a deep technical and lengthy article so you might want to skip all the details and jump right to the conclusion at the bottom

My test environment is legacy hardware and therefore I am not even attempting to break performance records or come close. The performance I get is not impressive at all. I’m using single- and dual-core servers with limited memory and single SATA disks with a limited data size that are in no way representing modern datacenter equipment. For showing the effects of fragmentation you don’t need fast hardware.

Now a bit of background on what I expected from my testing. First of all, database data files behave much different than, say, office files (i.e. PowerPoint, PDF, word docs) or other “file serving” style environments (i.e. content management, software development and many other things).

An office file typically gets written as one whole. On ZFS, I expect this to cause more or less sequential, low fragmented allocation even with larger files (5 to 10 MB). If you load the file in your text editor, then save it, most office software (Microsoft Office, Libreoffice, …) will overwrite the whole file at once. Much the same for other regular file serving purposes. In ZFS due to the copy-on-write mechanism this causes the new file version to be completely written to a new disk location but the new file is still largely unfragmented.

A database file behaves differently. Such files typically get created only once (when creating or extending tablespaces) and only grow if the database needs more space. Otherwise the file itself does not change (although the contents within the file change heavily – as we will see).

Especially if the workload is OLTP-like with many small random updates, every update will modify a single database block – or a few of them at a time. What’s the best tool to create random I/O on an Oracle database? Kevin Closson’s SLOB!

But I made a few modifications to SLOB, first of all to make it run on my very modest hardware (128 sessions on a single CPU is not a good idea), tweak the data size (82MB for a single user is likely to stick in cache and I don’t want that), and finally, I needed a method to track database blocks on a physical disk.

So let’s go over the test details.

Equipment

ZFS Server:

FreeBSD 9.1
Dual processor (32 bit) FreeBSD 9.1
Memory: 6 GB (of which 3.5 GB usable without PAE)
Disks: 3x 73GB 10.000 rpm SCSI via Adaptec SCSI controller (not used in this test)
1x SATA 400 GB via SATA PCI controller
Gigabit Ethernet (with Jumbo frames but set to MTU 7200 to match the DB server)

DB server:

CentOS 5.9 64-bit single CPU
Memory: 2GB
Disks: 1x SATA 400GB
Gigabit Ethernet (MTU size max 7200 due to low cost consumer hardware
Database software: Oracle 11.2.0.3.0 with Direct NFS enabled

On Linux, I created an LVM logical volume of 1GB in size and zeroed it out:

# lvcreate -Ay -L1GB -nslob data
# dd if=/dev/zero of=/dev/data/slob bs=65536

Then created a standard EXT3 file system on it:

 
# mkfs.ext3 /dev/data/slob
# mount /u02/slob/
# df | grep slob
/dev/mapper/data-slob
                       1032088     34092    945568   4% /u02/slob
# chown oracle:dba /u02/slob

Then I created a tablespace for SLOB:

CREATE BIGFILE TABLESPACE "EXT" DATAFILE '/u02/slob/slob.dbf' SIZE 2M AUTOEXTEND ON NEXT 2M MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO ;
# ls -alh /u02/slob/slob.dbf ; df -h /u02/slob
-rw-r----- 1 oracle dba 2.1M Feb 25 11:55 /u02/slob/slob.dbf
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/data-slob
1008M   36M  973M   4% /u02/slob

Setup SLOB table with only one user in EXT tablespace:

$ ./setup.sh EXT 1

However, I modified SLOB a bit to be able to find row offsets in a tablespace later (via the MAGIC string). I also increased the default number of rows (12000) to 100000 so that it will create an 830MB tablespace (default will only create about 93MB per user) – forcing 83% space allocation on the FS.

SQL> desc cf1
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
MAGIC                                              CHAR(32)
CUSTID                                             NUMBER(8)
C2                                                 VARCHAR2(128)
C3                                                 VARCHAR2(128)
.
.
C19                                                VARCHAR2(128)
C20                                                VARCHAR2(128)

The code to enter MAGIC values in CF1.MAGIC looks like this and is run once after loading the table:

DECLARE
i           number := 1;
BEGIN
FOR p IN ( SELECT rowid FROM cf1)
LOOP
UPDATE CF1 SET MAGIC=chr(10) || 'SLOBROW ' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;
i := i+1;
end loop;
end;

This causes a sequential run through all rows of the table (starting from the first row) and setting a searchable text string with the row number as ASCII. The chr(10) are newlines so that the output looks better later on (using the “strings” command).

The table content looks like this:

> select magic,custid from cf1 where rownum < 10
MAGIC                                CUSTID
-------------------------------- ----------
6051
SLOBROW  000001
6305
SLOBROW  000002
…
SLOBROW  000008
2735
SLOBROW  000009

9 rows selected.

Note that the newlines are caused by the char(10) so this is on purpose.

We check the fragmentation within the datafile:

# strings -t x /u02/slob/slob.dbf |grep "SLOBROW" >outfile1.txt

The outfile1.txt looks like this (hex block offset, followed by the MAGIC string):

 
109daf SLOBROW  000001
10bdaf SLOBROW  000002
10ddaf SLOBROW  000003
10fdb1 SLOBROW  000004
111daf SLOBROW  000005
113dad SLOBROW  000006
115daf SLOBROW  000007
117daf SLOBROW  000008
119daf SLOBROW  000009
11bdad SLOBROW  000010
11ddad SLOBROW  000011
.
.
.
31213dad SLOBROW  099997
31215dad SLOBROW  099998
31217dad SLOBROW  099999
31219dad SLOBROW  100000

So within the datafile there is no fragmentation.

Now we look at the 1GB block device that holds the /u02/slob filesystem:

# strings -t x /dev/data/slob |grep "SLOBROW" >outfile2.txt

The outfile2.txt looks like:

790adad SLOBROW  000001
790cdad SLOBROW  000002
790edad SLOBROW  000003
7910dad SLOBROW  000004
7912dad SLOBROW  000005
7914daf SLOBROW  000006
7916dad SLOBROW  000007
7918dad SLOBROW  000008
791adad SLOBROW  000009
791cdad SLOBROW  000010
791edad SLOBROW  000011
.
.
.
39a0ddaf SLOBROW  099995
39a0fdaf SLOBROW  099996
39a11dad SLOBROW  099997
39a13dad SLOBROW  099998
39a15dad SLOBROW  099999
39a17dad SLOBROW  100000

Again, no fragmentation within the block device.

Now I ran SLOB with one user (writer) against the table.

One SLOB writer (lite) will update 1000 x 256 = 256000 rows x 8K blocks = about 2 GB, which is more than the size of the table contents (830MB). After a few runs we can assume most blocks have been updated a few times.

Note that I updated the line in SLOB’s writer.sql.lite so that it touches all table rows:

v_r := dbms_random.value(257, 10000) ;

v_r := dbms_random.value(257, 100000) ;

to generate writes against all table rows (830MB).

Note that my SGA is 500MB, buffer cache only about 184 MB, which forces most of the block writes to physical I/O.

$ ./runit.sh 1 0

Tm 1538

(1538 seconds to read and write 2GB worth of random 8K blocks – about 300 iops – I only use a single slow SATA disk so this is not even too bad

From awr.txt FYI:

physical read bytes                   1,692,778,496    1,099,279.6   1,241,950.5
physical read total IO requests             207,392          134.7         152.2
physical read total bytes             1,722,757,120    1,118,747.6   1,263,945.1
physical write bytes                  1,860,108,288    1,207,942.5   1,364,716.3
physical write total IO requests            210,036          136.4         154.1
physical write total bytes            2,107,344,896    1,368,496.4   1,546,107.8

Let’s check fragmentation on the block device again now that we have updated many blocks:

# strings -t x /dev/data/slob |grep "SLOBROW" >outfile3.txt

The file looks like this:

790adad SLOBROW  000001
790cc71 SLOBROW  000002
790cdad SLOBROW  000002
790ec71 SLOBROW  000003
790edad SLOBROW  000003
7910c71 SLOBROW  000004
7910dad SLOBROW  000004
7912dad SLOBROW  000005
7914c74 SLOBROW  000006
7914daf SLOBROW  000006
7916c71 SLOBROW  000007
7916dad SLOBROW  000007
7918c71 SLOBROW  000008
7918dad SLOBROW  000008
791ac71 SLOBROW  000009
791adad SLOBROW  000009
791cc71 SLOBROW  000010
791cdad SLOBROW  000010
.
.
.
39a0dc74 SLOBROW  099995
39a0ddaf SLOBROW  099995
39a0fc74 SLOBROW  099996
39a0fdaf SLOBROW  099996
39a11c71 SLOBROW  099997
39a11dad SLOBROW  099997
39a13dad SLOBROW  099998
39a15c71 SLOBROW  099999
39a15dad SLOBROW  099999
39a17c71 SLOBROW  100000
39a17dad SLOBROW  100000

So why the double entries? I guess Oracle modifies the block contents after updating a row, in such a way that offsets within the blocks change slightly. The double entries are just a leftover feature of original data but within Oracle this is now unallocated content.

Notice that:
a) No fragmentation occurs, not on the datafile, not on the block device
b) the offsets of every row are mostly the same.

Lets take a look at row 10, row 5000 and row 10000 as an example:

# cat outfile2.txt |grep 000010
791cdad SLOBROW  000010
# cat outfile3.txt |grep 000010
791cc71 SLOBROW  000010
791cdad SLOBROW  000010

# cat outfile2.txt |grep 005000
a307dad SLOBROW  005000
# cat outfile3.txt |grep 005000
a307c71 SLOBROW  005000
a307dad SLOBROW  005000

# cat outfile2.txt |grep 010000
ca8fdad SLOBROW  010000
# cat outfile3.txt |grep 010000
ca8fc71 SLOBROW  010000
ca8fdad SLOBROW  010000

Note that the offset of the magic only changed a little bit: 316 bytes. Well within a single 8K oracle block. Again this shows that Oracle modified the contents of the blocks but not the block offsets within the datafile.

Let’s do another run…

$ ./runit.sh 1 0

Tm 1180

(bit faster, probably due to ext3 file system cache warmed up)

Check if anything changed on the layout since the last run:

# strings -t x /dev/data/slob |grep "SLOBROW" >outfile4.txt
# diff outfile3.txt outfile4.txt |wc -l
1512
# diff outfile3.txt outfile4.txt |grep SLOB|wc -l
756
# diff outfile3.txt outfile4.txt |grep SLOB|grep -v ">"
<no output>

So we have a few changes (additions to the new file – reflecting 756 changes within Oracle blocks!)

# diff outfile3.txt outfile4.txt | head -6
376a377
> 7a94c71 SLOBROW  000196
513a515
> 7b28c71 SLOBROW  000268
601a604
> 7b86c71 SLOBROW  000315

Again Oracle moved some data within the blocks. But no single block moved positions within the datafile, and no blocks within the datafile moved offsets on disk.

We can conclude that on EXT3 there is no fragmentation other than that caused by initial file creation. We can update blocks as many times as we like but the block offset never changes again. This also makes me confident that filling up an ext3 filesystem up to 99% with large static datafiles will have minimal impact on fragmentation.

Let’s now do the same thing on ZFS.

On my BSD ZFS server:

root@zfs:/root # gpart show
.
.
=>       63  781422705  ada0  MBR  (372G)
63        126        - free -  (63k)
189   16776963     1  freebsd  [active]  (8G)
16777152   16777215     2  freebsd  (8G)
   33554367    2097144     3  freebsd  (1G) 
35651511  745771257        - free -  (355G)

The disk slice I will put the Zpool on is /dev/ada0s3, an 1GB partition (like the EXT3 logical volume). The other slices and disks will be used for performance testing later.

Ada0 is also a 400GB SATA disk like the one on the Linux DB server (not that this matters for this test). Let’s zero out the data device that we want to use for the ZPOOL (so that we can use “strings” later on the device):

root@zfs:/root # dd if=/dev/zero of=/dev/ada0s3 bs=65536
dd: /dev/ada0s3: short write on character device
dd: /dev/ada0s3: end of device
16384+0 records in
16383+1 records out
1073737728 bytes transferred in 15.815460 secs (67891652 bytes/sec)

Now we create the zpool on the RAID device and a zfs filesystem on it:

root@zfs:/root # zpool create data ada0s3
root@zfs:/root # zfs create -o recordsize=8K data/data
root@zfs:/root # zfs set sharenfs=on data/data
root@zfs:/root # zfs set logbias=throughput data/data
root@zfs:/root # zfs set sync=disabled data/data
root@zfs:/root # zfs set atime=off data/data
root@zfs:/root # chown oracle:dba /data/data/

Note that I used 8K recordsize because this is the Oracle recommendation for datafiles. It also prevents excess reads when updating a single 8K block.

“Logbias=throughput” is a tricky one as it might cause lost transactions or even corruption when suffering a system crash. But I am not interested in that right now and I need this setting to get a bit decent performance. Same for “sync=disabled”. This is normally not recommended but on my not-so-powerful system it boosts performance bigtime and for this test it will not matter.

root@zfs:/root # zpool list
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
data  1016M   140K  1016M     0%  1.00x  ONLINE  -
root@zfs:/root # zpool status
pool: data
state: ONLINE
scan: none requested

config:

NAME        STATE     READ WRITE CKSUM
data        ONLINE       0     0     0
ada0s3    ONLINE       0     0     0

errors: No known data errors
root@zfs:/root # zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
data        140K   984M    32K  /data
data/data    31K   984M    31K  /data/data

Going back to the Linux db host:

# mount -a
# df | grep zfs
zfs:/ufs               8106752         0   7458176   0% /zfs/ufs
zfs:/data/data         1007552        64   1007488   1% /zfs/data

(the UFS entry is there because I plan to test UFS against ZFS performance as well)
I have prepared Direct NFS and this is what the oranfstab looks like:

# cat /etc/oranfstab
server: zfs
path: 192.168.0.9
export: /data/data mount: /zfs/data
export: /ufs mount: /zfs/ufs

Now we have an empty ZFS pool/FS. Let’s create a tablespace on it. (after bouncing the database to re-enable direct NFS):

CREATE BIGFILE TABLESPACE "ZFS" DATAFILE '/zfs/data/zfs.dbf' SIZE 2M REUSE AUTOEXTEND ON NEXT 2M MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO ;

Now let’s drop the existing SLOB tablespace (on EXT3) and users first:

$ sqlplus / as sysdba @drop_users

And create the SLOB table:

$ ./setup.sh ZFS 1
Setting up user 1
Waiting for background processes - Mon Feb 25 13:44:55 CET 2013
Table created.
PL/SQL procedure successfully completed.

So now we have an identical datafile as in the previous test but now on ZFS (via dNFS) instead of local EXT3 file system.

root@zfs:/root # df -h /data/data/ ; ls -alh /data/data/zfs.dbf
Filesystem    Size    Used   Avail Capacity  Mounted on
data/data     983M    841M    142M    86%    /data/data
-rw-r-----  1 oracle  dba   834M Feb 25 13:48 /data/data/zfs.dbf

Before even running the first SLOB test, let’s look at the MAGIC offsets again (first in the datafile, then on the raw disk):

root@zfs:/root # strings -t x /data/data/zfs.dbf | grep "SLOBROW" > zfsout1.txt ; cat zfsout1.txt
109daf SLOBROW  000001
10bdad SLOBROW  000002
10ddaf SLOBROW  000003
10fdad SLOBROW  000004
111dad SLOBROW  000005
.
.
.
3120fdad SLOBROW  099995
31211dad SLOBROW  099996
31213dad SLOBROW  099997
31215dad SLOBROW  099998
31217daf SLOBROW  099999
31219dad SLOBROW  100000

That looks very familiar. So the zfs.dbf file on ZFS itself is not fragmented (like in the initial test on ext3).

Now let’s inspect the raw block device:

root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOBROW" > zfsout2.txt ; cat zfsout2.txt
4037ad SLOBROW  001087
4081ad SLOBROW  001088
40a1ad SLOBROW  001089
40cfaf SLOBROW  001090
410dad SLOBROW  001091
412dad SLOBROW  001092
.
.
.
3cbf7fad SLOBROW  084135
3cbf9faf SLOBROW  084136
3cbfbfad SLOBROW  084137
3cbfdfaf SLOBROW  084138
#

So you see the first block is not row 1. The last rows are not rows 99990 thru 100000.

Let’s inspect a random area in the file:

.
.
.
28c01dad SLOBROW  095075
28c03dad SLOBROW  095081
28c05dad SLOBROW  095080
28c07dad SLOBROW  095077
28c09dad SLOBROW  095082
28c0bdad SLOBROW  095083
28c0ddad SLOBROW  095084
28c0fdad SLOBROW  095085
28c11dad SLOBROW  095086
28c13dad SLOBROW  095076
28c15dad SLOBROW  095088
28c17dad SLOBROW  095087
28c19dad SLOBROW  095090
28c1bdad SLOBROW  095092
28c1ddad SLOBROW  095089
28c1fdad SLOBROW  095093
28c21dad SLOBROW  095094
28c23dad SLOBROW  095078
28c25dad SLOBROW  095096
28c27dad SLOBROW  095097
28c29dad SLOBROW  095098
28c2bdad SLOBROW  095095
28c2ddaf SLOBROW  095099
28c2fdad SLOBROW  095100
28c31dad SLOBROW  095102
28c33dad SLOBROW  095091
28c35dad SLOBROW  095104
28c37dad SLOBROW  095101
28c39daf SLOBROW  095106
28c3bdad SLOBROW  095103
.
.
.

So you see the rows are not linear allocated and we have some weird forward and backward row jumps all over the place. That’s directly after table creation and filling with data – and before throwing block updates at the table!

Let’s figure out how many sequential chunks we have. With ZFS copying blocks around this is less trivial than it seems. Consider this entry:

18e3afad SLOBROW  000879
27444bad SLOBROW  000879

How do we know which of these two reflects the real Oracle block and which one is the leftover from a previous write? Ignoring such problems for a moment (there’s only a few duplicates so far so I will work this out later) let’s try to find out how much fragmentation we have. How do you define that?

Let’s define the fragmentation as being the total disk seek distance (in bytes) if we had to read the file sequentially. So every time the disk heads have to reposition we are going to measure this as the disk offset jump on the disk.

Let’s say a seek is a forward jump larger than 4 blocks (32768) or a jump back larger than 4 blocks (ignoring data blocks not not together but very close i.e. on the same disk track).

The script I used to calculate seek distance:

#!/usr/local/bin/bash
typeset -i c=0
typeset -i r=0
oldoff=0
oldrow=0
distance=0
cat $1 | sort -n -k +3 | while read offset magic row
do
  off=$(printf "%d" 0x$offset)
  diff=$(expr $off - $oldoff)
  if [[ $oldoff -ne 0 ]] && [[ $diff -gt 32768 || $diff -lt -32768 ]]; then
    distance=$(expr $distance + $(expr ${diff#-} / 1024 ))
    rows=$(expr $row - $oldrow)
    oldrow=$row
    printf "%8s %8s %11s %11s %8s\n" $row $rows $off $diff $(expr $distance / 1024)
  fi
  oldoff=$off
done

Now the script is not perfect – you could argue whether it reflects the real fragmentation or not and even propose better methods. But getting a bit closer to perfection the thing becomes very complex and I like to keep it simple.

Running the script against the output file:

[root@zfs ~]# ./countfragments zfsout2.txt | tee frags.txt

results in total seek distance of 94GB. Note that in the real world the physical disk seek distance is probably much less because a) we have duplicates that are not real, and b) if you see something like this:

28c2bdad SLOBROW  095095
28c2ddaf SLOBROW  095099
28c2fdad SLOBROW  095100
28c31dad SLOBROW  095102
28c33dad SLOBROW  095091

then the caching mechanism will most likely use prefetch and caching to work this out with only one or two seeks (instead of jumping up and down for each block). But I ignore that for now – we assume we have dumb disk without any cache or I/O order optimizations to make the point (otherwise the whole exercise would be much too complicated)

Now let’s kick off a slob run and see what happens.

$ ./runit.sh 1 0

Tm 3205

(note on ext3 it took 1180 seconds, but this delay could also be caused by the NFS network layer so I will not draw conclusions from this).

Some Awr.txt output again FYI:

physical write bytes                  1,910,071,296      595,624.9   1,057,040.0
physical write total IO requests            172,951           53.9          95.7
physical write total bytes            2,254,415,872      703,003.2   1,247,601.5
physical write total multi block              1,866            0.6           1.0
physical writes                             233,163           72.7         129.0

Created a new scan for SLOB blocks:

root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOBROW" > zfsout2.txt

Checking for differences before running SLOB:

root@zfs:/root # wc -l zfsout2.txt zfsout3.txt
100056 zfsout2.txt
212729 zfsout3.txt
312785 total

There are too many duplicates now to make realistic estimations about fragmentation. So let’s update the MAGIC strings so that we can look for actual database block and exclude most of the leftovers from previous writes.

updatemagic.sql:

DECLARE
i           number := 1;
BEGIN
  FOR p IN ( SELECT rowid FROM cf1)
  LOOP
    UPDATE CF1 SET MAGIC=chr(10) || '&magic' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;
    i := i+1;
  END LOOP;
END;
/

Running it:

$ sqlplus user1/user1 @updatemagic

Enter value for magic: SLOB001
old   6:         UPDATE CF1 SET MAGIC=chr(10) || '&magic' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;
new   6:         UPDATE CF1 SET MAGIC=chr(10) || 'SLOB001' || to_char(i,'000000') || chr(10) WHERE p.rowid = CF1.rowid;

PL/SQL procedure successfully completed.

(FYI runtime was 22 minutes – on EXT this runs in less than 1 minute – so you get some feeling about the delays here. I can tell you the network was not highly utilized at all, in fact it was hard to find any bottleneck but it seemed to be caused by high CPU context switches per second on the BSD machine)

Here the output on the BSD machine during the MAGIC update from iostat against the ZFS device:

root@zfs:/root # iostat -x ada0 10
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
.
.
.
ada0     133.5   0.2   580.3     0.9    1   1.9   8 <-- it looked like this for a while, only reads but hardly any updates. 
ada0     133.4   0.2   581.8     0.9    0   2.0   9  -- guess this is due to delayed dbwriter activity
ada0     116.8  61.1  1618.8   358.9    3  22.8  94 <-- here it finally starts writing
ada0     113.0  84.6  1346.8   397.8   10  18.7  88
ada0     108.0  75.8  1344.9   443.2    7  20.3  90
ada0     109.8  73.4  1505.4   494.5    6  19.3  86
ada0      96.7 186.3  1409.4  1170.1    6  16.9  92
ada0      41.8  14.2   390.2   137.2    8  26.6  37
ada0       0.0   0.0     0.0     0.0    8   0.0   0 <-- here ZFS does not do any IO for about 20 seconds. Bottleneck assumed
ada0       0.0   0.0     0.0     0.0    8   0.0   0  -- to be completely CPU/memory based – probably in part because of
ada0      17.4 186.7   615.3  1334.9   10 126.4 329  -- my old hardware ;-)
ada0       0.0   0.0     0.0     0.0   10   0.0   0
ada0       0.0   0.0     0.0     0.0   10   0.0   0
ada0      25.5  26.9   231.8   118.6    1 589.4 322
ada0     111.4  85.5  1145.5   630.2    1  17.4  94
ada0      98.4 156.5  1269.0  1074.7    1  16.2  89
.
.

Check for the results:

root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOB001" > zfsout4.txt
root@zfs:/root # wc -l zfsout4.txt
82841 zfsout4.txt

(Weird. There should be at least 100,000 rows) The fix: flush Oracle’s “dirty cache”

SQL> alter system checkpoint;
.
.
root@zfs:/root # strings -t x /dev/ada0s3 | grep "SLOB001" > zfsout5.txt
root@zfs:/root # wc -l zfsout5.txt
101223 zfsout5.txt

(much better, and only 1223 duplicates)

Let’s check for fragmentation in the new environment:

root@zfs:/root # ./countfragments zfsout5.txt | tee frags.txt
root@zfs:/root # tail -5 frags.txt
099996        1    95640177  -633761280  2433521
099997        1   729409649   633769472  2434125
099998        1    95918705  -633490944  2434730
099999        1   730164340   634245635  2435334
100000        1    95943281  -634221059  2435939

So the total seek distance after we performed a SLOB write run followed by an update of the magic numbers, is now 2435939 MB or about 2500 GB (up from 64 GB after initial creation). You can see the fragmentation taking its heavy toll. Let’s also find out how large the largest continuous sequence of blocks are:

root@zfs:/root # cat frags.txt | sort -n -k +2 | tail
040557       13   653829233      967168   184067
040570       13   180370545  -473556992   184519
067558       13    79769713  -408209920   510990
084989       13   205561969  -730385214   613430
031627       14    12778609  -165509632   151051
033906       14   855796337     3719168   159611
034090       15   264289393  -594196992   160196
085566       15     4227185  -605910016   614971
033971       31   858530417      166912   159618
085389       36   609938548       35331   614351

(note that the sorted second column indicates the number of adjacent blocks before jumping to another offset so a value of 15 means the script fount 15 adjacent 8K blocks)

Within our new layout of the datafile on ZFS, there are no sequential chunks on zfs larger than 36 Oracle blocks! And actually, the vast majority of continuous block areas is less than 4 blocks in size.

That means very heavy fragmentation!

For the record, let’s look at the datafile itself again (not the zfs block device):

root@zfs:/root # strings -t x /data/data/zfs.dbf  | grep "SLOB001" > zfsout6.txt
root@zfs:/root # cat zfsout6.txt | tail -5
31211c71 SLOB001 099996
31213c71 SLOB001 099997
31215c71 SLOB001 099998
31217c74 SLOB001 099999
31219c71 SLOB001 100000

That’s still 100% sequential. So within the datafile not a single block has changed offsets!

root@zfs:/root # ./countfragments zfsout6.txt
.
.
.
096804     1020   798006385       40957        3
097824     1020   806394993       40960        3
098844     1020   814783601       40960        3
099864     1020   823172212       40963        3

Total seek distance within the datafile: Only 3 MB! (as opposed to 2500 GB on the ZFS block device).

Conclusion

First a disclaimer: I used ZFS on FreeBSD 9.1. ZFS might behave different on Solaris (however I don’t think it so). ZFS is an OS feature that seems to be a very popular choice for some administrators, up to a point where I would say, an almost “religious” choice. So I would not be surprised to get some flaming comments on this post. As long as it is honest and valuable feedback I will approve them even if they prove me wrong.

Random(ish) workload against database tables on ZFS causes heavy fragmentation.

Whether this impacts performance is another discussion and I will attempt to blog on that later. ZFS (AFAIK) does not have defragmentation capabilities today, so to fix this, some of Oracle’s documentation vaguely recommends to copy complete datafiles around:

(from http://www.oracle.com/technetwork/server-storage/solaris10/documentation/wp-oraclezfsconfig-0510-ds-ac2-163550.pdf):

For a data warehouse database, keep 20% free space in the storage pool as a general rule. Periodically copying data files reorganizes the file location on disk and gives better full scan response time.

And that’s exactly my point. Now how should I do that in a mission critical environment with 99,999% required availability? That would require to take the tablespace offline, then copy the datafile(s) to a separate file system, then copying them back???

Update: Interesting article on degrading ZFS write performance http://blog.delphix.com/uday/2013/02/19/78/

10239 total views , 2 views today

ZFS and Database fragmentation

Share this: