Background
In my previous post on ZFS I showed how ZFS causes fragmentation for Oracle database files. At the end I promised (sort of) to also come back on topic around how this affects database performance. In the meantime I have been busy with many other things, but ZFS issues still sneak up on me frequently. Eventually, I was forced to take another look at this because of two separate customers asking for ZFS comparisons agaisnt ASM at the same time.
The account team for one of the two customers asked if I could perform some testing on their lab environment to show the performance difference between Oracle on ASM and on ZFS. As things happen in this business, things were already rolling before I could influence the prerequisites and the suggested test method. Promises were already made to the customer and I was asked to produce results yesterday.
Without knowledge on the lab environment, customer requirements or even details on the test environment they had set up. Typical day at the office.
In addition to that, ZFS requires a supported host OS – so Linux is out of the question (the status on kernel ZFS for Linux is still a bit unclear and certainly it would not be supported with Oracle). I had been using FreeBSD in my post on fragmentation – because that was my platform of choice at that point (my Solaris skills are, at best, rusty). Of course Oracle on FreeBSD is a no-go so back then, I used NFS to run the database on Linux and ZFS on BSD. Which implicitly solves some of the potential issues whilst creating some new ones, but alas.
Solaris x86
This time the idea was to run Oracle on Solaris (x86) that had both ZFS and ASM configured. How to perform a reasonable comparison that also shows the different behavior was unclear and when asking that question to the account team, the conference call line stayed surprisingly silent. All that they indicated up front is that the test tool on Oracle should be SLOB.
My first reaction was if they were aware of the fact that SLOB is designed to drive random I/O and thus, by nature, is not very well positioned to show performance effects of fragmentation – which would require sequential I/O. More silence. Sigh. To make matters worse, the storage platform on which the Solaris VM was configured (using VMware of course) was XtremIO. Again, XtremIO is so very different from every other EMC storage platform (as well as every other competitive platform, for that matter) – in that it uses hashing of data blocks to determine data placement within the flash cells. So – a bit like ZFS in itself – it has “fragmentation by design” – which makes the platform completely insensitive to random vs sequential I/O. In the XtremIO backend, everything is random anyway whatever you do – which allows the platform to scale up and out, and avoid any kind of data hotspots. So given a Proof of Concept where the test tool generates random I/O, on a storage platform that converts all I/O to random again, how do you show the impact of fragmentation?
But the customer promise was made so I started a voyage to get Solaris moving with Oracle, ASM, ZFS and SLOB, and think of a reasonable way to test the two scenarios.
Configuring the environment
After getting access to the system the first thing that I needed to do is install Oracle and Clusterware / ASM. Which was a challenge because the virtual machine was installed with pretty much default settings, a 100% full root file system and lack of paging space and some required software packages. But I will not go into details on what was needed to get the system going and stable (with a few exceptions that I will touch upon later).
The system had a bunch of XtremIO volumes configured of which of most importance are the ones I used for I/O testing. There were 10 volumes of 120GB each. 5 were already configured in a single ZFS pool. I checked disk alignment and some other settings and it turned out to be workable. I configured the other 5 volumes into an ASM (DATA) disk group, plus a few smaller ones for REDO.
On the ZFS pool I created a ZFS file system according to Oracle best practices (blocksize=8K, logbias, etc)
Testing method
Over the weekend I got some time to think on how to run such a test and I came up with the following test scenario:
- Create SLOB tablespace (“IOPS”) on ASM
- Create SLOB tables carefully sized – so that the data would fill up ZFS to exactly 80% (the limit according to best practices before you get serious performance issues).
- Run SLOB tests with different read/write ratios (on ASM)
- Do some sort of sequential IO on SLOB tables – so this must be some kind of (full) table scan scenario
- Bring IOPS tablespace offline and copy it to ZFS (using RMAN) – this way you get an initially unfragmented tablespace (hopefully)
- Change datafile location in Oracle to use ZFS datafile
- Run read-only SLOB tests
- Run table scans on SLOB data
- Run SLOB updates sized such that every SLOB block is updated at least once
- Re-run ZFS tests and note any difference
Full table scans
For running full table scans I wrote a pl/sql script that basically does the following:
- AWR snap
- Select random SLOB users (either all of them or limit the number of users)
- For each user, select count(*) from <user>.cf1 (full table scan)
- Record overall start/end time and per-user start/end time
- Calculate total data size and use time & data size to calculate scan bandwidth
- Report per-user and total statistics after completion (most notably, scan rate)
- AWR snap
Remark that this procedure is single-threaded. It is not intended to drive maximum bandwidth, it’s intended to drive a predictable comparison on scan rate in between tests. You could run a bunch of these in parallel to drive more bandwidth and I have been messing with the idea to build it into the sql code. Maybe another time 😛
Expectations
A few notes on what I expected as result. Obviously, XtremIO does not care about fragmentation – so you would initially expect similar results on ASM as on ZFS, for 100% read IOPS, as well as for full table scans. But another ZFS issue is IOPS inflation (which is a side effect of fragmentation). Consider a full table scan is requesting 128K read I/O (because Oracle db_multiblock_count is set to 16 using an 8K blocksize). Because ZFS has to get blocks from all over the place (after fragmentation), a 128K IO might be chopped into a bunch of 8K, 16K and maybe a few larger pieces. So it will result in somewhere between 16 I/O’s and maybe a bit less (if some blocks are still adjacent on disk – as seen by the OS at least). I would expect 128K IOs to translate roughly into 10-12 smaller IOs but we will see.
So the full table scan bandwidth, according to my expectations, would drop a bit, not because of fragmentation (and thus excess disk seeks) directly (we’re on 100% flash) but because of the extra host IO overhead due to many smaller IOs instead of a few large ones.
I also expected the random IOPS to be more or less equal.
System configuration
For the techies who are interested in the details, a disclosure of the configuration can be found here. But a few highlights:
XtremIO Volumes 5x 120GB (ZFS), 5x 120GB (ASM). SLOB tablespace 450GB (80% of ZFS free space). SLOB users: 64, SCALE=900000. Oracle: SGA 3G, multiblock read count 16, 8KB block size. Redo logs on separate ASM disk group (also during ZFS testing although usually you would move redo to ZFS as well). ZFS cache: 100MB (to avoid Solaris system hang which happened to me initially, and to prevent ZFS serving RIOPS from OS memory).
Disclaimer:
- It seems near impossible to do an apples-to-apples performance comparison. So for ZFS worshippers out there: don’t complain to me that the test is wrong, instead do the test yourself how you think it should be done and publish the result (with disclosure of configuration of course). There, got that one of my chest 😉
- The test environment I was using was not tuned for the best possible performance. The results therefore are relative and don’t reflect the most in what you can expect from an EMC XtremIO array.
Actually I’m running other tests in a better tuned lab on which I might blog as well very soon 🙂
Performing the test
Will skip the details here on SLOB runtime parameters, but every SLOB run was done with 64 users (entire dataset) and 5 minutes runtime with the exception of the update runs that drive fragmentation. Initially I ran the full table scans against all users, but later found that randomly picking 5 or 10 gives equally consistent results. All database results come from AWR reports and the scan rate as calculated by my script, in addition I have comments on the system level output (“iostat”).
Random read IO on ASM
This was after creating SLOB on ASM, nothing else happened to the data in between. But you get consistent results here even after messing with the data (I tried moving ZFS based tablespaces back to ASM and the results after that move are within 1%).
Physical reads per second: 44,391
Physical writes per second: 2
Bandwidth: 346 MB/s (roughly)
Note that a single XtremIO X-brick can handle much more than this, if you would configure more host volumes, correct I/O load-balancing etc. Will blog on these details later.
Typical iostat looks like this:
device r/s w/s kr/s kw/s wait actv svc_t %w %b sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 9958.9 0.3 79670.8 1.3 0.0 11.7 1.2 4 100 sd14 9978.9 0.0 79836.2 0.0 0.0 11.7 1.2 4 100 sd15 9996.9 0.0 79974.8 0.0 0.0 11.8 1.2 4 100 sd16 9931.9 0.0 79457.5 0.0 0.0 11.7 1.2 4 100 sd17 9830.9 0.7 78649.5 10.7 0.0 11.6 1.2 4 100 sd18 0.0 0.3 0.0 1.3 0.0 0.0 0.6 0 0
Note that this was when the IOPS were a bit above average. By dividing read bandwidth over reads you can estimate the IO size: 79670/9958 = 8K. Database I/O gets translated 1:1 into disk I/O. Service time a bit over 1ms (again, XtremIO can get service time much lower at a much higher IOPS rate – due to the better I/O load-balancing and overall better consistency with best practice and optimizations – you should expect well below 0.5 ms with this workload).
Full table scan on ASM
SYS:xtremdb > @slob-fulltablescan Limit number of full table scans (default all):10 USER28 7031 MB, Time: 29.64 USER5 7031 MB, Time: 30.01 USER30 7031 MB, Time: 29.96 USER40 7031 MB, Time: 30.04 USER1 7031 MB, Time: 30.05 USER29 7031 MB, Time: 30.11 USER34 7031 MB, Time: 30.04 USER20 7031 MB, Time: 30.06 USER7 7031 MB, Time: 30.12 USER31 7031 MB, Time: 30.2 ---Summary--- Users: 10 Scanned: 70313 MB Scan rate: 234.16 MB/s Runtime: 300.27 PL/SQL procedure successfully completed.
You can see here that the scan rate is roughly 234 MB/s. The AWR report agrees:
Physical read bytes/s = 245MB/s. Physical reads/s = 1872. Divide and you get 245000/1872 = 131 (close to 128).
iostat output:
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t11d0 575.9 9.0 73657.5 94.0 0.0 1.6 0.0 2.7 0 22 c3t0d0 609.4 10.5 77745.1 196.0 0.0 1.7 0.0 2.7 0 23 c3t1d0 640.9 6.5 81832.8 92.0 0.0 1.8 0.0 2.8 0 25 c3t2d0 549.0 12.5 70145.8 212.0 0.0 1.5 0.0 2.7 0 21 c3t3d0 590.4 5.0 75401.3 152.0 0.0 1.7 0.0 2.8 0 23 c3t4d0
See if it agrees on IO size: 73657/576=127.87. Close enough.
SLOB on ASM with 50% updates
For the record, another run with 50% update percentage. Not that it matters much.
Physical reads per second: 32,600
Physical writes per second: 15,687
Bandwidth: 267 MB/s read, write 134 MB/s.
Nearly all is 8K I/O (a few multiblock writes). Note here that for Oracle to write a block, it has to be read first, so for 50/50 SLOB reads/writes you get double the reads (the usual ones for select and the other part for updates). Hence the 2/1 ratio on OS level.
Moving to ZFS
(after moving tablespace offline – note that the copy was done once before so the write to ZFS was actually an overwrite of the existing iops.dbf file)
RMAN> copy datafile '+DATA/xtremdb/datafile/iops.267.854808281' to '/data_pool/data/iops.dbf'; . . input datafile file number=00005 name=+DATA/xtremdb/datafile/iops.267.854808281 output file name=/data_pool/data/iops.dbf tag=TAG20140806T100939 RECID=3 STAMP=854886495 channel ORA_DISK_1: datafile copy complete, elapsed time: 02:18:46
Typical IOstat:
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 516.5 751.1 2810.2 17611.3 0.0 2.3 0.0 1.8 1 35 c2t0d0 528.5 851.1 3122.2 17617.3 0.0 2.4 0.0 1.7 1 36 c2t1d0 422.0 743.1 2026.2 17503.3 0.0 2.2 0.0 1.9 0 34 c2t2d0 442.5 854.6 2238.2 17485.3 0.0 2.3 0.0 1.8 1 34 c2t3d0 505.0 444.0 2846.2 17737.3 0.0 2.0 0.0 2.1 0 33 c2t4d0 48.0 0.5 12288.9 2.0 0.0 0.1 0.0 2.4 0 11 c3t0d0 48.0 0.0 12288.9 0.0 0.0 0.1 0.0 2.6 0 13 c3t1d0 48.0 0.0 12288.9 0.0 0.0 0.1 0.0 2.5 0 12 c3t2d0 48.0 0.0 12288.9 0.0 0.0 0.1 0.0 2.3 0 11 c3t3d0 32.0 0.0 8192.6 0.0 0.0 0.1 0.0 2.5 0 8 c3t4d0
Note: I left out all irrelevant lines from iostat.
Interesting: Copy from ASM to ZFS, the upper 5 lines are ZFS disks, the lower 5 are ASM. You can see that RMAN is reading roughly 256K I/Os from ASM. More interesting is that ZFS writes also involve reads (now why would you have to read lots of data from a file system just to overwrite a file? Think about it). Another finding here is that ZFS writes much more data than ASM reads from disk. And I’m using recordsize=8K with aligned disks so that should not be an issue. Probably an artifact of ZFS trying to bundle IO’s together a little bit overaggressive? Intent logging (shouldn’t be because logbias=throughput)? You tell me.
Random read IO on ZFS
Physical reads per second: 22,231
Physical writes per second: 2
Bandwidth: 182 MB/s (roughly).
iostat:
extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b sd2 9170.4 11.6 73404.2 241.0 0.0 8.8 1.0 4 100 sd3 9174.4 12.6 73422.8 227.7 0.0 8.8 1.0 4 100 sd4 9123.5 3.7 73000.0 141.1 0.0 8.8 1.0 4 100 sd5 9181.4 3.3 73547.2 49.3 0.0 8.7 1.0 4 100 sd6 9229.0 10.3 73919.9 137.1 0.0 8.8 1.0 4 100 Notes:
This iostat snapshot is from a higher-than-average moment. Average read size (disk) = 8K. Service times are actually a bit lower than the same test on ASM. I expected similar RIOPS but only get half the rate of ASM here. ZFS kernel overhead? Not sure.
Full table scan on ZFS
SYS:xtremdb > @slob-fulltablescan Limit number of full table scans (default all):10 PL/SQL procedure successfully completed. USER58 7031 MB, Time: 52.58 USER64 7031 MB, Time: 55.17 USER61 7031 MB, Time: 54.14 USER0 7031 MB, Time: 52.36 USER29 7031 MB, Time: 51.62 USER10 7031 MB, Time: 52.28 USER32 7031 MB, Time: 51.63 USER57 7031 MB, Time: 53.69 USER14 7031 MB, Time: 52.7 USER6 7031 MB, Time: 55.89 ---Summary--- Users: 10 Scanned: 70313 MB Scan rate: 132.15 MB/s Runtime: 532.08
PL/SQL procedure successfully completed.
You can see here that the scan rate is roughly 132 MB/s. The AWR report agrees again:
Physical read bytes/s = 138MB/s. Physical reads/s = 1075. Divide and you get 138000/1075 = 128.
Now I ran a total of 3 hours with SLOB update percentage 100. I calculated based on the write rate that after 3 hours statistically most database blocks would have been overwritten at least once.
Random read IO on ZFS after updates
Physical reads per second: 22,111
Physical writes per second: 4
Bandwidth: 181 MB/s (roughly).
The iostat output also looks very similar compared to before the updates. This is what I expected as this is all random 8K I/O which should not be influenced by fragmentation or I/O inflation.
Full table scan on ZFS after updates
SYS:xtremdb > @slob-fulltablescan Limit number of full table scans (default all):10 PL/SQL procedure successfully completed. USER47 7031 MB, Time: 49.68 USER15 7031 MB, Time: 50.9 USER37 7031 MB, Time: 50.36 USER6 7031 MB, Time: 49.75 USER7 7031 MB, Time: 53.51 USER33 7031 MB, Time: 77.78 USER16 7031 MB, Time: 76.57 USER35 7031 MB, Time: 77.37 USER28 7031 MB, Time: 77.29 USER19 7031 MB, Time: 77.13 ---Summary--- Users: 10 Scanned: 70313 MB Scan rate: 109.80 MB/s Runtime: 640.35 PL/SQL procedure successfully completed.
Scan rate after fragmentation dropped from 138 to 110 MB/s. Frankly I expected a steeper dive but it seems that the XtremIO box is holding up pretty well, and the IOPS inflation issue is probably not that bad on such an array.
Summary
In this test comparing Oracle on XtremIO using both ASM and ZFS, the difference is significant and ASM performs at least twice as good as ZFS.
One should take into consideration that this is on an extremely high performance Flash array that is ignorant on workload types (hot spots, fragmentation, large vs small I/O).
Conclusions:
- ASM seems to have at least double the I/O performance compared to ZFS on the same system
- XtremIO completely solves (or avoids) ZFS fragmentation issues but IOPS inflation still occurs although with less dramatic effects than expected
- There are many parameters that can be tweaked so an apples-to-apples comparison is practically impossible. Mileage may vary.
I haven’t had the chance yet to also test on spinning disk. Will keep that for a future post.
Hi,
thx for the nice article and for sharing the configuration details, I understand these I/O performance tests require a lot of effort.
I have no experience with the magic XtremIO array, but I am anyway already convinced of using ASM for high performance Oracle databases, because of several reasons.
But this article cries for a response from somebody who values the strengths of ZFS more, I am sure some ZFS experts will find this blog.
Some comments:
– You used the ZFS of the 4 year old Solaris 10 update 10, please could you try the test with a current version, like Solaris 11.2 with current updates.
– Please could you share some more details why you limited the ARC cache to 100 MB, the ARC cache is part of “ZFS”, it is unfair to cripple ZFS with this, maybe not even the metadata can be cached. If you like to avoid caching of the data, maybe you can set the ZFS property primarycache=metadata.
– You have set zfs_vdev_max_pending=32, you can maybe avoid additionally queuing if you set the same queue length for the HBA. I don’t know the best practice for VMware RDM and XtreamIO, but the results of the following parameters in /etc/system would be interesting:
set zfs:zfs_vdev_max_pending = 32
set sd:sd_max_throttle=32
set ssd:ssd_max_throttle=32
Don’t know how you have found “32” as the best value for your benchmark. For VNX and VMAX EMC suggested us to use the value “20” as default, but this could be a historical outdated value, because it was already suggested many years ago. Of course this queue sizes depend on your array(config). I am very interested in your opinion and results on this.
Hi Manuel,
Valid points indeed so let me explain:
On the old solaris version – sometimes you have to work with what’s available. The lab guys prepared the host with this version so that’s what I used. But it’s very well possible that there are improvements in recent ZFS code. Not that I expect this to close the gap but who knows if the more recent versions will do a bit better. I am certainly not finished testing (actually I will be testing in another POC very soon – and will certainly blog about it – but I don’t know yet what that env looks like exactly. Will see.)
ARC cache: good one. Initially I started without any limitations in /etc/system. The result was that when I copied the tablespace from ASM to ZFS, the whole system became unresponsive (i.e. it responded to keystrokes but could not start another process). Typical memory starvation problem. Only hard powercycle helped. Happened a few times. Then I limited the ZFS cache in /etc/system and the problem disappeared.
Here a note from the ZFS Evil Tuning Guide (http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache):
“ZFS is not designed to steal memory from applications. A few bumps appeared along the way, but the established mechanism works reasonably well for many situations and does not commonly warrant tuning.”
Well this is exactly what happened in spite of that note. Maybe due to the old code?
So why limit ARC to 100MB and not more? I wanted to give ZFS enough memory for metadata, but to make a clean comparison with ASM I wanted to avoid serving lots of I/O from cache. This because ASM is out-of-band in that the ASM instance does not do any caching or translation of I/O either. I know you cannot compare apples to apples but giving ZFS a large cache without doing the same to ASM would not be fair. But I appreciate your comment, in the next test I will play with larger ZFS cache (on Solaris 11 😉 and see if it makes a difference.
I haven’t done any tuning on zfs_vdev_max_pending. This was the default setting. I quote again from the Evil Tuning Guide:
“Tuning is often evil and should rarely be done. ”
Messing around with blocksize, and ARC limits is bad enough as it is. Ideally I’d like to run on 100% default settings but that would not be reasonable. But apart from this, again, valid point – however, using EMC XtremIO the max pending does not make any difference, as the XtremIO box is so lightning fast that there will never be a significant buildup of I/O’s on the queue anyway. On the listed output you can see it sometimes gets as high as 8 but that’s still well below 32. With note that the VMware environment I was using wasn’t tuned to the max as the I/O queue should normally not be more than 1 or 2 (worst case). This confirmed by another lab with XtremIO where I have Linux and ASM. There I run 180,000 Oracle IOPS (much more than in this test) against the smallest available XtremIO box (single X-brick) and service times are less than 1ms with hardly any queuing at all.
If you would run on classic storage arrays (EMC VMAX, VNX, other disk-based), then yes you should certainly keep an eye on max pending.
Thanks for sharing your thoughts!
I performed Bart’s tests on the newest Solaris release (11.3) having plenty of RAM for ARC and caching both data and metadata and obtained the identical results. Besides that, there are number of issues with reaping ARC which occasionally cause IO latency outliers of several seconds. On the other hand, ZFS provides marvelous features like file snapshoting and cloning which are great facilitators for boosting agility in DevOps environments and decreasing the total cost of ownership. In summary, if you opt for ZFS you trade some performance for the features and flexibility.
Hi Nenad,
Peer review, I like that! And good to hear that the results are consistent.
When I started working with ZFS I have to admit I like the CLI interface too and some features are definitely interesting. But there are always alternatives. Standard LVM snapshots are available if you run XFS or EXT3/4. Maybe I’ll give that a shot and see if I can use that with Oracle. If you run a decent enterprise storage system (hopefully DellEMC 😉 then usually that’s the better option as it doesn’t require any host processing at all (remember CPU cycles on Oracle licensed processors are very expensive).
Thanks
Hi Bart,
I had been using Hitachi Shadow Image feature for snapshoting and cloning before Solaris 10. Afterwards, I switched to ZFS for following reasons:
– copy-on-write (huge space savings for the cloned database, almost immediate refreshing of development databases)
– compression (space savings and performance; btw. the performance benefits of using compression completely compensated for the fragmentation side-effects when running your test case!)
– free of charge
– handling: all the commands can be executed within the local Solaris zone with the delegated administration granted (no need to jump to the global zone as it was the case with the storage feature)
Best regards,
Nenad
I see your points although I don’t understand how ZFS compression improves performance. My tests showed the exact opposite but maybe it depends on the setup – If you’re oversubscribed on IO then compression reduces the backend IO and may mask performance issues?
Innovation on All Flash arrays has catched up and other (hybrid) arrays are following quickly. With a modern array snapshots are instant (< 1sec), free (no additional storage required at all, except of course for the new writes), and you can create as many as you like (sizing limits are way beyond what a normal system needs). Compression for all data without overhead and global dedupe across all data within the array (not just a single Zpool).
Handling – I see, however, this is going to be a problem as soon as you want to maintain consistency across multiple databases/middleware/apps
Regards,
Bart
Hi Bart, thanks for sharing and taking the effort 🙂
For this setup (I mean SAN) there is no point from a test though the basic setup is a perfect ASM scenario.
I’ll Try to replicate the test “same” test case using Local SSD/Flash disks. Just have to find 5 SSDs first…
How much memory did you assign the Solaris VM?
Regards André
The big issue with limiting ARC to 100M is you very likely got into a situation where even ZFS metadata can not be cached and that will severely cripple the performance. As the first poster pointed out setting primarycache=metadataonly is a far better comparison. Also I’m sure your ASM SGA size was far bigger than 100M and that’s where the extent map is stored among over things which is the closest analogy to ZFS metadata. Bottom line you should be comfortable allocating at least the amount of memory you used for ASM SGA to ZFS ARC while changing primarycache setting to only cache metadata.
The compression has improved read IO performance by having to do less IOPs for the same amount of data. The benefits of less IOPs far overweigh the CPU overhead caused by compression. Of course, this doesn’t hold for a CPU oversubscribed system, but in case of CPU starvation this wouldn’t be the only problem.
Best regards,
Nenad
OK that makes sense. We see the same effect in all flash arrays – compression improves performance (contra intuitive until you realize how fast modern CPUs can compress/decompress data).
Regards
Bart