A trivial question:
What is the basic function of a storage system?
I would say, the trivial function of a storage system is to store digital data and getting it back when you need it.
To be specific:
get the data back exactly the way you stored it.
You would probably say “Duh, of course!”
A storage system (as simple as a hard disk or as sophisticated as an EMC VMAX) is supposed to store data and give it back unmodified. But recent research shows that simple disk drives are not as reliable as you might think. Enough material is available that explains why and how often disk drives fail to return the correct information, often without any error as if the corrupted data is perfectly valid. See below for more references to this issue.
Oracle block corruptions
So every time I talk to Oracle Database Administrators (DBAs) about how to optimize databases, I ask them if they ever experienced database corruptions. In particular, single block corruptions, that then have to be repaired using Oracle’s RMAN tool or manual database recovery. I’d say, about half the DBAs I meet had such an experience in the last five years or so. And they comment that repairing the corruption is tricky, requires a lot of careful work and sometimes does not work at all and the administrator (and therefore the business user) has to accept that some data has lost.
Depending on what part of the database is corrupt beyond recovery, this can be a minor pain (such as having to rebuild an index) or very severe. I spoke to one customer who had corruption in the data dictionary tables and was unable to recover. I suggested them to see if they could import data into a new database using transportable tablespaces but that did not work, like anything else they already tried. They had to do a full database restore and lost many hours of production work. And spent many hours (of downtime) before the restore was complete.
If they have experienced corruption, my next question to them is typically: Did you ever find the root cause of the problem? What caused the corruption in the first place?
Until today I haven’t met a single administrator who could pinpoint the root cause without any doubt. They just fix the corruption (wherever possible) and move on. Hoping it won’t happen anytime soon again.
Silent Data Corruption
But I think that in many cases, unreliable disks play a role in causing the issues. Because a mis-read of a disk block (returning corrupted data without any errors) is such a rare occasion, it is hard to diagnose. You could run an I/O generator for weeks at maximum read/write workload and never detect any problems.
In the scientific research that was performed, a few thousand disk drives (or even more) were evaluated and for a much longer time. And data corruption finally shows its ugly face every now and then.
Rumours go that this cause of data corruption results in many customer escalations to Oracle senior management, probably even more than due to other issues. But I guess Oracle will neither confirm nor deny this.
Here a list of studies performed by different organizations on this issue:
- Carnegie Mellon University – Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? (2007)
- CERN – Data integrity study (2007)
- University of Wisconsin-Madison & Network Appliance – An Analysis of Data Corruption in the Storage Stack (2008)
- Google – Failure Trends in a Large Disk Drive Population (2007)
In the research it becomes clear there are several causes of (silent) data corruption. Not everything is caused by disk drive technology. In some cases, flaky cables, host bus adapters (HBA), buggy drivers or microcode can cause silent corruption as well. For more information, check the resources at the end of this page.
Database block checksums
To protect against such problems, Oracle provides a few mechanisms such as block checksum (a.k.a. CRC) checking. This is switched on by default and will cause the database to calculate checksums in each block to be written. If later the the block is read back from disk, it will test if the checksum matches the block, and if not, it knows the block was corrupted. But it does not automatically repair the corrupted block. So detecting corruption on the database level while reading the disk, is often too late.
Oracle started about 10 years ago with the HARD initiative (Hardware Assisted Resilient Data) which extends the block checking all the way from the database to storage subsystems. So a storage system could check if the data received through all I/O layers (scsi driver, HBA, cables, etc) is valid or not – and thereby preventing any data corruptions caused by faulty writes. In 2001, EMC was the first to implement this feature with a product named “EMC Double Checksum for Oracle”. An old reference to the original news release is here:
Many other vendors followed. But it’s not the full story. Oracle HARD protects against failed writes, but not against perfectly written data blocks that get corrupted due to “bit rot” or because of read errors.
Still, Oracle is (most often) capable of detecting corrupted blocks, but what do you do when you find out you have a corruption on a datafile? Sometimes, re-reading the disk helps. More often, it does not.
No problem, some people argue. We have RAID (sometimes even with double parity protection such as with RAID-6), so we can rebuild the corrupted data block…
Absolutely true… If the storage system knows the data is corrupt, and before the chance of somehow repairing it, is long gone.
If you have an El Cheapo storage system that does not check corruptions, it will probably read the corrupted block as if it was perfectly valid. A re-read by the database just causes the same (corrupted) block to be transmitted from storage cache to the database.
How can a RAID storage array know a block is corrupt? A storage system does not know anything about database table structures. So it cannot read Oracle block checksums to figure it out. Compare against the mirror disks in the RAID group? Technically possible but there are two problems. First of all, to scan the RAID members for every I/O in order to detect corruption would cause a massive performance overhead. Second, what do you do if there is an error? Assuming RAID-1 (mirroring) to keep things simple – What to do if you find out during reads that two mirrored blocks contain different data?
Some vendors have chosen (no kidding!) to randomly pick one of the two and ignore the contents of the other. Still 50% chance that it’s the right one… Others completely ignore the issue as if physical disks would never make any mistakes.
Protecting against data corruption
Some vendors (including EMC) go a bit further and store checksum information on every written track on disk. So now when you read a track from disk (and you don’t have to read the other RAID members) you can find out immediately if the track has the right contents. If not, you can repair it using RAID (assuming the other member disks did not corrupt their corresponding data). It’s highly comparable to Oracle database, generating checksums for every database block.
If you’re concerned about data integrity, ask your infrastructure vendor (not only disk storage vendors, but also vendors of complete database/data warehouse appliances) if they actually store parity data on disk – a step further than what Oracle H.A.R.D. is doing by catching only in-transit write errors.
EMC is paranoid about taking risks with customer’s data integrity. So we go boldly beyond just saving checksums on disk.
EMC systems perform disk scrubbing since ages (I know it was already on Symmetrix 3000/5000 around 1996 and maybe even before that). Disk scrubbing means you continuously scan every disk for errors, not waiting for data to be read sooner or later by the application. A very low error rate for a disk is acceptable (as the research shows that even normally functioning disks will occasionally spit out a wrong bit of data). If the error rate for a disk becomes too high, EMC assumes the disk is going to fail sooner or later (because of worn out ball bearings, for example) and will – in many cases – replace it before actually failing or corrupting more data. Replacing disks does not have to be physical – invoking a hot spare disk automatically avoids waiting for an engineer to show up hours later with the physical replacement drive. Resilvering a disk that has not failed yet (using a hot spare) is easier, faster, more reliable and has less performance impact than replacing an already failed drive.
Drawbacks to this disk checksum approach? A few… first of all, calculating checksums requires extra CPU cycles. Oracle states in their documentation that the normal in-database checksum calculation adds about 1-2% CPU overhead. I have no idea what the overhead is for EMC storage as (unlike in Oracle) you cannot switch it off 😉 but I guess something similar. Second drawback? It requires a tiny bit of extra disk capacity – for example, EMC CLARiiON and VNX store 8 extra bytes for every 512 byte disk block.
By the way, in my post about Performance Proof of Concepts I mentioned all kinds of stuff that you would switch off when doing performance benchmarks. Block checksums and disk scrubbing are things you really couldn’t care less about when doing benchmarks (as it is pseudo random data anyway, so you can risk a block corruption). Switching it off gives you another 1-2% higher benchmark number. Makes sense? Not for someone who compares these numbers against what he can do in real mission critical deployments…
By calculating the checksums immediately after getting a write from a host, at EMC we protect data as soon as it enters the storage system. I tell my DBA audiences that by doing so, we hope to avoid many subtle database block corruptions that would have happened otherwise. But we go beyond that. We do not rely on batteries or super capacitors to keep write cache memory powered – we use batteries only to save memory contents to disk (a bit like hibernation) before completely shutting down. We do not use cheap memory modules without ECC error checking. We do scrubbing on memory like we do on disk. We protect memory contents by mirroring (VNX, recent Symmetrix) or implementing parity (in older generations Symm). We can survive failure of complete memory chips without downtime or performance penalties (we will just mark the memory regions as unusable and continue processing with a bit less cache). We use dual-ported disks so that we can continue running even if a disk adapter fails (and thereby allowing to recover data written by a flaky disk controller). We test every Symmetrix VMAX disk in the assembly plant using hot and cold test rooms to see if the disk will fail in harsh conditions (before shipping it to customers). When testing new systems, we inject faulty data (such as fibre channel frames) in the SAN to see if the error gets detected and isolated. And so on…
T10 Data Integrity Field
ANSI has proposed a standard for integrity checking (ANSI T10-DIF) which looks very similar in functionality to the early nineties Symmetrix CRC checksum method. The new V-MAX models are compliant with T10-DIF, although I understand from Symmetrix engineers that, for performance reasons, we store CRC information for every 4 or 8 KB, not every 512-byte block. Which makes sense considering that 4K resp. 8K is the internal block size of the system.
This is how Storage Magazine describes it:
The American National Standards Institute’s (ANSI) T10 DIF (Data Integrity Field) specification calls for data to be written in blocks of 520 bytes instead of the current industry standard 512 bytes. The eight additional bytes or “DIF” provide a superchecksum that’s stored on disk with the data. The DIF is checked on every read and/or write of every sector. This makes it possible to detect and identify data corruption or errors, including misdirected, lost or torn writes. ANSI T10 DIF provides three types of data protection:
- Logical block guard for comparing the actual data written to disk
- Logical block application tag to ensure writing to the correct logical unit (virtual LUN)
- Logical block reference tag to ensure writing to the correct virtual block
When errors are detected, they can then be fixed by the storage system’s standard
(Source: Storage Magazine)
Many storage vendors today have implemented T10-DIF but not everyone. Make sure your vendor supports it if you’re concerned about data integrity.
Oracle themselves also have provided some excellent info on this phenomenon and standard here: Oracle data integrity landing page
(look for osd2008-data-integrity.pdf)
Still, even T10-DIF leaves a gap where corruptions can still occur. T10-DIF only describes how a storage system should detect corruptions within the system. It cannot detect “in-transit” corruptions. As said, Oracle H.A.R.D. can detect (and prevent) corrupted writes and Oracle checksum verification can detect (not prevent) corruptions when reading.
An addition to the T10-DIF standard is in the making, called T10-DIX (Data Integrity eXtensions). T10-DIX extends checksum verification to the I/O interface layer between host and storage.
Note that, at least as far as I have found, T10-DIF/DIX only talks about how checksum information is stored on disks and in transit. But, some disk types cannot handle 520-byte blocks and prevent T10-DIF to be used. Needless to say that EMC has implemented a similar protection for such drives (i.e. SATA disks). Neither does the standard specify how to perform proactive scanning (scrubbing), and it does not deal with other corruptions such as lost cache contents caused by power failures, etc. It also does not specify how to recover from any detected corruptions. I guess that it’s up to the vendor of the storage system to be dealt with – including dealing with some other corruption causes such as “torn pages”.
EMC has implemented a feature called “Generic Safewrite” to deal with this (thanks to Chris Gerath of EMC for the content):
Generic SafeWrite solves a different but related problem. Let us say that the HBA tells the array, ‘I am about to send 100 SCSI blocks to you.’ In SCSI-speak, the array says, ‘ok, I am ready.’ The HBA starts transmission and 50 blocks go out. Then something happens (let us say someone pulls the cable between host and array). The transmission times out and the HBA sends word back to the database that the write failed. It failed, but something happened …. Danger, Will Robinson. In a disk or a lesser array, the 50 SCSI blocks likely made it to the disk surface. Now you have the database thinking it has one thing (the old data, remember, it ‘knows’ the write failed), yet the disk contains the new data. If the data contains index information or who knows what, when it is read back, the database could fall over. SafeWrite holds ALL the data off of disk until it all arrives, acknowledges the write, then moves it to disk. This prevents what is sometimes known as the ‘torn page’ issue.
Is it possible to build a storage box cheaper than the ones EMC is offering? I certainly believe so. How hard is it to grab a decent Intel-based box, jam a few SAS/SATA adapters in the PCIe slots, add a big amount of cache, hook up a bunch of cheap off-the-shelf disks and build your own storage box? Many storage start-ups these days work this way.
But to get to the same reliability and integrity level as EMC systems is a long shot for many of our competitors. Unfortunately, even at EMC some folks seem to have forgotten about this, and start talking about all sorts of great innovations when talking to applications or database administrators at our customers. Over the last few years I have learned to start discussing the basics first. The fundamental requirements of any infrastructure components used for mission critical computing.
If I were a customer (and I have been an EMC customer in the past), I would care less about virtual provisioning, auto-tiering, even about power & cooling requirements, and maybe even less about performance than about data reliability and integrity.
What good is a box that can do a million IOPS if you can’t trust your data to it?
Carnegie Mellon University – Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? (2007)
CERN – Data integrity study (2007)
University of Wisconsin-Madison & Network Appliance – An Analysis of Data Corruption in the Storage Stack (2008)
http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html – Web page
http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf – PDF document
http://research.cs.wisc.edu/adsl/Publications/DataCorruption-FAST08.ppt – Presentation
Google – Failure Trends in a Large Disk Drive Population (2007)
Oracle info – T10-DIF, H.A.R.D. & Oracle block checksums
Vendor info – White papers and product documentation
Old page describing Dell|EMC CLARiiON unique features: http://www.dell.com/content/topics/global.aspx/power/en/ps2q02_darden?c=us&l=en&cs=555
Update: An Oracle Openworld presentation (hosted by EMC and Oracle together) on the topic can be found here: http://oss.oracle.com/~mkp/docs/OOW2011-DI.pdf