Database compression
Another technique that Oracle has improved as of version 11g is compression. In versions up to 10g you could only compress an entire table, and after that, random performance on a compressed table was poor. It worked well for data warehouses where I/O bandwidth is reduced (compressed data can be read quicker from disk than uncompressed) but only in specific cases.
In 11g Oracle has introduced “advanced” compression. I will not go into details, but it allows compression on a much more granular basis, so that OLTP applications can benefit, and it works on a record-by-record basis. Oracle claims this reduces the total database size (no-brainer 🙂 ) and therefore also the backup size (thereby ignoring the effects of tape compression that most customers use, so your mileage may vary). Data can only be compressed once, so the size of a normal database on tape compared to a compressed one will probably not be different with tape compression enabled.
De-duplication
What about de-duplication? Note that de-duplication is not the same as compression. Compression is a way to use smarter encoding of information so it takes up less space. De-duplication is the fine art of finding multiple copies of the same information and reduce it to only one copy. The larger the search area (whole database, or even de-duplication across multiple databases, applications and systems) and the smarter and more granular the algorithm (variable block length versus fixed-block, knowledge of the metadata in the dedup algorithm) the better deduplication works.
EMC’s Data Domain acquisition last year is a great improvement in our portfolio and, together with our Avamar product having a very smart de-duplication process, puts us in the best position for future innovations in this area – and today already offering fast backup de-duplication for large databases with equally fast restore times.
Oracle claims to have deduplication, but it is worth mentioning it is currently only available for Oracle Secure File LOBs (Large Objects – BLOBs aka Binary Large Objects, or CLOBs aka Character Large Objects) – a method of storing file content in a database (think of pictures, pdf documents, large text streams etc).
If you store the same LOB twice then Oracle will allocate space only once and keep a record of the number of references to it. The object is deleted only after the last reference disappears.
However, it does not work for normal records in tables. Customers often ask us at EMC if this could be done at the storage level, and I had some interesting discussions with EMC engineers about this. The answer is that this is very hard (if not impossible) to do because storage systems have no notion of database binary structures so the best we can do is search for duplicated disk blocks. But the chance of finding, say, two similar Oracle blocks will be near-zero because different Oracle block contains more than one data record, a different header, etc.
For deduplication to work at a database table level, the only way to achieve this is implementing it in the database engine. Claims of some competing storage vendors that they can de-duplicate primary Oracle table spaces has to be taken with a grain of salt.
Thin provisioning
In my last article I wrote about the poor storage efficiency in most organizations. If you add up (actually, multiply) the utilization of all storage levels (storage system utilization, SAN, volume manager, file system, data file, table space etc) you end up with a much lower utilization than what you’d expect. I am not surprised when finding utilization ratios of 35% or less – and I ignore RAID overhead, replication and snapshots because these added bytes have a purpose (namely, data protection)!
We Dutch people sometimes say, every disadvantage has an advantage (or the other way around if it suits you). The (unintended) hidden advantage of low storage utilization is, you need more disks, and thereby, unwittingly, avoid some of the random performance problems because the data is distributed on more spindles than absolutely required. Still, I doubt if throwing in more disks is the right way to get better service levels 🙂
Using thin provisioning (in EMC we call this “virtual provisioning”) allows us to offer more capacity to applications (including database) than actually is available. The laws of statistics help avoiding running out of space as long as we share the virtual pool across enough applications. Monitoring the free space in the pool carefully allows us to bring utilization up to 70% or more (actual feedback of some customers). It works well for Oracle if you adhere to the best practices (if you don’t, you could easily still run out of capacity or use more than actually needed).
Innovations are on the horizon that allows us to reclaim unused capacity on thin provisioned datasets and support “thick to thin” migrations (moving data quickly from normal to thin provisioned capacity).