A comparison of a few Disaster Recovery methods for Oracle database.

Disclaimer: I tried to be as unbiased as possible (working for an infrastructure vendor). If you think there are incorrect statements, let me know and I will correct them where possible.

Method Log shipping Data mirroring Continuous Data Protection
Product Oracle Data Guard EMC SRDF EMC Recoverpoint
Network link IP via database host SAN based SAN based
Operation Sync / Async Sync / Async Sync / Async
Bandwidth requirements Depends on REDO writes Depends on all writes
has write-folding
Depends on all writes
has write-folding
Requires “force logging” (performance impact) Yes No No
Requires archivelog mode Yes No No
Requires remote DB server + license Yes No (*) No (*)
Uses database host CPU & network resources Yes No No
Replicates non-Database data No Yes Yes
Non-Oracle databases No Yes Yes
Requires one instance per each database Yes No No
Multi DB/Platform/Application consistency groups No Yes Yes
Failover time Seconds Minutes Minutes
Failover automation Native (Data Guard broker) Requires external tooling Requires external tooling
Standby database access Read-only (Direct)
(Requires extra license)
Read/Write (via snapshots) Read/Write (via snaps)
Allows instant restore to older checkpoint No Yes (via snapshots) Yes (“time shift”)
Requires regular refreshes of remote snapshot No Yes Yes
Incremental re-sync after link failure Only if data was not modified at both locations Always Only when not modified
Initial build / rebuild method Restore from backup or other media Background sync Background sync
Transfers local data I/O corruption No
(does not transfer datafiles – only logs)
Yes (works as designed) Yes (as designed)
Prevents remote data corruption No
(remote server can silently corrupt data)
Yes
(no remote server involved)
Yes
Repair local corruptions using DR Yes (some cases) Yes (some cases) No
Continues if local RAID group fails
(i.e. serve I/O from remote system)
No (failover is triggered) Yes (it’s a remote “mirror” not a “copy”) No

*) Beware that Oracle typically requires “remote storage” to be fully licensed

More detailed explanation

Operation

All of the mentioned products can replicate data synchronous (“zero dataloss”) or Asynchronous (with a delay). Zero dataloss in Oracle context means you don’t lose any committed transactions. Asynchronous replication delays remote updates in order to avoid the performance impact of synchronous operations. All of the methods provide consistency for the database, in other words, the remote database is always restartable (for Oracle specialists: SAN replication always offers a “crash consistent” image similar to “shutdown abort” of a database).

Bandwidth requirements

Data Guard only transfers redo log data where SAN replication replicates everything (i.e. redo logs and data files). Arguably Data Guard has a bit lower bandwidth requirements compared to SAN replication because of this. EMC SAN replication tools have “write folding” (only send multiple overwrites once if you run async) and other optimizations (compression, de-dupe, block merging)  to mitigate the extra required bandwidth so the difference is not 2X but usually less. In some cases (when doing lots of very small updates) the overhead (i.e. redo wastage) in the redo logs causes Data Guard to consume more bandwidth than SAN replication would.

Also worth mentioning, is that replication links often need to be sized also for re-sync after link failures – which may invalidate the lower bandwidth requirements benefit completely.

Requires “force logging”

Data Guard is a framework around (redo) log shipping. This means in order to get data replicated it has to be logged first. Some database transactions can be performed with logging disabled (/NOLOG option) – especially useful for data loading and certain maintenance operations. “Force logging” mode of the database makes sure all operations get logged to maintain recoverability – but enforcing logging can have performance impact. SAN replication does not depend on database logging – it just replicates storage volumes, so there’s no performance impact for such NOLOG operations (other than the impact caused by transferring the data itself).

Requires archivelog mode

Many Oracle production databases (and I bet most of the ones that require D/R) use “archivelog” mode (they save logging information by copying all redo logs) so in case of data loss, the database (or parts of it) can be recovered using the logs. With SAN replication this is not required (at least not for failover) but many database administrators would still have it enabled for logical repairs.

Requires remote database server

Data Guard is a database feature. In order to replicate to another site, the remote site needs to have a standby database up and running to ingest redo log data. This means you must have a server and database (the “standby” database) on the remote site (including full licenses). With SAN replication this is not absolutely required as no hosts are involved in the replication process. That said, not having a standby server typically means the RTO (recovery time objective) will be very long (you first have to get a server, then configure it for running the database, then do the recovery).

However, in certain situations (i.e. lights-out 3rd bunker sites) it may be fine and even preferred not to have a standby server.

Uses database host CPU and network resources

Even though the host replication overhead with modern CPUs is very low (typically 1% or less), the fact that database licenses are very expensive, sometimes still justifies moving that overhead away to something else that does not require expensive CPU cycles. With SAN replication, still there are CPU cycles involved, however, these are performed by storage infrastructure that does not carry database licensing – and therefore may be less expensive – and certainly does not impact database host performance.

Replicates non-database data, non-Oracle databases

A modern application typically stores business data in more than just one database. If you strategically choose application/database replication methods over SAN replication, then you need to select a different tool for each database platform, each file system/host type, each middleware product, each hypervisor, etc. This can quickly become very complex and cause issues with administrator responsibility. Each IT department (i.e. DBAs, File server administrators, VMware operators etc) all report they have D/R covered by juggling many replication instances – but from a CIO perspective, you now have many points of failure. SAN replication is host and application agnostic and can replicate all business data from one single point of view and control.

Requires one instance per each database

With Data Guard you have to set up DR for each database that needs it. Most of our customers have many databases. If you run 100 databases of which 40 need to have D/R implemented, then how many Data Guard instances do you have to maintain and monitor? What if, on average, each one of them is down for a day per year because of maintenance, upgrades, re-sync, etc? Can the business survive if one database is not recoverable?

SAN replication avoids this problem both because it’s not host dependent (replication keeps going even if you’re doing database or application upgrades). Of course the SAN environment needs maintenance too – which is why best practices demand making a full, consistent application landscape snapshot (on the remote site) before doing any maintenance – so even in worst case scenarios you can fall back to a business consistent checkpoint for all applications. Also, SAN maintenance that impacts DR is usually much less frequent than application, database and host maintenance.

Multi database, platform, application consistency groups

This is a topic that is often not fully understood so here’s a quick explanation. Say you run a banking environment which has 2 databases; one that runs ATM machines and the other that keeps track of customer’s bank accounts. If someone withdraws money from an ATM it does not need a lot of imagination to understand that BOTH databases need to be updated with the transaction in a consistent manner (i.e. two-phase commit or some other way of enforcing this). If you would recover both databases but with a slight time delay, then the happy person who just used the ATM might not see the transaction reflected on his bank account – even when both databases are technically consistent. Or the other way around – ATM refuses withdrawal due to failed transaction but the amount is still deducted from the account. I leave the business impact of such a scenario to my reader’s imagination.

This problem can not be solved with methods that protect single databases or applications (unless there would be a way to synchronize recovery checkpoints at the microsecond level). That said, even EMC’s consistency group features on replication products only offer a tool, not the end solution. The devil is in the details. I’d be happy to provide more guidance if needed.

Failover time

A Data Guard standby database is up and running all the time in recovery mode. Promoting the standby to accept end user logins and become read/write is a matter of performing the command to do so. Hence Data Guard can offer very quick failover times.

SAN replication is in a disadvantage here because storage replicas are not mounted and activated until a failover is triggered. At that point the database has to be started up and recovered – which may take a number of minutes, depending on the amount of rollback and other admin tasks the database has to perform before being able to accept user connections.

Failover automation

Data Guard has failover method built-in (Data Guard broker). SAN replication depends on either manual failover (not recommended) or a 3rd party tool (such as VMware SRM) to automate failovers.

Having an embedded tool is nice as long as you don’t have to failover other components together with one database (such as middleware, other databases, app servers, network configuration etc). Note that Data Guard broker is optional and you can use other tooling as well if you like.

Standby database access

Because a Data Guard standby database is up and running all the time, it can be used for queries on the data as long as these queries are read-only. Note that this feature requires “Active Data Guard” license on both databases. Some people told me that read/write is also possible – further investigation learned me that this requires temporary suspending the link, and activating a feature called Oracle Flashback that rolls back any applied updates before re-enabling the link and catching up with delayed transactions. Which makes it mostly infeasible for real DR scenarios – but it may be useful in certain others (such as for testing and data distribution purposes).

SAN replication does not offer direct access to D/R copies (nothing is allowed to mess with precious copies of production data so access is deliberately disabled until a failover is triggered) but storage snapshots can be made from DR copies that are then mounted and started as an independent, read/write database (including changing the database name or other things, if you like, to avoid confusion about which database is connected). Note that any updates will be lost when the snapshot is terminated or refreshed. Having an independent snapshot from D/R allows things like full D/R testing, remote backups, using it for test & dev refreshes, firefighting/troubleshooting, auditing, etc.

Allows instant restore to older checkpoint

What I mean here is that sometimes you wish to go back in time because the current D/R copy also has become corrupted or lost transactions. At this point there is a large difference between the three methods:

  • Data Guard – Cannot roll back in time (or maybe by using Oracle Flashback – which many customers don’t like to use for production). The workaround is to delay redo log apply by a number of minutes or hours (I usually hear 4 hours as a best attempt). So if you catch a corruption or data loss within 4 hours after it was caused, you may be able to recover the standby to a point just before the corruption. A few problems arise that one may not be aware of: a) at night your administrators are asleep and by the time a detected problem made it all the way through the escalation process, the database may very well already have applied it on the standby also. b) The update delay renders the quick failover that Data Guard provides unusable – because now with a normal failover you first have to roll-forward the remaining 4 hours of transactions, which can take a long time. c) if the roll-forward is not stopped just before the point where the data loss occurred, there is no way to undo the transaction anymore (so you would need a full restore and another lengthy roll-forward).
  • EMC SRDF – Cannot roll back in time either – but can restore from snapshots quickly. So you could have a snapshot every 4 hours or so and use that as a recovery starting point. If you recover just a little too much, you can re-apply the snapshot and try again.
  • EMC Recoverpoint – has a journal based time shifting feature which allows you to pick any checkpoint and use that for quick crash recovery. If you shift back 10 minutes and find that the corruption was already there, roll back 5 more minutes and try again. You can roll back as much as the size of the journal allows (typically at least 24 hours or so) and the checkpoint can be on the second accurate (or less)

Requires regular refreshes of remote snapshots

A Data Guard standby is always updated continuously, so if you use it for read-only queries, you always have the latest state of production data (if you don’t delay log apply). SAN snapshots are one-off, point-in-time checkpoints and start aging as soon as you create them. So the advantage of snapshots being read/write is offset by having to refresh snapshots every now and then to avoid using a very old data set.

Incremental re-sync

All mentioned products can pick up from a suspended link and incrementally re-establish replication. With the exception of SRDF, incremental re-sync only works when data was not (!) changed at both locations. Why is this important? Say you have a rolling disaster (i.e. overheated datacenter floor, fire, etc) where the remote link goes down, and only 10 minutes later the production server gives up. Now you have to fail over to a 10 minutes old copy (despite even running in synchronous mode). If you manage to get the primary site back online, you now have transactions applied to the (now idle) production database as well as the (now activated) standby. Re-enabling synchronization requires either a block-by-block comparison of both datasets, or a full re-sync (overwrite) of one of the two datasets. Currently – as far as I am aware – the only D/R method offering delta updates without full resync is SRDF.

Initial build / rebuild method

SAN replication allows you just to select a set of storage volumes and start replicating. Once the initial sync is done, the DR copy is established and you’re good to go. Data Guard requires an initial creation of the standby database, for example by using RMAN to do a full restore and then start replicating using Data Guard. If somehow the standby becomes invalid, you need to do this restore/rebuild again which may take a long time and some administration effort.

Transfer of local data corruptions

This is part of a FUD campaign by Oracle – who claim that SAN replication is not “database aware” and replicates corruptions caused by the local host stack. Actually this is (partly) true: SAN replication does not protect against local corruption and happily transfers any modifications (good or bad) to the data. Data Guard will not transfer corrupted data files as it only transfers redo log updates – and if these redo updates would somehow become corrupted, they will not be applied on the standby database. SAN replication is designed as GIGO (garbage in, garbage out) and best practice is to a) avoid corruptions in the first place, and b) have other methods/strategies in place to detect and repair data corruptions.

For example, on EMC VMAX/VMAX 3, EMC supports T10-DIF (more on that and data integrity in general here) which offers end-to-end protection (both local and remote) against corruption (either using ASMLib or the new ASM Filter Driver in 12c).

Preventing remote corruptions

This is the missing part of the FUD campaign – Data Guard protects against (some) corruptions caused at the primary site – however, nothing stops the standby host to write bad data to data files because that’s not in the information flow of Data Guard (or even the database). The problem is that corruptions on the remote standby go undetected, until:

  • You force a scan for corruptions (DBverify/RMAN validate) and even these methods will not catch all problems
  • Even worse, you need to fail over due to primary site problems and only then find the standby to be corrupted as well (at that time you have other issues to deal with!).

Because SAN replication does not depend on host I/O, it avoids such problems completely, much like Data Guard avoids local corruptions to be transferred.

My opinion on the matter is that you should avoid corruption in the first place and not be dependent on a D/R tool for solving them, but instead corruption detection and prevention should be part of the primary architecture, outside of D/R tooling.

Repair local corruptions from remote

Data Guard has an interesting feature – if you have local corrupted blocks it can (sometimes) automatically repair those by requesting good copies of those blocks from the standby database. Of course SAN replication cannot repair Oracle datafiles as that would require deep integration – but in the case of SRDF it is possible to repair corrupted disk blocks from the remote mirror (also see next item) as long as they were caused by bad disks instead of host IO issues. FYI – This would only happen if normal corruption protection (checksums etc) would not prevent having corrupted blocks on multiple disks in a RAID group – this is merely an illustration of how SRDF works compared to other tools.

Side note: It’s a real nice to have feature – but note that this does not work if you just failed over and the primary site is missing (i.e. Data Guard is down). So it depends on not being in a disaster or failover situation and this is probably the biggest flaw – considering Murphy’s Law that problems never come alone. I experimented with this feature and found a few other limitations:

  • if your SYSTEM tablespace or control files have problems then Data Guard is not going to help you out of the mess
  • if instead of a few blocks got corrupted you mess up an entire data file (I truncated one to half the original size just to see what happens) and such serious problems cannot be fixed automatically either. Know your limitations.

Continue if local RAID group fails

If one would experience a double disk failure in a RAID group (very rare but it does happen occasionally) then only SRDF can continue without failover – because the remote site is actually a RAID “mirror”, not just a “copy”.

This post first appeared on Dirty Cache by Bart Sjerps. Copyright © 2011 – 2015. All rights reserved. Not to be reproduced for commercial purposes without written permission.

Loading

8 thoughts on “Comparing DR features

  • Hello Bart,

    Your comparison, lacks some information regarding the use of Dataguard. Dataguard in combination with Flashback (in the EE license) allows you total flexibility. For instance you can set a marker on a standby database, open it read/write, do all your testing and revert(flashback) to the original situation and log shipping will pick up again. It’s called a snapshot database.
    Regarding data/block corruption, there’s an automatic procedure called Block repair that retrieves an copy of the block from the counterpart or as a last resort, retrieves the block from the RMAN backup.

    Summary:
    – Standby database access ; read/write (snapshot database)
    – Incremental re-sync after link failure ; flashback standby/primary to PIT before corruption and reinstate Dataguard
    – Allows instant restore to older checkpoint ; yes see my remarks regarding flashback
    – Prevents remote data corruption ; yes, through automatic block repair

    Another point that is not mentioned in your sheet; is it possible to utilize the mirror database for reporting purposes ?

    Nonetheless , I agree with you that Dataguard is overpriced and that alternative solutions are highly considerable.

    1. Hi Gerd-Jan,

      Good point. Let’s see how I can add that to the list. However, say you have 7×24 operations and the RTO (fail-over time) is 15 minutes (reasonable for asynchronous replication). Then you test the standby using the method you indicate. If I understand the documentation correctly, this means:

      a) checkmark and suspend archive/redo log apply to the standby
      b) do the read/write testing. Say this takes a few (2) hours.
      c) flashback rollback to the checkmark (I guess this takes a while depending on how much data was changed, say 15 minutes)
      d) the prod database is now 2 hours and 15 minutes ahead of standby so the standby has to suck up the difference in archive logs and apply them to the database. From what I know, archive log apply is pretty slow so this might take another hour
      e) back in normal, protected operations after 3 hours 15 minutes

      During this 3h15 minute window, you can never meet the RTO objective of 15 minutes??

      What happens if production crashes 2 hours after you suspend the standby for DR testing? Don’t say this is unlikely – Murphy always strikes at the worst possible moment (maybe caused by the testing operations itself, for example a tester logs in to the wrong (prod) database and messes up)…

      Not to mention that Flashback logging requires a large flashback storage data pool – probably both on primary as on standby (which is why we like this feature at EMC 😉 and extra IO/CPU to manage the logs (maybe just 1% but remember, Oracle licensed CPUs are $$$$)

      Like to hear your thoughts on this 🙂

      Thanks for commenting
      Bart

      1. You’re not bound to one standby database. OK i now it’s far fetched, but if you plan to do this regularly you could set up an additonal standby for that purpose.

        1. Hmm so that would mean you have prod, and 2 standby’s: DR 1 and DR 2. You test failover procedures on DR 2 to avoid RTO issues on DR 1 just so you know failover to DR 2 works but if shit hits the fan you’re failing over to DR 1, the one you never test…?

          Might work if you flip-flop test between both. But still I don’t think of this as an architecturally clean solution 🙂
          Not mentioning the additional server that needs to be licensed plus additional IP network traffic, additional storage for the extra archivelog target, …

    2. On the remote data corruption issue; I wrote a blogpost on that which may be of interest. Few thoughts:

      – A standby database is usually not reading datafiles, just writing (ignoring active DG for a sec). How would the database know that there is a corruption that needs repair? Because if it doesn’t then the corruption can sit there for months going unnoticed (until you need to failover to DR of course)

      – I would say that even if it would work, it’s not prevention (the corruption still happens) but it gets fixed automatically (hopefully).

      – If you use ASM mirroring (certain Oracle engineered systems use this for example instead of hardware RAID) then what happens if you read from the “good” mirror (i.e. using RMAN validate) but the other mirror is corrupted?

      – What happens if the corruption is in the SYS tables? (I had a customer who experienced this, even data guard did not resolve, RMAN could not fix it, they had to do full restore)

      1. Detection of corrupted blocks is triggered at query time. Either on the primary or standby. The Automatic repair tries to restore the block first through requiring it from the other site, and as a last from the RMAN backup. I can not comment on the sys tables for I haven’t encountered such scenarios.

        1. Exactly so if you never do queries – or queries that don’t hit the corrupted blocks – the corruption can sit dormant for ages…

Leave a Reply