There are a couple of options when it comes to HA (High Availability) Thor.
It is important to note that HA when it comes to THOR is a constant processing ECL queue. Essentially you have two THORS behind the same ECL Queue. If a single component (thorslave, thormaster, node) goes south, the other THOR will continue to process and assuming you have replication enabled it will be able to read data from the backup location of the broken THOR. Other components (ECLserver, ESP) can have multiples, and the remaining components – Dali, DFUServer, etc., work in a traditional shared storage HA failover model.
The Downside: Costs twice as much because you essentially have to have two of everything.
The Upside: 97% of the time you can utilize the additional processing capacity. Run more jobs, have more space, etc.
In regards to DR capability things get way more tricky. It really isn’t a matter of Thor capability as much as it is bandwidth to replicate your data. If you are only dealing with tens of gigabytes delta per day then you will be fine with a rsync type replication or perhaps some hybrid model. If you are in the few hundred gigs to petabyte deltas, its only limited to how fat your wallet is. Usually customers will find where the data is the smallest (at ingestion, after normalization, at the Roxie) and replicate from that point and rerun the processing in both locations. The key to getting this right is to know your data flow. For instance if you are ingesting say 20TB of raw data daily and they are then taking that raw data and rolling it up, scoring it, indexing it, etc., then likely they would be better off replicating an intermediate dataset (what we call yogurt or base files) – rather than replicating the large ingest. If the opposite is occurring (small daily ingest and then they blow the data up in size) – it would be better to ingest the input and re-run. Thor has the ability to do a “thor copy” which copies data from one cluster to another. They can do this through ECL. Additionally, they may decide they don’t necessarily need or want to have a “hot” DR Thor (what we do). In our case, the most common disasters [minor] (major switch outage, total power down, multiple fiber cuts) cause only a relatively brief < 1 day disaster. Since our Thors are responsible for creating data updates we can take a day or a few to recover – data just isn’t as fresh as we like but as long as the Roxies are replicated – money is still flowing. We have decided that in the case of a major disaster (airplane hits the building), the likelihood of that occurring does not justify the cost of preventing against it and we could recover in 7-14 days by building out a new Thor. Disaster recovery is always a calculation – (the cost of failure * the likelihood per year of event occurring) <> the cost to prevent against it.
Have multiple ROXIE clusters and use a proxy to redirect. In case of how to keep the data in sync, we generally use a pull approach. The Roxie will automatically pull the data it needs from the “source” listed in the package file. This can be another Roxie or a Thor. In most cases we pull to our DR Roxies from our primary Roxie out of the load balancer, but it can pull from a Thor in the primary location as well.
Replication of some components (eclagent, esp (eclwatch), dfuserver, etc) are pretty straight forward – they really don’t hold anything to replicate. Dali is the big one. In Dali’s case, you have Sasha as the backup locally. The Dali files can be replicated off pretty mundane using rsync but a better approach would be to use a synchronizing device (cluster WAN sync, SAN block replication, etc.), and just put the DALI stores on that and let it replicate.
Unfortunately, there isn't just a one size fits all approach. This isn't an RDMS obviously and special care, design, and planning are required to make an effective DR strategy that doesn't “over synchronize” across slow WAN links but still provides you with an acceptable level of redundancy for the business needs.