OCI High Availability and Disaster Recovery
High Availability for Network Resources
- Computing environments configured to provide
nearly full-time availability are known as high
availability systems - Well-designed high availability systems avoid having single points-of-failure by the redundancy of resources
- When failures occur, the failover process moves the processing performed by the failed component to the backup component
OCI Services and High Availability
- AD, Availability domains are isolated from each other, fault-tolerant, and very unlikely to fail simultaneously annd do not share physical infrastructure, such as power or cooling
- Fault Domains (FD) enable you to distribute your instances so that they are not on the same physical. hardware within a single AD. Each AD will have 3 FDs.
- Load Balancer, Regional service, used to distribute load among AD
- In Storage service, Block Volume is AD service and replicated in AD, can be used volume replication. Object Storage is Highly Available in regional. File System is also regional and shared among AD. Highly available.
- Compute as AD service, ASG [Auto Scaling group] can be used ot make service highly available.
Best Practices
Networking
To provide high availability across availability domains, you can configure multiple private load balancers on Oracle Cloud Infrastructure and use on-premises or private DNS servers to set up a round-robin DNS configuration with the IP addresses of the private load balancers. The following is an overview of this process:
- Deploy two private load balancers, one in each availability domain.
- Configure two custom DNS VMs in the VCN.
- Modify the VCN Default DHCP options to use a Custom DNS Resolver and set the DNS servers to the IP addresses of the DNS VMs.
- Add a new round-robin DNS zone entry for the private load balancer FQDN with a low TTL.
- Add two A records with the IP addresses of the two private load balancers.
- Use the FQDN of the private load balancer when accessing the private load balancer.
- The most robust option is to use multiple FastConnect connections with circuits from different network service providers.
High Availability: https://docs.oracle.com/en/solutions/design-ha
Resilience & Availability
Unacceptable variance in performance (latency or throughput) for any reason, including the following ones:
- Multitenant “noisy neighbors” (failure of QoS mechanisms)
- Inability to efficiently reject overload (accidental or malicious) while continuing to do useful work
- Distributed thrash, message storms, retry storms, and other expensive “emergent” interactions
- Cold-shock (empty caches) after power-cycle, particularly simultaneous power-cycle of multiple systems
- Overhead when scaling the system