Things go wrong! It is a fact of life that things fail the whole time, whether it is due to hardware failure, human error or natural disasters. A big realisation in recent years has been that we should architect our systems to be resilient to failure. We have firmly embraced this philosophy.
We make exclusive use of Amazon Web Services (AWS) as the hosting environment for the PPO service. AWS is by far the largest cloud infrastructure provider and has data centres located all over the world. AWS's data centres are located outside flood and earthquake prone areas and are designed to not expose single points of failure in terms of power supply and network connections. AWS also do most of the heavy lifting in terms of providing the basic infrastructure on which to build highly resilient solutions.
Highly available services
One of the easiest ways that we use to ensure high availability of our service is to make use of highly available building blocks. These building blocks have been designed to be resilient even in the face of hardware failures, communication failures and even the loss of an entire data centre. An example of such a service is the AWS S3 object storage service which has been designed to provide 99.999999999% (eleven 9s) of durability by storing the data redundantly across multiple devices and multiple data centres. We make extensive use of this service to store database backups, documents and log files. Other highly available AWS services that we make use of include Route53 (domain name service), ELB (load balancing), DynamoDB (session storage) and WAF (web application firewall).
It is said that in the cloud era you should treat your servers like cattle and not pets, meaning that if a servers starts misbehaving you should replace it with another one without a second thought. PPO makes use of EC2 (virtualised servers) which are provisioned from pre-configured machine images. On our application layer, we scale up and down these servers based on load, resulting in the situation that a specific server never exists for longer than a few days. If our monitoring systems picked up a problem with a particular server we can replace it with another one in minutes, without any downtime for our users.
Our database servers are also regularly re-provisioned and databases moved to new infrastructure through a semi-automated live migration process. This forms part of our database server patching and balancing / capacity management processes.
Automation and software defined infrastructure
Automation is at the heart of being able to rapidly respond to failures. We therefore continuously strive to automate as much of our operational processes as possible, including provisioning of new servers, code builds and deployment and backups.
PPO heavily relies on a "software defined infrastructure" (using AWS CloudFormation) to be able to rapidly re-instantiate our entire infrastructure from source-controlled configuration scripts. This allows us to quickly and easily recreate our entire infrastructure in an alternate data centre.
The first step in knowing when there is a problem is to have the ability to detect it. The PPO service itself is independently monitored 24x7 to ensure that it is available and automatically alerts the technical team if an outage is detected which kicks of an incident response process.
We also make extensive use of AWS CloudWatch to log, monitor and alert on all aspects of our infrastructure to able to catch potential problems before they become a real problem.
Zero maintenance downtime
PPO has a policy in place that we never take the service down for scheduled maintenance. All maintenance therefore has to be done while the service is running. Although this does add complexity, it has the side benefit that it forces us to have redundancy in the architecture. If for example we need to do operating system patches, this is done on the machine images previously mentioned and then phased into the load balancer pool.
Database backups are fully automated and performed on an hourly basis. The backups are redundantly stored in AWS S3 across multiple physical locations with lifecycle management in place to ensure that backups are retained for the defined retention period. Independent, automated and manual processes are in place to verify the backups.
In the event that an alert is triggered, whether as a result of the PPO service not being available or because of other alarm thresholds being breached, this kicks off an incident response process which includes set escalation procedures.
If users are impacted, we use Twitter (https://twitter.com/ppodevops) to keep users informed of the nature of the incident as well as the progress that is being made in resolving the issue. You can also see our uptime for the past 12 months as well as the details of any incidents at our incident page (https://www.go2ppo.com/incidents/) – this is also where you will see any post-mortem reports on any incidents.