Many web applications communicate with third party services through network calls to their APIs. Rover integrates with various such services to enable users to upload images, send SMS messages, make payments and much more. We try to relegate network calls to asynchronous tasks whenever possible. Our outbound messaging system, for example, is entirely asynchronous. This allows us to deal with third party outages and other failures outside of our customer-facing web app and avoid direct user-facing impact. Unfortunately, some API calls must happen within the scope of a synchronous request/response cycle in our web app. One such case at Rover is when we need to charge an owner’s credit card when they book with a sitter.
This poses an availability risk from a site reliability standpoint. Third party services experience failures, even those as battle-hardened as AWS. If our infrastructure is not prepared to handle an outage of one of our external dependencies, we can experience cascading failures as server resources are exhausted waiting for long-running network calls to finish. Pending requests can pile up as the server has to wait for these API calls which will never successfully complete.
War Stories
As with most site reliability paradigms, this availability risk is far from hypothetical. On my second day as an engineer at Rover, AWS’s S3 service experienced a near-complete outage in the Northern Virginia region. At the time our customer-facing web app had a number of synchronous dependencies on S3. As soon as it went down, we saw this scary graph:
The response time of our web app started skyrocketing! Digging deeper, we saw requests piling up at our routing layer, waiting for a response from our application servers:
Most telling, we noticed that the number of the web workers which should be available to serve requests had dropped to zero, and the number of busy workers (typically very low) had jumped to… well, all of them:
The issue was clear. Our servers ran out of resources to serve incoming requests, resulting in massively increased response times and a precipitous drop in volume to our web app as the failure cascaded across our servers:
Learning from our Mistakes
While uncommon, Rover does experience outages, like any major web service. We follow the well-known and effective “five whys” outage post mortem process to understand the root cause of failures and correct any deficiencies in our infrastructure or process to prevent them from occurring in the future. One of the takeaways of our post-mortem for this incident was that our web app deployment configuration was not well-equipped to handle the failure of a synchronous third party dependency.
We run a Django web app on AWS behind uwsgi
and nginx
on our application servers in production. The uwsgi
process runs locally, and a “sidecar” nginx
process runs on each application server, proxying external requests to the local uwsgi
server. All application servers sit behind a load balancer to enable us to easily horizontally scale our infrastructure. This setup for Python web app deployments is quite common. You can see a simple diagram of our deployment setup below:
We were running our uwsgi
web workers as processes with fixed-size thread pools. Each thread can serve a single request. Blocking I/O, such as network calls to third party services, release Python’s Global Interpreter Lock and enable the other threads in the pool to continue executing, allowing for limited concurrency within a uwsgi
web worker process. However, because the size of the thread pool is fixed, there is a limit to the number of concurrent requests which can be served. Thus, if the I/O being performed blocks for a long time – such as a network call to a third party service experiencing a complete outage – the server will rapidly run out of slots for requests and cease performing any work.
What could we do? We knew our deployment configuration posed an availability risk in the face of a third party dependency failure, so we needed to investigate alternative approaches at the infrastructure level to mitigate this risk and harden our web app to external outages.
Asynchronous IO
The root cause of the issue was that web worker threads cannot serve additional requests while they are waiting for long, blocking IO to complete, which occurs when dependencies external to the application servers are experiencing an outage or duress. One alternative to fixed-size thread pools is to serve requests using asynchronous I/O. In this model, an “event loop” listens for incoming requests and spawns lightweight execution environments called “microthreads” or green threads that are able to do additional work while waiting on blocking IO (namely, serve requests). This prevents a third party service outage from exhausting the resources on your servers and can isolate the impact to only the endpoints that depend on that third party service. A number of HTTP WSGI servers support asynchronous IO, typically through gevent
(a Python library for networking with green threads). Luckily, uwsgi
is one of them!
This diagram of async I/O is courtesy https://eng.paxos.com/python-3s-killer-feature-asyncio
A change to our deployment configuration to support asynchronous I/O would be a major undertaking. Except for minor tweaks, we had not fundamentally altered the way we deployed our web app in quite some time. This refactor had the potential to be massively impactful to the entire infrastructure, and would have to be handled carefully. Before embarking on such a fundamental shift in our operational environment, we wanted to be absolutely confident that the decision to move to asynchronous I/O would be the right one. Furthermore we wanted to ensure no downtime if we decided to go this route.
Measure Twice, Cut Once
To achieve high confidence that switching our web workers to use gevent
was the right decision, we decided to design an experiment to test the capabilities of asynchronous I/O to withstand a third party dependency failure. We hypothesized that asynchronous web workers with gevent
would outperform synchronous, thread pool-based web workers when the infrastructure was under duress due to an external service outage. To test this hypothesis, we wanted to simulate the conditions our web app experienced during the S3 outage and measure the performance of uwsgi
with gevent
-based web workers against our existing deployment setup, which would serve as the control.
The idea of the experiment was to send traffic to a simple test app; simulate the conditions experienced during the S3 outage by forcing a small percentage of traffic to time out; and measure the key site reliability metrics we care about: throughput (requests completed per second), error rate, and response time. Our hypothesis was that throughput would be higher and error rate and response time lower for web workers running gevent
.
The Test Application
Our simple test app was designed to be a simulation of our production application as it would behave during the S3 outage. It exposed two endpoints:
- A “regular” endpoint which would complete after a short period of time
- A “problem” endpoint which was configured to cause any requests to it to time out
Requests to the “problem” endpoint would represent requests to endpoints in our production web app with a synchronous dependency on S3. By having this endpoint sleep for a long period of time, we would simulate a web worker waiting on long, blocking network I/O, as it would in production during the outage.
This test application was deployed to a single server with a uwsgi
configuration closely matching our production configuration. We instrumented the experiment environment with metrics to measure our key reliability metrics on both the client and server sides, to get a granular understanding of the experiment results.
Outage Simulation
Each experiment run directed traffic at the test application by making web requests every second. A small percentage of traffic was configured to hit the “problem” endpoint to mimic the behavior of our production web app; while most endpoints were behaving fine, those with a synchronous dependency on S3 were timing out and eating up server resources. The experiment environment slowly ramped up traffic over time, with request rates and problematic traffic percentages configured based on data from our production web app during the S3 outage. This would give us additional confidence that the outcome would be realistic and not a result of any confounding variables.
Results
We tested a few different configurations to evaluate what would be the most reliable setup for our infrastructure:
uwsgi
with thread pool web workers (our control)uwsgi
with async web workers usinggevent
gunicorn
with async web workers usinggevent
We included gunicorn
, an alternative wsgi server, to see if the performance would be noticeably different than uwsgi
. We ramped up request volume according to the following load pattern:
Let’s look at throughput first. Ideally the server is able to handle the requests just fine, even at high volume, and continue to serve requests which are not being sent to the “problematic” endpoint. The uwsgi
web workers using a thread pool predictably fell apart once enough long-running requests filled up all the available slots:
As we ramped up the request volume, a few continued to make it through, but the fixed size thread pool prevented the ideal request throughput and the client saw significantly degraded performance across the board. uwsgi
with asynchronous gevent
workers, on the other hand, performed marvelously:
Error rates skyrocketed with the thread pool uwsgi
configuration as soon as slots filled with long blocking I/O requests:
While some level of failures are expected (from the requests to problematic endpoints, which will eventually time out and fail), the failure rate should not grow so high as it does here. The asynchronous I/O configuration maintained a relatively constant error rate, even as the traffic volume ramped up to maximum:
Finally, the average response time over the duration of the experiment was an order of magnitude higher with the thread pool workers, as most requests timed out waiting for available slots to be served. The gunicorn
configuration performed comparably well in all areas, although we opted to stick with uwsgi
to simplify the rollout of asynchronous I/O to our infrastructure.
Conclusion
The results of the experiment gave us high confidence that rolling out gevent
to our infrastructure would harden us against third party service outages. Because this change was nontrivial and had the potential to be highly impactful, we wanted to be as sure as possible that it was the right decision to make. This experiment was time consuming, but the upfront effort paid dividends by eliminating uncertainty and risk. My team is responsible for, among other areas, site reliability and performance; our goal is to constantly improve the resilience and stability of our systems. We must always balance our agility as a team with the risk inherent in major production changes. We rolled out gevent
to our infrastructure slowly, one server at a time, carefully monitoring our key site reliability metrics and error reporting. The rollout was smooth and free of impact, with no downtime. More than a year later we’re still going strong; our servers no longer experience resource exhaustion when our third party dependencies have blips.
Here are some things we learned from this process:
- Upfront investment in design and planning is well worth the effort, especially when site reliability and stability are concerned. We made sure we were going down the right path and avoided production downtime by spending a lot of time thoroughly vetting our approach.
- Trust but verify. Don’t rely solely on intuition or make assumptions – gather data whenever possible. Not only will it help justify the actions you take, but when you look back at an architecture decision a year later it will help you understand what led you to that decision and what problems you were trying to solve.
- Asynchronous I/O is not a panacea! It can improve concurrency, but applying it blindly will not magically give you better resource utilization or improve your performance. It’s useful when you have the potential for your server resources to get tied up waiting for long-blocking I/O, but there are plenty of pitfalls to be aware of.
- Building a company culture that encourages engineers to spend time making sure they’re doing the right things is valuable. I was able to propose this experiment, get feedback, and collect the results with full support from my manager. Rover has a great collaborative culture with an emphasis on engineering excellence that makes me excited to work here (and we’re hiring!).