Solving intermittent latency on Rover Search

A developer’s worst fear is a bug that is low volume and difficult to reproduce. Rover engineers recently solved an issue with these characteristics: Occasional searches performed on the Rover website were experiencing an unexplained, additional 400 millisecond latency. These appeared on our performance dashboards as erratic spikes in response time, indicating there was a user impact.

A famous study by Amazon found that every additional 100 milliseconds of page load decreases revenue by 1%.

Luckily, slow searches our team observed were extremely rare, occurring fewer than 30 times per day. Due to the low volume, our initial investigations were time-boxed and unsuccessful.

However, as months went by with no definitive cause, we revisited the issue using new observability tools which were the key to finding the root cause.

In this blog post, I’ll walk you through the entire process: the symptoms, the hypotheses, the tools we used, and how we resolved the issue.

I’m Preston, a Software Engineer on Rover’s Recommendations team. I focus on Rover’s backend and search systems to maintain a fast and reliable Search Page experience.

Context

A central component of the Rover platform is the search page, which enables pet parents to easily find sitters across 17 countries. Rover processes millions of search requests each week and relies on a Django web backend combined with an Elasticsearch cluster to serve low-latency, real-time search results.

At Rover, we use a central endpoint for our search traffic called the Search View. This view takes a user’s selected filters and processes them to return search results. Our engineers keep a close eye on the response time to this view to monitor for performance regressions in this critical code path.

Detecting the Regression

All bugs and performance regression investigations begin the very same way, someone has to notice it! During our metric review meeting it was clear we had a new issue on the Search View. Our initial understanding of this issue was:

The issue causes extremely slow searches
The issue is infrequent
The issue appears to be random
The issue started July 7th

Image 1: Graph showing new emerging issue after culprit PR was merged

This initial triage gave us the information needed to prioritize the issue and assess impact. Our findings indicated this issue impacted a small percentage of traffic to the Search View.

The issue may reside in our Elasticsearch cluster or the python Django Web App, but we did not know exactly where or the cause.

Locating the Commit

Given we had a date, this gave us the information to move on to the next step, determine the commit that introduced the performance regression. Thanks to the sudden spike on the graph, we were able to narrow down the culprit commit.

Not surprising to anyone, the code in the commit updated our client libraries for Elasticsearch to a new major version.

At this point we had two options:

Revert the commit – We considered this, but we were committed to upgrading to Elasticsearch 8 and didn’t want to yield progress on our upgrade due to a minor performance regression.
Fix forward – Because the issue only affected a small percentage of searches, we decided to move ahead, investigate, and fix the root cause.

Making a Theory, Testing It

In the meeting room upon discovering this, ideas rang out!

“Elasticsearch 8 might have a bug”
“Some searches include an unoptimized filter combination”
“Our webapp has an issue serializing with the new libraries”

Hint: It wasn’t any of these

These were nothing more than theories, so how do we begin testing and validating these ideas? Below are the 4 tools our team used our investigation that led to an answer:

OpenTelemetery Traces – to visualize where time was spent inside the request
Metrics & Logs – capture and observe timing in the Django Web App and Elasticsearch
Reproducing – using test environment to reproduce the latency
py-spy – a sampling profiler to capture what Python was actually doing

Each of these tools revealed a key part of the puzzle, and in the next sections you’ll see exactly how we used them.

Analyzing The Request : APM Trace

If you’ve never used an APM Trace, you are missing out. This tool lets you visualize your request by showing where your request spent its time moving through your web app. Below is an APM Trace showing our issue clearly.

Image 3: OpenTelemetry Trace showing unusually slow POST calls to Elasticsearch.

Woah, something looks off! Why is a single POST call taking 400ms? Normal round trips to the Elasticsearch cluster are approximately 50 milliseconds. This was a clue that the issue was caused by one of three things:

Elasticsearch was slow to process the request
Infrastructure or network issues
Python code sending or processing the POST had issues

The quickest one to check was whether Elasticsearch Cluster was the culprit. Perhaps the large chunk of that POST was spent chewing away at the search request with malformed filters?

Timing Elasticsearch: Metrics and Logs

The next logical step was to monitor the time a search takes to be completed on our Elasticsearch cluster. Luckily for us, the Elasticsearch API response includes a `took` field, measuring time in milliseconds Elasticsearch spent on the request.

The `took` time was logged to Splunk and emitted as a metric to Datadog. Using this new data point, we confirmed searches experiencing the additional 400ms slowness all had average, speedy Elasticsearch `took` times. This confirms the issue is not inside the Elasticsearch Cluster!

This Isn’t Random: Reproducing In Test Environment

Upon learning the source of the issue was either in the network or our Django Web App, we took another look at the original metric that visualized the issue. Something stood out:

The frequency of slow searches increased during working hours
Slow searches were rare/non-existent on weekends

Image 4: Graph showing slow searches line up with code deployments.

The above chart shows deployments of our webapp during the work day as red ticks at the top. These line up with our slow searches!

This was the breakthrough!

The new working theory was during a deployment of the website, something related to the deployment process was briefly causing 400ms latency on searches.

We were able to reproduce this on staging and local development environments, but we still did not know why fresh instances of the web app had this issue, so we kept looking.

After some testing in our staging environments, we confirmed the very first search on a new pod or container experienced 400 milliseconds of additional latency!

This also explains why the issue was more prominent during working hours of our US and EU development teams, due to the higher volume of code deployment.

Locating The Root Cause: Py-spy Sampler

Py-spy is a sampling profiler that peeks into the call stack at 10 milliseconds intervals and captures the current method being executed. We configured py-spy to run on new deployments of our Django Web App pod for ten minutes to glean more insight to these slow search requests.

To our surprise, we discovered the 400 millisecond latency was occurring inside a `warn_stacklevel` method inside the elasticsearch-dsl library.

We now had all the information we needed to solve this long-standing issue!

Image 5: Py-Spy chart showing the culprit, slow code inside elasticsearch-dsl.

Root Cause and Fix

Taking a closer look at the image above, we can see the root cause is the warn_stacklevel method in the elasticsearch-dsl library making several calls to os.path.realpath and writing to disk. This is obviously an extremely slow operation in the middle of a user’s search request cycle, and explains why the ~400ms latency was so constant across various occurrences of this bug in production.

Now knowing the exact method causing the performance degradation and the associated library, we went to the GitHub issues page for the library to find answers. Luckily, an issue existed and the fix was already in place!

You can read more about the original Github issue for elasticsearch-dsl library here: https://github.com/elastic/elasticsearch-py/issues/3003

Our understanding was the first search after initializing the connection to Elasticsearch emits a warning of deprecated features you are using.This introduced a consistent 435 milliseconds latency on the first search every code deployment, which occurs 25+ times a day.

The fix from the Elasticsearch team was to add an environment variable in version 8.19 of the library to prevent this warning log from being emitted. After upgrading our library and using the new environment variable, we no longer experienced our mystery slowness for searches on Rover.

export DISABLE_WARN_STACKLEVEL=1

Takeaways

Look for patterns, even when behavior seems random.
In our case, correlating latency spikes with deployments revealed the trigger for the slowdown.
Validate each layer independently.
Comparing APM traces, application timings, and Elasticsearch `took` timing logs allowed us to rule out the search cluster early.
Reproduce issues in controlled environments.
Once we suspected a startup-related path, staging environments made the behavior reproducible on demand.
Use profiling tools to confirm actual execution paths.
py-spy helped us capture the exact function causing the extra latency, removing ambiguity.
Always be cautious of I/O:
Interacting with the disk should be avoided during a user’s request cycle.