Evolving Experimentation at Rover: Navigating the Build-vs-Buy Decision

At Rover, we lean heavily on feature flagging and experiments to govern new features and mitigate potential risk. Feature flags allow us to make incremental code changes gated behind a “switch” so that we can roll them out safely and roll them back when necessary.

Experiments allow us to do much of the same but with rich exposure logging and analysis to help us understand and assess the impact a given change has had.

What are “Feature Flags” & “Experiments”?

Feature flags can be simple, representing global “on” or “off” toggling regardless of context, or they may be complex, taking in relevant context to configure rules governing flag behavior (like rolling something out to US users only).

A simple feature flag example

Progressive Rollouts, Immediate Rollbacks

Beyond simple targeted rollouts for new features or behavior, feature flags allow us to intentionally expose code changes to a small percentage of users when a change carries higher risk. This lets us smoke test changes in production without risking widespread impact by configuring a feature flag to only be “on” for, say, 10% of our users.

Should something go wrong, another benefit of feature flags is the ability to quickly toggle them off, mitigating any unintended issues.

Experiments

In this context, “experiments” are the technical solution that allows us to validate how product changes impact our users and business metrics. They operate much like feature flags, allowing us to target specific segments of our user base, but also include segmentation logic (A/B/n tests) and exposure logging (tracking when a user interacts with an experiment). With this data, we can track high-level trends in how users interact with a given treatment compared to the control.

At Rover, we rely on data and analysis to drive decision-making. As such, experimentation is a critical part of our technical product lifecycle.

What Came Before

Why did we have an in-house system? Why did it work? Why didn’t it work?

Rover’s original feature flagging framework was based on a fork of the now-deprecated Gargoyle framework. This framework was built for Django’s ORM, which was our primary use case at the time.

Initially, Gargoyle was a great candidate to support Rover’s needs given the pace of our early startup growth. Over time, we built very Rover-specific wrappers and condition sets that allowed for powerful targeting on specific objects and attributes. We then expanded further with an experimentation framework built on top of Gargoyle flags.

Our in-house experimentation framework grew quickly, supporting consistent experiment experiences across authenticated and unauthenticated users, across web and native experiences, geographic-based splits (rather than user-based), and more.

Gargoyle served us incredibly well for many years, but cracks started to form as we scaled our business and our analytical needs and technology stack grew.

Why Migrate to a New Platform?

In 2023, our technology organization kicked off a working group to address an escalating trend of experimentation-related issues, dubbing it the “Experimentation Guild.” This group incorporated Engineering, Analytics, Data Science, and Product SMEs to triage and resolve common issues encountered during experiments.

The Experimentation Guild

What We Found

By carefully reviewing experimentation issues as they arose, we quickly identified some common trends:

60% of experiment issues were due to misconfiguration. Experiments were defined in code (with multiple classes representing different types), required extra effort to extend to the client side, and required even more effort to implement correctly in SSR.
None of this complexity was abstracted from the engineer. Creating and implementing a new experiment required deep institutional knowledge and exposed a broad surface area for mistakes.
Observability was weak. We exposed some core metrics, but it was difficult to quickly understand whether an experiment was working as expected. Issues were often found well into a launch, typically flagged by our Analytics team.

By 2024, we realized we were beginning to outgrow our in-house experimentation framework. Frequent bucketing issues (uneven splits, over/under-bucketing, double-bucketing users) were hard to debug and eroding stakeholder confidence. Cross-platform experiments were difficult to build and execute, and observability gaps slowed issue detection. Addressing these issues internally would have required significant investment.

But, Rover is not in the business of building feature flagging and experimentation frameworks. Was it wise to keep investing heavily in maintaining our own?

We ultimately decided, no, this was not a valuable investment. Feature flagging and experimentation are generic subdomains with mature, industry-standard vendor solutions. Rather than building a less robust in-house version, we decided to focus our resources on core subdomains that directly create business value.

This decision would also align closely with one of our guiding technical principles at Rover – “innovate where it counts”. This principle is meant to keep us intentional about preferring proven, industry-standard technologies over custom solutions when sensible.

All that remained was to determine which solution would work best for us.

This is More Than a Technical Discussion

Making technical architecture decisions is hard. It’s even harder when the solution must support critical use cases across Engineering, Data Science, Analytics, and Product. It can be “easy” to identify a solution that works well for engineers, but every other stakeholder group has its own priorities and requirements.

We needed a decision-making framework to help us narrow our vendor list to those that could meet all of our critical needs.

What Matters Most?

We began our exploration in early H2 2024 with a list of eight vendors. We set a loose internal deadline of EoY 2024 to align with budgeting and planning cycles while maintaining momentum, but the priority was making the right decision, even if it meant slipping the deadline.

This forced us to clearly define our must-have criteria, cutting down decision paralysis and aggressively pruning the candidate list.

Our knowledge base broke requirements into “must have”, “nice to have”, and “must not have” categories. While most vendors handled the basics (security, scalability, etc.), discussions quickly consolidated around needs unique to Rover and pain points with our existing platform.

Our critical areas of support became:

Warehouse-native support. The platform had to run on top of our data warehouse instead of the vendor’s cloud.
- Our Data Science and Analytics teams needed to define nuanced business metrics and perform ad-hoc analysis directly from our warehouse.
Low-complexity developer interfaces. With 60% of past experiment issues tied to misconfiguration, the platform must be simple to implement and debug.
- Reducing cognitive load on developers was key to lowering bug rates and improving time-to-detection when issues arise.
SDK support for core platforms. The vendor had to provide SDKs for Python, React, React Native, iOS, and Android.
- Minimizing custom code would reduce both maintenance costs and technical complexity.

The Evaluation Process

Our final three vendors each brought something powerful to the table. We ran small demos to validate integration with our stack, which proved pivotal in uncovering important (though not blocking) issues.

For example, one tool offered excellent support for identifying users across authenticated and unauthenticated sessions, which initially felt like a decisive advantage. But our demo revealed this feature required a synchronous network call that wouldn’t scale, changing our assessment.

These demos helped us refine our priorities and understand the true trade-offs between vendors.

As the deadline approached, discussions risked turning into circles. To break through, we reframed conversations: “Why would we pick Vendor X over Vendor Y?” This shift forced us to evaluate what actually mattered most and to separate “nice-to-haves” from true differentiators.

After significant effort (and some hand-waving in this post), we ultimately selected Statsig.

The Migration Process

With a vendor selected, the next challenge was migration: how to onboard Engineering, Data Science, Analytics, and Product from our in-house workflows to Statsig.

Start Small

Many stakeholders in the evaluation process were adjacent to our Marketplace group (two teams focused on search and heavy experiment users), making them a natural fit for early adoption.

After low-lift integration work, we set an initial goal: implement all new feature flags in Statsig. Not experiments yet – just feature flags. This early exposure surfaced integration gaps and helped us build an internal cohort of Statsig SMEs.

Expand to Experiments

Once Marketplace successfully adopted Statsig for feature flags, we expanded to experiments. While engineering onboarded, Analytics SMEs prepared our metrics catalog and best practices for the first round of experiments. These early tests highlighted integration and process gaps we could address quickly.

However, Marketplace could only generate so many experiments in a short time. To accelerate learning and surface more edge cases, we expanded adoption: all new experiments across Rover would now be implemented in Statsig.

Go Big

Rolling out to all teams sped up discovery of unknown issues and accelerated onboarding, but also created a surge of work to support and upskill the organization.

This decision proved pivotal for solidifying integrations and establishing organizational fluency, though it temporarily slowed some complex use cases.

Conclusion

So, the big questions: Did it work? What would you do differently? What advice do you have for others?

Did it Work?

“Did it work?” is tricky to answer definitively right now. Measuring the full impact of this migration will take time as experiments often run for weeks or months. But here’s what we know today:

Did we simplify developer tooling for experiments? Yes. Statsig’s SDKs are simple and well-documented. Experiment definitions now live in the Statsig Console, reducing developer burden and improving feedback loops.
Do we have better experiment observability? Absolutely. Statsig diagnostics surface issues instantly and provide richer insights than our in-house framework ever did. Anecdotally, this has dramatically decreased time-to-detection for issues, though we need more data to quantify this.
Are experiments “healthier”? So far, yes. At the time of writing, zero Statsig experiments have had bucketing issues not caused by bugs (we had one React integration misconfiguration). By comparison, 20% of in-house experiments in the prior half had bucketing-related issues that affected analysis.

Beyond experiments, a huge win for our organization has been improved lifecycle management of feature flags and a reduction in stale flag volume (flags that should be cleaned up). Reducing stale flags not only cuts down on maintenance overhead but also lowers the risk of unexpected behavior in production.

For reference, in H1 2025, teams not using Statsig saw about a 5% increase in stale flag volume, while our Statsig early adopters have seen a 33% decrease in stale flag volume so far in H2.

What We Learned

The biggest lesson from this project was to question all assumptions, no matter how fundamental they seem.

When we started the Experimentation Guild in 2023, there was a deeply held belief that we couldn’t replace our in-house analytical processes and tooling. I personally bought into this assumption without question and as a result, we didn’t seriously explore vendor solutions for nearly a year.

Looking back, that delay was a missed opportunity. Had we challenged that assumption sooner, we could have accelerated this initiative. That said, the time wasn’t wasted – we built a framework to analyze experimentation issues, which ultimately strengthened our case for a vendor solution.

The key takeaway to carry forward: don’t let untested assumptions slow you down. Question early so you can act faster.

Advice for Others

If you’re migrating from an in-house feature flagging or experimentation framework, keep these in mind:

Document functional differences between platforms. Prior learning will be deeply embedded in developer workflows.
- For example, our old framework exposed experiment groups (control, variant, etc.), while Statsig uses treatments. This shift confused developers at first, despite being a better model.
- Document not just how to use the new tools, but also why they work the way they do.
Plan for cross-functional coordination, but have clearly defined ownership. Experimentation spans Engineering, Product, Analytics, and Data Science, often across the full stack.
- One person can’t manage this alone. Identify SMEs for each stakeholder group and coordinate onboarding together.
- Be sure to identify clear ownership of the overall initiative to ensure continued progress.
Be intentional about your risk tolerance. We chose to accelerate adoption across the org early, surfacing issues faster but creating more short-term friction.
- For us, the trade-off was worth it. For your organization, it may or may not be, so decide consciously.

So much goes into large, cross-cutting decisions like how an organization handles feature flagging and experimentation, and we’ve only just scratched the surface of the work involved in this effort. For a company like Rover, with a strong culture of experimentation and data-driven decision-making, it’s incredibly exciting to have the opportunity to improve the workflows we use every day to enhance the experiences of sitters, pets, and their families.