As you might know from previous posts, we are hosted on Amazon Web Services (or AWS). We're
using the awesome Relational Database Service (or RDS) for our
RDBMS backends. Let's suppose you have a production database that kept growing and growing until it reached a point when the hosting fees exceeded 20% of the entire production environment hosting costs. What do you do? Well, the Rover Engineering team got itself into this situation, so in this post we'll present our approach to solving the problem.
Some background
When we started a handful of years ago, we needed a way to gather predominately write-only analytics data about how the marketplace works. This allowed us to monitor overall health and at the same time provide insight into ways to optimize it.
As with many startups, the team was very small and we didn't have a lot of data experience. Moreover, many big data technologies were still in their infancy. We decided to push the data to a separate parallel PostGIS database. Since it was supposed to only be accessed from the worker processes (i.e., not used in the HTTP request/response cycle), we could write all the data we needed. This would be our temporary solution until we figured out something better.
Fast forward to present day, and here we are! The database kept growing and growing, and because of our many other priorities, we just kept adding more storage to the database instance to keep up. Since the database has grown to a couple of terabytes in storage size, we've reached a point where more upgrades are no longer viable. Its time to take the bull by the horns and address our technical debt.
Prerequisites
Fortunately, the database is only accessed by our workers processes (Celery) which use their own separate queue. Therefore, we can easily stop processing this pool of workers while we
perform our downsize operations.
Further, we do not connect directly to the database server by its endpoint (which you can find in the RDS console). Instead, we setup a DNS record in Route 53 that points to the database endpoint. This is handy because we can easily switch to a new database instance just by updating the DNS record.
If you have a different setup, you may want to update it to something similar in order to limit disruption to the production system.
The process
Our approach has three major steps:
- Downsize individual tables (short-term)
- Downsize the database instance (short-term)
- Update the functionality to avoid relying on a "live" database (mid-term)
The first two steps will significantly cut down on hosting costs in the short term as well as buy us time to focus on the more complicated and time consuming final step. This way, we can address it in parallel with our other priorities as we continue to evolve our system.
Downsizing individual tables
Every application is unique in its own ways, but we think these high-level steps are applicable to any setup.
Since data becomes less interesting as it ages, we realized that we only need quick access and the ability to run performant queries against the most recent "live" data (approximately a month's worth). We still want to use the full set of "archive" data for more complex analyses, but we are willing to tradeoff speed for a cheaper storage solution. Thus, our adventures in downsizing result in a two step replication process to push data into our Data Warehouse (Amazon Redshift). Aided by our awesome Data Team, we will:
- mirror the tables in the existing database into a set of
"live" tables in Redshift - upsert the full set of historical data into corresponding "archive" tables
On the back-end side, we will keep only the last N days of records by running a daily task to remove old data. This way we can keep the database size at manageable levels for the next year or
so without needing to increase the storage size.
For the initial downsize, the high-level approach is:
- Stop the worker processes that access the analytics database. We will continue to push new tasks without processing the existing ones, so the queue will grow in size.
- Rename all the database objects, including constraints and indexes (very fast operation on PostgreSQL).
- Recreate the database objects with the same names that the application backend expects, but without creating the
constraints and indexes (to speedup inserts). - Copy the most recent N days of data in each table, where needed.
- Create the necessary constraints and indexes.
- Restart the worker processes. All the existing tasks in the queue will be processed and the data will start flowing
into the new tables. - If everything looks good, truncate the old tables and reclaim the space by issuing
VACUUM FULL
statements on them. - Finally, delete the old tables.
Downsizing the database instance
After downsizing individual tables, we will reclaim ~95% of storage space! The next step is to downsize the RDS storage. To our surprise we found out that AWS won't allow you to downsize the storage for an
RDS instance. We can understand why that might be the case! Further, you cannot restore a snapshot into a smaller
RDS instance (in terms of storage size) or create a read replica with less storage space.
To workaround these limitations, the solution is to dump the contents of the database and reload that snapshot into a new, downsized RDS instance. We would therefore use a similar approach here:
- Stop the worker processes.
- Dump the database contents using pg_dump.
- Restore the database contents into a previously created RDS instance using pg_restore.
- Update the DNS record in Route 53 to point to the new RDS instance.
- Let the changes propagate for 60 seconds (assuming that's the TTL on the DNS record).
- Restart the worker processes.
The problem is that the database size is still in the hundreds of gigabytes so the full process may
take a long time to perform. Our initial benchmarks show that the total process takes around 3x longer than the time needed to downsize the largest table. Therefore, the worker queue may grow so big it destabilizes the entire system. We'd like to avoid that as much as possible!
Since most of the dump operation is fetching data from the database and saving it onto the EBS volume, it means that
the bottlenecks are the CPU and the EBS volume IOPS. We decided to try the dump
operation on a big multi-core instance that has an EBS volume with a high number of Provisioned IOPS. By using a
multi-core machine we're also able to run the pg_dump
operation with parallelization to speed things up even more.
Here's our final setup for the database instance downsize.
- One c4.4xlarge EC2 instance in the same subnet as the database instances. Set the EBS volume to have enough space for the dump and 10,000 IOPS. This will give us 16 cores.
- One m4.xlarge RDS instance with enough space to store 2x of the downsized database tables. We could easily get away with a smaller RDS instance, but we wanted more CPU for the restore procedure. Further, the instance type can be downgraded after the restore.
And here's how we will perform the dump and restore.
$ pg_dump \
-h current-db-instance.rds.amazonaws.com \
-U myuser \
-d mydatabase \
-F d \
-f /home/ubuntu/db-dump \
-j 16 \
-cCO
$ pg_restore \
-h new-db-instance.rds.amazonaws.com \
-U myuser \
-d mydatabase \
-F d \
-j 16 \
-cC \
Next Steps
As we said earlier, the next steps are to continue to support the same functionality and queries without relying on a live (and more costly) database. At the end of our journey, we will have moved off said "live" DB in favor of running various ETL processes that plumb data directly into our Data Warehouse. We'll need to carefully coordinate with our marketplace analysts and data scientists so that we don't disrupt their workflow during the transition. This step is expected to be widely different for each use case, so we will spare you the details!
In the interim, by performing the first two steps we will significantly cut down on DB hosting costs
and will shave almost 20% of our production environment hosting bill! This should buy us quite a lot of time to properly direct more resources towards our complete, stable long-term solution!