Improving spanner DB read latency by 70% at Google

Recently at work, I was faced with a rather involved problem that required me to take a deeper look at DB read optimisations. I think the investigation and steps taken to solve the problem are worth sharing, since they could help others running into similar problems.

The Problem: Improve the latency of an RPC, i.e, reduce the time it takes between the client invoking an RPC and the client receiving the response.

Anyone with some experience in problem solving would realize that the above does not define a complete problem. Any problem requires constraints under which it needs to be solved. For example, reduce latency by how much ? For all requests or a subset ?

What were the constraints ?

The constraints that we were dealing with here were:

  • 95p latency < 40ms, i.e, 95% of all requests should have latency less than 40 milliseconds.
  • This should be true for all requests coming from North America, Europe and Asia.

What does the RPC do ?

It is important to give context on what the RPC does. This was crucial in figuring out where we need to focus on for optimizations.

The RPC, let us call it getUserProperties, provides access to a sort of federated database that stores various properties of a user. Basically, when someone asks it for a user property, say, country, it does the following:

  • Check if its cache has a fresh enough value. If yes, return it.
  • Otherwise, reach out to the true source of that property, and fetch the value from there.

The cache is built on a database known as Spanner. Spanner is a globally distributed planet-scale database built by Google. Cloud spanner is generally available and you can read more about it here.

First, we measure!

Before we put on our optimization hats, it was important to do two things:

  • Have a way to measure the latency.
  • Measure the current latency to know how much of an improvement we’re talking about.

It is important to understand that we needed to measure two different latencies. One was when we had a cache hit and returned the value directly after a db read. The other was when there was a cache miss and we needed to reach out to the source of truth.

So, we put some timers around our code and generated the latency numbers.

Latencies for cache hits
Latencies for cache miss

Clearly, we’re way off where we need to be and Optimizations are required!

Tradeoffs are inevitable.

Most Software engineers know that it is very hard for us to have the cake and eat it too. Tradeoffs are required in most places and our case was no exception. For the cache misses, there wasn’t much we could do. We can’t go to all sources and tell them to drop everything and work on improving latency. So the tradeoff was accepting that we can meet the latency requirements only for the cache hits.

This meant, that the cache needed to be updated by a pub-sub model for all the properties where we needed the low latency to ensure cache hits for > 95% of the requests. Properties for which this would not be possible, would be out of scope.

Next up: Stale DB reads

Now that we knew we needed to focus on making spanner db reads faster, the first thing we did was pursue doing stale db reads.

By default, Spanner does strong reads. A simplified explanations of this is that while the nearest (geographically) replica of the db can process your request, that replica must send it’s own RPC to the leader to ensure it has up-to-date information. This additional RPC makes strong reads slow.

However, Spanner also allows you to do stale reads. This means, you tell spanner that “X” seconds of data staleness is fine for you. Then the nearest replica can simply return its value without consulting with the leader, provided it is sure that its data is not more than “X” seconds old.

Based on benchmarks, even 10 second stale reads provide drastic improvements in read latency.

So, we confirmed that 10 seconds of staleness was fine for our use case made the change to move to 10 seconds stale reads.

Here’s the result we saw after making the change:

Latencies after moving to 10s stale reads

Our 50p latency reduced by 84%. However, 95p and 99p showed no improvements. This was unexpected because stale reads should have an effect throughout, even for the higher percentiles. This had me scratching my head for quite a while.

Aha, Breakthrough!

After some research and playing around with the data, the picture finally became clearer when I finally got round to breaking down the latency numbers by the continent of origin for the requests.
Here’s what we saw:

95p latency per continent

The numbers show that we had a 50% improvement in 95p latency for requests coming from North America but no improvements for requests in Europe.

Well, why was that ?

Next on the list: Co-location

Our spanner db had replicas only in North America. This meant that requests originating in Europe had to come to a replica in North America to fetch the value.

The network latency, i.e, the time it took for the request to reach North America and back was too high for stale reads to make any impact. In-fact, due to the high network latency, these requests were already being served by an up-to-date replica, which is why stale reads show zero improvement.

So our next step was to ensure we had replicas of the db in Europe as well. (Asia too, because we would receive some requests from Asia soon). This would mean that requests from Europe would not need to cross the ocean and can be handled within the continent.

We went through the process of creating the additional replicas in Europe and Asia and the resultant latency numbers responded brilliantly.

95p Latencies after replicating in Europe

As we expected, the latency for requests from Europe improved drastically, around 70%. After a few more minor optimizations in our codebase, our overall 95p latency now was close to 45 ms. It was great, a 62% decrease from where we started, but we weren’t quite there yet. We needed to be below 40ms with some buffer to handle spikes or further changes we’d make.

Time to look at the hardware!

To dig further, I took a deeper look into how spanner handled the reads internally and compared it to another RPC which had a faster read time compared to ours.

That’s when I noticed that our spanner instances were using HDD for storage while the faster one used SSD. There’s a ton of research on HDD vs SSD on the internet, but to keep it short, SSD is much faster for reads than HDD.

So, we planned for it, did capacity planning and migrated all our db instances to use SSD instead of HDD. The results finally put us into the acceptable zone for the latency and currently, we’re comfortably below the 40ms barrier.

Latencies after moving to SSD storage

What’s next ?

As can be seen from the graph above, the latency in North America is significantly higher than the ones in Europe and Asia. This means we have better co-location in EMEA compared to North America. That is a potential area to further optimize should we need to bring our latency further down in the future.

Building subscriptions at @Google with Google One | Previously @GoldmanSachs | When I am not working, I spend my time with Sports , Books and Movies |