Investigating Sinatra + Active Record + Cloud SQL Connection Bottleneck

9 min readApr 2, 2021

At Bukalapak [1] we had hundreds of microservices deployed on Google Cloud, running using Kubernetes [2]. Our databases mostly used Cloud SQL [3]. As for our code base was mostly written using Ruby and Golang.

Here I’m gonna share my experience when investigating bottleneck on one of the microservice that was managed by my Squad. The microservice was powered by Sinatra [4] and used Cloud SQL for MySQL (I will just refer it as Cloud SQL for the rest of this article).

In Bukalapak, we call the team as Squad: which usually consist of Product Manager, Engineering Manager, Test Engineering Manager, Backend Engineer, Frontend Engineer, Mobile Apps Engineer, Test Engineer, Data Scientist, Designer

The Cloud SQL has a default 4000 concurrent connection limit [5] for the machine type that we used. This limit can be configured but was not recommended as exceeding the default limit might cause instability.

You can use the max_connections flag to configure connections limits. The Cloud SQL team recommends using the default connection limits to ensure instance stability. You can find the connection limits for your instance by connecting to your database and running this command: SHOW VARIABLES LIKE "max_connections";

About The Microservice

Our microservice mostly handled interactions with third party. So there was a lot of external API call (means it involves the internet). As for the tech stacks / services involved were:

Ruby using Sinatra framework
Puma [6] for the web server
Active Record [7] for the ORM [8]
Datadog [9] for the monitoring service
OpsGenie [10] for the alerting service
Kubernetes with HPA (Horizontal Pod Autoscaler) [11] configured

The Problem?

As mentioned previously, we had a limit of 4000 connections to the Cloud SQL. As our Squad’s features grew, our microservice received more and more traffic. At that time it can reach ~2600 rps (request per second).

As one of the biggest marketplace in Indonesia, it was very normal to have a big promotion / campaign. Our microservice’s features were located in a lot of critical places (such as homepage, checkout page, etc). An increase in overall Bukalapak traffic, most of the time would contribute to the increase of our microservice’s traffic.

Luckily (or not? 😆) our microservice had a good monitoring and alerting system for the applications (throughput, latency, error rate, cpu, memory, etc) and the database (cpu, memory, number of active connections, etc).

We realize that our microservice had a high number of active database connections (~1300 connections). It’s still safe for business as usual, but it might reach the 4000 connection limit bottleneck if the traffic surge by 2x or 3x 🔥(which may happen for big campaign with live TV show).

Abnormalities

One day, in a bright sunny day 😎 , I spent some time to find some cue by looking at the Datadog metric for Database Connections. I saw some weird pattern at dawn where our microservice’s traffic was at its lowest.

I saw a flat pattern. It’s literally a flat line without any dip from ~0am-8am. My logic said that I should see the metric went down at ~3am correlate with our lowest throughput each day.

I had no idea what’s going on. I know the general concept of connection pool, but I didn’t know the implementation detail of Active Record related with the Database Connections.

Investigations

I then began my investigations related with Sinatra + Puma + Active Record + MySQL. I tried to understand how the Database Connection was being used by Active Record.

Start Small

Our microservice had dozens of libraries installed and a lot of business logics. I thought that it might not be the right place to use as a playground. There might be custom behavior caused by third party libraries. So I decided to start small, by creating a simple Sinatra application with Puma web server, then observe how it worked.

Here was how my setup looks like

I’m not copy pasting any code in here, as there was nothing’s fancy for the playground application:

I created a MySQL database with 1 table and 1 record (this step should be able to be skipped and replaced by using raw SQL query SELECT 1 , but at that time I didn’t do that to closely replicate the condition of our microservice)
Sinatra Application that used Puma web server, with Active Record for the ORM
The playground application exposed 2 endpoints: one endpoint /0 accessed the database, and the other one /1 did not access the database. I used puts on the Thread ID to debug the application
The Active Record was configured to have 2 database connection pool

I ran the application using 2 puma Threads with following command:

$ puma -t 2:2

I then hit it with 8 concurrent requests

Database Connection Pool Equals with Puma Threads

Everything looked fine, I didn’t notice any issue at all. Then I did another experiment to set the database connection pool with lower number than the puma Thread. In our case, I set the database connection pool into 1

ActiveRecord::Base.establish_connection(
  adapter: "mysql2",
  database: "goods",
  username: "root",
  password: "root",
  host: "127.0.0.1",
  port: 3306,
  pool: 1, # <--- I changed it from 2 into 1
)

Database Connection Pool Lower than Puma Threads

BOOM! 💥 We saw the infamous error: could not obtain a connection from the pool within 5.000 seconds (waited 5.005 seconds); all pooled connections were in use

Here is the screenshot version:

We can see that there was 8 requests, I’ll sum it up along with the Thread ID(s):

+---------+----------------+----------+
| Request |   Thread ID    | Endpoint |
+---------+----------------+----------+
|       1 | 70210596097360 | /1       |
|       2 | 70210596097060 | /1       |
|       3 | 70210596097060 | /0       |
|       4 | 70210596097060 | /1       |
|       5 | 70210596097060 | /0       |
|       6 | 70210596097060 | /1       |
|       7 | 70210596097060 | /1       |
|       8 | 70210596097360 | /0       | <--- Error
+---------+----------------+----------+

I ran the test several times and I saw that: successful request to endpoint /0 was only successful to specific Thread ID. On example above, it was to Thread ID 70210596097060

Research

I tried to search in the internet regarding this behavior, and found this post https://stackoverflow.com/a/36278672/1363015 [12]. Quoting the important part:

This method (as described in the source code docs) returns the connection associated with the current thread. Internally it uses a connections cache where the key is the Thread.current.object_id, i.e. a unique ID of the current thread.

It appeared that a Puma Thread will be mapped as a key to a database connection. As seen in my previous experiment, only Thread ID 70210596097060 could use the database connection. The other Thread ID 70210596097360 could not obtain the database connection.

The link was pointing to Active Record version 4.2. There was some changes (my playground application used the latest Active Record: version 6.1.3.1) since then, but the general logic was still similar.

I took it for granted. I then thought that setting database connection pool lower than Puma Thread would cause a disaster (like what happened on my playground application)! 😓

I discussed that finding and my experiment with my colleague Yap Sok Ann. My colleague found that to be weird and should not be like that. I was suggested to debug the checkout and checkin method in lib/active_record/connection_adapters/abstract/connection_pool.rb.

There was several ways to do this, such as:

Directly insert puts to the Active Record source code
Use debugger if your IDE support it (my favorite is RubyMine)
Or just directly patch the code. I’ll use this one for this article.

I inserted this code to app.rb:

module ActiveRecord
  module ConnectionAdapters
    class ConnectionPool
      alias original_checkout checkout
      def checkout
        puts "___checkout: #{Thread.current.object_id}"
        original_checkout
      end

      alias original_checkin checkin
      def checkin(conn)
        puts "!!!checkin: #{Thread.current.object_id}"
        original_checkin(conn)
      end
    end
  end
end

Here was the result:

We can see that there was 2 checkout (means obtaining connection from the pool), but there was no checkin (means returning connection to the pool).

If I directly call ActiveRecord::Base.clear_active_connections! (I found that from this documentation https://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/ConnectionPool.html [13]) then I could see the checkin printed. Which means my patch worked, and the checkin was truly not being called 😩

Suddenly… voila!! 🎉 My colleague ping me with a bug issue link: https://github.com/rails/rails/issues/37092 [14]. It appeared that Janko already reported that and did some workaround for Sinatra. It was to use gem 'sinatra-activerecord’ which would checkin the connection after a request cycle was completed https://github.com/sinatra-activerecord/sinatra-activerecord/blob/0f079e6000d2b3b00c642ab9af971466e401ba90/lib/sinatra/activerecord.rb#L33 [15]

Let’s Test It!

I re-ran my application after made these following changes:

In Gemfile, I changed gem 'activerecord’ into gem 'sinatra-activerecord'
In app.rb, I changed require 'active_record' into require 'sinatra/activerecord'

Hurraay! 💯 I could see the checkin and I didn’t see the previous error “could not obtain a connection from the pool within 5.000 seconds”. We can also see that two different Thread ID (70096363781900 and 70096363781620) could still accessed the database.

We need to be aware though, the database connection was returned to the pool after the end of the request. So if there are slowness after checkout then the checkin might get delayed as well.

request is coming
call database  <-- checkout (obtaining database connection)
slow API call  <-- it will cause the connection to be hold
end of request <-- checkin (connection is returned to pool)

Yet Another Issue

Pheew.. I finally found the workaround. Let’s try to use gem sinatra-activerecord for our microservice. But………… it appeared that our microservice already use gem sinatra-activerecord (╯°□°)╯︵ ┻━┻

I was so confused. I checked the sinatra-activerecord‘s code and the app.after{ ActiveRecord::Base.clear_active_connections! } was there.

Then, I looked through the GitHub issues with keyword not working(yes, one by one): https://github.com/sinatra-activerecord/sinatra-activerecord/search?q=not+working&type=issues [16] in case it had been reported as well.

Wohooo!! I found similar issue https://github.com/sinatra-activerecord/sinatra-activerecord/issues/2 [17]. Quoting the solution:

Actually I realized the app itself wasn’t working, and poking through the specs I figured out that I needed to add register Sinatra::ActiveRecordExtension to my app class to make things work. You might want to make a note of that in your read me?

Danngg! It appeared that I didn’t understand Sinatra well enough. I didn’t realize that Sinatra had Modular and Classic Style ( http://sinatrarb.com/intro.html [18], find: Modular vs. Classic Style). The one that I used in this article was Classic Style (that’s why it worked 🤷‍♂️), whereas the on that I used on our microservice was Modular Style.

It appeared that there was a miss implementation / configuration on our microservice. It had happened for more than one year, since the first time the microservice was deployed to production 😅🤭

Actually this seems to be a common issue (or not?) on Sinatra application. I saw another example: https://github.com/DataDog/dd-trace-rb/pull/144 [19]

Results

I made one liner changes to our microservice by adding:

register Sinatra::ActiveRecordExtension

Then deployed the changes to production.

The results were amazing 😍. We reduced ~40% database connections!

Actually the metric above was not directly caused by the checkin (returning databases connection to pool). That metric was caused by flushing idle database connection [20]. So, what actually happened was: there was a checkin then if the connection was idle for more than 5 minutes (the default idle_timeout), the connection will be flushed (disconnected). Before that commit, there might be a checkin but we couldn’t see any metrics that correlate with that.

We no longer worry for the 4000 connections limit. We also did more improvement by caching some database calls which can further reduce the database connections.

Of course our microservice also survived peak traffic during big campaign that had throughput ~6100 rps. The database connection was not a bottleneck anymore as we only used ~1650 database connections 🥳

In the end, it was a miss implementation / configuration from our team. But I think the journey to figure out the issue and the knowledge gained were pretty interesting.