Investigating Sinatra + Active Record + Cloud SQL Connection Bottleneck
At Bukalapak [1] we had hundreds of microservices deployed on Google Cloud, running using Kubernetes [2]. Our databases mostly used Cloud SQL [3]. As for our code base was mostly written using Ruby and Golang.
Here I’m gonna share my experience when investigating bottleneck on one of the microservice that was managed by my Squad. The microservice was powered by Sinatra [4] and used Cloud SQL for MySQL (I will just refer it as Cloud SQL for the rest of this article).
In Bukalapak, we call the team as Squad: which usually consist of Product Manager, Engineering Manager, Test Engineering Manager, Backend Engineer, Frontend Engineer, Mobile Apps Engineer, Test Engineer, Data Scientist, Designer
The Cloud SQL has a default 4000 concurrent connection limit [5] for the machine type that we used. This limit can be configured but was not recommended as exceeding the default limit might cause instability.
You can use the
max_connections
flag to configure connections limits. The Cloud SQL team recommends using the default connection limits to ensure instance stability. You can find the connection limits for your instance by connecting to your database and running this command:SHOW VARIABLES LIKE "max_connections";
About The Microservice
Our microservice mostly handled interactions with third party. So there was a lot of external API call (means it involves the internet). As for the tech stacks / services involved were:
- Ruby using Sinatra framework
- Puma [6] for the web server
- Active Record [7] for the ORM [8]
- Datadog [9] for the monitoring service
- OpsGenie [10] for the alerting service
- Kubernetes with HPA (Horizontal Pod Autoscaler) [11] configured
The Problem?
As mentioned previously, we had a limit of 4000 connections to the Cloud SQL. As our Squad’s features grew, our microservice received more and more traffic. At that time it can reach ~2600 rps (request per second).
As one of the biggest marketplace in Indonesia, it was very normal to have a big promotion / campaign. Our microservice’s features were located in a lot of critical places (such as homepage, checkout page, etc). An increase in overall Bukalapak traffic, most of the time would contribute to the increase of our microservice’s traffic.
Luckily (or not? 😆) our microservice had a good monitoring and alerting system for the applications (throughput, latency, error rate, cpu, memory, etc) and the database (cpu, memory, number of active connections, etc).
We realize that our microservice had a high number of active database connections (~1300 connections). It’s still safe for business as usual, but it might reach the 4000 connection limit bottleneck if the traffic surge by 2x or 3x 🔥(which may happen for big campaign with live TV show).
Abnormalities
One day, in a bright sunny day 😎 , I spent some time to find some cue by looking at the Datadog metric for Database Connections. I saw some weird pattern at dawn where our microservice’s traffic was at its lowest.
I saw a flat pattern. It’s literally a flat line without any dip from ~0am-8am. My logic said that I should see the metric went down at ~3am correlate with our lowest throughput each day.
I had no idea what’s going on. I know the general concept of connection pool, but I didn’t know the implementation detail of Active Record related with the Database Connections.
Investigations
I then began my investigations related with Sinatra + Puma + Active Record + MySQL. I tried to understand how the Database Connection was being used by Active Record.
Start Small
Our microservice had dozens of libraries installed and a lot of business logics. I thought that it might not be the right place to use as a playground. There might be custom behavior caused by third party libraries. So I decided to start small, by creating a simple Sinatra application with Puma web server, then observe how it worked.
Here was how my setup looks like
I’m not copy pasting any code in here, as there was nothing’s fancy for the playground application:
- I created a MySQL database with 1 table and 1 record (this step should be able to be skipped and replaced by using raw SQL query
SELECT 1
, but at that time I didn’t do that to closely replicate the condition of our microservice) - Sinatra Application that used Puma web server, with Active Record for the ORM
- The playground application exposed 2 endpoints: one endpoint
/0
accessed the database, and the other one/1
did not access the database. I usedputs
on the Thread ID to debug the application - The Active Record was configured to have 2 database connection pool
I ran the application using 2 puma Threads with following command:
$ puma -t 2:2
I then hit it with 8 concurrent requests
Everything looked fine, I didn’t notice any issue at all. Then I did another experiment to set the database connection pool with lower number than the puma Thread. In our case, I set the database connection pool into 1
ActiveRecord::Base.establish_connection(
adapter: "mysql2",
database: "goods",
username: "root",
password: "root",
host: "127.0.0.1",
port: 3306,
pool: 1, # <--- I changed it from 2 into 1
)
BOOM! 💥 We saw the infamous error: could not obtain a connection from the pool within 5.000 seconds (waited 5.005 seconds); all pooled connections were in use
Here is the screenshot version:
We can see that there was 8 requests, I’ll sum it up along with the Thread ID(s):
+---------+----------------+----------+
| Request | Thread ID | Endpoint |
+---------+----------------+----------+
| 1 | 70210596097360 | /1 |
| 2 | 70210596097060 | /1 |
| 3 | 70210596097060 | /0 |
| 4 | 70210596097060 | /1 |
| 5 | 70210596097060 | /0 |
| 6 | 70210596097060 | /1 |
| 7 | 70210596097060 | /1 |
| 8 | 70210596097360 | /0 | <--- Error
+---------+----------------+----------+
I ran the test several times and I saw that: successful request to endpoint /0
was only successful to specific Thread ID. On example above, it was to Thread ID 70210596097060
Research
I tried to search in the internet regarding this behavior, and found this post https://stackoverflow.com/a/36278672/1363015 [12]. Quoting the important part:
This method (as described in the source code docs) returns the connection associated with the current thread. Internally it uses a connections cache where the key is the
Thread.current.object_id
, i.e. a unique ID of the current thread.
It appeared that a Puma Thread will be mapped as a key to a database connection. As seen in my previous experiment, only Thread ID 70210596097060 could use the database connection. The other Thread ID 70210596097360 could not obtain the database connection.
The link was pointing to Active Record version 4.2. There was some changes (my playground application used the latest Active Record: version 6.1.3.1) since then, but the general logic was still similar.
I took it for granted. I then thought that setting database connection pool lower than Puma Thread would cause a disaster (like what happened on my playground application)! 😓
I discussed that finding and my experiment with my colleague Yap Sok Ann. My colleague found that to be weird and should not be like that. I was suggested to debug the checkout
and checkin
method in lib/active_record/connection_adapters/abstract/connection_pool.rb
.
There was several ways to do this, such as:
- Directly insert
puts
to the Active Record source code - Use debugger if your IDE support it (my favorite is RubyMine)
- Or just directly patch the code. I’ll use this one for this article.
I inserted this code to app.rb
:
module ActiveRecord
module ConnectionAdapters
class ConnectionPool
alias original_checkout checkout
def checkout
puts "___checkout: #{Thread.current.object_id}"
original_checkout
end
alias original_checkin checkin
def checkin(conn)
puts "!!!checkin: #{Thread.current.object_id}"
original_checkin(conn)
end
end
end
end
Here was the result:
We can see that there was 2 checkout (means obtaining connection from the pool), but there was no checkin (means returning connection to the pool).
If I directly call ActiveRecord::Base.clear_active_connections!
(I found that from this documentation https://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/ConnectionPool.html [13]) then I could see the checkin printed. Which means my patch worked, and the checkin was truly not being called 😩
Suddenly… voila!! 🎉 My colleague ping me with a bug issue link: https://github.com/rails/rails/issues/37092 [14]. It appeared that Janko already reported that and did some workaround for Sinatra. It was to use gem 'sinatra-activerecord’
which would checkin the connection after a request cycle was completed https://github.com/sinatra-activerecord/sinatra-activerecord/blob/0f079e6000d2b3b00c642ab9af971466e401ba90/lib/sinatra/activerecord.rb#L33 [15]
Let’s Test It!
I re-ran my application after made these following changes:
- In
Gemfile
, I changedgem 'activerecord’
intogem 'sinatra-activerecord'
- In
app.rb
, I changedrequire 'active_record'
intorequire 'sinatra/activerecord'
Hurraay! 💯 I could see the checkin and I didn’t see the previous error “could not obtain a connection from the pool within 5.000 seconds”. We can also see that two different Thread ID (70096363781900 and 70096363781620) could still accessed the database.
We need to be aware though, the database connection was returned to the pool after the end of the request. So if there are slowness after checkout then the checkin might get delayed as well.
request is coming
call database <-- checkout (obtaining database connection)
slow API call <-- it will cause the connection to be hold
end of request <-- checkin (connection is returned to pool)
Yet Another Issue
Pheew.. I finally found the workaround. Let’s try to use gem sinatra-activerecord
for our microservice. But………… it appeared that our microservice already use gem sinatra-activerecord
(╯°□°)╯︵ ┻━┻
I was so confused. I checked the sinatra-activerecord
‘s code and the app.after{ ActiveRecord::Base.clear_active_connections! }
was there.
Then, I looked through the GitHub issues with keyword not working
(yes, one by one): https://github.com/sinatra-activerecord/sinatra-activerecord/search?q=not+working&type=issues [16] in case it had been reported as well.
Wohooo!! I found similar issue https://github.com/sinatra-activerecord/sinatra-activerecord/issues/2 [17]. Quoting the solution:
Actually I realized the app itself wasn’t working, and poking through the specs I figured out that I needed to add
register Sinatra::ActiveRecordExtension
to my app class to make things work. You might want to make a note of that in your read me?
Danngg! It appeared that I didn’t understand Sinatra well enough. I didn’t realize that Sinatra had Modular and Classic Style ( http://sinatrarb.com/intro.html [18], find: Modular vs. Classic Style
). The one that I used in this article was Classic Style (that’s why it worked 🤷♂️), whereas the on that I used on our microservice was Modular Style.
It appeared that there was a miss implementation / configuration on our microservice. It had happened for more than one year, since the first time the microservice was deployed to production 😅🤭
Actually this seems to be a common issue (or not?) on Sinatra application. I saw another example: https://github.com/DataDog/dd-trace-rb/pull/144 [19]
Results
I made one liner changes to our microservice by adding:
register Sinatra::ActiveRecordExtension
Then deployed the changes to production.
The results were amazing 😍. We reduced ~40% database connections!
Actually the metric above was not directly caused by the checkin (returning databases connection to pool). That metric was caused by flushing idle database connection [20]. So, what actually happened was: there was a checkin then if the connection was idle for more than 5 minutes (the default idle_timeout), the connection will be flushed (disconnected). Before that commit, there might be a checkin but we couldn’t see any metrics that correlate with that.
We no longer worry for the 4000 connections limit. We also did more improvement by caching some database calls which can further reduce the database connections.
Of course our microservice also survived peak traffic during big campaign that had throughput ~6100 rps. The database connection was not a bottleneck anymore as we only used ~1650 database connections 🥳
In the end, it was a miss implementation / configuration from our team. But I think the journey to figure out the issue and the knowledge gained were pretty interesting.
References
- https://en.wikipedia.org/wiki/Bukalapak
- http://kubernetes.io/
- https://cloud.google.com/sql/docs
- http://sinatrarb.com/
- https://cloud.google.com/sql/docs/mysql/quotas#maximum_concurrent_connections
- https://github.com/puma/puma
- https://stackoverflow.com/a/36278672/1363015
- https://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/ConnectionPool.html
- https://github.com/rails/rails/issues/37092
- https://github.com/sinatra-activerecord/sinatra-activerecord/blob/0f079e6000d2b3b00c642ab9af971466e401ba90/lib/sinatra/activerecord.rb#L33
- https://github.com/sinatra-activerecord/sinatra-activerecord/search?q=not+working&type=issues
- https://github.com/sinatra-activerecord/sinatra-activerecord/issues/2
- http://sinatrarb.com/intro.html
- https://github.com/DataDog/dd-trace-rb/pull/144
- https://guides.rubyonrails.org/active_record_basics.html
- https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping
- https://www.datadoghq.com/
- https://www.atlassian.com/software/opsgenie
- https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
- https://github.com/rails/rails/pull/31221