First off, apologies if you noticed any issues during the following upgrade. I tried to minimize the impact as much as possible but there were a few snags along the way. 😅
I finally got around to rebuilding everything using ARM
instead of x86_64
. This provides a nice performance boost as well as cost savings. Enough so that I was able to double our capacity for a nominal increase in price. Our DB is already over-provisioned so this should provide a nice buffer for the time being.
Additionally, while I was in there I took the time to split the API so that the batch jobs could be run separately. This allows me to run more than one API at a time for increased scalability.
Overall, it went well with little disruption and improves our stability for the future. See below for the technical details, otherwise enjoy what’s left of the day. 🎉
Technical details
The two issues that came up were related to the CDN caching an error response and the ongoing battle with pict-rs
.
The CDN issue should be resolved and auto fix itself if it ever happens again. It will now override the cache TTL for error responses with a much shorter lifespan.
And while everything is configured for zero downtime deploys, the embedded DB used by pict-rs
locks the underlying file preventing another service from starting. This means it has to be taken completely offline until the lock clears (~10 minutes). I will continue looking into better ways to mitigate this in the future.