Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

We were wasting 75% of our EC2 Memory resource!

Sign upSign InSign upSign InSuraj ShahFollowLevel Up Coding--ListenShareHow, for years, a small configuration default which we never tweaked, led to our highest ever waste in a resource. This is a story of an overlook which affects JVM applications on the Docker environment and how it impacted us at Headout.Disclaimer: The points mentioned here are mine. Everything I’ve written here is by my own volition and are my views. They might or might not reflect the general view of everyone/anyone working at Headout or, of the Company itself.On a fine Tuesday morning, I was walking over to my desk, wondering how my build from last evening was performing. I pulled my chair closer to my desk and was completely oblivious to the slew of metrics I was about to witness.Headout, my current employer has one of the most interesting and amazing engineering problems at hand. To add cherry on the cake, most of these have a direct or indirect relationship to the sale performance and can be linked to have a monetary impact. This, for an engineer like me, is a double-edged sword. You are proud of the impact you create for the org, at the same time you become extremely wary of the negative impact that can ensue.With a mixed feeling of excitement and anxiety, a major change that I had worked on for the last couple of months was finally live. It had been rigorously tested for weeks, had been deployed with a percentage of our traffic for a while and with a lot of paranoia finally merged to main after some final metric assessment. And lo-and-behold the build malfunctioned. I stared in disbelief in front of my screen, as my worst nightmare was manifesting itself with red markers in front of me. Headout.com was down. Latency has spiked up by 400% and it was showing no signs of landing back to its resting point. Pings were crying for attention, I was panicking wondering what could’ve gone wrong. It appeared clearly that it wasn’t the build that had caused the problem, but an indirect effect of it.In the build I had worked on, I had moved our main (a monolith) API server’s Database Connection Pooling to a better, faster and efficient connection pool called HikariCP. HikariCP, as can be observed by a quick glance over their Github page, demonstrates how it’s one of the best DB connection pooling tools out there. No wonder why many JVM based frameworks have started defaulting to HikariCP for their connection pooling requirements. HikariCP is fast. Under the hood, if used as a dynamic sized pool like we did, it does a best effort guarantee to keep the connections available if there’s ever a spike or a burst of requests. But doing so can spawn a lot of threads. Any developer working on JVM based applications might be aware of the pitfalls of having a lot of threads being spawned. We had a reasonable max size for our pools, our instances were massive; 32GB of RAM level massive to be precise, also we ran only one Application Instance per EC2 instance (a policy I’m not quite aligned with). Yet threads were getting killed, GC was aggressively invoked and the application had massive hung states with no metrics, API calls and NIO in progress. These became zombie instances.We recovered in 7 minutes. With some thousands of dollars worth of loss estimated. A lot of debugging later, I found the culprit.We were only consuming 8GB of our total 32GB RAM allocation.Because JVM applications, post Java 10, when ran within a container, ran with a default Max Heap size of 25% of the total RAM available on the host system. It does so to make sure there’s enough room for other containers to run in parallel within the same cloud instance, also to make room for non-heap based memory.For us this simply didn’t make sense. We had a deployment configuration of one application per instance. And, on an instance with 32GB of RAM, OOM (OutOfMemory) or aggressive GC pauses should ideally have been way beyond our reach. Yet, here it was, staring us back into our eyes. It feels obvious in hindsight and also quite stupid to have missed something as significant as this. Guess you can only connect the dots looking backwards. Also we never really focused too much on our JVM metrics to better probe whether it did have all the resources we had provisioned it to have, it was something we just took for granted. Plus, we never really bothered fixing some memory leaks and archaic code structures that demanded such high memory requirement from an infrastructure point of view. We’re doing it now. After HikariCP showed us that we weren’t ready for burst traffic handling if it ever needed to acquire a lot of connections and spawn more threads (we didn’t wish to time these connections out). The reason why it worked earlier is because the older pool, DBCP isn’t as aggressive and fast as HikariCP is. This wonderful doc shows the comparison between multiple connection pools and also documents their runtime behaviour with respect to spikes/bursts.Spikes doesn’t always necessitate user-driven activity. For us it was a mix of running crons and cache computations along with user-driven activity that caused these sporadic spikes. While these changes might take a longer time frame to get done, the low-hanging fix and the gist of the problem remains simple:If you’re running JVM within Docker, make sure to check if your application has the memory you intend it to have.We’ve patched our build process and blessed JVM with 85% share of the total RAM on the system. It seems to have blessed us back by show-casing that we don’t really need such “heavy” instances. In the coming weeks, we’ll be shifting our main API servers to a lower tier on the EC2 instance chart (or maybe deploying more than one application instances on the EC2 instance and fix a major disagreement I have). This will help us use our resources better, along with keeping the expectation of savings on costs. While people like me, will breathe a sigh of relief and feel better knowing that our applications get the resources we intend them to have.I work as a Principal Engineer at Headout. My work involves working cross-functional, hopping between teams and pods to see how I can better solve some core problems that lie beyond their purview and between them. Headout is an amazing place to be. If you think you have what it takes to be a part of its wonderful vision, ping me on LinkedIn and I would love to hear from you about how you could be a great fit here!I like working on optimisation of resources and cost. If you feel that you’re over-paying for your AWS bills or have some unrecognised bottleneck, feel free to drop me a note on LinkedIn or Twitter and I would love to have a conversation with you, if not, let’s stay connected?Thank you.----Level Up CodingPrincipal Engineer @Headout, Founding Engineer @SuperShareSuraj ShahinProAndroidDev--2Arslan AhmadinLevel Up Coding--20Arslan AhmadinLevel Up Coding--28Suraj Shah--SamsonkinAWS Tip--9Matteo Bianchi--27Alexandre Oliveinskeepers--11Tanmay BhatinFAUN — Developer Community 🐾--1Dr. Ashish BamaniainLevel Up Coding--45Ghazanfar Ali--HelpStatusAboutCareersBlogPrivacyTermsText to speechTeams



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

We were wasting 75% of our EC2 Memory resource!

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×