May 24th 2020

At almost exactly 0300 (UTC) on the morning of 22/05 our monitoring systems started sending alerts for a handful of websites. They had begun to respond slower. This was very strange, the sites weren't down or throwing errors, just slow...

Ella Purnell Biography, Age, Eye Colo…
Guinness World Records names new hott…
Hamilton Beach Commercial Coffee Make…
Top 10 Healthy Snack Ideas to prevent…
Beatport Top 100 Downloads April 2023

I first turned my attention to the time this started happening, everything started slowing down right at the top of the hour. Was there a cron hammering the clusters CPU? Nope. Was there a problem with our monitoring system? Nope, requesting the sites from other locations showed the same response times. Was it a rogue deployment? Nope.

After this I started looking at other possibilities:

Was there an AWS outage? Nope, all systems were green
Was it an automatic update of a CMS? Nope, they were using different CMSes
Was there some kind of DDOS attack? Nope, traffic appeared normal
Was there a problem with the Database? Nope, all seemed to be fine

The Smoking Gun

My suspicions now turned towards the storage layer, since at this point it was basically the only moving part I hadn't yet looked at. We use AWS Elastic File Storage (EFS) to store any volatile data our apps consume, and simply mount it to our containers.

So to test if this was the source of the problem I attached to one of the affected containers and tested the time to perform a file touch:

$ time touch /path/to/app/source/testFile real 0m0.010s user 0m0.001s sys 0m0.000s

This was fine but it was a different story for directories mounted via EFS:

$ time touch /path/mounted/to/EFS/testFile real 0m0.334s user 0m0.001s sys 0m0.000s

334ms to just touch a file?! I would expect an increase in latency due to the overheads of the NFS protocol and networking etc but according to EFS FAQ "EFS Standard is designed to provide single-digit latencies on average...". This at least explained the issue, all the views, logs, assets etc that an app was reading from disk was taking an age to respond. So.. why?

At this point I opened a ticket with AWS, but while waiting for them to respond I kept digging...

Graphs!

I started rummaging through the plethora of graphs that AWS makes available through Cloud Watch when I saw this:

This was the "Well, there's your problem" moment, we had some how managed to exhausted our BurstCreditBalance (the blue line), I won't try to explain how it's calculated because it's complicated, but the short of it is, if you use lots of bandwidth all of a sudden they go down, and if everything is idle they go up. When you exhaust your credits AWS will throttle you, hard, making your apps slow and your customers sad.

This was also the point where I realised that the timing was just a coincidence, it just happens that the credits were exhausted at the top of the hour.

The Why (and more graphs!)

So we have the when and the what, but what about the why? Why did the credits get exhausted. It is immediately clear that something is reading a disproportionate amount of metadata, the metadataIOBytes is several hundred times that of read and write combined!

To get an idea of what was causing this excessive metadata IO I went back in time, it can't have been like this forever:

This was 2 days prior! (Don't be fooled by the dynamic axis labelling, the bottom of the left hand axis isn't zero). This was the day we had done a cluster wide PHP upgrade (patching CVE-2019-11048) but it wasn't the upgrade that caused the problem, at least not directly...

Once we had applied the patches to our container images we had to deploy them. This is where the problem began, just a couple of Pods in the cluster failed to be replaced correctly and were left in a "Terminating" state. Annoyingly K8 wasn't able to tell us why and any manual delete commands sent to these pods would just hang...

$ kubectl get pods --all-namespaces | grep Terminating ************ ************-staging-797fbbbc6d-whjpp 0/2 Terminating 0 25h ************ ************-production-56dc6d7dbd-wccmq 0/2 Terminating 0 35d ************ ************-production-588cd5f7cd-g9fxt 0/2 Terminating 0 3d7h ************ ************-production-588cd5f7cd-xmxll 0/2 Terminating 0 3d7h ************ ************-staging-585f86f6d7-ptr7k 0/2 Terminating 0 3d7h ************ ************-staging-585f86f6d7-zpqcc 0/2 Terminating 0 3d7h ************ ************-staging-78b99cd8f8-6ktsh 0/2 Terminating 0 43d ************ ************-staging-78b99cd8f8-hjmtg 0/2 Terminating 0 25d

The only way to be rid of these Zombie Pods was to reboot the nodes they were scheduled onto, yes the solution was literally to turn it off and on again, not exactly a satisfactory solution but a solution nonetheless. All of our pods are scheduled onto at least 2 nodes so no downtime occurred as a result.

Future Mitigations

I learned a lot during the process of debugging this and there are a few take aways from this incident

Move as much IO off of mounted volumes

One easy win for this is to move caches off the mounted storage as they are designed to be volatile. This should also have the added benefit of improving app performance slightly.

EFS metric alerts

If we had alerts for EFS credit balance we would've been pointed at the problem before it became a problem

Zombie monitoring

Adding monitoring for pods that have been in the terminating state for longer than a few minutes would again give us a heads up that something was up.

This post first appeared on Blog | Freshleaf Media, please read the originial post: here