Customer Impact:
For 132 minutes on 15 March 2017, users of Visual Studio Team Services experienced delays and some timeouts for their builds and Load tests. During the incident, builds for approximately 1,500 Hosted Build accounts did not start and eventually failed due to timeout errors. Also, tests for 40 Cloud Load Test accounts failed . The incident started at 21:48 UTC on 15 March 2017 and was active until 00:00 UTC the following day.
Related Articles
What went wrong:
The workflows for both Hosted Build and Cloud Load Test rely on the provisioning of new a VM for each build and test run. On 15 March 2017 at 21:42 UTC the Azure Storage Resource Provider started failing for all service management operations globally. As both Hosted Build and Cloud Load Test were unable to refresh their VM capacity, it resulted in queued builds and test run failures. The VSTS SRE team was alerted 36 minutes into the incident when the Build VM pool was exhausted. While trying to engage our partners in Azure we have discovered that Azure was already aware of this issue and working on a fix. After Azure Storage engineers applied a hotfix, both the Hosted Build and Cloud Load Test workflows recovered without manual intervention.
Next Steps:
Within the VSTS service we identified opportunities to improve our monitoring. Specifically, within existing telemetry, we log exceptions for failed Azure management calls which we will enable us to alert the team earlier for future issues. Additionally, our partners in Azure Storage have identified several resiliency improvements that are outlined in RCA noted below.
Azure Storage Service RCA: https://azure.microsoft.com/en-us/status/history (View entry titled ”RCA – Storage provisioning impacting multiple services” on 3/16/2017).
Sincerely,
Sri Harsha
This post first appeared on MSDN Blogs | Get The Latest Information, Insights, Announcements, And News From Microsoft Experts And Developers In The MSDN Blogs., please read the originial post: here