According to GitHub, in October 2024, GitHub experienced a notable incident that resulted in performance degradation across its services. This issue occurred due to a DNS infrastructure error following a database migration at one of our company sites.
Accident Overview
The incident began at 05:59 UTC on October 11 and lasted for over 19 hours. The initial problem arose when the site’s DNS infrastructure failed to resolve lookups after database migration. Efforts to recover the database led to a cascade of failures that further affected the DNS system. Customers began experiencing issues around 17:31 UTC, with 4% of Copilot users experiencing poor IDE code completion performance and 25% of Actions workflow users experiencing delays exceeding 5 minutes. Additionally, all code retrieval requests failed for approximately 4 hours.
Response and Resolution
Attempts to alleviate the problem by redirecting the affected DNS site to an alternate location were initially unsuccessful. This is because this strategy resulted in broken connections from healthy sites to degraded sites. At 20:52 UTC, the GitHub team implemented a remediation plan and deployed temporary DNS resolution to the affected sites. DNS resolution began to recover at 21:46 UTC and was fully operational at 22:16 UTC. Remaining issues related to code retrieval were resolved by 01:11 UTC on October 12.
Future precautions
Following the incident, GitHub worked to strengthen its resiliency and automated processes to accelerate the diagnosis and resolution of similar issues in the future. The company aims to improve infrastructure reliability to prevent such incidents from recurring.
We recommend that you visit the GitHub Status page for real-time updates on the status of GitHub services. Additionally, insight into ongoing projects and improvements can be found on the GitHub Engineering Blog.
Image source: Shutterstock