Jesse Ellis
February 13, 2025 20:05
According to GitHub’s availability report, GitHub experienced three events in January 2025, causing the service stopped due to distribution, configuration and hardware failure.
January service suspension
In January 2025, Github experienced three important events, as described in the availability report, and reduced performance throughout the service. Such confusion is due to various technical problems, including distribution errors, configuration changes and hardware failures.
Details of the case
January 9, 2025 (31 minutes)
The first incident occurred on January 9 at 01:26 at 01:56 UTC. Distribution has introduced a query that has a problem of saturating the default database server, resulting in a 6% error rate, reaching 6.85%. The user faced 500 response errors in multiple services. GitHub rolled back 14 minutes and then rolled back the distribution to ease the problem by identifying the wrong query through the internal tools and dashboards.
January 13, 2025 (49 minutes)
January 13, 23:35, 23:35 UTC and 00:24 UTC were not able to use Git work due to the change of traffic routing. With this adjustment, the internal load balancer has deleted the requests for Git work. The situation has been resolved by returning the configuration. Github is now improving monitoring and distribution practices to improve detection time and automate relaxation efforts.
January 30, 2025 (26 minutes)
January 30, 14:22 ~ 14:48 The last case of UTC failed to request a web request for github.com, with a highest error rate of 44%and an average successful request time exceeded 3 seconds. This problem comes from the hardware failure of the caching layer, which is responsible for speed limitations. The impact has been extended because there is no automatic disability measure. Github has done a manual disability measure with a reliable hardware to prevent recurrence. They plan to implement high availability cache configuration to strengthen the elasticity of similar failures.
Future improvement
GitHub is actively investing in improving tooling to detect query that has problems before deployment and to improve cache elasticity to prevent future confusion. Such measures aim to reduce detection and easing time for potential problems.
For real -time updates for service status and post -reports, users can visit GitHub’s status page. Additional insights to GitHub’s engineering efforts can be found in the Github engineering blog.
Image Source: Shutter Stock