Skip to main content

Google Cloud Caused Outage By Ignoring Its Usual Code Quality Protections

2 weeks 5 days ago
Google Cloud has attributed last week's widespread outage to a flawed code update in its Service Control system that triggered a global crash loop due to missing error handling and lack of feature flag protection. The Register reports: Google's explanation of the incident opens by informing readers that its APIs, and Google Cloud's, are served through our Google API management and control planes." Those two planes are distributed regionally and "are responsible for ensuring each API request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints." The core binary that is part of this policy check system is known as "Service Control." On May 29, Google added a new feature to Service Control, to enable "additional quota policy checks." "This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code," Google's incident report explains. The search monopolist appears to have had concerns about this change as it "came with a red-button to turn off that particular policy serving path." But the change "did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash." Google uses feature flags to catch issues in its code. "If this had been flag protected, the issue would have been caught in staging." That unprotected code ran inside Google until June 12th, when the company changed a policy that contained "unintended blank fields." Here's what happened next: "Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment." Google's post states that its Site Reliability Engineering team saw and started triaging the incident within two minutes, identified the root cause within 10 minutes, and was able to commence recovery within 40 minutes. But in some larger Google Cloud regions, "as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on ... overloading the infrastructure." Service Control wasn't built to handle this, which is why it took almost three hours to resolve the issue in its larger regions. The teams running Google products that went down due to this mess then had to perform their own recovery chores. Going forward, Google has promised a couple of operational changes to prevent this mistake from happening again: "We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers. We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity."

Read more of this story at Slashdot.

BeauHD

Intel Will Lay Off 15% To 20% of Its Factory Workers, Memo Says

2 weeks 5 days ago
Intel will lay off 15% to 20% of its factory workforce starting in July, potentially cutting over 10,000 jobs as part of a broader effort to streamline operations amid declining sales and mounting competitive pressure. "These are difficult actions but essential to meet our affordability challenges and current financial position of the company. It drives pain to every individual," Intel manufacturing Vice President Naga Chandrasekaran wrote to employees Saturday. "Removing organizational complexity and empowering our engineers will enable us to better serve the needs of our customers and strengthen our execution. We are making these decisions based on careful consideration of what's needed to position our business for the future." The company reiterated that "we will treat people with care and respect as we complete this important work." Oregon Live reports: Intel announced the pending layoffs in April and notified factory workers last week that the cuts would begin in July. It hadn't previously said just how deep the layoffs will go. The company had 109,000 employees at the end of 2024, but it's not clear how many of those worked in its factory division -- called Intel Foundry. The Foundry business includes a broad array of jobs, from technicians on the factory floor to specialized researchers who work years in advance to develop future generations of microprocessors. Intel is planning major cuts in other parts of its business, too, but employees say the company hasn't specified how many jobs it will eliminate in each business unit. Workers say they believe the impacts will vary within departments. Overall, though, the layoffs will surely eliminate several thousand jobs -- and quite possibly more than 10,000.

Read more of this story at Slashdot.

BeauHD