Facebook’s engineering team wrote about the different methodologies used at Facebook that help maintain a high degree of hardware availability in their data centers.
Facebook has data centers across the globe in 15 regions, and has been expanding regularly. A lot of this is possible because provisioning of infrastructure, including the network, is automated. Hardware failure is expected at such scale and is mitigated to some extent by Facebook’s internal cluster management system called Twine (previously known as Tupperware). There are four aspects which help Facebook maintain a high degree of hardware availability. These are systems for detection and remediation, monitoring and mitigation without significantly affecting application performance, using ML techniques for predicting anomalies, and an automated root cause analysis (RCA) system. These four aspects are also discussed in detail in individual papers by Facebook engineers.
Facebook’s detection and remediation system consists of the tools MachineChecker – which analyzes server logs, Facebook Auto-Remediation (FBAR), and Cyborg. MachineChecker checks include “checks for host ping, out-of-band (OOB) ping, power status check, memory error checks, sensor check, secure shell (SSH) access, network interface card (NIC) speed, dmesg check, S.M.A.R.T.check, and power supply check”. It creates alerts in a central system, and FBAR executes customizable scripts to attempt to fix it. Mostly written in Python, FBAR is not a new system – it has existed for almost a decade, with subsequent improvements. Cyborg runs lower level checks and logs a ticket for manual intervention if remediation was unsuccessful. As part of the workflow, FBAR and Cyborg both rerun MachineChecker again to verify the failure. According to one talk, “94% alarms are cleared without human intervention”.
The authors of the paper note that there are failures that can be missed because of their transient nature. Such failures include those triggered by high load, which disappear once the load lightens and thus subsequent reruns of MachineChecker do not catch them. To get around this, they suggest creating synthetic load on the CPU, memory, and the network.
Typical hardware failures will result in a CPU interrupt, which can affect application performance. The authors suggest adopting a middle ground where a hybrid approach utilizes two different methods to preserve accuracy without impacting performance.
New hardware and software configurations means the auto-remediation system needs to be updated with new rules from time to time. Some tickets can be marked as undiagnosed or misdiagnosed if there is a gap between updating the system with new rules and when new software and hardware are deployed. To avoid this, the team built a machine learning framework that “learns from how failures have been fixed in the past and tries to predict what repairs would be necessary” for current tickets. The input data for the ML models include failure signals collected from the MachineChecker checks, server metadata, and recent (around 6 months) repair tickets for the server.
The last tool in the group is an RCA system that correlates server hardware logs with software and tooling logs. It uses a pattern mining algorithm to find correlations among millions of log entries.
There are other systems in place at Facebook to mitigate outages in different layers of the stack like Maelstrom, apart from the ones talked about here.