195 lines
10 KiB
Markdown
195 lines
10 KiB
Markdown
# Hardware Fault Monitor
|
||
|
||
Author: Claire Weinan (cweinan@google.com), daylight22)
|
||
|
||
Other contributors: Heinz Boehmer Fiehn (heinzboehmer@google.com) Drew Walton
|
||
(acwalton@google.com)
|
||
|
||
Created: Aug 5, 2021
|
||
|
||
## Problem Description
|
||
|
||
The goal is to create a new hardware fault monitor which will provide a
|
||
framework for collecting various fault and sensor information and making it
|
||
available externally via Redfish for data center monitoring and management
|
||
purposes. The information logged would include a wide variety of chipset
|
||
registers and data from manageability hardware. In addition to collecting
|
||
information through BMC interfaces, the hardware fault monitor will also receive
|
||
information via Redfish from the associated host kernel (specifically for cases
|
||
in which the desired information cannot be collected directly by the BMC, for
|
||
example when accessing registers that are read and cleared by the host kernel).
|
||
|
||
Future expansion of the hardware fault monitor would include adding the means to
|
||
locally analyze fault and sensor information and then based on specified
|
||
criteria trigger repair actions in the host BIOS or kernel. In addition, the
|
||
hardware fault monitor could receive repair action requests via Redfish from
|
||
external data center monitoring software.
|
||
|
||
## Background and References
|
||
|
||
The following are a few related existing OpenBMC modules:
|
||
|
||
- Host Error Monitor logs CPU error information such as CATERR details and takes
|
||
appropriate actions such as performing resets and collecting crashdumps:
|
||
https://github.com/openbmc/host-error-monitor
|
||
|
||
- bmcweb implements a Redfish webserver for openbmc:
|
||
https://github.com/openbmc/bmcweb. The Redfish LogService schema is available
|
||
for logging purposes and the EventService schema is available for a Redfish
|
||
server to send event notifications to clients.
|
||
|
||
- Phosphor Debug Collector (phosphor-debug-collector) collects various debug
|
||
dumps and saves them into files:
|
||
https://github.com/openbmc/phosphor-debug-collector
|
||
|
||
- Dbus-sensors reads and saves sensor values and makes them available to other
|
||
modules via D-Bus: https://github.com/openbmc/dbus-sensors
|
||
|
||
- SEL logger logs to the IPMI and Redfish system event logs when certain events
|
||
happen, such as sensor readings going beyond their thresholds:
|
||
https://github.com/openbmc/phosphor-sel-logger
|
||
|
||
- FRU fault manager controls the blinking of LEDs when faults occur:
|
||
https://github.com/openbmc/phosphor-led-manager/blob/master/fault-monitor/fru-fault-monitor.hpp
|
||
|
||
- Guard On BMC records and manages a list of faulty components for isolation.
|
||
(Both the host and the BMC may identify faulty components and create guard
|
||
records for them):
|
||
https://github.com/openbmc/docs/blob/9c79837a8a20dc8e131cc8f046d1ceb4a731391a/designs/guard-on-bmc.md
|
||
|
||
There is an OpenCompute Fault Management Infrastructure proposal that also
|
||
recommends delivering error logs from the BMC:
|
||
https://drive.google.com/file/d/1A9Qc7hB3THw0wiEK_dbXYj85_NOJWrb5/
|
||
|
||
## Requirements
|
||
|
||
- The users of this solution are Redfish clients in data center software. The
|
||
goal of the fault monitor is to enable rich error logging (OEM and CPU vendor
|
||
specific) for data center tools to monitor servers, manage repairs, predict
|
||
crashes, etc.
|
||
|
||
- The fault monitor must be able to handle receiving fault information that is
|
||
polled periodically as well as fault information that may come in sporadically
|
||
based on fault incidents (e.g. crash dumps).
|
||
|
||
- The fault monitor should allow for logging of a variety of sizes of fault
|
||
information entries (on the order of bytes to megabytes). In general, more
|
||
severe errors which require more fault information to be collected tend to
|
||
occur less frequently, while less severe errors such as correctable errors
|
||
require less logging but may happen more frequently.
|
||
|
||
- Fault information must be added to a Redfish LogService in a timely manner
|
||
(within a few seconds of the original event) to be available to external data
|
||
center monitoring software.
|
||
|
||
- The fault monitor must allow for custom overwrite rules for its log entries
|
||
(e.g. on overflow, save first errors and more severe errors), or guarantee
|
||
that enough space is available in its log such that all data from the most
|
||
recent couple of hours is always kept intact. The log does not have to be
|
||
stored persistently (though it can be).
|
||
|
||
## Proposed Design
|
||
|
||
A generic fault monitor will be created to collect fault information. First we
|
||
discuss a few example use cases:
|
||
|
||
- On CATERR, the Host Error Monitor requests a crash dump (this is an existing
|
||
capability). The crash dump includes chipset registers but doesn’t include
|
||
platform-specific system-level data. The fault monitor would therefore
|
||
additionally collect system-level data such as clock, thermal, and power
|
||
information. This information would be bundled, logged, and associated with
|
||
the crash dump so that it could be post-processed by data center monitoring
|
||
tools without having to join multiple data sources.
|
||
|
||
- The fault monitor would monitor link level retries and link retrainings of
|
||
high speed serial links such as UPI links. This isn’t typically monitored by
|
||
the host kernel at runtime and the host kernel isn’t able to log it during a
|
||
crash. The fault monitor in the BMC could check link level retries and link
|
||
retrainings during runtime by polling over PECI. If a MCERR or IERR occurred,
|
||
the fault monitor could then add additional information such as high speed
|
||
serial link statistics to error logs.
|
||
|
||
- In order to monitor memory out of band, a system could be configured to give
|
||
the BMC exclusive access to memory error logging registers (to prevent the
|
||
host kernel from being able to access and clear the registers before the BMC
|
||
could collect the register data). For corrected memory errors, the fault
|
||
monitor could log error registers either through polling or interrupts. Data
|
||
center monitoring tools would use the logs to determine whether memory should
|
||
be swapped or a machine should be removed from usage.
|
||
|
||
The fault monitor will not have its own dedicated OpenBMC repository, but will
|
||
consist of components incorporated into the existing repositories
|
||
host-error-monitor, bmcweb, and phosphor-debug-collector.
|
||
|
||
In the existing Host Error Monitor module, new monitors will be created to add
|
||
functionality needed for the fault monitor. For instance, based on the needs of
|
||
the OEM, the fault monitor will register to be notified of D-Bus signals of
|
||
interest in order to be alerted when fault events occur. The fault monitor will
|
||
also poll registers of interest and log their values to the fault log (described
|
||
more later). In addition, the host will be able to write fault information to
|
||
the fault log (via a POST (Create) request to its corresponding Redfish log
|
||
resource collection). When the fault monitor becomes aware of a new fault
|
||
occurrence through any of these ways, it may add fault information to the fault
|
||
log. The fault monitor may also gather relevant sensor data (read via D-Bus from
|
||
the dbus-sensors services) and add it to the fault log, with a reference to the
|
||
original fault event information. The EventGroupID in a Redfish LogEntry could
|
||
potentially be used to associate multiple log entries related to the same fault
|
||
event.
|
||
|
||
The fault log for storing relevant fault information (and exposing it to
|
||
external data center monitoring software) will be a new Redfish LogService
|
||
(/redfish/v1/Systems/system/LogServices/FaultLog) with
|
||
`OverwritePolicy=unknown`, in order to implement custom overwrite rules such as
|
||
prioritizing retaining first and/or more severe faults. The back end
|
||
implementation of the fault log including saving and managing log files will be
|
||
added into the existing Phosphor Debug Collector repository with an associated
|
||
D-bus object (e.g. xyz/openbmc_project/dump/faultlog) whose interface will
|
||
include methods for writing new data into the log, retrieving data from the log,
|
||
and clearing the log. The fault log will be implemented as a new dump type in an
|
||
existing Phosphor Debug Collector daemon (specifically the one whose main()
|
||
function is in dump_manager_main.cpp). The new fault log would contain dump
|
||
files that are collected in a variety of ways in a variety of formats. A new
|
||
fault log dump entry class (deriving from the "Entry" class in dump_entry.hpp)
|
||
would be defined with an additional "dump type" member variable to identify the
|
||
type of data that a fault log dump entry's corresponding dump file contains.
|
||
|
||
bmcweb will be used as the associated Redfish webserver for external entities to
|
||
read and write the fault log. Functionality for handling a POST (Create) request
|
||
to a Redfish log resource collection will be added in bmcweb. When delivering a
|
||
Redfish fault log entry to a Redfish client, large-sized fault information (e.g.
|
||
crashdumps) can be specified as an attachment sub-resource (AdditionalDataURI)
|
||
instead of being inlined. Redfish events (EventService schema) will be used to
|
||
send external notifications, such as when the fault monitor needs to notify
|
||
external data center monitoring software of new fault information being
|
||
available. Redfish events may also be used to notify the host kernel and/or BIOS
|
||
of any repair actions that need to be triggered based on the latest fault
|
||
information.
|
||
|
||
## Alternatives Considered
|
||
|
||
We considered adding the fault logs into the main system event log
|
||
(/redfish/v1/Systems/system/LogServices/EventLog) or other logs already existing
|
||
in bmcweb (e.g. /redfish/v1/Systems/system/LogServices/Dump,
|
||
/redfish/v1/Managers/bmc/LogServices/Dump), but we would like to implement a
|
||
separate custom overwrite policy to ensure the most important information (such
|
||
as first errors and most severe errors) is retained for local analysis.
|
||
|
||
## Impacts
|
||
|
||
There may be situations where external consumers of fault monitor logs (e.g.
|
||
data center monitoring tools) are running software that is newer or older than
|
||
the version matching the BMC software running on a machine. In such cases,
|
||
consumers can ignore any types of fault information provided by the fault
|
||
monitor that they are not prepared to handle.
|
||
|
||
Errors are expected to happen infrequently, or to be throttled, so we expect
|
||
little to no performance impact.
|
||
|
||
## Testing
|
||
|
||
Error injection mechanisms or simulations may be used to artificially create
|
||
error conditions that will be logged by the fault monitor module.
|
||
|
||
There is no significant impact expected with regards to CI testing, but we do
|
||
intend to add unit testing for the fault monitor.
|