150 lines
6.4 KiB
Markdown
150 lines
6.4 KiB
Markdown
|
|
# Fail Boot on Hardware Errors
|
||
|
|
|
||
|
|
Author: Andrew Geissler (geissonator)
|
||
|
|
|
||
|
|
Other contributors:
|
||
|
|
|
||
|
|
Created: Feb 20, 2020 Updated: Apr 12, 2022
|
||
|
|
|
||
|
|
## Problem Description
|
||
|
|
|
||
|
|
Some groups, for example a manufacturing team, have a requirement for the BMC
|
||
|
|
firmware to halt a system if an error log is created which calls out a piece of
|
||
|
|
hardware. The reason behind this is to ensure a system is not shipped to a
|
||
|
|
customer if it has any type of hardware issue. It also ensures when an error is
|
||
|
|
found, it is identified quickly and all activity stops until the issue is fixed.
|
||
|
|
If the system has a hardware issue once shipped from manufacturing, then the BMC
|
||
|
|
firmware behavior should be to report the error, but allow the system to
|
||
|
|
continue to boot and operate.
|
||
|
|
|
||
|
|
OpenBMC firmware needs a mechanism to support this use case.
|
||
|
|
|
||
|
|
## Background and References
|
||
|
|
|
||
|
|
Within IBM, this function has been enabled/disabled by what is called
|
||
|
|
manufacturing flags. They were bits the user could set in registry variables
|
||
|
|
which the firmware would then query. These registry variables were only settable
|
||
|
|
by someone with admin authority to the system. These flags were not used outside
|
||
|
|
of manufacturing and test.
|
||
|
|
|
||
|
|
Extensions within phosphor-logging may process logs that do not always come
|
||
|
|
through the standard phosphor-logging interfaces (for example logs sent down by
|
||
|
|
the host). In these cases the system must still halt if those logs contain
|
||
|
|
hardware callouts.
|
||
|
|
|
||
|
|
[This][1] email thread was sent on this topic to the list.
|
||
|
|
|
||
|
|
## Requirements
|
||
|
|
|
||
|
|
- Provide a mechanism to cause the OpenBMC firmware to halt a system if a
|
||
|
|
phosphor-logging log is created with a inventory callout
|
||
|
|
- The mechanism to enable/disable this feature does not need to be an external
|
||
|
|
API (i.e. Redfish). It can simply be a busctl command one runs in an ssh to
|
||
|
|
the BMC
|
||
|
|
- The halt must be obvious to the user when it occurs
|
||
|
|
- The log which causes the halt must be identifiable
|
||
|
|
- The halt must only stop the chassis/host instance that encountered the error
|
||
|
|
- The halt must allow the host firmware the opportunity to gracefully shut
|
||
|
|
itself down
|
||
|
|
- The halt must stop the host (run obmc-host-stop@X.target) associated with
|
||
|
|
the error and attempt to leave system in the fail state (i.e. chassis power
|
||
|
|
remains on if it is on)
|
||
|
|
- The chassis/host instance pair will not be allowed to power on until the log
|
||
|
|
that caused the halt is resolved or deleted
|
||
|
|
- A BMC reset will clear this power on prevention
|
||
|
|
- Ensure the mechanism used to halt firmware on inventory callouts can also be
|
||
|
|
utilized by phosphor-logging extensions to halt firmware for other causes
|
||
|
|
- These causes will be defined within the extensions documentation
|
||
|
|
- Quiesce the associated host during this failure
|
||
|
|
|
||
|
|
**Special Note:** Initially the associated host and chassis will be hard coded
|
||
|
|
to chassis0 and host0. More work throughout the BMC stack is required to handle
|
||
|
|
multiple chassis and hosts. This design allows that type of feature to be
|
||
|
|
enabled at a later time.
|
||
|
|
|
||
|
|
## Proposed Design
|
||
|
|
|
||
|
|
Create a [phosphor-settingsd][2] setting,
|
||
|
|
`xyz.openbmc_project.Logging.Settings`. Within this create a boolean property
|
||
|
|
called QuiesceOnHwError. This property will be hosted under the
|
||
|
|
xyz.openbmc_project.Settings service.
|
||
|
|
|
||
|
|
Define a new D-Bus interface which will indicate an error has been created which
|
||
|
|
will prevent the boot of a chassis/host instance:
|
||
|
|
`xyz.openbmc_project.Logging.ErrorBlocksTransition`
|
||
|
|
|
||
|
|
This interface will be hosted under a instance based D-Bus object
|
||
|
|
`/xyz/openbmc_project/logging/blockX` where X is the instance of the
|
||
|
|
chassis/host pair being blocked.
|
||
|
|
|
||
|
|
When an error is created via a phosphor-logging interface, the software will
|
||
|
|
check to see if the error has a callout, and if so it will check the new
|
||
|
|
`xyz.openbmc_project.Logging.Settings.QuiesceOnHwError`. If this is true then
|
||
|
|
phosphor-logging will create a `/xyz/openbmc_project/logging/blockX` D-Bus
|
||
|
|
object with a `xyz.openbmc_project.Logging.ErrorBlocksTransition` interface
|
||
|
|
under it. A mapper [association][3] between the log and this new D-Bus object
|
||
|
|
will be created. The corresponding host instance will be put in quiesce by
|
||
|
|
phosphor-logging.
|
||
|
|
|
||
|
|
The blocked state can be exited by rebooting the BMC or clearing the log
|
||
|
|
responsible for the blocking. Other system specific policies could be placed in
|
||
|
|
the appropriate targets (for example if a chassis power off should clear the
|
||
|
|
block)
|
||
|
|
|
||
|
|
See the phosphor-logging [callout][4] design for more information on callouts.
|
||
|
|
|
||
|
|
A new `obmc-host-graceful-quiesce@.target` systemd target will be started. This
|
||
|
|
new target will ensure a graceful shutdown of the host is initated and then
|
||
|
|
start the `obmc-host-quiesce@.target` which will stop the host and move the host
|
||
|
|
state to Quiesced.
|
||
|
|
|
||
|
|
obmcutil will be enhanced to look for these block interfaces and notify the user
|
||
|
|
via the `obmcutil state` command if a block is enabled and what log is
|
||
|
|
associated with it.
|
||
|
|
|
||
|
|
The goal is to build upon this concept when future design work is done to allow
|
||
|
|
developers to associate certain error logs with causing a halt to the system
|
||
|
|
until a log is handled.
|
||
|
|
|
||
|
|
## Host Errors
|
||
|
|
|
||
|
|
In certain scenarios, it is desirable to also halt the boot, and prevent it from
|
||
|
|
rebooting, when the host sends down certain errors to the BMC.
|
||
|
|
|
||
|
|
These errors may be of SEL format, or may be OEM specific, such as the [PEL
|
||
|
|
format][5] used by IBM.
|
||
|
|
|
||
|
|
The interfaces provided within phosphor-logging to handle the hardware callout
|
||
|
|
scenarios can be repurposed for this use case.
|
||
|
|
|
||
|
|
## Alternatives Considered
|
||
|
|
|
||
|
|
Currently this feature is a part of the base phosphor-logging design. If no one
|
||
|
|
other then IBM sees value, we could roll this into the PEL-specific portion of
|
||
|
|
phosphor-logging.
|
||
|
|
|
||
|
|
## Impacts
|
||
|
|
|
||
|
|
This will require some additional checking on reported logs but should have
|
||
|
|
minimal overhead.
|
||
|
|
|
||
|
|
There will be no changes to system behavior unless a user turns on this new
|
||
|
|
setting.
|
||
|
|
|
||
|
|
## Testing
|
||
|
|
|
||
|
|
Unit tests will be run to ensure logic to detect errors with logs and verify
|
||
|
|
both possible values of the new setting.
|
||
|
|
|
||
|
|
Test cases will need to look for this new blocking D-Bus object and handle
|
||
|
|
appropriately.
|
||
|
|
|
||
|
|
[1]: https://lists.ozlabs.org/pipermail/openbmc/2020-February/020575.html
|
||
|
|
[2]: https://github.com/openbmc/phosphor-settingsd
|
||
|
|
[3]:
|
||
|
|
https://github.com/openbmc/docs/blob/master/architecture/object-mapper.md#associations
|
||
|
|
[4]:
|
||
|
|
https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Common/Callout/README.md
|
||
|
|
[5]:
|
||
|
|
https://github.com/openbmc/phosphor-logging/blob/master/extensions/openpower-pels/README.md
|