512 lines
22 KiB
Markdown
512 lines
22 KiB
Markdown
|
|
# BMC Service Failure Debug and Recovery
|
||
|
|
|
||
|
|
Author: Andrew Jeffery <andrew@aj.id.au> @arj
|
||
|
|
|
||
|
|
Other contributors: Andrew Geissler <geissonator@yahoo.com> @geissonator
|
||
|
|
|
||
|
|
Created: 6th May 2021
|
||
|
|
|
||
|
|
## Problem Description
|
||
|
|
|
||
|
|
The capability to debug critical failures of the BMC firmware is essential to
|
||
|
|
meet the reliability and serviceability claims made for some platforms.
|
||
|
|
|
||
|
|
This design addresses a few classes of failures:
|
||
|
|
|
||
|
|
- A class of failure exists where a BMC service has entered a failed state but
|
||
|
|
the BMC is still operational in a degraded mode.
|
||
|
|
- A class of failure exists under which we can attempt debug data collection
|
||
|
|
despite being unable to communicate with the BMC via standard protocols.
|
||
|
|
|
||
|
|
This proposal argues for and proposes a software-driven debug data capture and
|
||
|
|
recovery of a failed BMC.
|
||
|
|
|
||
|
|
## Background and References
|
||
|
|
|
||
|
|
By necessity, BMCs are not self-contained systems. BMCs exist to service the
|
||
|
|
needs of both the host system by providing in-band platform services such as
|
||
|
|
thermal and power management as well as system operators by providing
|
||
|
|
out-of-band system management interfaces such as error reporting, platform
|
||
|
|
telemetry and firmware management.
|
||
|
|
|
||
|
|
As such, failures of BMC subsystems may impact external consumers.
|
||
|
|
|
||
|
|
The BMC firmware stack is not trivial, in the sense that common implementations
|
||
|
|
are usually a domain-specific Linux distributions with complex or highly coupled
|
||
|
|
relationships to platform subsystems.
|
||
|
|
|
||
|
|
Complexity and coupling drive concern around the risk of critical failures in
|
||
|
|
the BMC firmware. The BMC firmware design should provide for resilience and
|
||
|
|
recovery in the face of well-defined error conditions, but the need to mitigate
|
||
|
|
ill-defined error conditions or entering unintended software states remains.
|
||
|
|
|
||
|
|
The ability for a system to recover in the face of an error condition depends on
|
||
|
|
its ability to detect the failure. Thus, error conditions can be assigned to
|
||
|
|
various classes based on the ability to externally observe the error:
|
||
|
|
|
||
|
|
1. Continued operation: The services detects the error and performs the actions
|
||
|
|
required to return to its operating state
|
||
|
|
|
||
|
|
2. Graceful exit: The service detects an error it cannot recover from, but
|
||
|
|
gracefully cleans up its resources before exiting with an appropriate exit
|
||
|
|
status
|
||
|
|
|
||
|
|
3. Crash: The service detects it is an unintended software state and exits
|
||
|
|
immediately, failing to gracefully clean up its resources before exiting
|
||
|
|
|
||
|
|
4. Unresponsive: The service fails to detect it cannot make progress and
|
||
|
|
continues to run but is unresponsive
|
||
|
|
|
||
|
|
As the state transformations to enter the ill-defined or unintended software
|
||
|
|
state are unanticipated, the actions required to gracefully return to an
|
||
|
|
expected state are also not well defined. The general approaches to recover a
|
||
|
|
system or service to a known state in the face of entering an unknown state are:
|
||
|
|
|
||
|
|
1. Restart the affected service
|
||
|
|
2. Restart the affected set of services
|
||
|
|
3. Restart all services
|
||
|
|
|
||
|
|
In the face of continued operation due to internal recovery a service restart is
|
||
|
|
unnecessary, while in the case of a unresponsive service the need to restart
|
||
|
|
cannot be detected by service state alone. Implementation of resiliency by way
|
||
|
|
of service restarts via a service manager is only possible in the face of a
|
||
|
|
graceful exit or application crash. Handling of services that have entered an
|
||
|
|
unresponsive state can only begin upon receiving external input.
|
||
|
|
|
||
|
|
Like error conditions, services exposed by the BMC can be divided into several
|
||
|
|
external interface classes:
|
||
|
|
|
||
|
|
1. Providers of platform data
|
||
|
|
2. Providers of platform data transports
|
||
|
|
|
||
|
|
Examples of the first are applications that expose various platform sensors or
|
||
|
|
provide data about the firmware itself. Failure of the first class of
|
||
|
|
applications usually yields a system that can continue to operate in a reduced
|
||
|
|
capacity.
|
||
|
|
|
||
|
|
Examples of the second are the operating system itself and applications that
|
||
|
|
implement IPMI, HTTPS (e.g. for Redfish), MCTP and PLDM. This second class also
|
||
|
|
covers implementation-specific data transports such as D-Bus, which requires a
|
||
|
|
broker service. Failure of a platform data transport may result in one or all
|
||
|
|
external interfaces becoming unresponsive and be viewed as a critical failure of
|
||
|
|
the BMC.
|
||
|
|
|
||
|
|
Like error conditions and services, the BMC's external interfaces can be divided
|
||
|
|
into several classes:
|
||
|
|
|
||
|
|
1. Out-of-band interfaces: Remote, operator-driven platform management
|
||
|
|
2. In-band interfaces: Local, host-firmware-driven platform management
|
||
|
|
|
||
|
|
Failures of platform data transports generally leave out-of-band interfaces
|
||
|
|
unresponsive to the point that the BMC cannot be recovered except via external
|
||
|
|
means, usually by issuing a (disruptive) AC power cycle. On the other hand, if
|
||
|
|
the host can detect the BMC is unresponsive on the in-band interface(s), an
|
||
|
|
appropriate platform design can enable the host to reset the BMC without
|
||
|
|
disrupting its own operation.
|
||
|
|
|
||
|
|
### Analysis of eBMC Error State Management and Mitigation Mechanisms
|
||
|
|
|
||
|
|
Assessing OpenBMC userspace with respect to the error classes outlined above,
|
||
|
|
the system manages and mitigates error conditions as follows:
|
||
|
|
|
||
|
|
| Condition | Mechanism |
|
||
|
|
| ------------------- | --------------------------------------------------- |
|
||
|
|
| Continued operation | Application-specific error handling |
|
||
|
|
| Graceful exit | Application-specific error handling |
|
||
|
|
| Crash | Signal, unhandled exceptions, `assert()`, `abort()` |
|
||
|
|
| Unresponsive | None |
|
||
|
|
|
||
|
|
These mechanisms inform systemd (the service manager) of an event, which it
|
||
|
|
handles according to the restart policy encoded in the unit file for the
|
||
|
|
service.
|
||
|
|
|
||
|
|
OpenBMC has a default behavior for all systemd services. That default is to
|
||
|
|
allow an OpenBMC systemd service to restart twice every 30 seconds. If a service
|
||
|
|
restarts more then twice within 30 seconds then that service will be considered
|
||
|
|
to be in a failed state by systemd and not restarted again until a BMC reboot.
|
||
|
|
|
||
|
|
Assessing the OpenBMC operating system with respect to the error classes, it
|
||
|
|
manages and mitigates error conditions as follows:
|
||
|
|
|
||
|
|
| Condition | Mechanism |
|
||
|
|
| ------------------- | -------------------------------------- |
|
||
|
|
| Continued operation | ramoops, ftrace, `printk()` |
|
||
|
|
| Graceful exit | System reboot |
|
||
|
|
| Crash | kdump or ramoops |
|
||
|
|
| Unresponsive | `hardlockup_panic`, `softlockup_panic` |
|
||
|
|
|
||
|
|
Crash conditions in the Linux kernel trigger panics, which are handled by kdump
|
||
|
|
(though may be handled by ramoops until kdump support is integrated). Kernel
|
||
|
|
lockup conditions can be configured to trigger panics, which in-turn trigger
|
||
|
|
either ramoops or kdump.
|
||
|
|
|
||
|
|
### Synthesis
|
||
|
|
|
||
|
|
In the context of the information above, handling of application lock-up error
|
||
|
|
conditions is not provided. For applications in the platform-data-provider class
|
||
|
|
of external interfaces, the system will continue to operate with reduced
|
||
|
|
functionality. For applications in the platform-data-transport-provider class,
|
||
|
|
this represents a critical failure of the firmware that must have accompanying
|
||
|
|
debug data.
|
||
|
|
|
||
|
|
## Handling platform-data-transport-provider failures
|
||
|
|
|
||
|
|
### Requirements
|
||
|
|
|
||
|
|
#### Recovery Mechanisms
|
||
|
|
|
||
|
|
The ability for external consumers to control the recovery behaviour of BMC
|
||
|
|
services is usually coarse, the nuanced handling is left to the BMC
|
||
|
|
implementation. Where available the options for external consumer tend to be, in
|
||
|
|
ascending order of severity:
|
||
|
|
|
||
|
|
| Severity | BMC Recovery Mechanism | Used for |
|
||
|
|
| -------- | ----------------------- | --------------------------------------------------------------------- |
|
||
|
|
| 1 | Graceful reboot request | Normal circumstances or recovery from platform data provider failures |
|
||
|
|
| 2 | Forceful reboot request | Recovery from unresponsive platform data transport providers |
|
||
|
|
| 3 | External hardware reset | Unresponsive operating system |
|
||
|
|
|
||
|
|
Of course it's not possible to issue these requests over interfaces that are
|
||
|
|
unresponsive. A robust platform design should be capable of issuing all three
|
||
|
|
restart requests over separate interfaces to minimise the impact of any one
|
||
|
|
interface becoming unresponsive. Further, the more severe the reset type, the
|
||
|
|
fewer dependencies should be in its execution path.
|
||
|
|
|
||
|
|
Given the out-of-band path is often limited to just the network, it's not
|
||
|
|
feasible for the BMC to provide any of the above in the event of some kind of
|
||
|
|
network or relevant data transport failure. The considerations here are
|
||
|
|
therefore limited to recovery of unresponsive in-band interfaces.
|
||
|
|
|
||
|
|
The need to escalate above mechanism 1 should come with data that captures why
|
||
|
|
it was necessary, i.e. dumps for services that failed in the path for 1.
|
||
|
|
However, by escalating straight to 3, the BMC will necessarily miss out on
|
||
|
|
capturing a debug dump because there is no opportunity for software to intervene
|
||
|
|
in the reset. Therefore, mechanism 2 should exist in the system design and its
|
||
|
|
implementation should capture any appropriate data needed to debug the need to
|
||
|
|
reboot and the inability to execute on approach 1.
|
||
|
|
|
||
|
|
The need to escalate to 3 would indicate that the BMC's own mechanisms for
|
||
|
|
detecting a kernel lockup have failed. Had they not failed, we would have
|
||
|
|
ramoops or kdump data to analyse. As data cannot be captured with an escalation
|
||
|
|
to method 3 the need to invoke it will require its own specialised debug
|
||
|
|
experience. Given this and the kernel's own lockup detection and data collection
|
||
|
|
mechanism, support for 2 can be implemented in BMC userspace.
|
||
|
|
|
||
|
|
Mechanism 1 is typically initiated by the usual in-band interfaces, either IPMI
|
||
|
|
or PLDM. In order to avoid these in the implementation of mechanism 2, the host
|
||
|
|
needs an interface to the BMC that is dedicated to the role of BMC recovery,
|
||
|
|
with minimal dependencies on the BMC side for initiating the dump collection and
|
||
|
|
reboot. At its core, all that is needed is the ability to trigger a BMC IRQ,
|
||
|
|
which could be as simple as monitoring a GPIO.
|
||
|
|
|
||
|
|
#### Behavioural Requirements for Recovery Mechanism 2
|
||
|
|
|
||
|
|
The system behaviour requirement for the mechanism is:
|
||
|
|
|
||
|
|
1. The BMC executes collection of debug data and then reboots once it observes a
|
||
|
|
recovery message from the host
|
||
|
|
|
||
|
|
It's desirable that:
|
||
|
|
|
||
|
|
1. The host has some indication that the recovery process has been activated
|
||
|
|
2. The host has some indication that a BMC reset has taken place
|
||
|
|
|
||
|
|
It's necessary that:
|
||
|
|
|
||
|
|
1. The host make use of a timeout to escalate to recovery mechanism 3 as it's
|
||
|
|
possible the BMC will be unresponsive to recovery mechanism 2
|
||
|
|
|
||
|
|
#### Analysis of BMC Recovery Mechanisms for Power10 Platforms
|
||
|
|
|
||
|
|
The implementation of recovery mechanism 1 is already accounted for in the
|
||
|
|
in-band protocols between the host and the BMC and so is considered resolved for
|
||
|
|
the purpose of the discussion.
|
||
|
|
|
||
|
|
To address recovery mechanism 3, the Power10 platform designs wire up a GPIO
|
||
|
|
driven by the host to the BMC's EXTRST pin. If the host firmware detects that
|
||
|
|
the BMC has become unresponsive to its escalating recovery requests, it can
|
||
|
|
drive the hardware to forcefully reset the BMC.
|
||
|
|
|
||
|
|
However, host-side GPIOs are in short supply, and we do not have a dedicated pin
|
||
|
|
to implement recovery mechanism 2 in the platform designs.
|
||
|
|
|
||
|
|
#### Analysis of Implementation Methods on Power10 Platforms
|
||
|
|
|
||
|
|
The implementation of recovery mechanism 2 is limited to using existing
|
||
|
|
interfaces between the host and the BMC. These largely consist of:
|
||
|
|
|
||
|
|
1. FSI
|
||
|
|
2. LPC
|
||
|
|
3. PCIe
|
||
|
|
|
||
|
|
FSI is inappropriate because the host is the peripheral in its relationship with
|
||
|
|
the BMC. If the BMC has become unresponsive, it is possible it's in a state
|
||
|
|
where it would not accept FSI traffic (which it needs to drive in the first
|
||
|
|
place) and we would need an mechanism architected into FSI for the BMC to
|
||
|
|
recognise it is in a bad state. PCIe and LPC are preferable by comparison as the
|
||
|
|
BMC is the peripheral in this relationship, with the host driving cycles into it
|
||
|
|
over either interface. Comparatively, PCIe is more complex than LPC, so an
|
||
|
|
LPC-based approach is preferred.
|
||
|
|
|
||
|
|
The host already makes use of several LPC peripherals exposed from the BMC:
|
||
|
|
|
||
|
|
1. Mapped LPC FW cycles
|
||
|
|
2. iBT for IPMI
|
||
|
|
3. The VUARTs for system and debug consoles
|
||
|
|
4. A KCS device for a vendor-defined MCTP LPC binding
|
||
|
|
|
||
|
|
The host could take advantage of any of the following LPC peripherals for
|
||
|
|
implementing recovery mechanism 2:
|
||
|
|
|
||
|
|
1. The SuperIO-based iLPC2AHB bridge
|
||
|
|
2. The LPC mailbox
|
||
|
|
3. An otherwise unused KCS device
|
||
|
|
|
||
|
|
In ASPEED BMC SoCs prior to the AST2600 the LPC mailbox required configuration
|
||
|
|
via the SuperIO device, which exposes the unrestricted iLPC2AHB backdoor into
|
||
|
|
the BMC's physical address space. The iLPC2AHB capability could not be mitigated
|
||
|
|
without disabling SuperIO support entirely, and so the ability to use the
|
||
|
|
mailbox went with it. This security issue is resolved in the AST2600 design, so
|
||
|
|
the mailbox could be used in the Power10 platforms, but we have lower-complexity
|
||
|
|
alternatives for generating an IRQ on the BMC. We could use the iLPC2AHB from
|
||
|
|
the host to drive one of the watchdogs in the BMC to trigger a reset, but this
|
||
|
|
exposes a stability risk due to the unrestricted power of the interface, let
|
||
|
|
alone the security implications, and like the mailbox is more complex than the
|
||
|
|
alternatives.
|
||
|
|
|
||
|
|
This draws us towards the use of a KCS device, which is best aligned with the
|
||
|
|
simple need of generating an IRQ on the BMC. AST2600 has at least 4 KCS devices
|
||
|
|
of which one is already in use for IBM's vendor-defined MCTP LPC binding leaving
|
||
|
|
at least 3 from which to choose.
|
||
|
|
|
||
|
|
### Proposed Design
|
||
|
|
|
||
|
|
The proposed design is for a simple daemon started at BMC boot to invoke the
|
||
|
|
desired crash dump handler according to the system policy upon receiving the
|
||
|
|
external signal. The implementation should have no IPC dependencies or
|
||
|
|
interactions with `init`, as the reason for invoking the recovery mechanism is
|
||
|
|
unknown and any of these interfaces might be unresponsive.
|
||
|
|
|
||
|
|
A trivial implementation of the daemon is
|
||
|
|
|
||
|
|
```sh
|
||
|
|
dd if=$path bs=1 count=1
|
||
|
|
echo c > /proc/sysrq-trigger
|
||
|
|
```
|
||
|
|
|
||
|
|
For systems with kdump enabled, this will result in a kernel crash dump
|
||
|
|
collection and the BMC being rebooted.
|
||
|
|
|
||
|
|
A more elegant implementation might be to invoke `kexec` directly, but this
|
||
|
|
requires the support is already available on the platform.
|
||
|
|
|
||
|
|
Other activities in userspace might be feasible if it can be assumed that
|
||
|
|
whatever failure has occurred will not prevent debug data collection, but no
|
||
|
|
statement about this can be made in general.
|
||
|
|
|
||
|
|
#### A Idealised KCS-based Protocol for Power10 Platforms
|
||
|
|
|
||
|
|
The proposed implementation provides for both the required and desired
|
||
|
|
behaviours outlined in the requirements section above.
|
||
|
|
|
||
|
|
The host and BMC protocol operates as follows, starting with the BMC application
|
||
|
|
invoked during the boot process:
|
||
|
|
|
||
|
|
1. Set the `Ready` bit in STR
|
||
|
|
|
||
|
|
2. Wait for an `IBF` interrupt
|
||
|
|
|
||
|
|
3. Read `IDR`. The hardware clears IBF as a result
|
||
|
|
|
||
|
|
4. If the read value is 0x44 (`D` for "Debug") then execute the debug dump
|
||
|
|
collection process and reboot. Otherwise,
|
||
|
|
|
||
|
|
5. Go to step 2.
|
||
|
|
|
||
|
|
On the host:
|
||
|
|
|
||
|
|
1. If the `Ready` bit in STR is clear, escalate to recovery mechanism 3.
|
||
|
|
Otherwise,
|
||
|
|
|
||
|
|
2. If the `IBF` bit in STR is set, escalate to recovery mechanism 3. Otherwise,
|
||
|
|
|
||
|
|
3. Start an escalation timer
|
||
|
|
|
||
|
|
4. Write 0x44 (`D` for "Debug") to the Input Data Register (IDR). The hardware
|
||
|
|
sets IBF as a result
|
||
|
|
|
||
|
|
5. If `IBF` clears before expiry, restart the escalation timer
|
||
|
|
|
||
|
|
6. If an STR read generates an LPC SYNC No Response abort, or `Ready` clears
|
||
|
|
before expiry, restart the escalation timer
|
||
|
|
|
||
|
|
7. If `Ready` becomes set before expiry, disarm the escalation timer. Recovery
|
||
|
|
is complete. Otherwise,
|
||
|
|
|
||
|
|
8. Escalate to recovery mechanism 3 if the escalation timer expires at any point
|
||
|
|
|
||
|
|
A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side
|
||
|
|
implementation is not required to emit one and the host implementation must
|
||
|
|
behave correctly without one. Recovery is only necessary if other paths have
|
||
|
|
failed, so STR can be read by the host when it decides recovery is required, and
|
||
|
|
by read by time-based polling thereafter.
|
||
|
|
|
||
|
|
The host must be prepared to handle LPC SYNC errors when accessing the KCS
|
||
|
|
device IO addresses, particularly "No Response" aborts. It is not guaranteed
|
||
|
|
that the KCS device will remain available during BMC resets.
|
||
|
|
|
||
|
|
As STR is polled by the host it's not necessary for the BMC to write to ODR. The
|
||
|
|
protocol only requires the host to write to IDR and periodically poll STR for
|
||
|
|
changes to IBF and Ready state. This removes bi-directional dependencies.
|
||
|
|
|
||
|
|
The uni-directional writes and the lack of SerIRQ reduce the features required
|
||
|
|
for correct operation of the protocol and thus the surface area for failure of
|
||
|
|
the recovery protocol.
|
||
|
|
|
||
|
|
The layout of the KCS Status Register (STR) is as follows:
|
||
|
|
|
||
|
|
| Bit | Owner | Definition |
|
||
|
|
| --- | -------- | ------------------------ |
|
||
|
|
| 7 | Software | |
|
||
|
|
| 6 | Software | |
|
||
|
|
| 5 | Software | |
|
||
|
|
| 4 | Software | Ready |
|
||
|
|
| 3 | Hardware | Command / Data |
|
||
|
|
| 2 | Software | |
|
||
|
|
| 1 | Hardware | Input Buffer Full (IBF) |
|
||
|
|
| 0 | Hardware | Output Buffer Full (OBF) |
|
||
|
|
|
||
|
|
#### A Real-World Implementation of the KCS Protocol for Power10 Platforms
|
||
|
|
|
||
|
|
Implementing the protocol described above in userspace is challenging due to
|
||
|
|
available kernel interfaces[1], and implementing the behaviour in the kernel
|
||
|
|
falls afoul of the defacto "mechanism, not policy" rule of kernel support.
|
||
|
|
|
||
|
|
Realistically, on the host side the only requirements are the use of a timer and
|
||
|
|
writing the appropriate value to the Input Data Register (IDR). All the proposed
|
||
|
|
status bits can be ignored. With this in mind, the BMC's implementation can be
|
||
|
|
reduced to reading an appropriate value from IDR. Reducing requirements on the
|
||
|
|
BMC's behaviour in this way allows the use of the `serio_raw` driver (which has
|
||
|
|
the restriction that userspace can't access the status value).
|
||
|
|
|
||
|
|
[1]
|
||
|
|
https://lore.kernel.org/lkml/37e75b07-a5c6-422f-84b3-54f2bea0b917@www.fastmail.com/
|
||
|
|
|
||
|
|
#### Prototype Implementation Supporting Power10 Platforms
|
||
|
|
|
||
|
|
A concrete implementation of the proposal's userspace daemon is available on
|
||
|
|
Github:
|
||
|
|
|
||
|
|
https://github.com/amboar/debug-trigger/
|
||
|
|
|
||
|
|
Deployment requires additional kernel support in the form of patches at [2].
|
||
|
|
|
||
|
|
[2]
|
||
|
|
https://github.com/amboar/linux/compare/2dbb5aeba6e55e2a97e150f8371ffc1cc4d18180...for/openbmc/kcs-raw
|
||
|
|
|
||
|
|
### Alternatives Considered
|
||
|
|
|
||
|
|
See the discussion in Background.
|
||
|
|
|
||
|
|
### Impacts
|
||
|
|
|
||
|
|
The proposal has some security implications. The mechanism provides an
|
||
|
|
unauthenticated means for the host firmware to crash and/or reboot the BMC,
|
||
|
|
which can itself become a concern for stability and availability. Use of this
|
||
|
|
feature requires that the host firmware is trusted, that is, that the host and
|
||
|
|
BMC firmware must be in the same trust domain. If a platform concept requires
|
||
|
|
that the BMC and host firmware remain in disjoint trust domains then this
|
||
|
|
feature must not be provided by the BMC.
|
||
|
|
|
||
|
|
As the feature might provide surprising system behaviour, there is an impact on
|
||
|
|
documentation for systems deploying this design: The mechanism must be
|
||
|
|
documented in such a way that rebooting the BMC in these circumstances isn't
|
||
|
|
surprising.
|
||
|
|
|
||
|
|
Developers are impacted in the sense that they may have access to better debug
|
||
|
|
data than might otherwise be possible. There are no obvious developer-specific
|
||
|
|
drawbacks.
|
||
|
|
|
||
|
|
Due to simplicity being a design-point of the proposal, there are no significant
|
||
|
|
API, performance or upgradability impacts.
|
||
|
|
|
||
|
|
### Testing
|
||
|
|
|
||
|
|
Generally, testing this feature requires complex interactions with host firmware
|
||
|
|
and platform-specific mechanisms for triggering the reboot behaviour.
|
||
|
|
|
||
|
|
For Power10 platforms this feature may be safely tested under QEMU by scripting
|
||
|
|
the monitor to inject values on the appropriate KCS device. Implementing this
|
||
|
|
for automated testing may need explicit support in CI.
|
||
|
|
|
||
|
|
## Handling platform-data-provider failures
|
||
|
|
|
||
|
|
### Requirements
|
||
|
|
|
||
|
|
As noted above, these types of failures usually yield a system that can continue
|
||
|
|
to operate in a reduced capacity. The desired behavior in this scenario can vary
|
||
|
|
from system to system so the requirements in this area need to be flexible
|
||
|
|
enough to allow system owners to configure their desired behavior.
|
||
|
|
|
||
|
|
The requirements for OpenBMC when a platform-data-provider service enters a
|
||
|
|
failure state are that the BMC:
|
||
|
|
|
||
|
|
- Logs an error indicating a service has failed
|
||
|
|
- Collects a BMC dump
|
||
|
|
- Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC
|
||
|
|
- Allow system owners to customize other behaviors (i.e. BMC reboot)
|
||
|
|
|
||
|
|
### Proposed Design
|
||
|
|
|
||
|
|
This will build upon the existing [target-fail-monitoring][1] design. The
|
||
|
|
monitor service will be enhanced to also take json file(s) which list critical
|
||
|
|
services to monitor.
|
||
|
|
|
||
|
|
Define a "obmc-bmc-service-quiesce.target". System owners can install any other
|
||
|
|
services they wish in this new target.
|
||
|
|
|
||
|
|
phosphor-bmc-state-manager will monitor this target and enter a `Quiesced` state
|
||
|
|
when it is started. This state will be reported externally via the Redfish API
|
||
|
|
under redfish/v1/Managers/bmc status property.
|
||
|
|
|
||
|
|
This would look like the following:
|
||
|
|
|
||
|
|
- In a services-to-monitor configuration file, add all critical services
|
||
|
|
- The state-manager service-monitor will subscribe to signals for service
|
||
|
|
failures and do the following when one fails from within the configuration
|
||
|
|
file:
|
||
|
|
- Log error with service failure information
|
||
|
|
- Request a BMC dump
|
||
|
|
- Start obmc-bmc-service-quiesce.target
|
||
|
|
- BMC state manager detects obmc-bmc-service-quiesce.target has started and puts
|
||
|
|
the BMC state into Quiesced
|
||
|
|
- bmcweb looks at BMC state to return appropriate state to external clients
|
||
|
|
|
||
|
|
[1]:
|
||
|
|
https://github.com/openbmc/docs/blob/master/designs/target-fail-monitoring.md
|
||
|
|
|
||
|
|
### Alternatives Considered
|
||
|
|
|
||
|
|
One simpler option would be to just have the OnFailure result in a BMC reboot
|
||
|
|
but historically this has caused more problems then it solves:
|
||
|
|
|
||
|
|
- Rarely does a BMC reboot fix a service that was not fixed by simply restarting
|
||
|
|
it.
|
||
|
|
- A BMC that continuously reboots itself due to a service failure is very
|
||
|
|
difficult to debug.
|
||
|
|
- Some BMC's only allow a certain amount of reboots so eventually the BMC ends
|
||
|
|
up stuck in the boot loader which is inaccessible unless special debug cables
|
||
|
|
are available so for all intents and purposes your system is now unusable.
|
||
|
|
|
||
|
|
### Impacts
|
||
|
|
|
||
|
|
Currently nothing happens when a service enters the fail state. The changes
|
||
|
|
proposed in this document will ensure an error is logged a dump is collected,
|
||
|
|
and the external BMC state reflects the failure when this occurs.
|
||
|
|
|
||
|
|
### Testing
|
||
|
|
|
||
|
|
A variety of service should be put into the fail state and the tester should
|
||
|
|
ensure the appropriate error is logged, dump is collected, and BMC state is
|
||
|
|
changed to reflect this.
|