openbmc_docs/designs/nmi-dbus-interface.md

97 lines
3.8 KiB
Markdown
Raw Permalink Normal View History

2024-12-23 14:53:31 +08:00
# Design proposal for issuing NMI on servers that use OpenBMC
Author: Lakshminarayana Kammath
Other contributors: Jayanth Othayoth
Created: 2019-05-21
## Problem Description
Currently, servers that use OpenBMC cannot have the ability to capture relevant
debug data when the host is unresponsive or hung. These systems need the ability
to diagnose the root cause of hang and perform recovery along with debugging
data collected.
## Background and References
There is a situation at customer places/lab where the host goes unresponsive
causing system hang(https://github.com/ibm-openbmc/dev/issues/457). This means
there is no way to figure out what went wrong with the host in a hung state. One
has to recover the system with no relevant debug data captured.
Whenever the host is unresponsive/running, Admin needs to trigger an NMI event
which, in turn, triggers an architecture-dependent procedure that fires an
interrupt on all the available processors on the system.
## Proposed Design for servers that use OpenBMC
This proposal aims to trigger NMI, which in turn will invoke an
architecture-specific procedure that enables data collection followed by
recovery of the Host. This will enable Host/OS development teams to analyze and
fix any issues where they see host hang and unresponsive system.
### D-Bus
Introducing new D-Bus interface in the control.host namespace
(/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/
NMI.interface.yaml) and implement the new D-Bus back-end for respective
processor specific targets.
### BMC Support
Enable NMI D-Bus phosphor interface and support this via Redfish
### Redfish Schema used
- Reference: DSP2046 2018.3,
- ComputerSystem 1.6.0 schema provides an action called #ComputerSystem.Reset,
This action is used to reset the system. The ResetType parameter is used for
indicating the type of reset needs to be performed. In this case, we can use
An NMI type
- Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86 systems) to
cease normal operations, perform diagnostic actions and typically halt the
system.
## High-Level Flow
1. Host/OS is hung or unresponsive or one need to take kernel dump to debug some
error conditions.
2. Admin/User can use the Redfish URI ComputerSystem.Reset that allows POST
operations and change the Action and ResetType properties to
{"Action":"ComputerSystem.Reset","ResetType":"Nmi"} to trigger NMI.
3. Redfish URI will invoke a D-Bus NMI back-end call which will use an arch
specific back-end implementation of xyz.openbmc_project.Control.Host.NMI to
trigger an NMI on all the processors on the system.
4. On receiving the NMI, the host will automatically invoke Architecture
specific actions. One such action could be; invoking the kdump followed by
the reboot.
- Note: NMI can be sent to the host in any state, not just at an unresponsive
state.
## Alternatives Considered
Extending the existing D-Bus interface state.Host namespace
(/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml)
to support new RequestedHostTransition property called Nmi. D-Bus back-end can
internally invoke processor-specific target to invoke NMI and do associated
actions.
There were strong reasons to move away from the above approach.
phosphor-state-manager has always been focused on the states of the BMC,
Chassis, and Host. NMI will be more of action against the host than a state.
## Impacts
This implementation only needs to make some changes to the system state when NMI
is initiated irrespective of what host OS state is in, so it has minimal impact
on the rest of the system.
## Testing
Depending on the platform hardware design, this test requires a host OS kernel
module driver to create hard lockup/hang and then check the scenario is good.
Also, one can invoke NMI to get the crash dump and confirm HOST received NMI via
console logs.