255 lines
10 KiB
Markdown
255 lines
10 KiB
Markdown
# Memory preserving reboot and System Dump extraction flow on POWER Systems.
|
|
|
|
Author: Dhruvaraj S <dhruvaraj@in.ibm.com>
|
|
|
|
Created: 11/06/2019
|
|
|
|
## Problem Description
|
|
|
|
On POWER based servers, a hypervisor firmware manages and allocates resources to
|
|
the logical partitions running on the server. If this hypervisor encounters an
|
|
error and cannot continue with management operations, the server needs to be
|
|
restarted. A typical server reboot will erase the content of the main memory
|
|
with the current running configuration of the logical partitions and the data
|
|
required for debugging the fault. Some hypervisors on the POWER based systems
|
|
don't have access to a non-volatile storage to store this content after a
|
|
failure. A warm reboot with preserving the main memory is needed on the POWER
|
|
based servers to create a memory dump required for the debugging. This document
|
|
explains the high-level flow of warm reboot and extraction of the resulting dump
|
|
from the hypervisor memory.
|
|
|
|
## Glossary
|
|
|
|
- **Boot**: The process of initializing hardware components in a computer system
|
|
and loading the operating system.
|
|
|
|
- **Hostboot**: The firmware runs on the host processors and performs all
|
|
processor, bus, and memory initialization on POWER based servers.
|
|
[read more](https://github.com/open-power/docs/blob/master/hostboot/HostBoot_PG.md)
|
|
|
|
- **Self Boot Engine (SBE)**: A microcontroller built into the host processors
|
|
of POWER systems to assist in initializing the processor during the boot. It
|
|
also acts as an entry point for several hardware access operations to the
|
|
processor. [read more](https://sched.co/SPZP)
|
|
|
|
- **Master Processor**: The processor which gets initialized first to execute
|
|
boot firmware.
|
|
|
|
- **POWER Hardware Abstraction Layer (PHAL)**: A software component on the BMC
|
|
providing access to the POWER hardware.
|
|
|
|
- **Hypervisor**: A hypervisor (or virtual machine monitor, VMM) is a computer
|
|
software, firmware, or hardware that creates and runs virtual machines
|
|
[read more](https://en.wikipedia.org/wiki/Hypervisor)
|
|
|
|
- **System Dump**: A dump of main memory and hardware states for debugging the
|
|
faults in hypervisor.
|
|
|
|
- **Memory Preserving Reboot (MPR)**: A method of reboot with preserving the
|
|
contents of the volatile memory.
|
|
|
|
- **Terminate Immediate (TI)**: A condition when the hypervisor encountered a
|
|
fatal error and cannot continue with the normal operations.
|
|
|
|
- **Attention**: The signal generated by the hardware or the firmware for a
|
|
specific event.
|
|
|
|
- **Redfish**: The Redfish standard is a suite of specifications that deliver an
|
|
industry-standard protocol providing a RESTful interface for the management of
|
|
servers, storage, networking, and converged infrastructure.
|
|
[Read More](<https://en.wikipedia.org/wiki/Redfish_(specification)>)
|
|
|
|
- **OCC**: An On-Chip Controller (OCC) is a co-processor that is embedded
|
|
directly on the die of POWER processors. The OCC can be used to controls the
|
|
processor frequency, power consumption, and temperature to maximize
|
|
performance and minimize energy usage.
|
|
|
|
[Read More](https://openpowerfoundation.org/on-chip-controller-occ/)
|
|
|
|
- **Checkstop**: A severe error inside a processor core that causes a processor
|
|
core to stop all processing activities.
|
|
|
|
- **PNOR**: PNOR is a host NOR flash where the firmware is stored.
|
|
|
|
## Background and References
|
|
|
|
When the POWER based server encounters a fault and needs a restart, it alerts
|
|
BMC to initiate a memory preserving reboot. BMC starts the reboot by informing
|
|
the SBE on each of the processors. SBE stops the running cores and collects the
|
|
hardware states in a specified format and store into the host memory. Once the
|
|
data is collected, the SBE returns control to the BMC. BMC then initiates a
|
|
memory preserved reboot. Once the system finished booting, the hypervisor
|
|
collects the hardware data and memory contents to create a dump file in the host
|
|
memory.
|
|
|
|
## Requirements
|
|
|
|
### Primary Requirements
|
|
|
|
- System dump should be collected irrespective of the availability of an
|
|
external entity to offload it at the time of a failure.
|
|
|
|
- It should provide a mechanism for the user to request a system dump.
|
|
|
|
- The server should boot back to runtime
|
|
|
|
- The hypervisor should send a special attention to BMC to notify about a severe
|
|
fault.
|
|
|
|
- BMC should receive special TI attention from hypervisor
|
|
|
|
- BMC should change the host state to 'DiagnosticMode.'
|
|
|
|
- BMC should inform SBE to start the memory preserving reboot and collect the
|
|
hardware data.
|
|
|
|
- Error log associated with dump needs to be part of the dump package
|
|
|
|
- A dump summary should be created with size and other details of the dump
|
|
|
|
- Once the dump is generated, the hypervisor should notify BMC.
|
|
|
|
- Hypervisor should offload the dump to BMC to transfer to an external client.
|
|
|
|
- Provide Redfish interfaces to manage dump
|
|
|
|
- A tool to collect the dump from the server.
|
|
|
|
- A method to parse the content of the dump.
|
|
|
|
## Proposed Design
|
|
|
|
### The flow
|
|
|
|
The flow of the memory preserving reboot and system dump offloading
|
|

|
|
|
|
#### 1 - Server fault and notification to BMC
|
|
|
|
When there is a fault, the hypervisor generates attention. The attention
|
|
listener on the BMC detects the attention. In the case of OpenPOWER based Linux
|
|
systems, an additional s0 interrupt will be sent to SBE to stop the cores
|
|
immediately.
|
|
|
|
#### 2 - Analyze the error data.
|
|
|
|
The attention listener on the BMC calls a chip-op to analyze the reason for the
|
|
attention.
|
|
|
|
#### 3 - Initiate System Dump
|
|
|
|
Attention on the BMC sets the Diagnostic target for reboot to initiate a memory
|
|
preserving reboot.
|
|
|
|
#### 4 - Initiate Memory preserve transition
|
|
|
|
following steps are executed as part of the reboot target - Set the system state
|
|
to DiagnosticMode - Stop OCC - Disable checkstop monitoring - Issue enter_mpipl
|
|
chip-op to each SBE
|
|
|
|
#### 5 - SBE collects the hardware data
|
|
|
|
Each SBE collects the architected states and stores it into a pre-defined
|
|
location.
|
|
|
|
#### 6 - BMC Start warm boot
|
|
|
|
Once the SBE finishes the hardware collection, it does following to boot the
|
|
system with preserving the memory. - Reset VPNOR - Enable watchdog - Enable
|
|
checkstop monitoring - Run istep proc_select_boot_master - Run istep
|
|
sbe_config_update - Issue continue_mpipl chip-op instead of start_cbs on the
|
|
master processor
|
|
|
|
#### 7 - Hostboot booting
|
|
|
|
Once SBE is started, it starts hostboot, hostboot copies the architected states
|
|
to the right location, move the memory contents to create the dump.
|
|
|
|
#### 8 - Hypervisor Formats dump and sends notification to BMC
|
|
|
|
Once the hypervisor is started, it formats the dump and sends a notification to
|
|
BMC through PLDM and with the dump size PLDM calls the dump manager interface to
|
|
notify the dump. Dump manager creates a dBus object for the new dump, with
|
|
status not offloaded and dump size. BMC web catches the object creation signal
|
|
and notifies HMC.
|
|
|
|
#### 9 - HMC send request to dump offload
|
|
|
|
Once HMC is ready to offload, it creates NBD server and send dump offload
|
|
request to BMC. BMCWeb creates an NBD client and NBD proxy to offload the dump.
|
|
BMC dump manager make a PLDM call with dump id provided by hypervisor and the
|
|
NBD device id. PLDM sends the offload request to the hypervisor with the dump
|
|
id.
|
|
|
|
#### 10 - Hypervisor starts dump offload
|
|
|
|
Hypervisor start sending down the dump packets through DMA PLDM reads the DUMP
|
|
and write to the NBD client endpoint The data reaches the NBD server on the HMC
|
|
and get written to a dump file.
|
|
|
|
#### 11 - Hypervisor sends down offload complete message
|
|
|
|
Hypervisor sends down offload complete message to BMC and BMC sends it to HMC.
|
|
The NBD endpoints are cleared.
|
|
|
|
#### 12 - HMC verifies dump and send dump DELETE to BMC.
|
|
|
|
HMC verifies the dump and send dump delete request to BMC BMC sends the dump
|
|
delete message to hypervisor Hypervisor deletes dump in host memory.
|
|
|
|
### Memory preserve reboot sequence.
|
|
|
|

|
|
|
|
### Dump offload sequence
|
|
|
|

|
|
|
|
## Alternatives Considered
|
|
|
|
Offload the dump from hypervisor to external dump collection application instead
|
|
of offloading through BMC. But offloading though BMC is selected due to
|
|
following reasons. - BMC provides a common point for offloading all dumps -
|
|
During the prototyping, it is found that the offloading through BMC gave better
|
|
performance. - Offloading through BMC has less development impact on the host.
|
|
|
|
## Impacts
|
|
|
|
- PLDM on BMC and Host - Extensions to PLDM implementation to pass type of dump,
|
|
and notification of new dump file to dump manager.
|
|
[PLDM Design]([https://github.com/openbmc/docs/blob/7c8847e95203ebfaeddb82a6e1b9888bc6a11311/designs/pldm-stack.md])
|
|
|
|
- Dump manager on BMC - BMC dump manager supports dump stored on BMC and that
|
|
needs to expanded to support host dumps.
|
|
|
|
- External dump offloading application needs to support NBD based offload
|
|
|
|
- Proposing a new redfish schema for dump operations.
|
|
[Redfish Dump Proposal](https://lists.ozlabs.org/pipermail/openbmc/2019-December/019827.html)
|
|
|
|
- BMC Web needs to implement new redfish specification for dump.
|
|
|
|
- Add support to openpower-hw-diags to catch special attention and initiate
|
|
memory preserving reboot.
|
|
|
|
- SBE needs to support a new operation to analyze the attention received from
|
|
the host. The interface update is yet to be published.
|
|
|
|
## Testing
|
|
|
|
- Unit test plans - Test dump manager interfaces using busctl - Test reboot by
|
|
setting the diag mode target - Test the SBE chip on using standalone calls -
|
|
Test PLDM by using hypervisor debug commands - Test BMCWeb interfaces using
|
|
curl
|
|
|
|
- Integration testing by
|
|
|
|
- User-initiated dump testing, which invokes a memory preserving reboot to
|
|
collect dump.
|
|
- Initiate memory preserving reboot by injecting host error
|
|
- Offload dump collected in host.
|
|
|
|
- System Dump test plan
|
|
- Automated tests to initiate and offload dump as part of test bucket.
|
|
- Both user-initiated and error injection should be attempted.
|