Thursday, March 10 • 10:30am - 11:00am
Out of Band Management for OCP Server PCIe add-on-cards and SSDs

Live Stream - http://www.youtube.com/watch?v=_alt6Vx0n0I

The I2C sideband management feature developed for PCIe add-on-cards (AOC) by the OCP community can be a very useful tool for running datacenters more efficiently and offers potential for advanced server management of SSDs. This session will look at how the development of active monitoring of high performance PCIe SSDs by the BMC in OCP servers resulted in significant operational cost savings and SSD endurance management for large scale PCIe flash deployments.

The Add-on-Card Thermal Interface Spec for Intel Motherboard V3.0 specification defines an I2C/SMBus for the SSDs/PCIe AOCs for the BMC in OCP servers. With this the BMC can make decisions to keep server working under most efficient operating range lowering cost of operation. One such use case is temperature monitoring of these devices by BMC so it can dynamically control fan speed to provide adequate cooling. The SSD/PCIe AOC presents itself as an emulated temperature sensor so BMC can get temperature reading as if it is getting directly from a temperature sensor. SSDs draw power based on the type of workload that is presented to them and this power draw can vary over a surprising large range. Heavy writes (the worst workload) draw the maximum power and make them very hot whereas other workload draws significantly different power. These workloads do not necessarily stress other parts of the server, so in the absence of temperature monitoring of the SSDs and dynamic fan control, the server has to assume the worst thermal profile and run fans at the maximum speed. SSD endurance is limited to a certain number of petabytes written so it is impractical that this worst case workload will be running for more than a small fraction of a SSDs life. Additionally, the temperature of the NAND on an SSD is an important input into the endurance that and SSD can deliver, particularly lower cost NAND options. This combination of operational cost savings and lower cost NAND make this a particularly valuable feature that is provided to the OCP community. This mechanism of out-of-band management can be further extended to monitor other aspect of SSDs/PCIe AOCs. The session will cover the potential to more advanced monitoring such as:
• Power: Power consumption, power state transitions etc.
• SSD life cycle metrics like percentage drive life used, SMART data, Throttling information, FW slot information. Basically some of the monitoring already covered as part of NVMe standard.
• Errors logs and crash logs to provide adequate debugging information of failures.


