We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Open source PMCI stack implementation for add-in-card manageability.

00:00

Formale Metadaten

Titel
Open source PMCI stack implementation for add-in-card manageability.
Serientitel
Anzahl der Teile
637
Autor
Mitwirkende
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Disaggregated computing today relies more on add-in-cards like FPGAs/Smart NICs/xPUs. Traditionally add-in-cards have relied on IPMI based manageability solutions. However, the newer standards from DMTF (PMCI protocol stack) provide more robust and scalable solutions for add-in card manageability. SPDM provides the attestation and secure communication channel between the BMC and the add-in cards. MCTP/PLDM stack provides mechanisms for the BMC to auto discover card’s capabilities and carry out manageability functions like sensor monitoring, event logging, firmware updates etc. This provides significant advantage over IPMI which lacked secure communications and had limitations on supporting advanced high speed interfaces like PCIe and had limitations on number of sensors it could support. We plan to present on how add-in-cards can be managed through PMCI protocols and how do we model the add-in-cards’ manageability functions in a way a DataCentre orchestration software can consume it (i.e. Redfish modelling of add-in-cards). The implementation is planned for OpenBMC and a variety of Add-In-Cards can be supported through a standard manageability model.
179
Vorschaubild
20:09
245
253
Vorschaubild
30:06
294
350
Vorschaubild
59:28
370
419
491
588
Vorschaubild
30:18
ImplementierungPlastikkarteOpen SourceKeller <Informatik>IntelFirmwareBestimmtheitsmaßSatellitensystemGamecontrollerDatenverwaltungPCI-ExpressWärmeausdehnungFormation <Mathematik>StandardabweichungGraphikprozessorBefehlsprozessorSystemplattformArchitektur <Informatik>WärmestrahlungLeistung <Physik>EEPROMDateiformatProtokoll <Datenverarbeitungssystem>AssoziativgesetzSoftware RadioStreaming <Kommunikationstechnik>Open SourceDatenverwaltungImplementierungPlastikkartePlug inVersionsverwaltungBitrateMetropolitan area networkSpeicherkarteSoftwareSchlussregelQuick-SortDifferenteGenerator <Informatik>Wort <Informatik>GamecontrollerTermCASE <Informatik>SystemplattformFunktion <Mathematik>Mechanismus-Design-TheorieDienst <Informatik>RichtungSystemverwaltungKonfiguration <Informatik>WhiteboardInformationInterface <Schaltung>TVD-VerfahrenBildschirmmaskeProdukt <Mathematik>PunktSatellitensystemPhysikalisches SystemAggregatzustandZahlenbereichLeistung <Physik>GeradeMomentenproblemMereologieKette <Mathematik>Stochastische AbhängigkeitSchnittmengeStandardabweichungBasisfunktionEntscheidungstheorieFirmwareWärmestrahlungComputerarchitekturSystem-on-ChipHauptplatineFächer <Mathematik>Zentrische StreckungSerielle SchnittstelleWeg <Topologie>Protokoll <Datenverarbeitungssystem>BefehlsprozessorBus <Informatik>Offene MengeDateiformatRaum-ZeitStrömungsrichtungUmwandlungsenthalpieSoftwareentwicklerEEPROMFunktionalRechenzentrumEreignishorizontProjektive EbeneMaschinenschreibenWeb-SeiteService providerFormation <Mathematik>ComputeranimationJSONXMLUML
ComputerarchitekturVektorrechnungDatenverwaltungKeller <Informatik>SystemplattformBus <Informatik>PCI-ExpressPlastikkarteInterface <Schaltung>PCI-ExpressRechenzentrumProtokoll <Datenverarbeitungssystem>TelekommunikationStandardabweichungSystemplattformComputersicherheitPhysikalismusZusammenhängender GraphPlug inFunktionalDatenverwaltungInterface <Schaltung>SystemverwaltungZahlenbereichBefehlsprozessorGamecontrollerBus <Informatik>MaßerweiterungSoftwareFunktion <Mathematik>PlastikkarteSystemprogrammUmwandlungsenthalpieHardwareInformationstechnikZählenMessage-PassingSchwellwertverfahrenSatellitensystemQuick-SortMechanismus-Design-TheorieRechenschieberShape <Informatik>DifferenteSpeicherkarteZentrische StreckungKeller <Informatik>Generator <Informatik>MathematikFirmwareAbstraktionsebeneBildverstehenInformationMAPSchnittmengeImplementierungVerschlingungEindringerkennungPhysikalische SchichtIdentitätsverwaltungSystemintegrationVHDSLBridge <Kommunikationstechnik>Metropolitan area networkTablet PCAggregatzustandLesen <Datenverarbeitung>ART-NetzPhysikalisches SystemZweiSystemaufrufElektronischer FingerabdruckMomentenproblemWorkstation <Musikinstrument>Gemeinsamer SpeicherZusammengesetzte VerteilungLOLA <Programm>Wort <Informatik>BildschirmmaskeGewicht <Ausgleichsrechnung>Reelle ZahlPhysikalischer EffektInverser LimesDifferenz <Mathematik>FensterfunktionOrdnung <Mathematik>Kontextbezogenes SystemOpen SourcePortscannerComputeranimationFlussdiagramm
PCI-ExpressSystemplattformInterface <Schaltung>BestimmtheitsmaßAssoziativgesetzFunktion <Mathematik>EreignishorizontDateiformatEEPROMComputersicherheitStandardabweichungPlastikkarteVektorrechnungSpezialrechnerTelekommunikationDatenverwaltungRechenzentrumKomponente <Software>Proxy ServerClientGamecontrollerServerInformationUmwandlungsenthalpieGamecontrollerDokumentenserverPlastikkarteRepository <Informatik>Plug inStandardabweichungEinsFrequenzFunktionalDatenverwaltungSchaltnetzMechanismus-Design-TheoriePhysikalische SchichtProxy ServerEreignishorizontBeweistheorieEndliche ModelltheorieYouTubeInterface <Schaltung>ClientExogene VariableWurm <Informatik>Bridge <Kommunikationstechnik>Automatische HandlungsplanungSpeicherkarteOffene MengeFirmwareProtokoll <Datenverarbeitungssystem>SystemplattformZusammenhängender GraphKeller <Informatik>DateiformatTelekommunikationComputerarchitekturImplementierungArithmetische FolgeFunktion <Mathematik>SpeicherabzugOrdnung <Mathematik>ProgrammbibliothekOpen SourceAnpassung <Mathematik>RechenzentrumComputersicherheitMAPDifferenteDatenmodellSystemverwaltungQuick-SortVerschlingungSynchronisierungDatenflussVersionsverwaltungArithmetisches MittelFehlermeldungPhysikalischer EffektMailing-ListeSpieltheorieMetropolitan area networkEinschließungssatzShape <Informatik>PhysikalismusSoundverarbeitungServiceorientierte ArchitekturGruppenoperationGeometrische FrustrationChiffrierungWorkstation <Musikinstrument>AbstandBitrateTaskDatensichtgerätForcingCAN-BusVollständiger VerbandWort <Informatik>MarketinginformationssystemBasisfunktionRhombus <Mathematik>Computeranimation
Element <Gruppentheorie>Physikalische SchichtOpen SourceImplementierungDatenverwaltungFirmwareComputersicherheitProtokoll <Datenverarbeitungssystem>Plug inSpeicherkarteRechenschieberData MiningAutomatische HandlungsplanungInterface <Schaltung>SoftwareKernel <Informatik>SocketQuick-SortFormale SpracheUmwandlungsenthalpieRaum-ZeitWärmeübergangCASE <Informatik>FreewareCoxeter-GruppeNP-hartes ProblemDatenflussMAPExpertensystemInverser LimesProdukt <Mathematik>Wort <Informatik>Mailing-ListeBitrateMultiplikationsoperatorComputeranimationBesprechung/Interview
Element <Gruppentheorie>Computeranimation
Transkript: Englisch(automatisch erzeugt)
Hello everyone, this is Sumant Bhatt and today we will talk about open source PMC stack implementation for add-in card manageability. Some introduction about myself. So I am a
BMC firmware engineer at Intel. So I have a 5.5 years experience of embedded software development. I started off with the IOT products and I did some work on network operating systems and then I moved to Intel. So I've been with Intel for the last 2.5 years
now and majorly I'm working on this open BMC project. So today we'll talk about the manageability aspects of add-in cards. So I would presume the audience has some basic
knowledge about the BMC satellite management controllers, add-in cards in general and out of band manageability and some of the DMT of standards like PMC and Redfish, etc. I do want to touch upon in detail about what is this PMC and how this is going to be
used for add-in card manageability. But if you want more information, I think there is open BMC Wiki page which will talk in detail about each of these topics. Today's agenda.
So we'll start off with the platform architecture overview with the add-in card. How is the add-in card connected in the platform is what we're going to touch upon that and we'll talk about the existing mechanisms for add-in card manageability. We'll touch upon shortcomings of existing mechanisms. Then we'll see the overview
of PMC architecture and we'll see how PMC set of protocols can be used for add-in card manageability. We'll also talk about external interfaces for a data center management software
to consume this management data. We'll see how open BMC is planning to implement this PMC set of protocols and provide certain references for the audience to go ahead and have a look. So this is how a typical platform looks like with an add-in card, right?
So we have a host CPU and in Intel platforms we have PCH also in the place and we have a baseboard management controller and between the host and BMC we have a host interface and the
motherboard and it talks to the add-in card over this I2C bus and when we come to add-in card, so add-in card typically has a SOC which carries out the primary functionality of the add-in card, right? It could be a FPGA, NIC card, it could be a GPU, anything, right?
And the add-in card usually has a satellite management controller. So the satellite management controller is like a mini BMC. It has a set of functionalities of a BMC and it carries out certain management aspects of the add-in card. So the add-in card could have sensors
like thermal sensors, voltage sensors and management controller could be controlling fan on the add-in card. So based on the temperature variations, right? And also add-in card management controller could actually be exposing all this management information through I2C.
And it may not be like hard and fast rule that there should be a management controller. So there are add-in cards without a management controller altogether. So they could expose all the sensors and stuff directly on the I2C bus. Those kinds of mechanisms are also possible.
When we talk about add-in card manageability, what do we mean? So there are typical management functions like inventory management. So what we mean by inventory management is in a data
center scale. So there are a lot of cards which go into this systems and to keep track of the inventory, we usually need the manufacturer name, manufacturing date, serial number,
part number, stuff like that. So this comes under the inventory management, thermal and power management. Of course, the cards, if they get heated up, they could be detrimental to the service performance. And you usually would want to limit the power consumed by the
card. So thermal and power management is an important manageability function. Next one is firmware updates. The card says SOC and the management controller on the card,
both usually have a firmware and update of those firmwares is an important manageability function. And then typically a data center management software manages all the things in
the data center. It takes care of health monitoring of all the systems in the data center. And usually the sensors and events from the service are sent out to the data center management software. And the telemetry stream comprises of all this kind of data.
This is an important manageability function. And this is achieved through external link, which is provided by the BMC. Yeah. Talking about today's add-in current manageability landscape. First is the inventory management. It is carried out using
through space specification. So usually there are two ways to do this. One is directly have the EEPROM on the ATC bus. So if the EEPROM is present on the ATC bus, it usually implements
the IPMA FRU format. So the IPMA FRU format dictates how the inventory information should be organized in the EEPROM. There is a standard and BMC could read this data directly from the EEPROM on the ATC bus. And so BMC could then expose this kind of inventory information
to the data center management software. Second way to do this inventory management is, if the system management controller on the add-in card, if it supports the protocol, right? So the BMC could go ahead and send this IPMA FRU commands,
and then it can collect the inventory information from the add-in card and expose the same information through external interfaces. Coming to the next manageability function,
the thermal and power monitoring. So this, again, we have two options here. One is sensors being directly connected on the ATC bus between the BMC and the add-in card. So on the add-in card, the sensors could be placed in the right spots, right? What I mean by right spots is,
the thermal information could be captured from the sensors and the BMC could directly go ahead and read this thermal sensors on the ATC bus. And the VRs also could be exposed on the same ATC bus so that BMC can go ahead and directly read this. And the second way to do this one,
he is again through IPMV protocol. So the system management software controller on the add-in card, it could have the private buses on which this ATC sensors and VRs are connected.
And the system management controller could aggregate all this data and provide the same information through IPMV. So the BMC could query this system management controller and using IPMI commands, it could get the latest sensor data. So this way thermal and power monitoring can
be achieved. So the issue with this is, BMC needs to have some sort of prior knowledge of the add-in card it's going to manage. So in a data center environment, we have so
many vendors providing the add-in cards and they have different generations and different versions. So it becomes difficult to have the BMCs pre-configured with all this kind of add-in card information. Next, the manageability function is firmware update. Typically how
it is implemented today is the system management controller supports this IPMI firmware update commands. So these are OEM extensions to the IPMI commands and the
implementation varies across OEMs. So the vendors, they usually provide a host based utility tool to carry out this firmware updates. And so usually cards from two different
vendors, they don't support the same way of updating the firmware. So there is no standard for this firmware update function. And coming to external exposure, the BMC collects all this
telemetry data from the add-in card and it exposes through external exposure through IPMI Coming to security aspects, so private buses were considered to be secure. So there is no
standard security protocol that is in place. But with the more security awareness, the hardware security is also of prominence today and having no security between two devices on the ITC bus,
it's considered to be vulnerable and it's no longer accepted today. So yeah, these are like the shortcomings of today's mechanisms. The major one is today IPMI
specification body is no longer functional. What this means is any new requirement in this minority space, the IPMI specification will not be able to cater to this. For example, we have higher speed buses like I3C, PCIe and sending IPMI messages over this high speed
buses is not possible today. And the sensor count in IPMI is 8-bit value. So this means that only 255 sensors can be represented in IPMI. But if you take a typical add-in card, it can easily have 40 sensors. And given the number of PCI slots and other sensors
in the platform, the upper threshold in IPMI is easily hit and many system integrators are already hitting this upper threshold. Of course, there are workarounds for this one, but it's all OEM way of handling these things. And having more and more OEM functions doesn't
solve a problem, common problems for all vendors. So this is a major shortcoming. Second one is, as I discussed firmware updates for the cards. So these are all
specific and there is a need to have a standard specification to carry out firmware updates. And the third one is the security related aspects. There is no standard security mechanism. Of course, platform vendor can come up with some sort of security mechanism,
but the OEM or the card vendor and the platform vendor may not be in agreement for this kind of things. So there is no standard security mechanism in place today. And as we see,
PM shape protocol stack makes an attempt to address this problem. I will see that in future slides. And next one is there is no plug and play solution for add-in cards. BMCs have to be configured for the add-in card they're going to manage. So this,
this becomes a challenge when you consider data center scale with many vendors providing this add-in cards and different generation of add-in cards. Coming to PMC architecture,
PMC is platform management communication interconnect. What this essentially says is, it provides a bunch of protocols which are spread out across multiple layers. So in the right side of this slide, I mentioned how this PLM and MCTP protocol stacks up in this
layers, right? The PMC architecture, what it essentially says is it disaggregates the physical layer from the manageability functions. How this is achieved is through
this MCTP protocol and this is a management component transport protocol. This abstracts the physical layer from the manageability functions. PLM is one of the major PMC protocol where manageability functions like sensor monitoring, inventory data, firmware updates, all those things
are defined. And this PLM protocol is totally agnostic of what is the underlying physical layer transport is. Today assume add-in card would be providing this manageability interface through
SMBs. And if tomorrow, if the card wants to migrate to I3C, the PLM protocol will not change. Only thing what changes is I3C physical layer will be encapsulated by the MCTP protocol. And for the PLM protocol, it would remain the same. So the picture on the left side is
circulated by DMTF. This is DMTF vision of manageability interfaces for feature generations. The external interface would be Redfish. And so this Redfish interface
would be for things like data center management software or some sort of orchestration software, which carries out aggregated manageability in a data center. And the thing at the center
will typically be a BMC. BMC talks to various platform components like the host or if there are other satellite controllers on the platform, or if there are other NIC cards or any other
devices on the platform, the management controller goes ahead and talks to the platform components and aggregates those manageability information so that they can be streamed to the data center software and the PMC protocol protocols come into picture here. So this is where things
like PLM MCTP come into picture. All these protocols facilitate this platform components communication. So this is how PMC set of protocols can be used for adding card
manageability. This is a high level picture. So assume this is a host platform. It would
have a CPU and in inter-platforms we have PCH. There's a PC link between add-in card and the CPU and the BMC and the system management controller on the add-in card. So they talk to each other using SMBus. The MCTP protocol is built on top of SMBus and the
upper layer protocols like SPDM and PLM, they provide the manageability functions on top of MCTP. So there are MCTP specifics here. So BMC usually acts as a MCTP bus owner and goes
ahead and assigns endpoint IDs to system management controller. So this is how MCTP network is established between BMC and the add-in card. And this facilitates for their communications like further manageability functions like SPDM and PLM. The SPDM is used
for security aspects. There are two aspects here. One is attestation of the add-in card, secure messages. So attestation means that the add-in card is authenticated and we make
sure the identity of the card before we talk to it. Second one is secure messages. In secure messages, we make sure that we send only encrypted packets over this SMBus link and PLM, it's another protocol in the PMC stack which provides manageability functions like FRU,
firmware updates, the sensor monitoring, and other stuff. We'll talk in detail about PLM in coming slides. So coming to the specifications which count this add-in card
manageability. So most of this manageability functions are provided by this PLM specifications.
PLM is platform level data model. So this models how two different platform components can talk to each other. Inventory information is provided through PLM for FRU. So this has its own format.
It is not compliant with the IPMA FRU format. So in for PLM for FRU dictates that the system management controller on the add-in card, it is a responder to PLM FRU commands. So the BMC can send out PLM FRU commands and get the manageability information. Coming to the census,
PLM provides something called less monitoring and control specification. Here things like PDRs, census, effectors, events, all those things are defined. So the innovative thing,
the platform MNC specification provides is PDRs. So the add-in card has something called as PDR. It's a repository which provides the information about all the census and effectors
on the add-in card. So the BMC, when it detects an add-in card, it can go ahead and download this PDRs from the add-in card itself. And BMC being the primary PDR repository maintainer, it can just add all those downloaded PDRs into its main repo and then it can carry out with
all the sensor PDR passing and it can come to note the census and effectors on the add-in card. So this way, this kind of becomes a plug and play solution and BMC need not have any prior knowledge about the add-in card. It can simply download the PDRs from the card and it can carry
out the monitoring functions and BMC can also act as a event receiver. So the add-in cards can generate events and notify BMC asynchronously about any events which happen in the add-in card.
Coming to firmware update, PLM sanitizes the way firmware update can happen. So there is a
separate specification. So this kind of standardizes the firmware update mechanism across vendors. Coming to security, the BMC protocol stack has a SPDM protocol. So there is one specification which provides standards for card attestation and another standard which
provides for encrypted communication between the BMC and SMC. So whatever we talked until now is
using this PMC protocol set to collect the management information from add-in card and the BMC should expose this to the data center management tool. So usually we go for Redfish, which is a standard secure interface for data center management tool to collect the
manageability data. And when Redfish clients have to collect all this kind of data, there are actually two ways in which add-in cards manageability data can be collected. One is PLM RD. So what PLM RD is, it's a specification which dictates how BMC can
act as a proxy between the Redfish client and the add-in card. So what BMC does is it actually converts the HTTP slash HTTPS payload in the Redfish and it converts it as PLM for
RD packets and sends it to the RD device and whatever RD device response it is sent back to the Redfish client. So BMC acts like a bridge between the Redfish client and the device. So
this would be like a way to transparently collect add-in card manageability data. Another way is if the PLM device does not support RD specification, and if it supports only the standard MNC and Fru and firmware updates, then what we could do is BMC can model this
PLM device and BMC itself can expose the Redfish interface so that the Redfish client can directly collect this manageability information.
Coming to how open BMC plans to implement this BMC protocol stack. So in Intel, we have something called Intel BMC repo where open source implementation of this
stack is already going on. So this is a work in progress. So the protocol demands are written in C++, but however, the core functionalities are implemented as C libraries. This is done in order to encourage adaptation of these libraries, both on the BMC side and
device firmware. So this is a C library which even the device manufacturers can use. And coming to the architecture diagram, so in open BMC, all the demands are usually written in C++ and MCTP domain. It being the transport service, it exposes DBS APIs.
So the upper layer protocols can make use of this DBS APIs to send out their payload. MCTP would encapsulate and send down the corresponding physical medium.
And SPDM domain, we plan to use something called as open SPDM. This again is open source C library and this is meant to be used across different devices. And in open BMC, SPDM domain,
is a C++ library which is going to consume this open SPDM library and it's going to send out this SPDM packets. So here is the link. You can go ahead and have a look at this one. And we plan to have some sort of sync with the open BMC main community and have all this
Intel implementation working in the community open BMC as well. In short, industries moving towards this PLM, both BMC and add-in card vendors are more and more moving towards this.
What we see is there is going to be a period where both BMC and add-in cards need to support the older versions, which is like the IPMA based and the newer ones PLM protocols as well.
So there will be some period where both protocols might co-exist and then people might abandon IPMA together and move towards PLM. There are already products available with such combinations. One example is the accelerator card from Xilinx. It supports both the EEPROM
based proof and it also supports this PLM protocol. These are references on existing
standards and also on the PMC specifications. So in the Intel PMC implementation, you can head over and have a look at the implementation. And I also would like to acknowledge Richard Tamayar's work on this add-in card modeling using PLM. So there's a YouTube video where
Richard talks about how this add-in card can be modeled using PLM. It's an interesting topic as well. Just go ahead and have a look at that one. That is all from my side. Thank you.
Okay. I think we are online now. Yeah. I think at the end, some slides were missed.
So I just pinged about this slide deck. Yeah. I see certain questions about security, but at this level, I'm not a security expert. A colleague of mine is going to talk much more about this one is PDM, which is up next. But regarding this PMC for manageability of
add-in cards. So originally what we need is a lightweight protocol, which can be... I see certain questions about security, but I'm not a security expert.
Apologies about that. Yeah. So the original requirement for this one is we need simple embedded devices to be able to handle this security protocol. So I think SPDM is more suited towards that one. But when we talk about this MCTP, PLM and other PMC protocol stack in
general. So there is a plan for this one. A kernel based implementation in the sense,
this actually looks like network interfaces. And there are discussions in the PMC community to implement some sort of socket interface from the kernel for this upper layer protocols
to be implemented. Any other questions? We have a lot of time, so
please feel free to ask more questions about open source PMC stack implementation.
Here is a question. Would you like to consider some simpler language maybe for the specific specs so they get easier to follow? Yes. I have policies for that. I mean,
there are too many protocols in this PMC space. Yes, I will consider that. Probably we should call it as inventory management spec and sensor monitoring spec and firmware update spec
and security specification respectively. So the SPDM spec is for security and PLM is for other management aspects, right? And the MCTP protocol, it's like the transport protocol. I think we could agree to that one. Thanks. I see someone is writing right now.
Thank you. Thank you, Olaf. So if there is no other questions,
thank you for your presentations month. We also have private room for this talk, so feel free to ask more questions. Sure, definitely. And it will be available
on backstage. It's still not late night in India, so I'll be able to answer any more questions in backstage. Thank you.