System Management¶
qaic-util command line utility enables developers to query
- card and SoC(s) health
- firmware version
- compute/memory/IO resources available vs in-use
- card power and temperature
- status of certain device capabilities like ECC etc
Cloud AI Platform SDK installation is required for qaic-util usage.
QID a.k.a deviceID are indentifiers (integers) assigned to each AI 100 SoC present in the system. Note that certain SKUs may contain more than one AI 100 SoC per Card.
qaic-util displays information in 2 formats:
- vertical format where cards/SoCs are queried once and the parameters are listed one per line.
- tabular format where certain parameters (compute, IO, power, temperature etc) are listed in a tabular format, refreshed every 'n' seconds (user input)
-d flag can be used to display information for a specific QID
Developers can grep for keywords like Status, Capabilities, Nsp, temperature, power to get specific information from the cards/SoCs.
Health¶
Status field indicates the health of the card.
Readyindicates card is in good health.Errorindicates card is in error condition or user lacks permissions (usesudo).
sudo /opt/qti-aic/tools/qaic-util -q | grep -e Status -e QID
QID 0
Status:Ready
QID 1
Status:Ready
QID 2
Status:Ready
QID 3
Status:Ready
Verify the function steps can be used to run a sample workload on QIDs to ensure HW/SW is funtioning correctly.
SoC Reset¶
Developers can reset the QIDs using soc_reset sysfs node to recover the SoCs if they are in Error condition. These are the steps to issue a soc_reset.
-
Identify the
In the sample output below,MHI IDassociated with theQIDMHI ID:0is associated withQID 0and so on.Note
MHI and QID do not always map to the same integer. It is imperative for developers to identify the mapping first before issuing the
soc_reset -
Issue
soc_resetto theMHI IDidentified in step 1.Verify the health/function of the SoCs/Cards after a
soc_reset.
Advanced System Management¶
For advanced system management details, refer to Cloud AI Card Management
This document is shared with System Integrators and covers the following topics.
- Boot and firmware management
- Security - Secure boot enablement and attestation
- BMC integration
- Platform validation tools
- Platform error management
Python APIs¶
Python APIs also provide the abilty to monitor the health and resources of the cards/SoCs. Refer to Util class