System Management¶
qaic-util
command line utility enables developers to query
- card and SoC(s) health
- firmware version
- compute/memory/IO resources available vs in-use
- card power and temperature
- status of certain device capabilities like ECC etc
Cloud AI Platform SDK installation is required for qaic-util
usage.
QID
a.k.a deviceID
are indentifiers (integers) assigned to each AI 100 SoC present in the system. Note that certain SKUs may contain more than one AI 100 SoC per Card.
qaic-util
displays information in 2 formats:
- vertical format where cards/SoCs are queried once and the parameters are listed one per line.
- tabular format where certain parameters (compute, IO, power, temperature etc) are listed in a tabular format, refreshed every 'n' seconds (user input)
-d
flag can be used to display information for a specific QID
qaic-util
, provides --filter(-f) option along with -q and -t options, which can filter by certain device properties. Also to dump output to the .json file by using -j option.
Examples:
To display information for a specific Card
.
Card
in tabular format.
To dump output from the qaic-util, option -j can be used,
sudo /opt/qti-aic/tools/qaic-util -j <output-file-name>.json -f "Board serial==<BOARD_SERIAL_OF_CARD>"
Developers can grep
for keywords like Status
, Capabilities
, Nsp
, temperature
, power
to get specific information from the cards/SoCs.
Health¶
Status
field indicates the health of the card.
Ready
indicates card is in good health.Error
indicates card is in error condition or user lacks permissions (usesudo
).
sudo /opt/qti-aic/tools/qaic-util -q | grep -e Status -e QID
QID 0
Status:Ready
QID 1
Status:Ready
QID 2
Status:Ready
QID 3
Status:Ready
Verify the function steps can be used to run a sample workload on QIDs
to ensure HW/SW is funtioning correctly.
SoC Reset¶
Developers can reset the QIDs
using soc_reset
sysfs node to recover the SoCs if they are in Error
condition. These are the steps to issue a soc_reset
.
-
Identify the
In the sample output below,MHI ID
associated with theQID
MHI ID:0
is associated withQID 0
and so on.Note
MHI and QID do not always map to the same integer. It is imperative for developers to identify the mapping first before issuing the
soc_reset
-
Issue
soc_reset
to theMHI ID
identified in step 1.Verify the health/function of the SoCs/Cards after a
soc_reset
.
Advanced System Management¶
For advanced system management details, refer to Cloud AI Card Management
This document is shared with System Integrators and covers the following topics.
- Boot and firmware management
- Security - Secure boot enablement and attestation
- BMC integration
- Platform validation tools
- Platform error management
Python APIs¶
Python APIs also provide the abilty to monitor the health and resources of the cards/SoCs. Refer to Util class