System Management¶
qaic-util
command line utility enables developers to query
- card and SoC(s) health
- firmware version
- compute/memory/IO resources available vs in-use
- card power and temperature
- status of certain device capabilities like ECC etc
Cloud AI Platform SDK installation is required for qaic-util
usage.
QID
a.k.a deviceID
are indentifiers (integers) assigned to each AI 100 SoC present in the system. Note that certain SKUs may contain more than one AI 100 SoC per Card.
qaic-util
displays information in 3 formats:
- detailed view where cards/SoCs are queried once and the parameters are listed one per line.
-
tabular format where certain parameters (compute, IO, power, temperature etc) are listed in a tabular format, refreshed every 'n' seconds (user input)
-
tree format where certain parameters (PCIe BDF address, MHI ID, device node name, status) are listed in a tree structure organized by card. This view is useful for understanding the PCIe topology and visualizing multi-soc cards like Cloud AI 100 Ultra.
qaic-util
, provides --filter(-f) option along with -q and -t options, which can filter by certain device properties. Also to dump output to the .json file by using -j option.
Examples:
To display information for a specific Card
.
Card
in tabular format.
To dump output from the qaic-util, option -j can be used,
sudo /opt/qti-aic/tools/qaic-util -j <output-file-name>.json -f "Board serial==<BOARD_SERIAL_OF_CARD>"
Developers can grep
for keywords like Status
, Capabilities
, Nsp
, temperature
, power
to get specific information from the cards/SoCs.
Health¶
Status
field indicates the health of the card.
Ready
indicates card is in good health.Error
indicates card is in error condition or user lacks permissions (usesudo
).
sudo /opt/qti-aic/tools/qaic-util -q | grep -e Status -e QID
QID 0
Status:Ready
QID 1
Status:Ready
QID 2
Status:Ready
QID 3
Status:Ready
Verify the function steps can be used to run a sample workload on QIDs
to ensure HW/SW is funtioning correctly.
SoC Reset¶
Developers can reset the QIDs
to recover the SoCs if they are in Error
condition. The specific soc_reset
can be done using either MHI ID
or pci address
of the QID
. Also, there is an option to reset all the QIDs
. Below are the steps to issue a soc_reset
.
-
Reset using the
MHI ID
associated with theQID
.- Identify the
MHI ID
associated with theQID
In the sample output below,MHI ID:0
is associated withQID 0
and so on.
Note
MHI and QID do not always map to the same integer. It is imperative for developers to identify the mapping first before issuing the
soc_reset
Output example:
- Issue
soc_reset
using theMHI ID
associated with theQID
.
- Identify the
-
Reset using the
pci address
associated with theQID
.-
Find the
Output example:pci address
associated with theQID
. -
Issue
Output example:soc_reset
using thepci address
associated with theQID
.
-
-
Reset all QIDs.
Verify the health/function of the SoCs/Cards after a
soc_reset
.
Advanced System Management¶
For advanced system management details, refer to Cloud AI Card Management
This document is shared with System Integrators and covers the following topics.
- Boot and firmware management
- Security - Secure boot enablement and attestation
- BMC integration
- Platform validation tools
- Platform error management
Python APIs¶
Python APIs also provide the abilty to monitor the health and resources of the cards/SoCs. Refer to Util class