System Management¶
qaic-util
command line utility enables developers to query
card and SoC(s) health
firmware version
compute/memory/IO resources available vs in-use
card power and temperature
status of certain device capabilities like ECC etc
Cloud AI Platform SDK installation is required for qaic-util
usage.
QID
a.k.a deviceID
are indentifiers (integers) assigned to each
AI 100 SoC present in the system. Note that certain SKUs may contain
more than one AI 100 SoC per Card.
qaic-util
displays information in 3 formats:
detailed view where cards/SoCs are queried once and the parameters are listed one per line.
sudo /opt/qti-aic/tools/qaic-util -q
sudo /opt/qti-aic/tools/qaic-util -q -d <QID#> #To display information for a specific `QID`
tabular format where certain parameters (compute, IO, power, temperature etc) are listed in a tabular format, refreshed every ‘n’ seconds (user input)
sudo /opt/qti-aic/tools/qaic-util -t 1
sudo /opt/qti-aic/tools/qaic-util -t 1 -d <QID#> #To display information for a specific `QID`
tree format where certain parameters (PCIe BDF address, MHI ID, device node name, status) are listed in a tree structure organized by card. This view is useful for understanding the PCIe topology and visualizing multi-soc cards like Cloud AI 100 Ultra.
sudo /opt/qti-aic/tools/qaic-util -r
sudo /opt/qti-aic/tools/qaic-util -r -v # View detailed PCIe topology
qaic-util
, provides -filter(-f) option along with -q and -t options,
which can filter by certain device properties. Also to dump output to
the .json file by using -j option.
Examples: To display information for a specific Card
.
sudo /opt/qti-aic/tools/qaic-util -q -f "Board serial==<BOARD_SERIAL_OF_CARD>"
To display information for a specific Card
in tabular format.
sudo /opt/qti-aic/tools/qaic-util -t 1 -f "Board serial==<BOARD_SERIAL_OF_CARD>"
To dump output from the qaic-util, option -j can be used,
sudo /opt/qti-aic/tools/qaic-util -j <output-file-name>.json -f "Board serial==<BOARD_SERIAL_OF_CARD>"
Developers can grep
for keywords like Status
, Capabilities
,
Nsp
, temperature
, power
to get specific information from the
cards/SoCs.
Health¶
Status
field indicates the health of the card.
Ready
indicates card is in good health.Error
indicates card is in error condition or user lacks permissions (usesudo
).
sudo /opt/qti-aic/tools/qaic-util -q | grep -e Status -e QID
QID 0
Status:Ready
QID 1
Status:Ready
QID 2
Status:Ready
QID 3
Status:Ready
Verify the function
steps can be used to run a sample workload on QIDs
to ensure HW/SW
is funtioning correctly.
SoC Reset¶
Developers can reset the QIDs
to recover the SoCs if they are in
Error
condition. The specific soc_reset
can be done using either
MHI ID
or pci address
of the QID
. Also, there is an option
to reset all the QIDs
. Below are the steps to issue a soc_reset
.
Reset using the
MHI ID
associated with theQID
.Identify the
MHI ID
associated with theQID
sudo /opt/qti-aic/tools/qaic-util -q | grep -e MHI -e QID
In the sample output below,
MHI ID:0
is associated withQID 0
and so on.???+ note MHI and QID do not always map to the same integer. It is imperative for developers to identify the mapping first before issuing the
soc_reset
Output example: ``` QID 0 MHI ID:0 QID 1 MHI ID:1 QID 2 MHI ID:2 QID 3 MHI ID:3
- Issue `soc_reset` using the `MHI ID` associated with the `QID`.
sudo su echo 1 > /sys/bus/mhi/devices/mhi/soc_reset #MHI ID is 0,1,2…```Reset using the
pci address
associated with theQID
.Find the
pci address
associated with theQID
.
sudo /opt/qti-aic/tools/qaic-util -q -d 1 | grep -iw "pci address"
Output example:
PCI Address:0000:2d:00.0
Issue
soc_reset
using thepci address
associated with theQID
.
sudo /opt/qti-aic/tools/qaic-util -s -p 0000:2d:00.0
Output example:
Resetting 0000:2d:00.0: 0000:2d:00.0 success
Reset all QIDs.
sudo /opt/qti-aic/tools/qaic-util -s
Verify the health/function of the SoCs/Cards after a
soc_reset
.
LLM Network Mode¶
Qmonitor can be used to enable/disable LLM Network Mode on Ultra SKU. By enabling MODE_LLM will help LLM networks to perform better, whereas the MODE_NON_LLM will benefit the non LLM networks. By default, MODE_LLM is enabled.
Below are the example Qmonitor commands, ### Disable the mode:
sudo /opt/qti-aic/tools/qaic-monitor-json -i ./setDisable.json
Here, setDisable.json
is an input file:
{
"request": [
{
"qid": 0,
"power": {
"set_network_mode_request": {
"network_mode": "MODE_NON_LLM"
}
}
}
]
}
Output response
:
{
"response": [
{
"qid": 0,
"power": {
"set_network_mode_response": {
"status": "SUCCESS",
"mode_status": "NETWORK_MODE_RESPONSE_SUCCESS"
}
}
}
]
}
Check current mode:¶
sudo /opt/qti-aic/tools/qaic-monitor-json -i ./getReq.json
Here, getReq.json
is an input file:
{
"request": [
{
"qid": 0,
"power": {
"get_network_mode_request": {}
}
}
]
}
Output response
:
{
"response": [
{
"qid": 0,
"power": {
"get_network_mode_response": {
"status": "SUCCESS",
"network_mode": "MODE_NON_LLM",
"mode_status": "NETWORK_MODE_RESPONSE_SUCCESS"
}
}
}
]
}
Enable the mode:¶
sudo /opt/qti-aic/tools/qaic-monitor-json -i ./setEnableReq.json
Here, setEnableReq.json
is an input file:
{
"request": [
{
"qid": 0,
"power": {
"set_network_mode_request": {
"network_mode": "MODE_LLM"
}
}
}
]
}
Output response
:
{
"response": [
{
"qid": 0,
"power": {
"set_network_mode_response": {
"status": "SUCCESS",
"mode_status": "NETWORK_MODE_RESPONSE_SUCCESS"
}
}
}
]
}
For more details on Qmonitor, refer to Qmonitor
Advanced System Management¶
For advanced system management details, refer to Cloud AI Card Management
This document is shared with System Integrators and covers the following topics.
Boot and firmware management
Security - Secure boot enablement and attestation
BMC integration
Platform validation tools
Platform error management
Python APIs¶
Python APIs also provide the abilty to monitor the health and resources of the cards/SoCs. Refer to Util class.