Skip to content

System Management

qaic-util command line utility enables developers to query

  • card and SoC(s) health
  • firmware version
  • compute/memory/IO resources available vs in-use
  • card power and temperature
  • status of certain device capabilities like ECC etc

Cloud AI Platform SDK installation is required for qaic-util usage.

QID a.k.a deviceID are indentifiers (integers) assigned to each AI 100 SoC present in the system. Note that certain SKUs may contain more than one AI 100 SoC per Card.

qaic-util displays information in 3 formats:

  • detailed view where cards/SoCs are queried once and the parameters are listed one per line.
    sudo /opt/qti-aic/tools/qaic-util -q 
    sudo /opt/qti-aic/tools/qaic-util -q  -d <QID#> #To display information for a specific `QID`
    
  • tabular format where certain parameters (compute, IO, power, temperature etc) are listed in a tabular format, refreshed every 'n' seconds (user input)

    sudo /opt/qti-aic/tools/qaic-util -t 1 
    sudo /opt/qti-aic/tools/qaic-util -t 1 -d <QID#> #To display information for a specific `QID`
    

  • tree format where certain parameters (PCIe BDF address, MHI ID, device node name, status) are listed in a tree structure organized by card. This view is useful for understanding the PCIe topology and visualizing multi-soc cards like Cloud AI 100 Ultra.

    sudo /opt/qti-aic/tools/qaic-util -r
    sudo /opt/qti-aic/tools/qaic-util -r -v # View detailed PCIe topology
    

qaic-util, provides --filter(-f) option along with -q and -t options, which can filter by certain device properties. Also to dump output to the .json file by using -j option.

Examples:
To display information for a specific Card.

sudo /opt/qti-aic/tools/qaic-util -q -f "Board serial==<BOARD_SERIAL_OF_CARD>"
To display information for a specific Card in tabular format.
sudo /opt/qti-aic/tools/qaic-util -t 1 -f "Board serial==<BOARD_SERIAL_OF_CARD>"
To dump output from the qaic-util, option -j can be used,
sudo /opt/qti-aic/tools/qaic-util -j <output-file-name>.json -f "Board serial==<BOARD_SERIAL_OF_CARD>"

Developers can grep for keywords like Status, Capabilities, Nsp, temperature, power to get specific information from the cards/SoCs.

Health

Status field indicates the health of the card.

  • Ready indicates card is in good health.
  • Error indicates card is in error condition or user lacks permissions (use sudo).
sudo /opt/qti-aic/tools/qaic-util -q | grep -e Status -e QID
QID 0
        Status:Ready
QID 1
        Status:Ready
QID 2
        Status:Ready
QID 3
        Status:Ready

Verify the function steps can be used to run a sample workload on QIDs to ensure HW/SW is funtioning correctly.

SoC Reset

Developers can reset the QIDs to recover the SoCs if they are in Error condition. The specific soc_reset can be done using either MHI ID or pci address of the QID. Also, there is an option to reset all the QIDs. Below are the steps to issue a soc_reset.

  1. Reset using the MHI ID associated with the QID.

    • Identify the MHI ID associated with the QID
      sudo /opt/qti-aic/tools/qaic-util -q | grep -e MHI -e QID
      
      In the sample output below, MHI ID:0 is associated with QID 0 and so on.
    Note

    MHI and QID do not always map to the same integer. It is imperative for developers to identify the mapping first before issuing the soc_reset

    Output example:

    QID 0
            MHI ID:0
    QID 1
            MHI ID:1
    QID 2
            MHI ID:2
    QID 3
            MHI ID:3
    

    • Issue soc_reset using the MHI ID associated with the QID.
      sudo su 
      echo 1 > /sys/bus/mhi/devices/mhi<MHI ID>/soc_reset  #MHI ID is 0,1,2...  
      
  2. Reset using the pci address associated with the QID.

    • Find the pci address associated with the QID.

      sudo /opt/qti-aic/tools/qaic-util -q -d 1 | grep -iw "pci address"
      
      Output example:
      PCI Address:0000:2d:00.0
      

    • Issue soc_reset using the pci address associated with the QID.

      sudo /opt/qti-aic/tools/qaic-util -s -p 0000:2d:00.0
      
      Output example:
      Resetting 0000:2d:00.0:
      0000:2d:00.0 success
      

  3. Reset all QIDs.

    sudo /opt/qti-aic/tools/qaic-util -s
    

    Verify the health/function of the SoCs/Cards after a soc_reset.

Advanced System Management

For advanced system management details, refer to Cloud AI Card Management

This document is shared with System Integrators and covers the following topics.

  • Boot and firmware management
  • Security - Secure boot enablement and attestation
  • BMC integration
  • Platform validation tools
  • Platform error management

Python APIs

Python APIs also provide the abilty to monitor the health and resources of the cards/SoCs. Refer to Util class