Starting AICM

Install Dependencies

AICM files are located in:

/opt/qti-aic/tools/aic-manager

AICM requires several Python packages to run. We recommend installing the dependencies in a virtual environment:

python -m venv <path-to-the-virtual-environment>
source <path-to-the-virtual-environment>/bin/activate
pip install -r requirements.txt

Start Agent

After activating the virtual environment, AICM can be started using the following command:

usage: aicm_agent.py [-h] [-c CONFIG_FILE] [--ip IP] [--port PORT] [-u USERS] [--ssl-key SSL_KEY] [--ssl-cert SSL_CERT] [--qmonitor-ip QMONITOR_IP] [--qmonitor-port QMONITOR_PORT] [--log LOG]
                     [--max-log-size MAX_LOG_SIZE] [-v] [--dump-default-config DUMP_DEFAULT_CONFIG] [--dump-default-users DUMP_DEFAULT_USERS] [--collection KEY=VALUE [KEY=VALUE ...]]

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config_file CONFIG_FILE
                        Config file path
  --ip IP               IP address to bind to
  --port PORT           Port number to listen on
  -u USERS, --users USERS
                        Full path to users credentials file (.yaml)
  --ssl-key SSL_KEY, --ssl_key SSL_KEY
                        Path to the SSL key file. Required for HTTPS
  --ssl-cert SSL_CERT, --ssl_cert SSL_CERT
                        Path to the SSL certificate file. Required for HTTPS
  --qmonitor-ip QMONITOR_IP, --qmonitor_ip QMONITOR_IP
                        IP of QMonitor GRPC Server
  --qmonitor-port QMONITOR_PORT, --qmonitor_port QMONITOR_PORT
                        Port of QMonitor GRPC Server
  --log LOG             Path where to store logs
  --max-log-size MAX_LOG_SIZE, --max_log_size MAX_LOG_SIZE
                        Max sizes of logs in bytes
  -v, --verbose         Increase output verbosity
  --collection KEY=VALUE [KEY=VALUE ...]
                        Set a number of key-value pairs (do not put spaces before or after the = sign).

Dump default files:
  Dumps the default files and then exit.

  --dump-default-config DUMP_DEFAULT_CONFIG, --dump_default_config DUMP_DEFAULT_CONFIG
                        Dump the default config to the specified folder and exit
  --dump-default-users DUMP_DEFAULT_USERS, --dump_default_users DUMP_DEFAULT_USERS
                        Dump the default users file to the specified folder and exit

These settings can also be supplied via a configuration file. If both are found, then the command line arguments will take priority. The configuration file can be supplied via the --config_file option.

# AICM Configuration

# IP to bind AICM
ip = 127.0.0.1

# Port to bind AICM
port = 9000

# Full path to users credentials file
# users =

# SSL Key path
# ssl_key =

# SSL Certificate path
# ssl_cert =

# IP of QMonitor GRPC Server
qmonitor_ip = localhost

# Port of QMonitor GRPC Server
qmonitor_port = 62472

# Path to directory where to store logs
# log =

# Max sizes of logs in bytes
max_log_size = 100000000

# Verbosity of AICM agent
verbose = 0

# KEY=VALUE for defining collection intervals in milliseconds for each collection
collection = [HEALTH=1000, DDR_BW=1000, PCI=1000, RAS_ECC_ERROR=10000]

AICM can also be run in a service-like manner :

sudo bash scripts/start_aicm_agent.sh

Once running, the APIs can be tested at <ip>:<port>/docs through the SwaggerUI.

Stop Agent

The following command will stop the agent when scripts/start_aicm_agent.sh was used to start it:

sudo bash scripts/stop_aicm_agent.sh

Alternatively, pressing Ctrl+C will stop AICM if it’s running using the python aicm_agent.py command.

Setup Basic Auth

Since Basic Auth is used as the authentication method, users will need to authenticate all requests to our API.

The accepted credentials are stored in the ``.users.yaml`` file.

You can add/modify the credentials using the following syntax:

credentials:
  - username: admin
    hash: $2b$12$rEQTKF4IVHKPyeX6miseJ.xOjhmI5OFqlLuwE2OB4CuEIvHC2IFP6
    note: "Example of credential"

These credentials will be needed in every request made to the HTTP Rest Endpoints. For security purposes the password is hashed using bcrypt. A script used to get the hash is provided at /scripts/hash_password.py Replace <password> and run this command:

python ./scripts/hash_password.py <password>

HTTPS

Basic Auth is just a simple mechanism for authentication. For added security, running HTTPS is recommended, which requires users to provide a certificate and key upon startup. This can be done by passing the following args:

--ssl-key SSL_KEY    Path to the SSL key file. Needs to be provided for
                     HTTPS
--ssl-cert SSL_CERT  Path to the SSL certificate file. Needs to be provided
                     for HTTPS

Metrics

These are currently the metrics served by AICM (stored in docs/metrics.csv): In addition to a set of core metrics, AICM provides Reliability, Availability, Serviceability (RAS) error statuses.

Model

Field Name

Description

Model

Field Name

Description

HealthDataModel

dev_status

Status of the device

HealthDataModel

mhi_id

MHI ID

HealthDataModel

pci_address

PCI Address of the device

HealthDataModel

pci_info

PCI Info

HealthDataModel

max_link_speed

Max Link Speed

HealthDataModel

max_link_width

Max Link Width

HealthDataModel

current_link_speed

Current Link Speed

HealthDataModel

current_link_width

Current Link Width

HealthDataModel

dev_link

Dev Link Name

HealthDataModel

hw_version

Hardware version

HealthDataModel

hw_serial_string

HW Serial Number

HealthDataModel

fw_version

Firmware version

HealthDataModel

fw_qc_image_version

Qualcomm firmware identification string

HealthDataModel

fw_oem_image_version

OEM custom firmware identification string

HealthDataModel

fw_image_variant

Firmware image variant, e.g. debug, release, etc

HealthDataModel

device_capabilities

Device Firmware Features

HealthDataModel

current_boot_interface

Boot Interface

HealthDataModel

nsp_version

NSP version

HealthDataModel

nsp_qc_image_version

NSP Image string

HealthDataModel

nsp_oem_image_version

Image string provided by OEM

HealthDataModel

nsp_image_variant

NSP image variant, e.g. debug, release

HealthDataModel

dram_total_kb

Total RAM in system in KB

HealthDataModel

dram_free_kb

Amount of RAM free in KB

HealthDataModel

dram_fragmentation_percentage

Percentage of DRAM fragmentation

HealthDataModel

vc_total

Total number of virtual channels on the system

HealthDataModel

vc_free

Number of available virtual channels

HealthDataModel

pc_total

Total number of Physical Channels

HealthDataModel

pc_reserved

Number of reserved Physical Channels

HealthDataModel

nsp_total

Number of neural processors on the system

HealthDataModel

nsp_free

Number of available neural processors

HealthDataModel

dram_bw_KBps

DRAM bandwidth in Kbytes/second, averaged over last ~100 ms

HealthDataModel

mcid_total

Total number of multicast IDs available on the system

HealthDataModel

mcid_free

Number of available multicast IDs

HealthDataModel

semaphore_total

Total number of semaphores available on the system

HealthDataModel

semaphore_free

Number of available semaphores

HealthDataModel

num_constant_loaded

Number of constants loaded, each load of constants increments by 1

HealthDataModel

num_constant_in_use

Number of loaded constants that are actively used by networks running on the system

HealthDataModel

num_networks_loaded

Number of neural networks loaded in memory on the system

HealthDataModel

num_networks_active

Number of neural networks currently actively computing on the system

HealthDataModel

neural_processor_frequency_Mhz

Nominal operating frequency of the neural processors, all processors are having the same max clock

HealthDataModel

ddr_frequency_Mhz

Nominal operating frequency of DDR memory

HealthDataModel

compute_noc_frequency_Mhz

Nominal operating frequency of compute network on chip

HealthDataModel

memory_noc_frequency_Mhz

Nominal operating frequency of memory network on chip

HealthDataModel

system_noc_frequency_Mhz

Nominal operating frequency of system network on chip

HealthDataModel

metadata_version

Metadata version

HealthDataModel

nnc_protocol_version

NNC protocol version

HealthDataModel

sbl_image

SBL image string

HealthDataModel

pvs_image_version

PVS image version

HealthDataModel

nsp_defective_pg_mask

Defective NSP mask

HealthDataModel

num_retired_ddr_pages

Number of retired ddr pages

HealthDataModel

need_reset_to_retire_pages

Reset required to retire pending pages

HealthDataModel

board_serial

Board serial

HealthDataModel

soc_temparature_degree_C

SOC temperature in Degree Celsius

HealthDataModel

board_power_watts

Board power in Watts

HealthDataModel

tdp_cap_watts

Thermal Design Power cap in Watts

HealthDataModel

sku_type

SKU Type

HealthDataModel

complex_id

Complex ID

HealthDataModel

soc_power_watts

SOC Power in Watts

HealthDataModel

soc_tdp_cap_watts

SOC Thermal Design Power cap in Watts

PciDataModel

byte_count_rx

Bytes received on PCIE

PciDataModel

byte_count_tx

Bytes sent on PCIE

DdrBwDataModel

byte_count_total

Sum of the NSP individual byte count

DdrBwDataModel_NspDdrBwDataModel

byte_count

DDR Byte Count for this NSP

RasErrorsDataModel

ras_ddr_correctable_error_count

Count of Correctable Errors received from ras_ddr

RasErrorsDataModel

ras_ddr_uncorrectable_error_count

Count of Uncorrectable Errors received from ras_ddr

RasErrorsDataModel

ras_mcw_correctable_error_count

Count of Correctable Errors received from ras_mcw

RasErrorsDataModel

ras_mcw_uncorrectable_error_count

Count of Uncorrectable Errors received from ras_mcw

RasErrorsDataModel

ras_imem_correctable_error_count

Count of Correctable Errors received from ras_imem

RasErrorsDataModel

ras_imem_uncorrectable_error_count

Count of Uncorrectable Errors received from ras_imem

RasErrorsDataModel

ras_nsp_correctable_error_count

Count of Correctable Errors received from ras_nsp

RasErrorsDataModel

ras_nsp_uncorrectable_error_count

Count of Uncorrectable Errors received from ras_nsp