User guide
Starting AICM¶
Install Dependencies¶
AICM files are located in:
AICM requires several Python packages to run.
We recommend installing the dependencies in a virtual environment:
python -m venv <path-to-the-virtual-environment>
source <path-to-the-virtual-environment>/bin/activate
pip install -r requirements.txt
Start Agent¶
After activating the virtual environment, AICM can be started using the following command:
usage: aicm_agent.py [-h] [-c CONFIG_FILE] [--ip IP] [--port PORT] [-u USERS] [--ssl-key SSL_KEY] [--ssl-cert SSL_CERT] [--qmonitor-ip QMONITOR_IP] [--qmonitor-port QMONITOR_PORT] [--log LOG]
                     [--max-log-size MAX_LOG_SIZE] [-v] [--dump-default-config DUMP_DEFAULT_CONFIG] [--dump-default-users DUMP_DEFAULT_USERS] [--collection KEY=VALUE [KEY=VALUE ...]]
optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config_file CONFIG_FILE
                        Config file path
  --ip IP               IP address to bind to
  --port PORT           Port number to listen on
  -u USERS, --users USERS
                        Full path to users credentials file (.yaml)
  --ssl-key SSL_KEY, --ssl_key SSL_KEY
                        Path to the SSL key file. Required for HTTPS
  --ssl-cert SSL_CERT, --ssl_cert SSL_CERT
                        Path to the SSL certificate file. Required for HTTPS
  --qmonitor-ip QMONITOR_IP, --qmonitor_ip QMONITOR_IP
                        IP of QMonitor GRPC Server
  --qmonitor-port QMONITOR_PORT, --qmonitor_port QMONITOR_PORT
                        Port of QMonitor GRPC Server
  --log LOG             Path where to store logs
  --max-log-size MAX_LOG_SIZE, --max_log_size MAX_LOG_SIZE
                        Max sizes of logs in bytes
  -v, --verbose         Increase output verbosity
  --collection KEY=VALUE [KEY=VALUE ...]
                        Set a number of key-value pairs (do not put spaces before or after the = sign).
Dump default files:
  Dumps the default files and then exit.
  --dump-default-config DUMP_DEFAULT_CONFIG, --dump_default_config DUMP_DEFAULT_CONFIG
                        Dump the default config to the specified folder and exit
  --dump-default-users DUMP_DEFAULT_USERS, --dump_default_users DUMP_DEFAULT_USERS
                        Dump the default users file to the specified folder and exit
These settings can also be supplied via a configuration file.
If both are found, then the command line arguments will take priority.
The configuration file can be supplied via the --config_file option.
# AICM Configuration
# IP to bind AICM
ip = 127.0.0.1
# Port to bind AICM
port = 9000
# Full path to users credentials file
# users =
# SSL Key path
# ssl_key =
# SSL Certificate path
# ssl_cert =
# IP of QMonitor GRPC Server
qmonitor_ip = localhost
# Port of QMonitor GRPC Server
qmonitor_port = 62472
# Path to directory where to store logs
# log =
# Max sizes of logs in bytes
max_log_size = 100000000
# Verbosity of AICM agent
verbose = 0
# KEY=VALUE for defining collection intervals in milliseconds for each collection
collection = [HEALTH=1000, DDR_BW=1000, PCI=1000, RAS_ECC_ERROR=10000]
AICM can also be run in a service-like manner :
Once running, the APIs can be tested at <ip>:<port>/docs through the SwaggerUI.
Stop Agent¶
The following command will stop the agent when scripts/start_aicm_agent.sh was used to start it:
Ctrl+C will stop AICM if it's running using the python aicm_agent.py command.
Setup Basic Auth¶
Since Basic Auth is used as the authentication method, users will need to authenticate all requests to our API.
The accepted credentials are stored in the .users.yaml file.
You can add/modify the credentials using the following syntax:
credentials:
  - username: admin
    hash: $2b$12$rEQTKF4IVHKPyeX6miseJ.xOjhmI5OFqlLuwE2OB4CuEIvHC2IFP6
    note: "Example of credential"
bcrypt.
A script used to get the hash is provided at /scripts/hash_password.pyReplace
<password> and run this command:
HTTPS¶
Basic Auth is just a simple mechanism for authentication. For added security, running HTTPS is recommended, which requires users to provide a certificate and key upon startup. This can be done by passing the following args:
  --ssl-key SSL_KEY    Path to the SSL key file. Needs to be provided for
                       HTTPS
  --ssl-cert SSL_CERT  Path to the SSL certificate file. Needs to be provided
                       for HTTPS
Metrics¶
These are currently the metrics served by AICM (stored in docs/metrics.csv):
In addition to a set of core metrics, AICM provides Reliability, Availability, Serviceability (RAS) error statuses.
| Model | Field Name | Description | 
|---|---|---|
| HealthDataModel | dev_status | Status of the device | 
| HealthDataModel | mhi_id | MHI ID | 
| HealthDataModel | pci_address | PCI Address of the device | 
| HealthDataModel | pci_info | PCI Info | 
| HealthDataModel | max_link_speed | Max Link Speed | 
| HealthDataModel | max_link_width | Max Link Width | 
| HealthDataModel | current_link_speed | Current Link Speed | 
| HealthDataModel | current_link_width | Current Link Width | 
| HealthDataModel | dev_link | Dev Link Name | 
| HealthDataModel | hw_version | Hardware version | 
| HealthDataModel | hw_serial_string | HW Serial Number | 
| HealthDataModel | fw_version | Firmware version | 
| HealthDataModel | fw_qc_image_version | Qualcomm firmware identification string | 
| HealthDataModel | fw_oem_image_version | OEM custom firmware identification string | 
| HealthDataModel | fw_image_variant | Firmware image variant, e.g. debug, release, etc | 
| HealthDataModel | device_capabilities | Device Firmware Features | 
| HealthDataModel | current_boot_interface | Boot Interface | 
| HealthDataModel | nsp_version | NSP version | 
| HealthDataModel | nsp_qc_image_version | NSP Image string | 
| HealthDataModel | nsp_oem_image_version | Image string provided by OEM | 
| HealthDataModel | nsp_image_variant | NSP image variant, e.g. debug, release | 
| HealthDataModel | dram_total_kb | Total RAM in system in KB | 
| HealthDataModel | dram_free_kb | Amount of RAM free in KB | 
| HealthDataModel | dram_fragmentation_percentage | Percentage of DRAM fragmentation | 
| HealthDataModel | vc_total | Total number of virtual channels on the system | 
| HealthDataModel | vc_free | Number of available virtual channels | 
| HealthDataModel | pc_total | Total number of Physical Channels | 
| HealthDataModel | pc_reserved | Number of reserved Physical Channels | 
| HealthDataModel | nsp_total | Number of neural processors on the system | 
| HealthDataModel | nsp_free | Number of available neural processors | 
| HealthDataModel | dram_bw_KBps | DRAM bandwidth in Kbytes/second, averaged over last ~100 ms | 
| HealthDataModel | mcid_total | Total number of multicast IDs available on the system | 
| HealthDataModel | mcid_free | Number of available multicast IDs | 
| HealthDataModel | semaphore_total | Total number of semaphores available on the system | 
| HealthDataModel | semaphore_free | Number of available semaphores | 
| HealthDataModel | num_constant_loaded | Number of constants loaded, each load of constants increments by 1 | 
| HealthDataModel | num_constant_in_use | Number of loaded constants that are actively used by networks running on the system | 
| HealthDataModel | num_networks_loaded | Number of neural networks loaded in memory on the system | 
| HealthDataModel | num_networks_active | Number of neural networks currently actively computing on the system | 
| HealthDataModel | neural_processor_frequency_Mhz | Nominal operating frequency of the neural processors, all processors are having the same max clock | 
| HealthDataModel | ddr_frequency_Mhz | Nominal operating frequency of DDR memory | 
| HealthDataModel | compute_noc_frequency_Mhz | Nominal operating frequency of compute network on chip | 
| HealthDataModel | memory_noc_frequency_Mhz | Nominal operating frequency of memory network on chip | 
| HealthDataModel | system_noc_frequency_Mhz | Nominal operating frequency of system network on chip | 
| HealthDataModel | metadata_version | Metadata version | 
| HealthDataModel | nnc_protocol_version | NNC protocol version | 
| HealthDataModel | sbl_image | SBL image string | 
| HealthDataModel | pvs_image_version | PVS image version | 
| HealthDataModel | nsp_defective_pg_mask | Defective NSP mask | 
| HealthDataModel | num_retired_ddr_pages | Number of retired ddr pages | 
| HealthDataModel | need_reset_to_retire_pages | Reset required to retire pending pages | 
| HealthDataModel | board_serial | Board serial | 
| HealthDataModel | soc_temparature_degree_C | SOC temperature in Degree Celsius | 
| HealthDataModel | board_power_watts | Board power in Watts | 
| HealthDataModel | tdp_cap_watts | Thermal Design Power cap in Watts | 
| HealthDataModel | sku_type | SKU Type | 
| HealthDataModel | complex_id | Complex ID | 
| HealthDataModel | soc_power_watts | SOC Power in Watts | 
| HealthDataModel | soc_tdp_cap_watts | SOC Thermal Design Power cap in Watts | 
| PciDataModel | byte_count_rx | Bytes received on PCIE | 
| PciDataModel | byte_count_tx | Bytes sent on PCIE | 
| DdrBwDataModel | byte_count_total | Sum of the NSP individual byte count | 
| DdrBwDataModel_NspDdrBwDataModel | byte_count | DDR Byte Count for this NSP | 
| RasErrorsDataModel | ras_ddr_correctable_error_count | Count of Correctable Errors received from ras_ddr | 
| RasErrorsDataModel | ras_ddr_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_ddr | 
| RasErrorsDataModel | ras_mcw_correctable_error_count | Count of Correctable Errors received from ras_mcw | 
| RasErrorsDataModel | ras_mcw_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_mcw | 
| RasErrorsDataModel | ras_imem_correctable_error_count | Count of Correctable Errors received from ras_imem | 
| RasErrorsDataModel | ras_imem_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_imem | 
| RasErrorsDataModel | ras_nsp_correctable_error_count | Count of Correctable Errors received from ras_nsp | 
| RasErrorsDataModel | ras_nsp_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_nsp |