User guide
Starting AICM¶
Install Dependencies¶
AICM files are located in:
AICM requires several Python packages to run.
We recommend installing the dependencies in a virtual environment:
python -m venv <path-to-the-virtual-environment>
source <path-to-the-virtual-environment>/bin/activate
pip install -r requirements.txt
Start Agent¶
After activating the virtual environment, AICM can be started using the following command:
usage: aicm_agent.py [-h] [-c CONFIG_FILE] [--ip IP] [--port PORT] [-u USERS] [--ssl-key SSL_KEY] [--ssl-cert SSL_CERT] [--qmonitor-ip QMONITOR_IP] [--qmonitor-port QMONITOR_PORT] [--log LOG]
[--max-log-size MAX_LOG_SIZE] [-v] [--dump-default-config DUMP_DEFAULT_CONFIG] [--dump-default-users DUMP_DEFAULT_USERS] [--collection KEY=VALUE [KEY=VALUE ...]]
optional arguments:
-h, --help show this help message and exit
-c CONFIG_FILE, --config_file CONFIG_FILE
Config file path
--ip IP IP address to bind to
--port PORT Port number to listen on
-u USERS, --users USERS
Full path to users credentials file (.yaml)
--ssl-key SSL_KEY, --ssl_key SSL_KEY
Path to the SSL key file. Required for HTTPS
--ssl-cert SSL_CERT, --ssl_cert SSL_CERT
Path to the SSL certificate file. Required for HTTPS
--qmonitor-ip QMONITOR_IP, --qmonitor_ip QMONITOR_IP
IP of QMonitor GRPC Server
--qmonitor-port QMONITOR_PORT, --qmonitor_port QMONITOR_PORT
Port of QMonitor GRPC Server
--log LOG Path where to store logs
--max-log-size MAX_LOG_SIZE, --max_log_size MAX_LOG_SIZE
Max sizes of logs in bytes
-v, --verbose Increase output verbosity
--collection KEY=VALUE [KEY=VALUE ...]
Set a number of key-value pairs (do not put spaces before or after the = sign).
Dump default files:
Dumps the default files and then exit.
--dump-default-config DUMP_DEFAULT_CONFIG, --dump_default_config DUMP_DEFAULT_CONFIG
Dump the default config to the specified folder and exit
--dump-default-users DUMP_DEFAULT_USERS, --dump_default_users DUMP_DEFAULT_USERS
Dump the default users file to the specified folder and exit
These settings can also be supplied via a configuration file.
If both are found, then the command line arguments will take priority.
The configuration file can be supplied via the --config_file
option.
# AICM Configuration
# IP to bind AICM
ip = 127.0.0.1
# Port to bind AICM
port = 9000
# Full path to users credentials file
# users =
# SSL Key path
# ssl_key =
# SSL Certificate path
# ssl_cert =
# IP of QMonitor GRPC Server
qmonitor_ip = localhost
# Port of QMonitor GRPC Server
qmonitor_port = 62472
# Path to directory where to store logs
# log =
# Max sizes of logs in bytes
max_log_size = 100000000
# Verbosity of AICM agent
verbose = 0
# KEY=VALUE for defining collection intervals in milliseconds for each collection
collection = [HEALTH=1000, DDR_BW=1000, PCI=1000, RAS_ECC_ERROR=10000]
AICM can also be run in a service-like manner :
Once running, the APIs can be tested at <ip>:<port>/docs
through the SwaggerUI.
Stop Agent¶
The following command will stop the agent when scripts/start_aicm_agent.sh
was used to start it:
Ctrl+C
will stop AICM if it's running using the python aicm_agent.py
command.
Setup Basic Auth¶
Since Basic Auth is used as the authentication method, users will need to authenticate all requests to our API.
The accepted credentials are stored in the .users.yaml
file.
You can add/modify the credentials using the following syntax:
credentials:
- username: admin
hash: $2b$12$rEQTKF4IVHKPyeX6miseJ.xOjhmI5OFqlLuwE2OB4CuEIvHC2IFP6
note: "Example of credential"
bcrypt
.
A script used to get the hash is provided at /scripts/hash_password.py
Replace
<password>
and run this command:
HTTPS¶
Basic Auth is just a simple mechanism for authentication. For added security, running HTTPS is recommended, which requires users to provide a certificate and key upon startup. This can be done by passing the following args:
--ssl-key SSL_KEY Path to the SSL key file. Needs to be provided for
HTTPS
--ssl-cert SSL_CERT Path to the SSL certificate file. Needs to be provided
for HTTPS
Metrics¶
These are currently the metrics served by AICM (stored in docs/metrics.csv
):
In addition to a set of core metrics, AICM provides Reliability, Availability, Serviceability (RAS) error statuses.
Model | Field Name | Description |
---|---|---|
HealthDataModel | dev_status | Status of the device |
HealthDataModel | mhi_id | MHI ID |
HealthDataModel | pci_address | PCI Address of the device |
HealthDataModel | pci_info | PCI Info |
HealthDataModel | max_link_speed | Max Link Speed |
HealthDataModel | max_link_width | Max Link Width |
HealthDataModel | current_link_speed | Current Link Speed |
HealthDataModel | current_link_width | Current Link Width |
HealthDataModel | dev_link | Dev Link Name |
HealthDataModel | hw_version | Hardware version |
HealthDataModel | hw_serial_string | HW Serial Number |
HealthDataModel | fw_version | Firmware version |
HealthDataModel | fw_qc_image_version | Qualcomm firmware identification string |
HealthDataModel | fw_oem_image_version | OEM custom firmware identification string |
HealthDataModel | fw_image_variant | Firmware image variant, e.g. debug, release, etc |
HealthDataModel | device_capabilities | Device Firmware Features |
HealthDataModel | current_boot_interface | Boot Interface |
HealthDataModel | nsp_version | NSP version |
HealthDataModel | nsp_qc_image_version | NSP Image string |
HealthDataModel | nsp_oem_image_version | Image string provided by OEM |
HealthDataModel | nsp_image_variant | NSP image variant, e.g. debug, release |
HealthDataModel | dram_total_kb | Total RAM in system in KB |
HealthDataModel | dram_free_kb | Amount of RAM free in KB |
HealthDataModel | dram_fragmentation_percentage | Percentage of DRAM fragmentation |
HealthDataModel | vc_total | Total number of virtual channels on the system |
HealthDataModel | vc_free | Number of available virtual channels |
HealthDataModel | pc_total | Total number of Physical Channels |
HealthDataModel | pc_reserved | Number of reserved Physical Channels |
HealthDataModel | nsp_total | Number of neural processors on the system |
HealthDataModel | nsp_free | Number of available neural processors |
HealthDataModel | dram_bw_KBps | DRAM bandwidth in Kbytes/second, averaged over last ~100 ms |
HealthDataModel | mcid_total | Total number of multicast IDs available on the system |
HealthDataModel | mcid_free | Number of available multicast IDs |
HealthDataModel | semaphore_total | Total number of semaphores available on the system |
HealthDataModel | semaphore_free | Number of available semaphores |
HealthDataModel | num_constant_loaded | Number of constants loaded, each load of constants increments by 1 |
HealthDataModel | num_constant_in_use | Number of loaded constants that are actively used by networks running on the system |
HealthDataModel | num_networks_loaded | Number of neural networks loaded in memory on the system |
HealthDataModel | num_networks_active | Number of neural networks currently actively computing on the system |
HealthDataModel | neural_processor_frequency_Mhz | Nominal operating frequency of the neural processors, all processors are having the same max clock |
HealthDataModel | ddr_frequency_Mhz | Nominal operating frequency of DDR memory |
HealthDataModel | compute_noc_frequency_Mhz | Nominal operating frequency of compute network on chip |
HealthDataModel | memory_noc_frequency_Mhz | Nominal operating frequency of memory network on chip |
HealthDataModel | system_noc_frequency_Mhz | Nominal operating frequency of system network on chip |
HealthDataModel | metadata_version | Metadata version |
HealthDataModel | nnc_protocol_version | NNC protocol version |
HealthDataModel | sbl_image | SBL image string |
HealthDataModel | pvs_image_version | PVS image version |
HealthDataModel | nsp_defective_pg_mask | Defective NSP mask |
HealthDataModel | num_retired_ddr_pages | Number of retired ddr pages |
HealthDataModel | need_reset_to_retire_pages | Reset required to retire pending pages |
HealthDataModel | board_serial | Board serial |
HealthDataModel | soc_temparature_degree_C | SOC temperature in Degree Celsius |
HealthDataModel | board_power_watts | Board power in Watts |
HealthDataModel | tdp_cap_watts | Thermal Design Power cap in Watts |
HealthDataModel | sku_type | SKU Type |
HealthDataModel | complex_id | Complex ID |
HealthDataModel | soc_power_watts | SOC Power in Watts |
HealthDataModel | soc_tdp_cap_watts | SOC Thermal Design Power cap in Watts |
PciDataModel | byte_count_rx | Bytes received on PCIE |
PciDataModel | byte_count_tx | Bytes sent on PCIE |
DdrBwDataModel | byte_count_total | Sum of the NSP individual byte count |
DdrBwDataModel_NspDdrBwDataModel | byte_count | DDR Byte Count for this NSP |
RasErrorsDataModel | ras_ddr_correctable_error_count | Count of Correctable Errors received from ras_ddr |
RasErrorsDataModel | ras_ddr_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_ddr |
RasErrorsDataModel | ras_mcw_correctable_error_count | Count of Correctable Errors received from ras_mcw |
RasErrorsDataModel | ras_mcw_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_mcw |
RasErrorsDataModel | ras_imem_correctable_error_count | Count of Correctable Errors received from ras_imem |
RasErrorsDataModel | ras_imem_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_imem |
RasErrorsDataModel | ras_nsp_correctable_error_count | Count of Correctable Errors received from ras_nsp |
RasErrorsDataModel | ras_nsp_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_nsp |