User guide
Starting AICM¶
Install Dependencies¶
AICM files are located in:
AICM requires several Python packages to run.
We recommend installing the dependencies in a virtual environment:
python -m venv <path-to-the-virtual-environment>
source <path-to-the-virtual-environment>/bin/activate
pip install -r requirements.txt
Start Agent¶
After activating the virtual environment, AICM can be started using the following command:
python aicm_agent.py
optional arguments:
-h, --help Show this help message and exit
--config_file Path to the Config file.
--dump-default-config, Dump the default config to the specified folder
--dump_default_config_users and exit.
--dump-default-users, Dump the default users file to the specified
--dump_default_users folder and exit.
--ip IP address to bind to
--port Port number to listen on
--log Path where to store logs
--max-log-size, --max_log_size Max size of logs in bytes
--ssl-key, --ssl_key Path to the SSL key file. Needs to be provided
for HTTPS.
--ssl-cert, --ssl_cert Path to the SSL certificate file. Needs to be
provided for HTTPS
--qmonitor-ip, IP of QMonitor GRPC Server
--qmonitor_ip provided for HTTPS.
--qmonitor-port, Port of QMonitor GRPC Server
--qmonitor_port
-u, --users Full path to the users credentials file (.yaml)
-v, --verbose Increase output verbosity
These settings can also be supplied via a configuration file.
If both are found, then the command line arguments will take priority.
The configuration file can be supplied via the --config_file
option.
# AICM Configuration
# IP to bind AICM
ip = 127.0.0.1
# Port to bind AICM
port = 9000
# Full path to users credentials file
# users =
# SSL Key path
# ssl_key =
# SSL Certificate path
# ssl_cert =
# Path to directory where to store logs
# log =
# IP of QMonitor GRPC Server
qmonitor_ip = localhost
# Port of QMonitor GRPC Server
qmonitor_port = 62472
# Max sizes of logs in bytes
max_log_size = 100000000
# Verbosity of AICM agent
verbose = 0
AICM can also be run in a service-like manner :
Once running, the APIs can be tested at <ip>:<port>/docs
through the SwaggerUI.
Stop Agent¶
The following command will stop the agent when scripts/start_aicm_agent.sh
was used to start it:
Ctrl+C
will stop AICM if it's running using the python aicm_agent.py
command.
Setup Basic Auth¶
Since Basic Auth is used as the authentication method, users will need to authenticate all requests to our API.
The accepted credentials are stored in the .users.yaml
file.
You can add/modify the credentials using the following syntax:
credentials:
- username: admin
hash: $2b$12$rEQTKF4IVHKPyeX6miseJ.xOjhmI5OFqlLuwE2OB4CuEIvHC2IFP6
note: "Example of credential"
bcrypt
.
A script used to get the hash is provided at /scripts/hash_password.py
Replace
<password>
and run this command:
HTTPS¶
Basic Auth is just a simple mechanism for authentication. For added security, running HTTPS is recommended, which requires users to provide a certificate and key upon startup. This can be done by passing the following args:
--ssl-key SSL_KEY Path to the SSL key file. Needs to be provided for
HTTPS
--ssl-cert SSL_CERT Path to the SSL certificate file. Needs to be provided
for HTTPS
Metrics¶
These are currently the metrics served by AICM (stored in docs/metrics.csv
):
In addition to a set of core metrics, AICM provides Reliability, Availability, Serviceability (RAS) error statuses.
Model | Field Name | Description |
---|---|---|
HealthDataModel | dev_status | Status of the device |
HealthDataModel | mhi_id | MHI ID |
HealthDataModel | pci_address | PCI Address of the device |
HealthDataModel | pci_info | PCI Info |
HealthDataModel | max_link_speed | Max Link Speed |
HealthDataModel | max_link_width | Max Link Width |
HealthDataModel | current_link_speed | Current Link Speed |
HealthDataModel | current_link_width | Current Link Width |
HealthDataModel | dev_link | Dev Link Name |
HealthDataModel | hw_version | Hardware version |
HealthDataModel | hw_serial_string | HW Serial Number |
HealthDataModel | fw_version | Firmware version |
HealthDataModel | fw_qc_image_version | Qualcomm firmware identification string |
HealthDataModel | fw_oem_image_version | OEM custom firmware identification string |
HealthDataModel | fw_image_variant | Firmware image variant, e.g. debug, release, etc |
HealthDataModel | device_capabilities | Device Firmware Features |
HealthDataModel | current_boot_interface | Boot Interface |
HealthDataModel | nsp_version | NSP version |
HealthDataModel | nsp_qc_image_version | NSP Image string |
HealthDataModel | nsp_oem_image_version | Image string provided by OEM |
HealthDataModel | nsp_image_variant | NSP image variant, e.g. debug, release |
HealthDataModel | dram_total_kb | Total RAM in system in KB |
HealthDataModel | dram_free_kb | Amount of RAM free in KB |
HealthDataModel | dram_fragmentation_percentage | Percentage of DRAM fragmentation |
HealthDataModel | vc_total | Total number of virtual channels on the system |
HealthDataModel | vc_free | Number of available virtual channels |
HealthDataModel | pc_total | Total number of Physical Channels |
HealthDataModel | pc_reserved | Number of reserved Physical Channels |
HealthDataModel | nsp_total | Number of neural processors on the system |
HealthDataModel | nsp_free | Number of available neural processors |
HealthDataModel | dram_bw_KBps | DRAM bandwidth in Kbytes/second, averaged over last ~100 ms |
HealthDataModel | mcid_total | Total number of multicast IDs available on the system |
HealthDataModel | mcid_free | Number of available multicast IDs |
HealthDataModel | semaphore_total | Total number of semaphores available on the system |
HealthDataModel | semaphore_free | Number of available semaphores |
HealthDataModel | num_constant_loaded | Number of constants loaded, each load of constants increments by 1 |
HealthDataModel | num_constant_in_use | Number of loaded constants that are actively used by networks running on the system |
HealthDataModel | num_networks_loaded | Number of neural networks loaded in memory on the system |
HealthDataModel | num_networks_active | Number of neural networks currently actively computing on the system |
HealthDataModel | neural_processor_frequency_Mhz | Nominal operating frequency of the neural processors, all processors are having the same max clock |
HealthDataModel | ddr_frequency_Mhz | Nominal operating frequency of DDR memory |
HealthDataModel | compute_noc_frequency_Mhz | Nominal operating frequency of compute network on chip |
HealthDataModel | memory_noc_frequency_Mhz | Nominal operating frequency of memory network on chip |
HealthDataModel | system_noc_frequency_Mhz | Nominal operating frequency of system network on chip |
HealthDataModel | metadata_version | Metadata version |
HealthDataModel | nnc_protocol_version | NNC protocol version |
HealthDataModel | sbl_image | SBL image string |
HealthDataModel | pvs_image_version | PVS image version |
HealthDataModel | nsp_defective_pg_mask | Defective NSP mask |
HealthDataModel | num_retired_ddr_pages | Number of retired ddr pages |
HealthDataModel | need_reset_to_retire_pages | Reset required to retire pending pages |
HealthDataModel | board_serial | Board serial |
HealthDataModel | soc_temparature_degree_C | SOC temperature in Degree Celsius |
HealthDataModel | board_power_watts | Board power in Watts |
HealthDataModel | tdp_cap_watts | Thermal Design Power cap in Watts |
HealthDataModel | sku_type | SKU Type |
HealthDataModel | complex_id | Complex ID |
HealthDataModel | soc_power_watts | SOC Power in Watts |
HealthDataModel | soc_tdp_cap_watts | SOC Thermal Design Power cap in Watts |
PciDataModel | byte_count_rx | Bytes received on PCIE |
PciDataModel | byte_count_tx | Bytes sent on PCIE |
DdrBwDataModel | byte_count_total | Sum of the NSP individual byte count |
DdrBwDataModel_NspDdrBwDataModel | byte_count | DDR Byte Count for this NSP |
RasErrorsDataModel | ras_ddr_correctable_error_count | Count of Correctable Errors received from ras_ddr |
RasErrorsDataModel | ras_ddr_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_ddr |
RasErrorsDataModel | ras_mcw_correctable_error_count | Count of Correctable Errors received from ras_mcw |
RasErrorsDataModel | ras_mcw_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_mcw |
RasErrorsDataModel | ras_imem_correctable_error_count | Count of Correctable Errors received from ras_imem |
RasErrorsDataModel | ras_imem_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_imem |
RasErrorsDataModel | ras_nsp_correctable_error_count | Count of Correctable Errors received from ras_nsp |
RasErrorsDataModel | ras_nsp_uncorrectable_error_count | Count of Uncorrectable Errors received from ras_nsp |