Starting AICM¶
Install Dependencies¶
AICM files are located in:
/opt/qti-aic/tools/aic-manager
AICM requires several Python packages to run. We recommend installing the dependencies in a virtual environment:
python -m venv <path-to-the-virtual-environment>
source <path-to-the-virtual-environment>/bin/activate
pip install -r requirements.txt
Start Agent¶
After activating the virtual environment, AICM can be started using the following command:
usage: aicm_agent.py [-h] [-c CONFIG_FILE] [--ip IP] [--port PORT] [-u USERS] [--ssl-key SSL_KEY] [--ssl-cert SSL_CERT] [--qmonitor-ip QMONITOR_IP] [--qmonitor-port QMONITOR_PORT] [--log LOG]
[--max-log-size MAX_LOG_SIZE] [-v] [--dump-default-config DUMP_DEFAULT_CONFIG] [--dump-default-users DUMP_DEFAULT_USERS] [--collection KEY=VALUE [KEY=VALUE ...]]
optional arguments:
-h, --help show this help message and exit
-c CONFIG_FILE, --config_file CONFIG_FILE
Config file path
--ip IP IP address to bind to
--port PORT Port number to listen on
-u USERS, --users USERS
Full path to users credentials file (.yaml)
--ssl-key SSL_KEY, --ssl_key SSL_KEY
Path to the SSL key file. Required for HTTPS
--ssl-cert SSL_CERT, --ssl_cert SSL_CERT
Path to the SSL certificate file. Required for HTTPS
--qmonitor-ip QMONITOR_IP, --qmonitor_ip QMONITOR_IP
IP of QMonitor GRPC Server
--qmonitor-port QMONITOR_PORT, --qmonitor_port QMONITOR_PORT
Port of QMonitor GRPC Server
--log LOG Path where to store logs
--max-log-size MAX_LOG_SIZE, --max_log_size MAX_LOG_SIZE
Max sizes of logs in bytes
-v, --verbose Increase output verbosity
--collection KEY=VALUE [KEY=VALUE ...]
Set a number of key-value pairs (do not put spaces before or after the = sign).
Dump default files:
Dumps the default files and then exit.
--dump-default-config DUMP_DEFAULT_CONFIG, --dump_default_config DUMP_DEFAULT_CONFIG
Dump the default config to the specified folder and exit
--dump-default-users DUMP_DEFAULT_USERS, --dump_default_users DUMP_DEFAULT_USERS
Dump the default users file to the specified folder and exit
These settings can also be supplied via a configuration file. If both
are found, then the command line arguments will take priority. The
configuration file can be supplied via the --config_file
option.
# AICM Configuration
# IP to bind AICM
ip = 127.0.0.1
# Port to bind AICM
port = 9000
# Full path to users credentials file
# users =
# SSL Key path
# ssl_key =
# SSL Certificate path
# ssl_cert =
# IP of QMonitor GRPC Server
qmonitor_ip = localhost
# Port of QMonitor GRPC Server
qmonitor_port = 62472
# Path to directory where to store logs
# log =
# Max sizes of logs in bytes
max_log_size = 100000000
# Verbosity of AICM agent
verbose = 0
# KEY=VALUE for defining collection intervals in milliseconds for each collection
collection = [HEALTH=1000, DDR_BW=1000, PCI=1000, RAS_ECC_ERROR=10000]
AICM can also be run in a service-like manner :
sudo bash scripts/start_aicm_agent.sh
Once running, the APIs can be tested at <ip>:<port>/docs
through the
SwaggerUI.
Stop Agent¶
The following command will stop the agent when
scripts/start_aicm_agent.sh
was used to start it:
sudo bash scripts/stop_aicm_agent.sh
Alternatively, pressing Ctrl+C
will stop AICM if it’s running using
the python aicm_agent.py
command.
Setup Basic Auth¶
Since Basic Auth is used as the authentication method, users will need to authenticate all requests to our API.
The accepted credentials are stored in the ``.users.yaml`` file.
You can add/modify the credentials using the following syntax:
credentials:
- username: admin
hash: $2b$12$rEQTKF4IVHKPyeX6miseJ.xOjhmI5OFqlLuwE2OB4CuEIvHC2IFP6
note: "Example of credential"
These credentials will be needed in every request made to the HTTP Rest
Endpoints. For security purposes the password is hashed using
bcrypt
. A script used to get the hash is provided at
/scripts/hash_password.py
Replace <password>
and run this
command:
python ./scripts/hash_password.py <password>
HTTPS¶
Basic Auth is just a simple mechanism for authentication. For added security, running HTTPS is recommended, which requires users to provide a certificate and key upon startup. This can be done by passing the following args:
--ssl-key SSL_KEY Path to the SSL key file. Needs to be provided for
HTTPS
--ssl-cert SSL_CERT Path to the SSL certificate file. Needs to be provided
for HTTPS
Metrics¶
These are currently the metrics served by AICM (stored in
docs/metrics.csv
): In addition to a set of core metrics, AICM
provides Reliability, Availability, Serviceability (RAS) error statuses.
Model |
Field Name |
Description |
---|---|---|
Model |
Field Name |
Description |
HealthDataModel |
dev_status |
Status of the device |
HealthDataModel |
mhi_id |
MHI ID |
HealthDataModel |
pci_address |
PCI Address of the device |
HealthDataModel |
pci_info |
PCI Info |
HealthDataModel |
max_link_speed |
Max Link Speed |
HealthDataModel |
max_link_width |
Max Link Width |
HealthDataModel |
current_link_speed |
Current Link Speed |
HealthDataModel |
current_link_width |
Current Link Width |
HealthDataModel |
dev_link |
Dev Link Name |
HealthDataModel |
hw_version |
Hardware version |
HealthDataModel |
hw_serial_string |
HW Serial Number |
HealthDataModel |
fw_version |
Firmware version |
HealthDataModel |
fw_qc_image_version |
Qualcomm firmware identification string |
HealthDataModel |
fw_oem_image_version |
OEM custom firmware identification string |
HealthDataModel |
fw_image_variant |
Firmware image variant, e.g. debug, release, etc |
HealthDataModel |
device_capabilities |
Device Firmware Features |
HealthDataModel |
current_boot_interface |
Boot Interface |
HealthDataModel |
nsp_version |
NSP version |
HealthDataModel |
nsp_qc_image_version |
NSP Image string |
HealthDataModel |
nsp_oem_image_version |
Image string provided by OEM |
HealthDataModel |
nsp_image_variant |
NSP image variant, e.g. debug, release |
HealthDataModel |
dram_total_kb |
Total RAM in system in KB |
HealthDataModel |
dram_free_kb |
Amount of RAM free in KB |
HealthDataModel |
dram_fragmentation_percentage |
Percentage of DRAM fragmentation |
HealthDataModel |
vc_total |
Total number of virtual channels on the system |
HealthDataModel |
vc_free |
Number of available virtual channels |
HealthDataModel |
pc_total |
Total number of Physical Channels |
HealthDataModel |
pc_reserved |
Number of reserved Physical Channels |
HealthDataModel |
nsp_total |
Number of neural processors on the system |
HealthDataModel |
nsp_free |
Number of available neural processors |
HealthDataModel |
dram_bw_KBps |
DRAM bandwidth in Kbytes/second, averaged over last ~100 ms |
HealthDataModel |
mcid_total |
Total number of multicast IDs available on the system |
HealthDataModel |
mcid_free |
Number of available multicast IDs |
HealthDataModel |
semaphore_total |
Total number of semaphores available on the system |
HealthDataModel |
semaphore_free |
Number of available semaphores |
HealthDataModel |
num_constant_loaded |
Number of constants loaded, each load of constants increments by 1 |
HealthDataModel |
num_constant_in_use |
Number of loaded constants that are actively used by networks running on the system |
HealthDataModel |
num_networks_loaded |
Number of neural networks loaded in memory on the system |
HealthDataModel |
num_networks_active |
Number of neural networks currently actively computing on the system |
HealthDataModel |
neural_processor_frequency_Mhz |
Nominal operating frequency of the neural processors, all processors are having the same max clock |
HealthDataModel |
ddr_frequency_Mhz |
Nominal operating frequency of DDR memory |
HealthDataModel |
compute_noc_frequency_Mhz |
Nominal operating frequency of compute network on chip |
HealthDataModel |
memory_noc_frequency_Mhz |
Nominal operating frequency of memory network on chip |
HealthDataModel |
system_noc_frequency_Mhz |
Nominal operating frequency of system network on chip |
HealthDataModel |
metadata_version |
Metadata version |
HealthDataModel |
nnc_protocol_version |
NNC protocol version |
HealthDataModel |
sbl_image |
SBL image string |
HealthDataModel |
pvs_image_version |
PVS image version |
HealthDataModel |
nsp_defective_pg_mask |
Defective NSP mask |
HealthDataModel |
num_retired_ddr_pages |
Number of retired ddr pages |
HealthDataModel |
need_reset_to_retire_pages |
Reset required to retire pending pages |
HealthDataModel |
board_serial |
Board serial |
HealthDataModel |
soc_temparature_degree_C |
SOC temperature in Degree Celsius |
HealthDataModel |
board_power_watts |
Board power in Watts |
HealthDataModel |
tdp_cap_watts |
Thermal Design Power cap in Watts |
HealthDataModel |
sku_type |
SKU Type |
HealthDataModel |
complex_id |
Complex ID |
HealthDataModel |
soc_power_watts |
SOC Power in Watts |
HealthDataModel |
soc_tdp_cap_watts |
SOC Thermal Design Power cap in Watts |
PciDataModel |
byte_count_rx |
Bytes received on PCIE |
PciDataModel |
byte_count_tx |
Bytes sent on PCIE |
DdrBwDataModel |
byte_count_total |
Sum of the NSP individual byte count |
DdrBwDataModel_NspDdrBwDataModel |
byte_count |
DDR Byte Count for this NSP |
RasErrorsDataModel |
ras_ddr_correctable_error_count |
Count of Correctable Errors received from ras_ddr |
RasErrorsDataModel |
ras_ddr_uncorrectable_error_count |
Count of Uncorrectable Errors received from ras_ddr |
RasErrorsDataModel |
ras_mcw_correctable_error_count |
Count of Correctable Errors received from ras_mcw |
RasErrorsDataModel |
ras_mcw_uncorrectable_error_count |
Count of Uncorrectable Errors received from ras_mcw |
RasErrorsDataModel |
ras_imem_correctable_error_count |
Count of Correctable Errors received from ras_imem |
RasErrorsDataModel |
ras_imem_uncorrectable_error_count |
Count of Uncorrectable Errors received from ras_imem |
RasErrorsDataModel |
ras_nsp_correctable_error_count |
Count of Correctable Errors received from ras_nsp |
RasErrorsDataModel |
ras_nsp_uncorrectable_error_count |
Count of Uncorrectable Errors received from ras_nsp |