Model Sharding¶
Cloud AI SDK enables model sharding which provides the benefits of running larger models and improve throughput/latency/batch-size support across SoCs/Cards connected to the same host. Topologies supported are with or without a PCIe switch. Cards connected to a PCIe switch with peer-to-peer communication enabled provide the best performance.
Use Cases¶
There are 2 primary use cases of model sharding via tensor slicing.
- Execute models that do not fit in the memory footprint of a single SoC.
- Optimize performance (latency/throughput) for models that can fit within a single SoC but still benefit from tensor-slicing.
Architecture¶
For tensor slicing to achieve the best performance (latency/throughput), the server architecture in particular, accelerator card inter-connect performance is critical. The image below shows 8 AI 100 Ultra accelerator cards connected via a PCIe switch to the host. There are two approaches regarding card-to-card communication.
- P2P communication between the cards through a PCIe switch. This architecture provides the best performance.
- Multi-device through host: Card to card communication happens through the host. This approach will have inferior performance compared to P2P.
This sample configuration allows model sharding via tensor slicing across 8 cards (typically used for > 15B parameter models).
Tensor Slicing¶
Model operations are split across muliple SoCs (maximum across 16 SoCs in P2P config). The image provides a sample graph execution that is tensor sliced across 4 AI 100 accelerator cards. As seen from the image, there is a lot of inter-card traffic across models layers. Inter-card data bandwidth available plays a critical role in the performance. Hence, the need to enable P2P inter-card communication via PCIe switch. The AI 100 Ultra card has a PCIe switch between the 4 SoCs on the card. In a server with many AI 100 accelerators the PCIe hierarchy plays a critical role in the performance.
Platform setup¶
Pre-requisites¶
- Host should be able to support large BAR sizes. Each AI 100 accelerator card requires 2+ GB of BAR space per SoC.
- BAR region 4 for every AI 100 SoC is 2G (size=2G as shown below).
If the region 4 for every SoC is not 2G, contact your System Integrator.
Card configuration¶
-
Disable PCIe Switch ACS to enable P2P communication between PCIe ports. See instructions here.
-
Enable multi-device partitioning on all the SoCs. Download enable_mdp.json from here.
systemd-run --unit=qmonitor-proxy /opt/qti-aic/tools/qaic-monitor-grpc-server /opt/qti-aic/tools/qaic-monitor-json -i enable_mdp.json
Reset all the SoCs in the server.
3. Verify that multi-device partitioning (MDP) feature is enabled for all devices.
MDP+
indicates that multi-device feature is enabled on the device.
Compilation¶
Model partitioning across multiple devices is done by the compiler. The user is required to specify the number of SoCs/devices and the connectivity between the devices. Here a few examples of the device partition config files based on the connectivity and number of devices. The device partition config file is passed to the compiler qaic-exec
CLI.
Example 1: Model, tensor sliced across 4 SoCs with P2P communication between the SoCs.
mdp_4soc_p2p.json
{
"connections": [
{
"devices": [0,1,2,3],
"type": "p2p"
}
],
"partitions": [
{
"name": "Partition0",
"devices": [
{
"deviceId": 0,
"numCores": 16
},
{
"deviceId": 1,
"numCores": 16
},
{
"deviceId": 2,
"numCores": 16
},
{
"deviceId": 3,
"numCores": 16
}
]
}
]
}
Example 2: Model, tensor sliced across 2 SoCs with communication through the host between the SoCs. If no connections
are defined, the conectivity is assumed to be through the host
mdp_2soc_host.json
{
"partitions": [
{
"name": "Partition0",
"devices": [
{
"deviceId": 0,
"numCores": 16
},
{
"deviceId": 1,
"numCores": 16
}
]
}
]
}
To compile the model with the tensor sliced configurations, pass the device paritioning config file to qaic-exec
using mdp-load-partition-config
flag as shown below.
/opt/qti-aic/exec/qaic-exec \ -m=$model_path \ -aic-hw \ -aic-hw-version=2.0 \ -network-specialization-config=specializations.json \ -retained-state \ -convert-to-fp16 \ -aic-num-cores=${CORES} \ -custom-IO-list-file=${model_name}/custom_io.yaml \ -compile-only \ -aic-binary-dir=qpc/${model_name}-${BS}bs-${PL}pl-${CL}cl-${CORES}c-${SOCS}soc-${MX} \ -mdp-load-partition-config=mdp.json
Execution¶
Refer to Cloud-ai-sdk example for executing inference on multi-SoCs.
Recommendations¶
For very large models which are compiled for inter-SoC communication through the host, the host memory requirements can be large. If inference fails due to host or device resource exhaustion, try below options.
-
Increase the maximum number of memory mappings allowed for a process from the default value of 65k
Verify the new setting using -
Increase system memory (RAM) to 1TB and CPU count to 32 cores or higher.