Model Sharding¶

Cloud AI SDK enables model sharding which provides the benefits of running larger models and improve throughput/latency/batch-size support across SoCs/Cards connected to the same host. Topologies supported are with or without a PCIe switch. Cards connected to a PCIe switch with peer-to-peer communication enabled provide the best performance.

Use Cases¶

There are 2 primary use cases of model sharding via tensor slicing.

Execute models that do not fit in the memory footprint of a single SoC.
Optimize performance (latency/throughput) for models that can fit within a single SoC but still benefit from tensor-slicing.

Architecture¶

For tensor slicing to achieve the best performance (latency/throughput), the server architecture in particular, accelerator card inter-connect performance is critical. The image below shows 8 AI 100 Ultra accelerator cards connected via a PCIe switch to the host. There are two approaches regarding card-to-card communication.

P2P communication between the cards through a PCIe switch. This architecture provides the best performance.
Multi-device through host: Card to card communication happens through the host. This approach will have inferior performance compared to P2P.

This sample configuration allows model sharding via tensor slicing across 8 cards (typically used for > 15B parameter models).

Tensor Slicing¶

Model operations are split across muliple SoCs (maximum across 16 SoCs in P2P config). The image provides a sample graph execution that is tensor sliced across 4 AI 100 accelerator cards. As seen from the image, there is a lot of inter-card traffic across models layers. Inter-card data bandwidth available plays a critical role in the performance. Hence, the need to enable P2P inter-card communication via PCIe switch. The AI 100 Ultra card has a PCIe switch between the 4 SoCs on the card. In a server with many AI 100 accelerators the PCIe hierarchy plays a critical role in the performance.

Platform setup¶

Pre-requisites¶

Host should be able to support large BAR sizes. Each AI 100 accelerator card requires 2+ GB of BAR space per SoC.
BAR region 4 for every AI 100 SoC is 2G (size=2G as shown below).
```
lspci -s <PCIe address of AI 100 SoC> -vv | grep "Region 4"
Region 4: Memory at xyz (64-bit, prefetchable) [size=2G]
```
If the region 4 for every SoC is not 2G, contact your System Integrator.

Card configuration¶

Disable PCIe Switch ACS to enable P2P communication between PCIe ports. See instructions here.
Enable multi-device partitioning on all the SoCs. Download enable_mdp.json from here.
```
systemd-run --unit=qmonitor-proxy /opt/qti-aic/tools/qaic-monitor-grpc-server
/opt/qti-aic/tools/qaic-monitor-json -i enable_mdp.json 
```
Reset all the SoCs in the server.
```
sudo /opt/qti-aic/tools/qaic-util -s
```
3. Verify that multi-device partitioning (MDP) feature is enabled for all devices.
```
/opt/qti-aic/tools/qaic-util -q | grep MDP
```
MDP+ indicates that multi-device feature is enabled on the device.

Compilation¶

Model partitioning across multiple devices is done by the compiler. The user is required to specify the number of SoCs/devices and the connectivity between the devices. Here a few examples of the device partition config files based on the connectivity and number of devices. The device partition config file is passed to the compiler qaic-exec CLI.

Example 1: Model, tensor sliced across 4 SoCs with P2P communication between

the SoCs. name="__codelineno-4-1" href="#__codelineno-4-1">mdp_4soc_p2p.json "connections": [ { "devices": [0,1,2,3], "type": "p2p" } ], "partitions": [ { "name": "Partition0", "devices": [ { "deviceId": 0, "numCores": 16 }, { "deviceId": 1, "numCores": 16 }, { "deviceId": 2, "numCores": 16 }, { "deviceId": 3, "numCores": 16 } ] } ] }

Example 2: Model, tensor sliced across 2 SoCs with communication through the host between the SoCs. If no connections are defined, the conectivity is assumed to be through the host

mdp_2soc_host.json 

{
    "partitions": [
        {
            "name": "Partition0",
            "devices": [
                {
                    "deviceId": 0,
                    "numCores": 16
                },
                {
                    "deviceId": 1,
                    "numCores": 16
                }
            ]
        }
    ]
}

To compile the model with the tensor sliced configurations, pass the device paritioning config file to qaic-exec using mdp-load-partition-config flag as shown below.

/opt/qti-aic/exec/qaic-exec \
    -m=$model_path \
    -aic-hw \
    -aic-hw-version=2.0 \
    -network-specialization-config=specializations.json \
    -retained-state \
    -convert-to-fp16 \
    -aic-num-cores=${CORES} \
    -custom-IO-list-file=${model_name}/custom_io.yaml \
    -compile-only \
    -aic-binary-dir=qpc/${model_name}-${BS}bs-${PL}pl-${CL}cl-${CORES}c-${SOCS}soc-${MX} \
    -mdp-load-partition-config=mdp.json

Execution¶

Refer to Cloud-ai-sdk example for executing inference on multi-SoCs.

Recommendations¶

For very large models which are compiled for inter-SoC communication through the host, the host memory requirements can be large. If inference fails due to host or device resource exhaustion, try below options.

Increase the maximum number of memory mappings allowed for a process from the default value of 65k
```
sudo bash -c "sysctl -w 'vm.max_map_count=2048000'"
```
Verify the new setting using
```
cat /proc/sys/vm/max_map_count
```
Increase system memory (RAM) to 1TB and CPU count to 32 cores or higher.