Skip to content

Model Sharding

Cloud AI SDK enables model sharding which provides the benefits of running larger models and improve throughput/latency/batch-size support across SoCs/Cards connected to the same host. Topologies supported are with or without a PCIe switch. Cards connected to a PCIe switch with peer-to-peer communication enabled provide the best performance.

Use Cases

There are 2 primary use cases of model sharding via tensor slicing.

  • Execute models that do not fit in the memory footprint of a single SoC.
  • Optimize performance (latency/throughput) for models that can fit within a single SoC but still benefit from tensor-slicing.

Architecture

For tensor slicing to achieve the best performance (latency/throughput), the server architecture in particular, accelerator card inter-connect performance is critical. The image below shows 8 AI 100 Ultra accelerator cards connected via PCIe switches to the host. There are two approaches regarding card-to-card communication.

  • P2P communication between the cards through a PCIe switch. This architecture provides the best performance.
  • Multi-device through host: Card to card communication happens through the host. This approach will have inferior performance compared to P2P.

This sample configuration allows model sharding via tensor slicing across 8 cards (typically used for > 15B parameter models).

Tensor Slicing

Model operations are split across muliple SoCs (maximum across 16 SoCs in P2P config). The image provides a sample graph execution that is tensor sliced across 4 AI 100 Ultra accelerator cards. As seen from the image, there is a lot of inter-card traffic across models layers. Inter-card data bandwidth available plays a critical role in the performance. Hence, the need to enable P2P inter-card communication via PCIe switch. The AI 100 Ultra card has a PCIe switch between the 4 SoCs on the card. In a server with many AI 100 accelerators the PCIe hierarchy plays a critical role in the performance.

Platform setup

Pre-requisites

  • A minimum of 4 AI 100 Ultra cards is recommended per PCIe switch.
  • The PCIe switch shall meet the maximum bandwidth requirements per lane for all cards connected to the switch.
  • Host should be able to support large BAR sizes. Each AI 100 accelerator card requires 2+ GB of BAR space per SoC.
  • BAR region 4 for every AI 100 SoC is 2G (size=2G as shown below).
    lspci -s <PCIe address of AI 100 SoC> -vv | grep "Region 4"
    Region 4: Memory at xyz (64-bit, prefetchable) [size=2G]
    
    If the region 4 for every SoC is not 2G, contact your System Integrator.

Card configuration

Enable multi-device partitioning (MDP) on all the SoCs using --setup_mdp all option while installing the platform SDK.

sudo ./install.sh --setup_mdp all
Example output:
Enabling MDP support system wide.
Disabling ACS. Required to enable MDP P2P.
Increasing mmap limit (max_map_count) to 2048000.vm.max_map_count = 2048000 #qaic
Increasing openfiles limit to 1048576.Installation is successful.
Note: This will enable mdp, disable acs, increase the mmap limit & ulimit value.

Enable multi-device partitioning (MDP) on all the SoCs can also be done using qaic-util, for more details please refer to qaic-util user guide qaic_util

Note: This will enable mdp and disable acs.

Compilation

Model partitioning across multiple devices is done by the compiler. The user is required to specify the number of SoCs/devices and the connectivity between the devices. Here a few examples of the device partition config files based on the connectivity and number of devices. The device partition config file is passed to the compiler qaic-exec CLI.

Example 1: Model, tensor sliced across 4 SoCs with P2P communication between the SoCs.

mdp_4soc_p2p.json 

{
    "connections": [
        {
            "devices": [0,1,2,3],
            "type": "p2p"
        }
    ],
    "partitions": [
        {
            "name": "Partition0",
            "devices": [
                {
                    "deviceId": 0,
                    "numCores": 16
                },
                {
                    "deviceId": 1,
                    "numCores": 16
                },
                {
                    "deviceId": 2,
                    "numCores": 16
                },
                {
                    "deviceId": 3,
                    "numCores": 16
                }
            ]
        }
    ]
}

Example 2: Model, tensor sliced across 2 SoCs with communication through the host between the SoCs. If no connections are defined, the conectivity is assumed to be through the host

mdp_2soc_host.json 

{
    "partitions": [
        {
            "name": "Partition0",
            "devices": [
                {
                    "deviceId": 0,
                    "numCores": 16
                },
                {
                    "deviceId": 1,
                    "numCores": 16
                }
            ]
        }
    ]
}

To compile the model with the tensor sliced configurations, pass the device paritioning config file to qaic-exec using mdp-load-partition-config flag as shown below.

/opt/qti-aic/exec/qaic-exec \
    -m=$model_path \
    -aic-hw \
    -aic-hw-version=2.0 \
    -network-specialization-config=specializations.json \
    -retained-state \
    -convert-to-fp16 \
    -mxfp6-matmul \
    -aic-num-cores=${CORES} \
    -custom-IO-list-file=${model_name}/custom_io.yaml \
    -compile-only \
    -aic-binary-dir=qpc/${model_name}-${BS}bs-${PL}pl-${CL}cl-${CORES}c-${SOCS}soc-mxfp6 \
    -mdp-load-partition-config=mdp.json

Where:

  • CORES is the number of NSP cores per AI 100 SoC, typically 16
  • BS is batch size
  • PL is the prompt length
  • CL is the context length
  • SOCS is the number of AI 100 SoCs (4 per Ultra Accelerator, or 1 per Std/Pro Accelerator)

Execution

Refer to Cloud-ai-sdk example for executing inference on multi-SoCs.

Recommendations

For very large models which are compiled for inter-SoC communication through the host, the host memory requirements can be large. If inference fails due to host or device resource exhaustion, try below options.

  • Increase system memory (RAM) to 1TB and CPU count to 32 cores or higher.