Finetune Infra๏
This repository provides the infrastructure for finetuning models using different hardware accelerators such as QAic. Same CLI can be used to run finetuning on GPU by changing the value of device flag (for finetuning on GPU, install torch specific to CUDA).
Installation๏
Same as QEfficient along with QAIC PyTorch Eager mode.
For QEfficient Library : https://github.com/quic/efficient-transformers
For torch_qaic, assuming QEfficient is already installed,
pip install /opt/qti-aic/integrations/torch_qaic/py310/torch_qaic-0.1.0-cp310-cp310-linux_x86_64.whl
If qeff-env inside docker is used then torch_qaic and accelerate packages are already installed.
Finetuning๏
Export the ENV variables to download and enable private datasets
export HF_DATASETS_TRUST_REMOTE_CODE=True
Export the ENV variables to get the device and HW traces and debugging logs
export QAIC_DEVICE_LOG_LEVEL=0 # For Device level logs
export QAIC_DEBUG=1 # To understand the CPU fallback ops
Dataset Details๏
To download the Alpaca dataset, visit this link. Download the dataset and place it under the dataset directory. Make sure to update the training configuration accordingly.
wget -c https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/refs/heads/main/alpaca_data.json -P dataset/
To download the grammar dataset, visit this link. Download the dataset and place it under the datasets_grammar directory. Make sure to update the training configuration accordingly.
Usage๏
Single SOC finetuning on QAIC๏
python -m QEfficient.cloud.finetune --device qaic:0 --model_name "meta-llama/Llama-3.2-1B"
You can also configure various training parameters. Below is an example command line
python -m QEfficient.cloud.finetune --device qaic:0 --use-peft --output_dir ./meta-sam --num_epochs 2 --context_length 256
For more details on the usage of the training parameters, use the below command:
python -m QEfficient.cloud.finetune -h
Distributed training(DDP) on QAIC๏
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node 4 -m QEfficient.cloud.finetune --device qaic --enable_ddp --num_epochs 2 --model_name "meta-llama/Llama-3.2-1B"
**nproc-per-node is number of workers(QAIC devices) running locally.
Visualization๏
Tensorboard logs are generated inside runs/ directory with date and time stamp. to visualise the data,
tensorboard --logdir runs/<file> --bind_all
Some features/functionalities of fine-tuning stack:๏
1) Gradient accumulation: By default, gradient accumulation happens for 4 steps. To update this value, command line argument gradient_accumulation_steps has to be passed. (Example: '--gradient_accumulation_steps 8')
2) Gradient Checkpointing: By default, gradient checkpointing is disabled. To enable it, command line argument gradient_accumulation_steps has to be passed.
๐ง Steps to Fine-Tune with a Custom Dataset๏
Launching Fine-Tuning with a Custom Dataset
Use the following command-line arguments to begin fine-tuning using a custom dataset:
--dataset custom_dataset --dataset_config data_config.json
The
--dataset_config
argument is mandatory when--dataset custom_dataset
is specified. Thedata_config.json
file contains essential parameters used during dataset preprocessing.Example
data_config.json
File{ "train_split": "train", "test_split": "test", "test_split_ratio": 0.15, "preproc_file": "sample_dataset_preproc.py:preprocessing_fn", "collate_file": "sample_dataset_preproc.py:data_collate_fn", "disc_style": "sarcasm_more" }
Specifying the Preprocessing Function
In
data_config.json
, include a"preproc_file"
mandatory key to define the path to your preprocessing Python file and the function within it.Use the format
"filename.py:function_name"
. The filename and function name both are required. Example:"preproc_file": "sample_dataset_preproc.py:preprocessing_fn"
The preprocessing function must follow the structure below. The function parameters and the return type of the function should not be altered. The sample illustrates
apply_prompt_template
andtokenize
as sub-functions, but we can define our own sub-functions as needed. For reference, check the example files in the ./QEfficient/finetune/dataset/ directory.def preprocessing_fn(dataset_config, tokenizer, split, context_length=None): # Load the dataset or read from the disk # ... # Split the dataset into train and test splits if needed, # and use the appropriate split based on the 'split' argument. # ... def apply_prompt_template(example): # Apply prompt formatting to each datapoint (e.g., example) # ... return example # Return the processed example def tokenize(example): # Tokenize the formatted datapoint (e.g., example) # ... return tokenizer(example["text"], truncation=True, max_length=context_length) # Example tokenization # Apply prompt template to preprocess it in accordance to the dataset and task. dataset = dataset.map(apply_prompt_template, ...) # Finally, tokenize the dataset dataset = dataset.map(tokenize, batched=True, remove_columns=['text']) # Example batched tokenization # Each sample in the dataset should have keys acceptable by the HF # model and the loss function. # Typically, for CausalLM models used with 'generation' task_mode, # the keys should be 'input_ids', 'attention_mask', and 'labels'. return dataset
In the sample preprocessing function above, the
split
variable takes its value fromdata_config.json
. For the training dataset, the value will be taken from the"train_split"
key, and for the evaluation/test dataset, it will be taken from the"test_split"
key.Additional arguments needed for the preprocessing function can be passed in
data_config.json
and will be available via thedataset_config
variable within the function. For instance, in the sample config above,"test_split_ratio"
and"disc_style"
keys can be used in the preprocessing function to define the test split ratio and style of the dataset. These values are accessed through thedataset_config
variable. Check out the sample preprocessing file at ./QEfficient/finetune/dataset/custom_dataset/sample_dataset_preproc.py.
Custom Collate Function for Batching
When using a batch size greater than 1, we may need to override the default collate (batching different samples together in a batch) behavior by including a
"collate_file"
key indata_config.json
.Use the same
"file.py:function"
format. If omitted, the default Hugging FaceDataCollatorForSeq2Seq
is typically used, which pads sequences to the longest length in the batch.A custom collate function must follow the structure below. The function parameters and the return type of the function should not be altered:
def get_data_collator(tokenizer): # Define and return a custom collate_fn here # ... # This function should take a list of samples and return a batch. # Example: # from transformers import DataCollatorForLanguageModeling # return DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)