Distributed training¶
Tip
The code from this article is available in:
Follow the instructions in the article to run the example.
Overview¶
tinytopics >= 0.7.0 supports distributed training using Hugging Face Accelerate. This article demonstrates how to run distributed training on a single node with multiple GPUs.
The example utilizes Distributed Data Parallel (DDP) for distributed training. This approach assumes that the model parameters fit within the memory of a single GPU, as each GPU maintains a synchronized copy of the model. The input data can exceed the memory capacity. This is generally a reasonable assumption for topic modeling tasks, as storing the factorized matrices is often less memory-intensive.
Hugging Face Accelerate also supports other distributed training strategies such as Fully Sharded Data Parallel (FSDP) and DeepSpeed, which distribute model tensors across different GPUs and allow training larger models at the cost of speed.
Generate data¶
We will use a 100k x 100k count matrix with 20 topics for distributed
training. To generate the example data, save the following code to
distributed_data.py
and run:
import os
import numpy as np
import tinytopics as tt
def main():
n, m, k = 100_000, 100_000, 20
data_path = "X.npy"
if os.path.exists(data_path):
print(f"Data already exists at {data_path}")
return
print("Generating synthetic data...")
tt.set_random_seed(42)
X, true_L, true_F = tt.generate_synthetic_data(
n=n, m=m, k=k, avg_doc_length=256 * 256
)
print(f"Saving data to {data_path}")
X_numpy = X.cpu().numpy()
np.save(data_path, X_numpy)
if __name__ == "__main__":
main()
Generating the data is time-consuming (about 10 minutes), so running it as a standalone script helps avoid potential timeout errors during distributed training. You can also execute it on an instance type suitable for your data ingestion pipeline, rather than using valuable GPU instance hours.
Run distributed training¶
First, configure the distributed environment by running:
You will be prompted to answer questions about the distributed training environment and strategy. The answers will be saved to a configuration file at:
~/.cache/huggingface/accelerate/default_config.yaml
You can rerun accelerate config
at any time to update the
configuration. For data distributed parallel on a 4-GPU node, select
single-node multi-GPU training options with the number of GPUs set to 4,
and use the default settings for the remaining questions (mostly “no”).
Next, save the following code to distributed_training.py
and run:
import os
from accelerate import Accelerator
from accelerate.utils import set_seed
import tinytopics as tt
def main():
accelerator = Accelerator()
set_seed(42)
k = 20
data_path = "X.npy"
if not os.path.exists(data_path):
raise FileNotFoundError(
f"{data_path} not found. Run distributed_data.py first."
)
print(f"Loading data from {data_path}")
X = tt.NumpyDiskDataset(data_path)
# All processes should have the data before proceeding
accelerator.wait_for_everyone()
model, losses = tt.fit_model_distributed(X, k=k)
# Only the main process should plot the loss
if accelerator.is_main_process:
tt.plot_loss(losses, output_file="loss.png")
if __name__ == "__main__":
main()
This script uses fit_model_distributed()
(added in tinytopics 0.7.0)
to train the model. Since distributed training on large datasets likely
takes longer, fit_model_distributed()
displays more detailed progress
bars for each epoch, going through all batches in each epoch.
Sample runs¶
We ran the distributed training example on a 1-GPU, 4-GPU, and 8-GPU setup with H100 and A100 GPUs. The table below shows the training time per epoch, total time, GPU utilization, VRAM usage, instance cost, and total cost.
Metric | 1x H100 (80 GB SXM5) | 4x H100 (80 GB SXM5) | 8x A100 (40 GB SXM4) |
---|---|---|---|
Time per epoch (s) | 24 | 6 | 6 |
Total time (min) | 80 | 20 | 20 |
GPU utilization | 16% | 30-40% | 30-50% |
VRAM usage | 1% | 4% | 4% |
Instance cost (USD/h) | 3.29 | 12.36 | 10.32 |
Total cost (USD) | 4.38 | 4.12 | 3.44 |
Using 4 H100 GPUs is approximately 4x faster than using 1 H100 GPU, with a slightly lower total cost. Using 8x A100 GPUs has similar speed comparing to 4x H100 GPUs but with an even lower total cost due to the lower instance cost.
The loss plot and real-time GPU utilization monitoring via nvtop
on
the 4x H100 GPU instance are shown below.
For more technical details on distributed training, please refer to the Hugging Face Accelerate documentation, as this article covers only the basics.