Kubeflow Trainer Quick Start
A minimal distributed PyTorch training setup on Alauda AI using Kubeflow Trainer v2: a custom runtime image, a ClusterTrainingRuntime, and an MNIST example notebook.
Runtime image
Use the prebuilt image alaudadockerhub/torch-distributed:v2.9.1-aml2, or build your own from this torch_distributed.Containerfile:
ClusterTrainingRuntime
Apply this kf-torch-distributed.yaml as cluster admin. The pod spec is tightened for Alauda AI's default PSA.
Run the example notebook
The notebook installs Python packages and downloads MNIST, so the workbench needs outbound network access.
Download kubeflow-trainer-mnist.ipynb and upload it to your workbench, then follow it to submit the TrainJob.
For background on Trainer v2 features, see the upstream Kubeflow Trainer docs.