This guide shows you how to use the NVIDIA NeMo ecosystem on a Managed Training cluster for end-to-end generative AI model development. It provides step-by-step instructions for the following distinct but related workflows, each covered in its own dedicated section:
- NVIDIA NeMo: For foundational model development, follow these instructions to perform large-scale pre-training, continuous pre-training (CPT), and supervised fine-tuning (SFT).
- NVIDIA NeMo-RL: For model alignment and preference tuning, use this section to apply advanced techniques like Reinforcement Learning (RL) to align your model with human instructions and preferences.
Whether you're building a model from scratch or refining an existing one, this document guides you through setting up your environment, managing containerized jobs, and launching training scripts on the cluster.
NVIDIA NeMo
The NVIDIA NeMo framework is an end-to-end platform for building, customizing, and deploying generative AI models. This section of the guide is specifically for developers and researchers focused on the foundational stages of model development. It provides step-by-step instructions for using NeMo to perform large-scale pre-training, continuous pre-training (CPT), and supervised fine-tuning (SFT) on a Managed Training cluster.
This training clusters guide provides the complete workflow for running a training job with the NeMo framework. The process is divided into two main parts: the initial one-time setup of your environment and the recurring steps for launching a job.
Set up your environment
Before launching a job, you need to prepare your environment by ensuring you have a container image and the necessary training scripts.
Prepare a container image
You have two options for the container image: use a prebuilt image (recommended) or build a custom one.
Use a prebuilt image (recommended)
Prebuilt container images are provided in the .squashfs format. Copy the
appropriate image for your region into your working directory.
# Example for the US region
gcloud storage cp gs://managed-containers-us/nemo_squashfs/nemo-20250721.sqsh .
Build a customized container (advanced)
Follow these steps only if the prebuilt containers don't meet your needs.
This procedure guides you through converting a custom container image into the
.squashfs format using enroot.
Step 1: Authenticate with Google Cloud.
Use the following commands to ensure that both your Google Cloud user account and the Docker registry where your image is hosted are authenticated:
gcloud auth login
gcloud auth configure-docker us-docker.pkg.dev
Step 2: Create the conversion script.
Create a file named enroot-convert.sh and add the following script
content. Before running this script, you must update the REMOTE_IMG and
LOCAL_IMG variables to point to your container image and your chosen output path.
#!/bin/bash
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --ntasks-per-node=1
# Run this script on the slurm login node:
# sbatch -N 1 enroot-convert.sh
set -x
set -e
# The remote docker image URI.
REMOTE_IMG="docker://us-docker.pkg.dev/{YOUR_CONTAINER_IMG_URI}:{YOUR_CONTAINER_IMAGE_TAG}"
# The local path to the to be imported enroot squash file.
LOCAL_IMG="${HOME}/my_nemo.sqsh"
# The path to the enroot config file.
TMP_ENROOT_CONFIG_PATH="/tmp/\$(id -u --name)/config/enroot"
# Download the docker image to each node.
srun -l -N "${SLURM_NNODES}" \
bash -c "
mkdir -p ${TMP_ENROOT_CONFIG_PATH};
echo 'machine us-docker.pkg.dev login oauth2accesstoken password $(gcloud auth print-access-token)' > ${TMP_ENROOT_CONFIG_PATH}/.credentials;
rm -f ${LOCAL_IMG};
ENROOT_CONFIG_PATH=${TMP_ENROOT_CONFIG_PATH} ENROOT_MAX_PROCESSORS=$(( $(nproc) / 2 )) enroot import -o ${LOCAL_IMG} ${REMOTE_IMG};
"
Step 3: Run the script and verify the output.
Execute the script on the Slurm login node.
sbatch -N 1 enroot-convert.sh
After the job completes, find the conversion logs in a file named
slurm-<JOB_ID>.out and the final container image at the path you specified
for LOCAL_IMG.
Download the training recipes
The training recipes are stored in a private googlesource.com repository.
To access them with the Git command line, you must first generate authentication
credentials.
Generate authentication credentials.
Visit the following URL and follow the on-screen instructions. This configures your local environment to authenticate with the repository. https://www.googlesource.com/new-password
Clone the repository.
Once the credentials are authenticated, run the following command to download the recipes.
git clone https://vertex-model-garden.googlesource.com/vertex-oss-training
Launch a training job
Once your environment is set up, you can launch a training job.
Step 1: Set environment variables
The following environment variables may be required for your job:
- The
HF_TOKENis required to download models and datasets from Hugging Face. - The
WANDB_API_KEYis required to use Weights & Biases for experiment analysis.
export HF_TOKEN=YOUR_HF_TOKEN
export WANDB_API_KEY=YOUR_WANDB_API_KEY
Step 2: Run the launch script
Navigate to your working directory and run the run.py script to start a
job. This example kicks off a demo training job with Llama 3.1-2b.
# Set the working directory
export WORK_DIR=$HOME/vertex-oss-training/nemo
cd $WORK_DIR
# Launch the training job
export NEMORUN_HOME=$WORK_DIR && \
python3 run.py -e slurm --slurm-type hcc-a3m --partition a3m \
-d $WORK_DIR -i $WORK_DIR/nemo-demo.sqsh \
-s pretrain/llama3p1_2b_pt.py -n 2 \
--experiment-name nemo-demo-run
Launch parameters
--slurm-typeis set based on the cluster type (for example,hcc-a3m,hcc-a3u,hcc-a4).--partitionmust be set to an available partition. You can check partition names with thesinfocommand.- The
run.pyscript automatically mounts several directories to the Docker container, including--log-dir,--cache-dir, and--data-dir, if they are set.
Monitoring job status and logs
After you launch the job, a status block is displayed:
Experiment Status for nemo-demo-run_1753123402
Task 0: nemo-demo-run
- Status: RUNNING
- Executor: SlurmExecutor on @localhost
- Job id: 75
- Local Directory: $NEMORUN_HOME/experiments/nemo-demo-run/nemo-demo-run_1753123402/nemo-demo-run
The execution logs are written to the path shown in the Local Directory field
from the status output. For example, you can find the log files at a path
similar to this:
$NEMORUN_HOME/experiments/nemo-demo-run/nemo-demo-run_1753123402/nemo-demo-run/<JOB_ID>.log
Common errors and solutions
This section describes common issues that may arise during job execution and provides recommended steps to resolve them.
Invalid partition error
By default, jobs attempt to launch on the general partition. If the general partition doesn't exist or isn't available, the job will fail with the following error:
sbatch: error: invalid partition specified: general
sbatch: error: Batch job submission failed: Invalid partition name specified
Solution:
Specify an available partition using the --partition or -p argument in your
launch command.
To see a list of available partitions, run the sinfo command on the
Slurm login node.
sinfo
The output shows the available partition names, such as a3u in this example:
| PARTITION | AVAIL | TIMELIMIT | NODES | STATE | NODELIST |
|---|---|---|---|---|---|
| a3u* | up | infinite | 2 | idle~ | alice-a3u-[2-3] |
| a3u* | up | infinite | 2 | idle | alice-a3u-[0-1] |
Tokenizer download error
You may encounter an OSError related to a cross-device link when the script attempts to download the GPT2 tokenizer:
OSError: [Errno 18] Invalid cross-device link: 'gpt2-vocab.json' -> '/root/.cache/torch/megatron/megatron-gpt-345m_vocab'
Solutions:
You have two options to resolve this issue:
- Option #1:
Rerun the job. This error is often transient. Rerunning the job using
the same
--cache-dirmay resolve the issue. - Option #2:
Manually download the tokenizer files. If rerunning the job fails,
follow these steps:
- Download the following two files:
gpt2-vocab.jsongpt2-merges.txt
- Move the downloaded files into the
torch/megatron/subdirectory within your cache directory (for example,<var>YOUR_CACHE_DIR</var>/torch/megatron/). - Rename the files as follows:
- Rename
gpt2-vocab.jsontomegatron-gpt-345m_vocab. - Rename
gpt2-merges.txttomegatron-gpt-345m_merges.
- Rename
- Download the following two files:
NVIDIA NeMo-RL
The NVIDIA NeMo-RL framework is designed to align large language models with human preferences and instructions. This section guides you through using NeMo-RL on a cluster to perform advanced alignment tasks, including supervised fine-tuning (SFT), preference-tuning (such as Direct Preference Optimization, or DPO), and Reinforcement Learning (RL).
The guide covers two primary workflows: running a standard batch training job and using the interactive development environment for debugging.
Prerequisites
Before you begin, create a cluster by following the instructions on the Create cluster page, or use an existing Managed Training cluster, if you have one.
Connect to the cluster login node
To connect to the cluster's login node, find the correct Google Cloud CLI command by navigating to the Google Compute Engine Virtual Machine page in the Google Google Cloud console and clicking SSH > View Google Cloud CLI command. It will look similar to this:
ssh $USER_NAME@machine-addr
Example:
ssh $USER_NAME@nic0.sliua3m1-login-001.europe-north1-c.c.infinipod-shared-dev.internal.gcpnode.com
Use the prebuilt Docker image
Converted .sqsh files are provided for prebuilt container images. You can
select a container for your region and either set it directly as the container
image parameter or download it to the cluster's file system.
To directly set it as the container image param, use one of the following paths.
Note that you should replace <region> with your specific region
(for example, europe, asia, us):
/gcs/managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqsh
To download the image to the cluster's lustre storage, use the following command:
gcloud storage cp gs://managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqsh DESTINATION
Download code
In order to get access to the training recipe with git CLI, visit https://www.googlesource.com/new-password. The recipe can be downloaded with the following command:
cd $HOME
git clone https://vertex-model-garden.googlesource.com/vertex-oss-training
Launch jobs
Step 1: Set environment variables.
To pull models and data from Hugging Face, the HF_TOKEN may need to be set.
To use Weights & Biases for experiment analysis, the WANDB_API_KEY needs to
be set. Update these variables in the following file:
File to update: $HOME/vertex-oss-training/nemo_rl/configs/auth.sh
If you don't want to use Weights & Biases, set the logger.wandb_enabled to False
in your launch script.
Step 2: Download or copy the container file to your launch folder.
Here are some examples.
gcloud storage cp \
gs://managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqsh \
$HOME/vertex-oss-training/nemo_rl/nemo_rl-h20250923.sqsh
# OR
cp /gcs/managed-containers-<region>/nemo_rl_squashfs/nemo_rl-h20250923.sqsh \
$HOME/vertex-oss-training/nemo_rl/nemo_rl-h20250923.sqsh
cd $HOME/vertex-oss-training/nemo_rl/
Step 3: Prepare or Clone the NeMo-RL repository.
Create a clone of the NeMo-RL code if not already present. Note that you may
need to use git submodule update --init --recursive if you've already cloned
the repository without the --recursive flag.
git clone https://github.com/NVIDIA-NeMo/RL --recursive
Step 4: Launch the training job.
sbatch -N <num_nodes> launch.sh --cluster_type hcc-a3m --job_script algorithms/dpo.sh
Where:
--cluster-typeis set based on the cluster type:- A3-Mega:
hcc-a3m - A3-Ultra:
hcc-a3u - A4:
hcc-a4 - A3H:
hcc-a3h
- A3-Mega:
--partitionshould be set accordingly, wheresinfocan be used to check the slurm partitions.
After your job starts, a new directory named after its SLURM Job ID are created in your current location. Inside, you'll find all the logs and checkpoints belonging to this job. More precisely, inside that you'll find the following directories and files:
checkpoints/→ This directory is mounted inside the NeMo-RL container and contains all of the checkpoints from the training.ray-logs/→ This directory contains the logs from ray head and ray workers.nemo_rl_output.log→ This file contains the Slurm logs from your submitted job.attach.sh(Interactive jobs only) → This is a bash script which lets you attach to an interactive job. If your job is launched successfully, it might take a couple of minutes for this file to be created.
Development with NeMo-RL
Interactive Setup
Two options are available for quick interactive development with NeMo-RL.
nemorlinteractive
This is a straightforward helper command which lets you choose a GPU node from the cluster (let's say node number 5), and then it takes you to a running container for NeMo-RL inside your selected node. This command is helpful for your single node workflows.
In order to use nemorlinteractive, you need to follow these prerequisite steps:
- Provide all auth tokens you want (for example, HF and WandB) loaded to the
job in the
configs/auth.shfile. Set the
CLUSTER_TYPEenvironment variable according to the following guideline:export CLUSTER_TYPE="hcc-a3m" # --> if you have A3-Mega cluster export CLUSTER_TYPE="hcc-a3u" # --> if you have A3-Ultra cluster export CLUSTER_TYPE="hcc-a4" # --> If you have A4 cluster export CLUSTER_TYPE="hcc-a3h" # --> If you have A3H clusterImport
nemorlinteractivein your bash terminal by sourcing thebash_utils.sh:source bash_utils.shRun the
nemorlinteractivecommand. For example:#\ Assuming you want to take the compute node number 5. nemorlinteractive 5
Interactive launch
This option lets you run workloads interactively on multiple compute nodes. Interactive jobs are most suitable for debugging and verification use cases. These workloads reserve the node indefinitely until the developer decides that the debugging has concluded and they want to release the resources.
The following are the steps that need to be followed for this option:
Provide all auth tokens you want (for example HF and WandB) loaded to the job
in the configs/auth.sh file.
sbatch -N <num_nodes> launch.sh --cluster_type hcc-a3m --interactive
Wait for 2-5 minutes and you should see
<job_id>/attach.shcreated.To monitor progress of the launch check
<job_id>/nemo_rl_output.logto see the progress of the launch script and check<job_id>/ray_logs/to see the progress of ray head and workers launch.Connect to the interactive job. This script lets you connect again even if you lose connection:
bash <job_id>/attach.sh