This tutorial provides a step-by-step guide for running
supervised fine-tuning (SFT) on a single v6e-8
Tensor Processing Unit (TPU) virtual machine (VM)
instance on Google Cloud by using MaxText, a high-performance JAX-based training
stack for large language models (LLMs).
Objectives
- Set up a Cloud TPU VM instance.
- Install MaxText and its dependencies.
- Convert a Hugging Face model to MaxText format.
- Run an SFT training workload on the TPU.
- Convert the fine-tuned model back to Hugging Face format for serving.
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
You need a Hugging Face access token to use this tutorial. You can sign up for a free account at Hugging Face. Once you have an account, generate an access token:
- On the Welcome to Hugging Face page, click your account avatar and select Access tokens.
- On the Access tokens page, click Create new token.
- Select the Read token type and enter a name for your token.
- Your access token is displayed. Save the token in a safe place.
- On the Hugging Face website, accept the license
agreement for the model that you plan to train. This tutorial uses the
model
gemma3-4b.
To get the permissions that you need to complete this tutorial, ask your administrator to grant you the following IAM roles on your project:
- TPU Admin (
roles/tpu.admin) - Service Account User (
roles/iam.serviceAccountUser) - Compute Editor (
roles/compute.editor)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Set up the environment
Set up your environment variables by running the following script:
Replace the following:
- YOUR_PROJECT_ID: your Google Cloud project ID
- ZONE_NAME: the zone that you want to use
- RESERVATION_NAME: your capacity reservation
- TPU_MACHINE_NAME: the name of your Cloud TPU VM instance
Authenticate with Google Cloud by running the following command:
gcloud auth login
Create your Cloud TPU VM
Create a Cloud TPU VM instance with 8 v6e TPU chips, bound to your capacity
reservation.
After the VM instance has been created, connect to it by using SSH.
Complete the steps that follow within your TPU VM instance.
Install MaxText
Update the system packages within the TPU VM instance.
Install Python 3.12, which MaxText requires, and its virtual environment package.
Use uv to speed up the installation of the Python package.
Create a virtual environment named maxtext_venv and activate it.
Install MaxText and the dependencies that it requires for post-training tasks.
Install the remaining required dependencies by running the following command:
Convert the model to MaxText format
To train the model in MaxText format, you must convert it from Hugging Face format to MaxText format.
Specify your environment variables, such as your Hugging Face access token, the name of the model that you want to use, and the directory where you want to save the model in MaxText format.
Replace YOUR_HF_TOKEN with the Hugging Face access token that you previously created.
To convert the model from Hugging Face format to MaxText format, run the following script. This conversion takes about five minutes to complete.
Start the training workload
After the conversion process has completed, you can start the SFT workload.
Configure the SFT workload training parameters.
Start the training job. This takes about 10 minutes on a
v6e-8VM instance.
Convert the trained model back into Hugging Face format
After the training workload has completed, convert the model back to Hugging Face format.
Set the paths for export and the trained parameters.
Run the conversion back to Hugging Face format.
After the conversion has completed, your tuned model stored in
/dev/shm/gemma3-4b/hf-trained is ready to be used. Because you lose access to
the contents of the /dev/shm folder when the VM reboots, you should move the
tuned model to persistent storage or upload it to the Hugging Face Hub.
Clean up
To avoid incurring additional charges, delete the resources created during this tutorial.
Delete your TPU VM instance
Exit your Cloud TPU VM instance, and then delete it.
What's next
- For more information about Cloud TPU, see Introduction to Cloud TPU.
- For architecture and configuration details for the
v6e-8TPU, see TPU v6e.