With Managed Training, you can visualize your training logs in near real-time using Vertex AI TensorBoard. Simply configure your workload to save logs to a Cloud Storage bucket, and they will be automatically streamed to the TensorBoard interface for analysis.
Prerequisites
Before you begin, ensure you have the following:
- A running Managed Training cluster.
- A Cloud Storage bucket to store your TensorBoard logs. This bucket must be in the same region as your TensorBoard instance. For setup instructions, see Create a Cloud Storage bucket.
- A Vertex AI TensorBoard instance. For creation instructions, see Create a Vertex AI TensorBoard instance.
- The correct IAM permissions. To allow Cloud Storage FUSE to read
from and write to the storage bucket, the service account used by
your cluster's VMs requires the Storage Object User (
roles/storage.objectUser) role.
Enabling Tensorboard upload
To configure the TensorBoard integration for your job, pass the following
arguments using the --extra flag in your Slurm job submission:
tensorboard_base_output_dir: Specifies the Cloud Storage path to upload logs to. For example,gs://my-bucket/my-logs.tensorboard_url: Specifies the Vertex AI TensorBoard instance, experiment, or run URL. If only an instance is provided, a new experiment and run are created. If omitted, the default TensorBoard instance for the project is used. For example,projects/123/locations/us-central1/tensorboards/456.
Example
# Using specific tensorboard instance
sbatch --extra="tensorboard_base_output_dir=<your-cloud-storage-dir>,tensorboard_url=projects/<project-id>/locations/<location>/tensorboards/<tensorboard-instance-id>" your_script.sbatch
Writing logs from your training job
Within your training script, access the AIP_TENSORBOARD_LOG_DIR environment
variable. This variable provides the unique Cloud Storage path where your
script should write its TensorBoard logs.
The path follows this structure:
gs://<your-cloud-storage-path>/<cluster-id>-<cluster-uuid>/tensorboard/job-<job-id>/
The following example shows a complete workflow with two key components: the Slurm submission script that configures the job, and the Python training script that reads the environment variable to write its logs.
Slurm Job Script (simple_job.sbatch):
#!/bin/bash
#SBATCH --job-name=tensorboard-simple-test
#SBATCH --output=tensorboard-simple-test-%j.out
# Activate your Python virtual environment if needed
# source /path/to/your/venv/bin/activate
python3 simple_logger.py
Python Script (simple_logger.py):
import tensorflow as tf
import os
# Get the log directory from the environment variable
log_dir = os.environ.get("AIP_TENSORBOARD_LOG_DIR")
print(f"Writing TensorBoard logs to: {log_dir}")
writer = tf.summary.create_file_writer(log_dir)
with writer.as_default():
for step in range(10):
# Simulate some metrics
loss = 1.0 - (step * 0.1)
accuracy = 0.6 + (step * 0.04)
# Log the metrics
tf.summary.scalar('loss', loss, step=step)
tf.summary.scalar('accuracy', accuracy, step=step)
writer.flush()
print(f"Step {step}: loss={loss:.4f}, accuracy={accuracy:.4f}")
writer.close()
print(f"--- Finished writing metrics to {log_dir} ---")
Real-time Log Synchronization
To visualize metrics from a running job, you must periodically close and
recreate the summary writer in your training code. This is necessary because
gcsfuse only syncs log files to Cloud Storage once they are closed. This
"flushing" technique ensures that intermediate results are visible in the
TensorBoard console before the job completes.
Viewing Vertex AI TensorBoard
Once your job is submitted, you can monitor its progress by going to the to the Vertex AI Experiments page in the Google Cloud console.