SAX on Cloud TPU v5e

SAX cluster (SAX cell)

SAX admin server and SAX model server are two essential components that run a SAX cluster.

SAX admin server

The SAX admin server monitors and coordinates all SAX model servers in a SAX cluster. In a SAX cluster, you can launch multiple SAX admin servers, where only one of the SAX admin server is active through leader election, the others are standby servers. When the active admin server fails, a standby admin server will become active. The active SAX admin server assigns model replicas and inference requests to available SAX model servers.

SAX admin storage bucket

Each SAX cluster requires a Cloud Storage bucket to store the configurations and locations of SAX admin servers and SAX model servers in the SAX cluster.

SAX model server

The SAX model server loads a model checkpoint and runs inference with GSPMD. A SAX model server runs on a single TPU VM worker. Single-host TPU model serving requires a single SAX model server on a single-host TPU VM. Multi-host TPU model serving requires a group of SAX model servers on a multi-host TPU slice. Multi-host model serving is currently not available, but this document does provide an example with a 175B test model for preview.

SAX model serving

The following section walks through the workflow for serving language models using SAX. It uses the GPT-J 6B model as an example for single-host model serving.

Before starting, install the Cloud TPU SAX Docker images on your TPU VM:

sudo usermod -a -G docker ${USER}
newgrp docker

gcloud auth configure-docker us-docker.pkg.dev

SAX_ADMIN_SERVER_IMAGE_NAME="us-docker.pkg.dev/cloud-tpu-images/inference/sax-admin-server"
SAX_MODEL_SERVER_IMAGE_NAME="us-docker.pkg.dev/cloud-tpu-images/inference/sax-model-server"
SAX_UTIL_IMAGE_NAME="us-docker.pkg.dev/cloud-tpu-images/inference/sax-util"

SAX_VERSION=v1.0.0

export SAX_ADMIN_SERVER_IMAGE_URL=${SAX_ADMIN_SERVER_IMAGE_NAME}:${SAX_VERSION}
export SAX_MODEL_SERVER_IMAGE_URL=${SAX_MODEL_SERVER_IMAGE_NAME}:${SAX_VERSION}
export SAX_UTIL_IMAGE_URL="${SAX_UTIL_IMAGE_NAME}:${sax_version}"

docker pull ${SAX_ADMIN_SERVER_IMAGE_URL}
docker pull ${SAX_MODEL_SERVER_IMAGE_URL}
docker pull ${SAX_UTIL_IMAGE_URL}

Set some other variables you will use later:

export SAX_ADMIN_SERVER_DOCKER_NAME="sax-admin-server"
export SAX_MODEL_SERVER_DOCKER_NAME="sax-model-server"
export SAX_CELL="/sax/test"

GPT-J 6B single-host model serving example

Single-host model serving is applicable to single-host TPU slice, that is, v5litepod-1, v5litepod-4 and v5litepod-8.

Create a SAX cluster

Create a Cloud Storage storage bucket for the SAX cluster:

SAX_ADMIN_STORAGE_BUCKET=${your_admin_storage_bucket}

gcloud storage buckets create gs://${SAX_ADMIN_STORAGE_BUCKET} \
--project=${PROJECT_ID}

You might need another Cloud Storage storage bucket to store the checkpoint.

SAX_DATA_STORAGE_BUCKET=${your_data_storage_bucket}

SSH into your TPU VM in a terminal to launch the SAX admin server:

docker run \
--name ${SAX_ADMIN_SERVER_DOCKER_NAME} \
-it \
-d \
--rm \
--network host \
--env GSBUCKET=${SAX_ADMIN_STORAGE_BUCKET} \
${SAX_ADMIN_SERVER_IMAGE_URL}

You can check the docker log by:

docker logs -f ${SAX_ADMIN_SERVER_DOCKER_NAME}

The output in the log will look similar to the following:

I0829 01:22:31.184198       7 config.go:111] Creating config fs_root: "gs://test_sax_admin/sax-fs-root"
I0829 01:22:31.347883       7 config.go:115] Created config fs_root: "gs://test_sax_admin/sax-fs-root"
I0829 01:22:31.360837      24 admin_server.go:44] Starting the server
I0829 01:22:31.361420      24 ipaddr.go:39] Skipping non-global IP address 127.0.0.1/8.
I0829 01:22:31.361455      24 ipaddr.go:39] Skipping non-global IP address ::1/128.
I0829 01:22:31.361462      24 ipaddr.go:39] Skipping non-global IP address fe80::4001:aff:fe8e:fc8/64.
I0829 01:22:31.361469      24 ipaddr.go:39] Skipping non-global IP address fe80::42:bfff:fef9:1bd3/64.
I0829 01:22:31.361474      24 ipaddr.go:39] Skipping non-global IP address fe80::20fb:c3ff:fe5b:baac/64.
I0829 01:22:31.361482      24 ipaddr.go:56] IPNet address 10.142.15.200
I0829 01:22:31.361488      24 ipaddr.go:56] IPNet address 172.17.0.1
I0829 01:22:31.456952      24 admin.go:305] Loaded config: fs_root: "gs://test_sax_admin/sax-fs-root"
I0829 01:22:31.609323      24 addr.go:105] SetAddr /gcs/test_sax_admin/sax-root/sax/test/location.proto "10.142.15.200:10000"
I0829 01:22:31.656021      24 admin.go:325] Updated config: fs_root: "gs://test_sax_admin/sax-fs-root"
I0829 01:22:31.773245      24 mgr.go:781] Loaded manager state
I0829 01:22:31.773260      24 mgr.go:784] Refreshing manager state every 10s
I0829 01:22:31.773285      24 admin.go:350] Starting the server on port 10000
I0829 01:22:31.773292      24 cloud.go:506] Starting the HTTP server on port 8080

Launch a single-host SAX model server into the SAX cluster:

At this point, the SAX cluster contains only the SAX admin server. You can connect to your TPU VM over SSH in a second terminal to launch a SAX model server in your SAX cluster:

docker run \
    --privileged  \
    -it \
    -d \
    --rm \
    --network host \
    --name ${SAX_MODEL_SERVER_DOCKER_NAME} \
    --env SAX_ROOT=gs://${SAX_ADMIN_STORAGE_BUCKET}/sax-root \
    ${SAX_MODEL_SERVER_IMAGE_URL} \
       --sax_cell=${SAX_CELL} \
       --port=10001 \
       --platform_chip=tpuv4 \
       --platform_topology=1x1

Convert model checkpoint:

You need to install PyTorch and Transformers to download the GPT-J checkpoint from EleutherAI:

pip3 install accelerate
pip3 install torch
pip3 install transformers

To convert the checkpoint to SAX checkpoint, you need to install paxml:

pip3 install paxml==1.1.0

The following script converts the GPT-J checkpoint to SAX checkpoint:

python3 -m convert_gptj_ckpt --base EleutherAI/gpt-j-6b --pax pax_6b

After the conversion is done:

ls checkpoint_00000000/

You need to create a commit_success file and placed in the sub directories:

gcloud storage cp checkpoint_00000000 ${CHECKPOINT_PATH} --recursive

touch commit_success.txt
gcloud storage cp commit_success.txt ${CHECKPOINT_PATH}/
gcloud storage cp commit_success.txt ${CHECKPOINT_PATH}/metadata/
gcloud storage cp commit_success.txt ${CHECKPOINT_PATH}/state/

Publish the model to SAX cluster

You can now publish GPT-J with the checkpoint converted in the previous step.

MODEL_NAME=gptjtokenizedbf16bs32
MODEL_CONFIG_PATH=saxml.server.pax.lm.params.gptj.GPTJ4TokenizedBF16BS32
REPLICA=1

To publish the GPT-J (and steps afterward), use SSH to connect to your TPU VM in a third terminal:

docker run \
 ${SAX_UTIL_IMAGE_URL} \
   --sax_root=gs://${SAX_ADMIN_STORAGE_BUCKET}/sax-root \
   publish \
     ${SAX_CELL}/${MODEL_NAME} \
     ${MODEL_CONFIG_PATH} \
     ${CHECKPOINT_PATH} \
     ${REPLICA}

You will see a lot of activity from the model server Docker log until you see something like the following to indicate the model has loaded successfully:

I0829 01:33:49.287459 139865140229696 servable_model.py:697] loading completed.

Generate inference results

For GPT-J, input and output must be formatted as a comma separated token ID string. You will need to tokenize the text input.

TEXT = "Below is an instruction that describes a task, paired with
an input that provides further context. Write a response that
appropriately completes the request.\n\n### Instruction\:\nSummarize the
following news article\:\n\n### Input\:\nMarch 10, 2015 . We're truly
international in scope on Tuesday. We're visiting Italy, Russia, the
United Arab Emirates, and the Himalayan Mountains. Find out who's
attempting to circumnavigate the globe in a plane powered partially by the
sun, and explore the mysterious appearance of craters in northern Asia.
You'll also get a view of Mount Everest that was previously reserved for
climbers. On this page you will find today's show Transcript and a place
for you to request to be on the CNN Student News Roll Call. TRANSCRIPT .
Click here to access the transcript of today's CNN Student News program.
Please note that there may be a delay between the time when the video is
available and when the transcript is published. CNN Student News is
created by a team of journalists who consider the Common Core State
Standards, national standards in different subject areas, and state
standards when producing the show. ROLL CALL . For a chance to be
mentioned on the next CNN Student News, comment on the bottom of this page
with your school name, mascot, city and state. We will be selecting
schools from the comments of the previous show. You must be a teacher or a
student age 13 or older to request a mention on the CNN Student News Roll
Call! Thank you for using CNN Student News!\n\n### Response\:

You can obtain the token IDs string through the EleutherAI/gpt-j-6b tokenizer:

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-j-6b")                  :

Tokenize the input text:

encoded_example = tokenizer(TEXT)
input_ids = encoded_example.input_ids
INPUT_STR = ",".join([str(input_id) for input_id in input_ids])

You can expect a token ID string similar to the following:

>>> INPUT_STR
'21106,318,281,12064,326,8477,257,4876,11,20312,351,281,5128,326,3769,2252,4732,13,19430,257,2882,326,20431,32543,262,2581,13,198,198,21017,46486,25,198,13065,3876,1096,262,1708,1705,2708,25,198,198,21017,23412,25,198,16192,838,11,1853,764,775,821,4988,3230,287,8354,319,3431,13,775,821,10013,8031,11,3284,11,262,1578,4498,24880,11,290,262,42438,22931,21124,13,9938,503,508,338,9361,284,2498,4182,615,10055,262,13342,287,257,6614,13232,12387,416,262,4252,11,290,7301,262,11428,5585,286,1067,8605,287,7840,7229,13,921,1183,635,651,257,1570,286,5628,41336,326,373,4271,10395,329,39311,13,1550,428,2443,345,481,1064,1909,338,905,42978,290,257,1295,329,345,284,2581,284,307,319,262,8100,13613,3000,8299,4889,13,48213,6173,46023,764,6914,994,284,1895,262,14687,286,1909,338,8100,13613,3000,1430,13,4222,3465,326,612,743,307,257,5711,1022,262,640,618,262,2008,318,1695,290,618,262,14687,318,3199,13,8100,13613,3000,318,2727,416,257,1074,286,9046,508,2074,262,8070,7231,1812,20130,11,2260,5423,287,1180,2426,3006,11,290,1181,5423,618,9194,262,905,13,15107,3069,42815,764,1114,257,2863,284,307,4750,319,262,1306,8100,13613,3000,11,2912,319,262,4220,286,428,2443,351,534,1524,1438,11,37358,11,1748,290,1181,13,775,481,307,17246,4266,422,262,3651,286,262,2180,905,13,921,1276,307,257,4701,393,257,3710,2479,1511,393,4697,284,2581,257,3068,319,262,8100,13613,3000,8299,4889,0,6952,345,329,1262,8100,13613,3000,0,198,198,21017,18261,25'

To generate a summary for your article:

docker run \
  ${SAX_UTIL_IMAGE_URL} \
    --sax_root=gs://${SAX_ADMIN_STORAGE_BUCKET}/sax-root \
    lm.generate \
      ${SAX_CELL}/${MODEL_NAME} \
      ${INPUT_STR}

You can expect something similar to:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
|                                                                                                                                                    GENERATE                                                                                                                                                    |    SCORE     |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| 1212,2443,3407,262,905,42978,764,198,11041,262,42978,284,1037,2444,351,3555,35915,290,25818,764,198,2953,262,4220,286,262,2443,11,2912,329,257,2863,284,307,4750,319,8100,13613,3000,13,220,921,1276,307,257,4701,393,257,3710,2479,1511,393,4697,284,2581,257,3068,319,262,8100,13613,3000,8299,4889,13,50256 | -0.023136413 |
| 1212,2443,3407,262,905,42978,764,198,11041,262,42978,284,1037,2444,351,3555,35915,290,25818,764,198,2953,262,4220,286,262,2443,11,2912,329,257,2863,284,307,4750,319,8100,13613,3000,13,220,921,1276,307,257,4701,393,257,3710,2479,1511,393,4697,284,2581,257,3068,319,262,8100,13613,3000,8299,4889,0,50256  |  -0.91842502 |
| 1212,2443,3407,262,905,42978,764,198,11041,262,42978,284,1037,2444,351,3555,35915,290,25818,764,198,2953,262,4220,286,262,2443,11,2912,329,257,2863,284,307,4750,319,8100,13613,3000,13,921,1276,307,257,4701,393,257,3710,2479,1511,393,4697,284,2581,257,3068,319,262,8100,13613,3000,8299,4889,13,50256     |   -1.1726116 |
| 1212,2443,3407,262,905,42978,764,198,11041,262,42978,284,1037,2444,351,3555,35915,290,25818,764,198,2953,262,4220,286,262,2443,11,2912,329,257,2863,284,307,4750,319,8100,13613,3000,13,220,921,1276,307,1511,393,4697,284,2581,257,3068,319,262,8100,13613,3000,8299,4889,13,50256                            |   -1.2472695 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+

To detokenize the output token IDs string:

output_token_ids = [int(token_id) for token_id in OUTPUT_STR.split(',')]
OUTPUT_TEXT = tokenizer.decode(output_token_ids, skip_special_tokens=True)

You can expect the detokenized text as:

>>> OUTPUT_TEXT
'This page includes the show Transcript.\nUse the Transcript to help
students with reading comprehension and vocabulary.\nAt the bottom of
the page, comment for a chance to be mentioned on CNN Student News.
You must be a teacher or a student age 13 or older to request a mention on the CNN Student News Roll Call.'

Clean up your Docker containers and Cloud Storage storage buckets.

175B multi-host model serving preview

Some of the large language models will require a multi-host TPU slice, that is, v5litepod-16 and above. In those cases, all multi-host TPU hosts will need to have a copy of a SAX model server, and all model servers function as a SAX model server group to serve the large model on a multi-host TPU slice.

Create a new SAX cluster

You can follow the same step of Create a SAX cluster in the GPT-J walk through to create a new SAX cluster and a SAX admin server.

Or, if you already have an existing SAX cluster, you can launch a multi-host model server into your SAX cluster.

Launch a multi-host SAX model server into a SAX cluster

Use the same command to create a multi-host TPU slice as you use for a single-host TPU slice, just specify the appropriate multi-host accelerator type:

ACCELERATOR_TYPE=v5litepod-32
ZONE=us-east1-c

gcloud alpha compute tpus queued-resources create ${QUEUED_RESOURCE_ID} \
  --node-id ${TPU_NAME} \
  --project ${PROJECT_ID} \
  --zone ${ZONE} \
  --accelerator-type ${ACCELERATOR_TYPE} \
  --runtime-version ${RUNTIME_VERSION} \
  --service-account ${SERVICE_ACCOUNT} \
  --reserved

To pull the SAX model server image to all TPU hosts/workers and launch them:

gcloud compute tpus tpu-vm ssh ${TPU_NAME} \
  --project ${PROJECT_ID} \
  --zone ${ZONE} \
  --worker=all \
  --command="
    gcloud auth configure-docker \
      us-docker.pkg.dev
    # Pull sax model server image
    docker pull ${SAX_MODEL_SERVER_IMAGE_URL}
    # Run model server
    docker run \
      --privileged  \
      -it \
      -d \
      --rm \
      --network host \
      --name ${SAX_MODEL_SERVER_DOCKER_NAME} \
      --env SAX_ROOT=gs://${SAX_ADMIN_STORAGE_BUCKET}/sax-root \
      ${SAX_MODEL_SERVER_IMAGE_URL} \
        --sax_cell=${SAX_CELL} \
        --port=10001 \
        --platform_chip=tpuv4 \
        --platform_topology=1x1"

Publish the model to SAX cluster

This example uses a LmCloudSpmd175B32Test model:

MODEL_NAME=lmcloudspmd175b32test
MODEL_CONFIG_PATH=saxml.server.pax.lm.params.lm_cloud.LmCloudSpmd175B32Test
CHECKPOINT_PATH=None
REPLICA=1

To publish the test model:

docker run \
 ${SAX_UTIL_IMAGE_URL} \
   --sax_root=gs://${SAX_ADMIN_STORAGE_BUCKET}/sax-root \
   publish \
     ${SAX_CELL}/${MODEL_NAME} \
     ${MODEL_CONFIG_PATH} \
     ${CHECKPOINT_PATH} \
     ${REPLICA}

Generate inference results

docker run \
  ${SAX_UTIL_IMAGE_URL} \
    --sax_root=gs://${SAX_ADMIN_STORAGE_BUCKET}/sax-root \
    lm.generate \
      ${SAX_CELL}/${MODEL_NAME} \
      "Q:  Who is Harry Porter's mother? A\: "

Note that since this example uses a test model with random weights, the output may not be meaningful.

Clean Up

Stop the docker containers:

docker stop ${SAX_ADMIN_SERVER_DOCKER_NAME}
docker stop ${SAX_MODEL_SERVER_DOCKER_NAME}

Delete your Cloud Storage admin storage bucket and any data storage bucket using the gcloud CLI as shown below.

gcloud storage rm gs://${SAX_ADMIN_STORAGE_BUCKET} --recursive --continue-on-error
gcloud storage rm gs://${SAX_DATA_STORAGE_BUCKET} --recursive --continue-on-error