Serve Qwen2-7B with vLLM on TPUs

This tutorial shows you how to serve the Qwen/Qwen2-7B model using the vLLM TPU serving framework on a v6e TPU VM.

Objectives

Set up your environment.
Run vLLM with Qwen2-7B.
Send an inference request.
Run a benchmark workload.
Clean up.

Costs

This tutorial uses billable components of Google Cloud, including:

To generate a cost estimate based on your projected usage, use the pricing calculator.

Before you begin

Before going through this tutorial, follow the instructions in the Set up the Cloud TPU environment page. The instructions guide you through the steps needed to create a Google Cloud project and configure it to use Cloud TPU. You may also use an existing Google Cloud project. If you choose to do so, you can skip the create a Google Cloud project step and start with Set up your environment to use Cloud TPU.

You need a Hugging Face access token to use this tutorial. You can sign up for a free account at Hugging Face. Once you have an account, generate an access token:

On the Welcome to Hugging Face page, click your account avatar and select Access tokens.
On the Access Tokens page, click Create new token.
Select the Read token type and enter a name for your token.
Your access token is displayed. Save the token in a safe place.

Set up your environment

Queued Resources

Create a Cloud TPU v6e VM using the queued resources API. For Qwen2-7B, we recommend using a v6e-4 TPU.

export PROJECT_ID=<PROJECT>
export TPU_NAME=<TPU_NAME>
export ZONE=<ZONE>
export QR_ID=<QR_ID>
export TPU_TYPE=<TPU_TYPE>

Set the variables:

PROJECT - The name of your project.
TPU_NAME - Name of the TPU VM machine you will create.
ZONE - The cloud zone in which you create the new VM.
TPU_TYPE - The type of TPU VM you create. For example: v6e-1 or v6e-4.
QR_ID - Name of the Queued Resource you create.

Create the Queued Resource request:

gcloud alpha compute tpus queued-resources create $QR_ID \
 --node-id $TPU_NAME \
 --project $PROJECT_ID \
 --zone $ZONE \
 --accelerator-type $TPU_TYPE \
 --runtime-version v2-alpha-tpuv6e

Check to make sure your TPU VM is ready.

gcloud compute tpus queued-resources describe $QR_ID \
  --project $PROJECT_ID \
  --zone $ZONE

For example, when the status is ACTIVE:

name: projects/your-project-id/locations/your-zone/queuedResources/your-queued-resource-id
  state:
  state: ACTIVE
  tpu:
  nodeSpec:
  - node:
      acceleratorType: v6e-4
      bootDisk: {}
      networkConfig:
          enableExternalIps: true
      queuedResource: projects/your-project-number/locations/your-zone/queuedResources/your-queued-resource-id
      runtimeVersion: v2-alpha-tpuv6e
      schedulingConfig: {}
      serviceAccount: {}
      shieldedInstanceConfig: {}
      useTpuVm: true
      nodeId: your-node-id
      parent: projects/your-project-number/locations/your-zone

Reservation

Create a Cloud TPU v6e VM using a reservation. For Qwen2-7B, we recommend using a v6e-4 TPU. Start by setting environment variables:

export PROJECT_ID=<PROJECT>
export TPU_NAME=<TPU_NAME>
export ZONE=<ZONE>
export TPU_TYPE=v6e-8
export HF_TOKEN=<HF_TOKEN>
export RESERVATION=<RESERVATION>

Set the variables:

PROJECT - The name of your project.
TPU_NAME - Name of the TPU VM machine you will create.
ZONE - The cloud zone in which you create the new VM.
TPU_TYPE - The type of TPU VM you create. For example: v6e-1 or v6e-4.
RESERVATION - Name of the reservation with your TPUs.

Create the TPU VM using your reservation:

gcloud alpha compute tpus tpu-vm create $TPU_NAME \
    --zone=$ZONE \
    --project $PROJECT_ID \
    --accelerator-type=$TPU_TYPE \
    --version=v2-alpha-tpuv6e \
    --provisioning-model=reservation-bound \
    --reservation=$RESERVATION

Connect to the TPU VM.

gcloud compute tpus tpu-vm ssh $TPU_NAME \
  --project $PROJECT_ID \
  --zone $ZONE

Run vLLM with Qwen2-7B

Set your Hugging Face token and model name variables.

  export HF_TOKEN="YOUR_HF_TOKEN"
  export MODEL_NAME="Qwen/Qwen2-7B"

Inside the TPU VM, run the vLLM Docker container in detached mode and start the vLLM server. This command uses a shared memory size of 10 GB.

export DOCKER_URI="vllm/vllm-tpu:v0.18.0"
export CONTAINER_NAME="${USER}-vllm"
export MAX_MODEL_LEN=4096
export TP=1 # number of chips

sudo docker run -d --name "${CONTAINER_NAME}" \
    --privileged --net=host \
    -v /dev/shm:/dev/shm \
    --shm-size 10gb \
    -e "HF_HOME=/dev/shm" \
    -e "HF_TOKEN=${HF_TOKEN}" \
    -p 8000:8000 "${DOCKER_URI}" \
        vllm serve ${MODEL_NAME} \
            --seed 42 \
            --gpu-memory-utilization 0.98 \
            --max-num-batched-tokens 1024 \
            --max-num-seqs 128 \
            --tensor-parallel-size $TP \
            --max-model-len $MAX_MODEL_LEN

Check the server logs to confirm it's running.

sudo docker logs -f "${CONTAINER_NAME}"

When the vLLM server is running you see an output that resembles the following. After the output displays, press CTRL+C to return to the terminal.

(APIServer pid=7) INFO:     Started server process [7]
(APIServer pid=7) INFO:     Waiting for application startup.
(APIServer pid=7) INFO:     Application startup complete.

Send an inference request

Once the vLLM server is running, you can send requests to the API. For more information, see the vLLM API reference documentation.

Send a test request to the server using curl.

sudo docker exec "${CONTAINER_NAME}" \
  curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "The future of AI is",
        "max_tokens": 200,
        "temperature": 0
      }'

The response is returned in JSON format.

Run a benchmark workload

You can run benchmarks against the running server from your second terminal.

Inside the container, install the datasets library.

sudo docker exec "${CONTAINER_NAME}" pip install datasets

Inside the container, run the vllm bench serve command.

sudo docker exec "${CONTAINER_NAME}" \
    vllm bench serve \
        --backend vllm \
        --dataset-name random \
        --num-prompts 1000 \
        --seed 100

The benchmark results look like the following:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  45.35
Total input tokens:                      1024000
Total generated tokens:                  126848
Request throughput (req/s):              22.05
Output token throughput (tok/s):         2797.15
Peak output token throughput (tok/s):    4258.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          25377.57
---------------Time to First Token----------------
Mean TTFT (ms):                          21332.46
Median TTFT (ms):                        21330.37
P99 TTFT (ms):                           42436.47
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.36
Median TPOT (ms):                        38.56
P99 TPOT (ms):                           38.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.35
Median ITL (ms):                         38.55
P99 ITL (ms):                            39.43
==================================================

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

In your terminal, type exit to disconnect from the TPU VM.

Delete your resources

You can delete the project which will delete all resources or you can keep the project and delete the resources.

Delete your project

To delete your Google Cloud project and all associated resources run:

    gcloud projects delete $PROJECT_ID

Delete TPU resources

Queued Resources

Delete your Cloud TPU resources. The following command deletes both the queued resource request and the TPU VM using the --force parameter.

gcloud alpha compute tpus queued-resources delete $QR_ID \
  --project=$PROJECT_ID \
  --zone=$ZONE \
  --force

Reservation

Delete your Cloud TPU VM. Use following command to terminate the VM, releasing the TPUs back to your reservation.

gcloud compute tpus tpu-vm delete $TPU_NAME --zone $ZONE --project $PROJECT_ID --quiet

What's next

Learn more about vLLM on Cloud TPU.
Learn more about Cloud TPU.

Serve Qwen2-7B with vLLM on TPUs Stay organized with collections Save and categorize content based on your preferences.

Objectives

Costs

Before you begin

Set up your environment

Queued Resources

Reservation

Run vLLM with Qwen2-7B

Send an inference request

Run a benchmark workload

Clean up

Delete your resources

Delete your project

Delete TPU resources

Queued Resources

Reservation

What's next

Serve Qwen2-7B with vLLM on TPUs