Serve Qwen3-8B-Base with vLLM on TPUs

This tutorial shows you how to serve the Qwen/Qwen3-8B-Base model using the vLLM TPU serving framework on a v6e TPU VM.

Objectives

  1. Set up your environment.
  2. Run vLLM with Qwen3-8B-Base.
  3. Send an inference request.
  4. Run a benchmark workload.
  5. Clean up.

Costs

This tutorial uses billable components of Google Cloud, including:

To generate a cost estimate based on your projected usage, use the pricing calculator.

Before you begin

Before going through this tutorial, follow the instructions in the Set up the Cloud TPU environment page. The instructions guide you through the steps needed to create a Google Cloud project and configure it to use Cloud TPU. You may also use an existing Google Cloud project. If you choose to do so, you can skip the create a Google Cloud project step and start with Set up your environment to use Cloud TPU.

You need a Hugging Face access token to use this tutorial. You can sign up for a free account at Hugging Face. Once you have an account, generate an access token:

  1. On the Welcome to Hugging Face page, click your account avatar and select Access tokens.
  2. On the Access Tokens page, click Create new token.
  3. Select the Read token type and enter a name for your token.
  4. Your access token is displayed. Save the token in a safe place.

Set up your environment

  1. Create a Cloud TPU v6e VM using the queued resources API. For Qwen2-7B-Instruct, we recommend using a v6e-4 TPU.

    export PROJECT_ID=YOUR_PROJECT_ID
    export TPU_NAME=Qwen3-8B-Base-tutorial
    export ZONE=us-east5-a
    export QR_ID=Qwen3-8B-Base-qr
    
    gcloud alpha compute tpus queued-resources create $QR_ID \
     --node-id $TPU_NAME \
     --project $PROJECT_ID \
     --zone $ZONE \
     --accelerator-type v6e-1 \
     --runtime-version v2-alpha-tpuv6e
    
  2. Check to make sure your TPU VM is ready.

    gcloud compute tpus queued-resources describe $QR_ID \
      --project $PROJECT_ID \
      --zone $ZONE
    

    When your TPU VM has been created the status of the queued resource request will be set to ACTIVE. For example:

    name: projects/your-project-id/locations/your-zone/queuedResources/your-queued-resource-id
      state:
      state: ACTIVE
      tpu:
      nodeSpec:
      - node:
          acceleratorType: v6e-8
          bootDisk: {}
          networkConfig:
              enableExternalIps: true
          queuedResource: projects/your-project-number/locations/your-zone/queuedResources/your-queued-resource-id
          runtimeVersion: v2-alpha-tpuv6e
          schedulingConfig: {}
          serviceAccount: {}
          shieldedInstanceConfig: {}
          useTpuVm: true
          nodeId: your-node-id
          parent: projects/your-project-number/locations/your-zone
    
  3. Connect to the TPU VM.

      gcloud compute tpus tpu-vm ssh $TPU_NAME \
        --project $PROJECT_ID \
        --zone $ZONE
    

Run vLLM with Qwen3-8B-Base

  1. Inside the TPU VM, run the vLLM Docker container. This command uses a shared memory size of 10GB.

      export DOCKER_URI=vllm/vllm-tpu:latest
    
      sudo docker run -it --rm --name $USER-vllm --privileged --net=host \
        -v /dev/shm:/dev/shm \
        --shm-size 10gb \
        -p 8000:8000 \
        --entrypoint /bin/bash ${DOCKER_URI}
    
  2. Inside the container, set your Hugging Face token. Replace YOUR_HF_TOKEN with your Hugging Face token.

    export HF_HOME=/dev/shm
    export HF_TOKEN=YOUR_HF_TOKEN
    
  3. Start the vLLM server using the vllm serve command.

    export MAX_MODEL_LEN=4096
    export TP=1 # number of chips
    
    vllm serve Qwen/Qwen3-8B-Base \
        --seed 42 \
        --disable-log-requests \
        --gpu-memory-utilization 0.98 \
        --max-num-batched-tokens 1024 \
        --max-num-seqs 128 \
        --tensor-parallel-size $TP \
        --max-model-len $MAX_MODEL_LEN
    

    When the vLLM server is running you will see output like the following:

    (APIServer pid=7) INFO:     Started server process [7]
    (APIServer pid=7) INFO:     Waiting for application startup.
    (APIServer pid=7) INFO:     Application startup complete.
    

Send an inference request

Once the vLLM server is running, you can send requests to it from a new shell.

  1. Open a new shell and connect to your TPU VM.

      export PROJECT_ID=YOUR_PROJECT_ID
      export TPU_NAME=Qwen3-8B-Base-tutorial
      export ZONE=us-east5-a
    
      gcloud compute tpus tpu-vm ssh $TPU_NAME \
        --project $PROJECT_ID \
        --zone=$ZONE
    
  2. Open a shell into the running Docker container.

      sudo docker exec -it $USER-vllm /bin/bash
    
  3. Send a test request to the server using curl.

      curl http://localhost:8000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "Qwen/Qwen3-8B-Base",
            "prompt": "The future of AI is",
            "max_tokens": 200,
            "temperature": 0
          }'
    

The output from the request appears as follows:

{
  "id": "cmpl-8ac19b8ab39d0383",
  "object": "text_completion",
  "created": 1765321405,
  "model": "Qwen/Qwen3-8B-Base",
  "choices": [
    {
      "index": 0,
      "text": " a topic of much debate and speculation. While some fear that AI
       will take over the world and lead to the end of humanity, others believe
       that AI will bring about a new era of prosperity and progress. In this
       article, we will explore the potential future of AI and what it could
       mean for humanity.\nThe Rise of AI\nAI has already made significant
       strides in recent years, with advancements in machine learning, natural
       language processing, and computer vision. These technologies have enabled
       AI to perform tasks that were once thought to be the exclusive domain of
       humans, such as recognizing objects in images, translating languages, and
       even playing complex games like Go.\nAs AI continues to evolve, it is
       likely that we will see even more impressive feats of intelligence. For
       example, AI could be used to develop new drugs, design more efficient
       buildings, and even create art. The possibilities are endless, and it is
       clear that AI will play an increasingly important role in our lives in
       the years to come.\nThe Potential Benefits of",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 205,
    "completion_tokens": 200,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Run a benchmark workload

You can run benchmarks against the running server from your second terminal.

  1. Inside the container, install the datasets library.

    pip install datasets
    
  2. Run the vllm bench serve command.

    export HF_HOME=/dev/shm
    cd /workspace/vllm
    
    vllm bench serve \
        --backend vllm \
        --model "Qwen/Qwen3-8B-Base"  \
        --dataset-name random \
        --num-prompts 1000 \
        --seed 100
    

The benchmark results appear as follows:

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  73.97
Total input tokens:                      1024000
Total generated tokens:                  128000
Request throughput (req/s):              13.52
Output token throughput (tok/s):         1730.38
Peak output token throughput (tok/s):    2522.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          15573.42
---------------Time to First Token----------------
Mean TTFT (ms):                          34834.97
Median TTFT (ms):                        34486.19
P99 TTFT (ms):                           70234.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          47.30
Median TPOT (ms):                        48.57
P99 TPOT (ms):                           48.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           47.31
Median ITL (ms):                         53.49
P99 ITL (ms):                            54.58
==================================================

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or delete the individual resources.

  1. In the second shell, type exit to exit from the vLLM container.
  2. In the second shell, type exit command to close the terminal.
  3. In the first shell, type Ctrl+C to stop the vLLM server.
  4. In the first shell, type exit to exit from the vLLM container.
  5. In the first shell, type exit to disconnect from the TPU VM.

Delete your resources

You can delete the project which will delete all resources or you can keep the project and delete the resources.

Delete your project

To delete your Google Cloud project and all associated resources run:

    gcloud projects delete $PROJECT_ID

Delete TPU resources

Delete your Cloud TPU resources. The following command deletes both the queued resource request and the TPU VM using the --force parameter.

  gcloud alpha compute tpus queued-resources delete $QR_ID \
    --project=$PROJECT_ID \
    --zone=$ZONE \
    --force

What's next