This document describes how to dynamically increase the storage capacity of Managed Lustre volumes for your stateful workloads in Google Kubernetes Engine (GKE) without disrupting your applications.
For example, if your long-running AI/ML training jobs have dynamic and unpredictable storage requirements, enable Managed Lustre volume expansion to increase the storage capacity of your existing Managed Lustre PersistentVolume (PV).
This document is for Platform admins and operators, DevOps, Storage administrators, and Machine learning (ML) engineers who manage storage for stateful workloads on GKE.
When you expand a volume, your costs increase based on the new, larger capacity, according to the standard Google Cloud Managed Lustre pricing.
Before you begin
Requirements
Make sure that you meet the following requirements:
- You must have a GKE cluster version 1.35.0-gke.2331000 or later.
- You must enable the Managed Lustre CSI driver on an existing cluster. The driver is disabled by default in Standard and Autopilot clusters.
Limitations
- You can only increase the size of an existing volume, and you can't decrease its size.
- You can't use volume expansion with the
ReadOnlyManyaccess mode. - When you resize Lustre volumes, follow the minimum and maximum capacity limits and step sizes set by your volume's performance tier. For more information, see Performance considerations.
- Specify Lustre volume sizes in GiB as multiples of 1000. Kubernetes translates units such as Ti into binary values (for example, 18 Ti is interpreted as 18,432 GiB), which results in the Lustre API rejecting the request.
Enable volume expansion for a StorageClass
Verify whether your StorageClass supports volume expansion:
kubectl get sc STORAGECLASS_NAME -o jsonpath='{.allowVolumeExpansion}{"\n"}'Replace
STORAGECLASS_NAMEwith the name of your StorageClass.If the command outputs nothing or returns
false, you must explicitly update the StorageClass configuration to allow expansion.Open the StorageClass configuration for editing:
kubectl edit storageclass STORAGECLASS_NAMEIn the editor, add the
allowVolumeExpansion: truefield to your StorageClass configuration:apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: lustre-sc provisioner: lustre.csi.storage.gke.io ... allowVolumeExpansion: true
Expand a PersistentVolumeClaim
To initiate volume expansion, edit your PersistentVolumeClaim (PVC) to request an increase in volume size.
- Identify the new, valid size for expansion as described in Determine valid expansion sizes.
Open the PVC configuration for editing:
kubectl edit pvc PVC_NAMEReplace
PVC_NAMEwith the name of your PVC.In the editor, update the
spec.resources.requests.storagefield with the valid expansion size. For example, to expand a volume from9000Gito18000Gi, modify thestoragefield as follows:spec: accessModes: - ReadWriteOnce resources: requests: storage: 18000Gi # Changed from 9000Gi
Verify the volume expansion
Monitor the expansion progress by reviewing the PVC's events:
kubectl describe pvc PVC_NAMEThe following events in the PVC's output indicate the current progress or outcome of the volume expansion request:
ExternalExpanding: indicates that Kubernetes is waiting for theexternal-resizerto expand the PVC.Resizing: indicates that the resize operation is in progress. This operation can take up to 90 minutes for larger capacity increases.VolumeResizeSuccessful: confirms that the volume has been successfully expanded.VolumeResizeFailed: indicates that an error occurred. The event message contains details from the Google Cloud Managed Lustre API. This state might be transient and might resolve on its own.
After the expansion is complete, verify the PVC's updated configuration:
kubectl get pvc PVC_NAME -o yamlEnsure the
status.capacityfield reflects the new, incremented size.
If you encounter any issues during the expansion process, see Troubleshooting.
Determine valid expansion sizes
To determine your new volume size, first identify your volume's performance tier and its corresponding step size.
Identify the volume's performance tier
You can find your volume's performance tier using either of the following options:
StorageClass
Run the following command and look for
the perUnitStorageThroughput value (for example, 1000). This value
indicates the performance tier in MBps per TiB.
kubectl get sc STORAGECLASS_NAME -o yaml
Replace STORAGECLASS_NAME with the name of your StorageClass.
Lustre instance
Identify your volume's performance tier by checking the underlying Managed Lustre instance's properties directly:
Find the name of the PV bound to your PVC:
kubectl get pvc PVC_NAMEReplace
PVC_NAMEwith the name of your PVC.The output is similar to the following. Note the PV name in the
VOLUMEcolumn, for example,pv-lustre.NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE pvc-lustre Bound pv-lustre 9000Gi RWX lustre-rwx <unset> 26mFind the volume's location and instance name in the
volumeHandlefield:kubectl get pv PV_NAME -o yamlReplace
PV_NAMEwith the PV name from the previous step.The
volumeHandlevalue is formatted asPROJECT_ID/LOCATION/INSTANCE_NAME. Note theINSTANCE_NAMEandLOCATIONfor the next step.Check the performance tier properties by describing the Managed Lustre instance:
gcloud lustre instances describe INSTANCE_NAME --location=LOCATIONReplace
INSTANCE_NAMEandLOCATIONwith the values from the previous step.In the output, look for the
perUnitStorageThroughputfield. This value indicates the performance tier in MBps per TiB.
Capacity limits and step sizes
After you identify the performance tier, refer to the following table to find the associated capacity limits and required step size.
Tier (perUnitStorageThroughput) |
Min capacity | Max capacity | Step size |
|---|---|---|---|
| 1,000 MBps per TiB | 9,000 GiB | 954,000 GiB (~1 PiB) | 9,000 GiB |
| 500 MBps per TiB | 18,000 GiB | 1,908,000 GiB (~2 PiB) | 18,000 GiB |
| 250 MBps per TiB | 36,000 GiB | 3,816,000 GiB (~4 PiB) | 36,000 GiB |
| 125 MBps per TiB | 72,000 GiB | 7,632,000 GiB (~8 PiB) | 72,000 GiB |
Volumes must be increased according to the step size assigned to their tier. Any increase in capacity must be a multiple of that specific step size. For example, if your 1,000 MBps tier volume has a capacity of 9,000 GiB, you can increase it to 18,000 GiB, 27,000 GiB, and other multiples.
Troubleshooting
This section provides solutions for common issues you might encounter with when you expand Lustre volumes.
Expansion fails with an "Invalid Argument" error
Symptom
- The PVC enters a
Resizingstate but then fails. - When you run the
kubectl describe pvc PVC_NAMEcommand, you see an error similar toVolumeResizeFailed: rpc error: code = InvalidArgument desc = ....
Cause
This error typically means that the requested storage size is not valid for the Lustre volume's performance tier for one of the following reasons:
- The requested size is not a multiple of the required step size for the tier.
- The requested size is below the minimum or above the maximum capacity for the tier.
Resolution
- Review Capacity limits and step sizes to find the valid step sizes and capacity limits for your volume's performance tier.
- Edit the PVC again to request a valid storage size that meets the step size and capacity limits.
Expansion fails with an "Internal Error"
Symptom
- The PVC resize fails.
- When you run the
kubectl describe pvc PVC_NAMEcommand, you might see aVolumeResizeFailedevent with an error message containingcode = Internal.
Cause
This error indicates a problem with the underlying Managed Lustre service.
Resolution
- Retry the expansion by applying the PVC manifest again with the new, requested size. This might resolve transient backend issues.
- If retrying fails, contact Cloud Customer Care.
Expansion is stuck in "Resizing" state
Symptom
- The PVC remains in the
Resizingstate for an extended period (more than 30 minutes for smaller expansions, or more than 90 minutes for larger expansions). - You might see a
VolumeResizeFailedevent with aDEADLINE_EXCEEDEDerror message.
Cause
This issue can happen with large capacity increases, which can take up to 90 minutes to complete. The csi-external-resizer component might time out while waiting for the Google Cloud Managed Lustre API to respond, even though the underlying expansion operation is still in progress.
Resolution
- The
csi-external-resizerautomatically retries the operation after a backoff period. Continue to monitor the PVC events for aVolumeResizeSuccessfulevent. - If the PVC remains in the
Resizingstate for more than 90 minutes, contact Cloud Customer Care.
Expansion does not start, or is stuck in an ExternalExpanding state
Symptom
- You update the
spec.resources.requests.storagefield in your PVC, but the PVC status does not change toResizing. - When you run the
kubectl describe pvc PVC_NAMEcommand, the event log only shows theExternalExpandingstate and does not progress to theResizingstate:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalExpanding 21m (x2 over 58m) volume_expand waiting for an external controller to expand this PVC
Cause
This behavior typically indicates one of the following issues:
- The StorageClass associated with the PVC does not permit volume expansion.
- There is a problem with the
csi-external-resizersidecar container, which is the component responsible for initiating the expansion.
Resolution
Check the StorageClass configuration and verify that the
allowVolumeExpansion: truefield is set:kubectl get sc STORAGECLASS_NAME -o jsonpath='{.allowVolumeExpansion}{"\n"}'If
allowVolumeExpansionis missing or set tofalse, update the StorageClass to allow volume expansion.If the StorageClass is configured correctly, the issue is likely with the GKE control plane components that manage the resize operation. Contact Cloud Customer Care for assistance.
Expansion fails due to a quota or capacity issue
Symptom
- The PVC resize fails, and a
VolumeResizeFailedevent appears on the PVC. - When you run the
kubectl describe pvc PVC_NAMEcommand, the event message from the Managed Lustre backend indicates a quota or capacity issue.
Cause
The requested valid expansion can't be fulfilled because it exceeds the overall capacity or quota available for the Managed Lustre service in your project or region.
Resolution
- Edit your PVC again and request a smaller storage increase.
- Contact your organization's Google Cloud administrator to request an increase to the overall Lustre service quotas or capacity for your project.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this document, delete the PVC. This operation also deletes the associated PV and the underlying Managed Lustre instance if the reclaimPolicy is set to Delete.
kubectl delete pvc PVC_NAME
Replace PVC_NAME with the name of your PVC.
What's next
- Learn more about Storage for GKE clusters.
- Read the documentation about GKE persistent volumes and provisioning.