Migrate Hive managed tables to Google Cloud
This document shows you how to migrate your Hive managed tables to Google Cloud.
You can use the Hive managed tables migration connector in the BigQuery Data Transfer Service to seamlessly migrate your tables managed by Hive metastore, supporting both Hive and Iceberg formats from on-premises and cloud environments to Google Cloud. The Hive managed tables migration connector supports files stored in the following data sources:
- HDFS
- Amazon S3
- Azure Blob Storage or Azure Data Lake Storage Gen2
With the Hive managed tables migration connector, you can register your Hive managed tables with Dataproc Metastore or BigLake metastore Iceberg REST Catalog while using Cloud Storage as the file storage.
This connector supports both full and metadata-only transfers. Full transfers will transfer both your data and metadata from your source tables to your target metastore. You can make a metadata-only transfer if you already have your data migrated to Cloud Storage.
The following diagram provides an overview of the table migration process from Hadoop cluster.

Limitations
Hive managed tables transfers are subject to the following limitations:
- To migrate Apache Iceberg tables, you must register the tables with the BigLake metastore Iceberg REST catalog to allow write access for open-source engines (such as Apache Spark or Flink).
- To migrate Hive managed tables, you must register the tables with Dataproc Metastore to allow write access for open-source engines, and to allow read access for BigQuery.
- File names must comply with Cloud Storage object naming requirements.
- Storage Transfer Service has specific behaviors if data is changed at the source while a transfer is in progress. We don't recommend writing to tables while the table is being actively migrated.
- Cloud Storage has a 5 TiB limit for single objects. Files within your Apache Hive tables larger than 5 TiB will fail to transfer.
For a list of other limitations, see Storage Transfer Service Known Limitations.
Quotas and concurrency limits
When you migrate data from Hive managed tables, Storage Transfer Service quotas and limits apply. For more information, see Quotas & Limits.
Before you begin
Before you schedule Hive managed tables transfer, you must perform the following:
Generate metadata file for Apache Hive
Run the dwh-migration-dumper tool to extract metadata
for Apache Hive. The tool generates a file named hive-dumper-output.zip
to a Cloud Storage bucket, referred to in this document as DUMPER_BUCKET.
Enable APIs
Enable the following APIs in your Google Cloud project:
- Data Transfer API
- Storage Transfer API
A service agent is created when you enable the Data Transfer API.
Configure permissions
- Create a service account and grant it the BigQuery Admin role (
roles/bigquery.admin). This service account is used to create the transfer configuration. A service agent (P4SA) is created upon enabling the Data Transfer API. Grant it the following roles:
roles/metastore.metadataOwnerroles/storagetransfer.adminroles/serviceusage.serviceUsageConsumerroles/storage.objectAdminroles/storage.admin- If you are migrating metadata to BigLake metastore Iceberg REST Catalog, you must also grant the
roles/biglake.adminrole.
- If you are migrating metadata to BigLake metastore Iceberg REST Catalog, you must also grant the
Grant the service agent the
roles/iam.serviceAccountTokenCreatorrole with the following command:gcloud iam service-accounts add-iam-policy-binding SERVICE_ACCOUNT --member serviceAccount:service-PROJECT_NUMBER@gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com --role roles/iam.serviceAccountTokenCreator
Grant the Storage Transfer Service service agent (
project-<var>PROJECT_NUMBER</var>@storage-transfer-service.iam.gserviceaccount.com) the following roles in the project:roles/storage.admin- If you are migrating from HDFS, you must also grant the
roles/storagetransfer.serviceAgentrole.
You can also configure more granular permissions. For more information, see the source-specific permissions guides:
Configure your Storage Transfer Agent for HDFS data lakes
Required when the file is stored in HDFS. To set up the storage transfer agent required for an HDFS data lake transfer, do the following:
- Install Docker on on-premises agent machines.
- Create a Storage Transfer Service agent pool in your Google Cloud project.
- Install agents on your on-premises agent machines.
Configure Storage Transfer Service permissions for Amazon S3
Required when the file is stored in Amazon S3. Transfers from Amazon S3 are agentless transfers, which require specific permissions. To configure the Storage Transfer Service for a Amazon S3 transfer, do the following:
- Setup access credentials for AWS Amazon S3.
- Note the access key ID and secret access key after setting up your access credentials.
- Add IP ranges used by Storage Transfer Service workers to your list of allowed IPs if your AWS project uses IP restrictions.
Configure Storage Transfer Service permissions for Microsoft Azure Storage
Required when the file is stored in Azure Blob Storage or Azure Data Lake Storage Gen2. Transfers from Microsoft Azure Storage are agentless transfers, which require specific permissions. To configure the Storage Transfer Service for a Microsoft Azure Storage transfer, do the following:
- Generate a Shared Access Signature (SAS) token for your Microsoft Azure storage account.
- Note the SAS token after generating it.
- Add IP ranges used by Storage Transfer Service workers to your list of allowed IPs if your Microsoft Azure storage account uses IP restrictions.
Schedule Hive managed tables transfer
Select one of the following options:
Console
Go to the Data transfers page in the Google Cloud console.
Click Create transfer.
In the Source type section, select Hive Managed Tables from the Source list.
For Location, select a location type, and then select a region.
In the Transfer config name section, for Display name, enter a name for the data transfer.
In the Schedule options section, do the following:
- In the Repeat frequency list, select an option to specify how often this data transfer runs. To specify a custom repeat frequency, select Custom. If you select On-demand, then this transfer runs when you manually trigger the transfer.
- If applicable, select either Start now or Start at set time, and provide a start date and run time.
In the Data source details section, do the following:
- For Transfer strategy, select one of the following:
FULL_TRANSFER: Transfer all data and register metadata with the target metastore. This is the default option.METADATA_ONLY: Register metadata only. You must have data already present in the correct Cloud Storage location referenced in the metadata.
- For Table name patterns, specify HDFS data lake tables to transfer by providing table names or patterns that matches tables in the HDFS database. You must use Java regular expression syntax to specify table patterns. For example:
db1..*matches all tables in db1.db1.table1;db2.table2matches table1 in db1 and table2 in db2.
- For BQMS discovery dump gcs path, enter the path to the
hive-dumper-output.zipfile that you generated when creating a metadata file for Apache Hive. If you are using dumper output orchestration withcron, provide the Cloud Storage folder path configured in--gcs-base-path, which contains dumper output ZIP files. - Choose the Metastore type from the drop-down list:
DATAPROC_METASTORE: Select this option to store your metadata in Dataproc Metastore. You must provide the URL for the Dataproc Metastore in Dataproc metastore url.BIGLAKE_REST_CATALOG: Select this option to store your metadata in the BigLake metastore Iceberg REST catalog.
For Destination gcs path, enter a path to a Cloud Storage bucket to store your migrated data.
Optional: For Service account, enter a service account to use with this data transfer. The service account should belong to the same Google Cloud project where the transfer configuration and destination dataset is created.
For Storage type, select one of the following options. This field is only available if Transfer strategy is set to
FULL_TRANSFER:HDFS: Select this option if your file storage isHDFS. In the STS agent pool name field, you must provide the name of the agent pool that you created when you configured your Storage Transfer Agent.S3: Select this option if your file storage isAmazon S3. In the Access key ID and Secret access key fields, you must provide the access key ID and secret access key that you created when you set up your access credentials.AZURE: Select this option if your file storage isAzure Blob Storage. In the SAS token field, you must provide the SAS token that you created when you set up your access credentials.
Optional: For Partition Filter gcs path, enter a full Cloud Storage path to a custom filter JSON file to filter partitions.
- For Transfer strategy, select one of the following:
bq
To schedule Hive managed tables transfer, enter the bq mk
command and supply the transfer creation flag --transfer_config:
bq mk --transfer_config --data_source=hadoop --display_name='TRANSFER_NAME' --service_account_name='SERVICE_ACCOUNT' --project_id='PROJECT_ID' --location='REGION' --params='{ "transfer_strategy":"TRANSFER_STRATEGY", "table_name_patterns":"LIST_OF_TABLES", "table_metadata_path":"gs://DUMPER_BUCKET/hive-dumper-output.zip", "target_gcs_file_path":"gs://MIGRATION_BUCKET", "metastore":"METASTORE", "destination_dataproc_metastore":"DATAPROC_METASTORE_URL", "destination_bigquery_dataset":"BIGLAKE_METASTORE_DATASET", "translation_output_gcs_path":"gs://TRANSLATION_OUTPUT_BUCKET/metadata/config/default_database/", "storage_type":"STORAGE_TYPE", "agent_pool_name":"AGENT_POOL_NAME", "aws_access_key_id":"AWS_ACCESS_KEY_ID", "aws_secret_access_key":"AWS_SECRET_ACCESS_KEY", "azure_sas_token":"AZURE_SAS_TOKEN", "partition_filter_gcs_path":"FILTER_GCS_PATH" }'
Replace the following:
TRANSFER_NAME: the display name for the transfer configuration. The transfer name can be any value that lets you identify the transfer if you need to modify it later.SERVICE_ACCOUNT: the service account name used to authenticate your transfer. The service account should be owned by the sameproject_idused to create the transfer and it should have all of the required permissions.PROJECT_ID: your Google Cloud project ID. If--project_idisn't supplied to specify a particular project, the default project is used.REGION: location of this transfer configuration.TRANSFER_STRATEGY: (Optional) Specify one of the following values:FULL_TRANSFER: Transfer all data and register metadata with the target metastore. This is the default value.METADATA_ONLY: Register metadata only. You must have data already present in the correct Cloud Storage location referenced in the metadata.
LIST_OF_TABLES: a list of entities to be transferred. Use a hierarchical naming spec -database.table. This field supports RE2 regular expression to specify tables. For example:db1..*: specifies all tables in the databasedb1.table1;db2.table2: a list of tables
DUMPER_BUCKET: the Cloud Storage bucket containing thehive-dumper-output.zipfile. If you are using dumper output orchestration withcron, then changetable_metadata_pathto be the Cloud Storage folder path configured with--gcs-base-pathin cron setup—for example:"table_metadata_path":"<var>GCS_PATH_TO_UPLOAD_DUMPER_OUTPUT</var>".MIGRATION_BUCKET: Destination GCS path to which all underlying files will be loaded. Available only iftransfer_strategyisFULL_TRANSFER.METASTORE: The type of metastore to migrate to. Set this to one of the following values:DATAPROC_METASTORE: To transfer metadata to Dataproc Metastore.BIGLAKE_REST_CATALOG: To transfer metadata to BigLake metastore Iceberg REST Catalog.
DATAPROC_METASTORE_URL: The URL of your Dataproc Metastore. Required ifmetastoreisDATAPROC_METASTORE.BIGLAKE_METASTORE_DATASET: The BigQuery dataset for your BigLake metastore. Required ifmetastoreisBIGLAKE_METASTOREandtransfer_strategyisFULL_TRANSFER.STORAGE_TYPE: Specify the underlying file storage for your tables. Supported types areHDFS,S3, andAZURE. Required iftransfer_strategyisFULL_TRANSFER.AGENT_POOL_NAME: the name of the agent pool used for creating agents. Required ifstorage_typeisHDFS.AWS_ACCESS_KEY_ID: the access key ID from access credentials. Required ifstorage_typeisS3.AWS_SECRET_ACCESS_KEY: the secret access key from access credentials. Required ifstorage_typeisS3.AZURE_SAS_TOKEN: the SAS token from access credentials. Required ifstorage_typeisAZURE.FILTER_GCS_PATH: (Optional) A full Cloud Storage path to a custom filter JSON file to filter partitions.
Run this command to create the transfer configuration and start the Hive managed tables transfer. Transfers are scheduled to run every 24 hours by default, but can be configured with transfer scheduling options.
When the transfer is complete, your tables in Hadoop cluster will be
migrated to MIGRATION_BUCKET.
Data ingestion options
The following sections provide more information about how you can configure your Hive managed tables transfers.
Incremental transfers
When a transfer configuration is set up with a recurring schedule, every subsequent transfer updates the table on Google Cloud with the latest updates made to the source table. For example, all insert, delete, or update operations with schema changes are reflected in Google Cloud with each transfer.
Transfer scheduling options
By default, transfers are scheduled to
run every 24 hours by default. To configure how often transfers are run,
add the --schedule flag to the transfer configuration, and specify a transfer
schedule using the schedule syntax.
Hive managed tables transfers must have a minimum of 24 hours
between transfer runs.
For one-time transfers, you can add the
end_time flag to the transfer configuration to only run the
transfer once.
Filter partitions
You can transfer a subset of partitions from your Hive
managed tables by providing a custom filter JSON file stored in
Cloud Storage. When scheduling the transfer, supply the full
Cloud Storage path to this JSON file using the partition_filter_gcs_path
parameter.
The following is an example of the filter JSON file structure:
{
"filters": [
{
"table": "db1.table1",
"condition": "IN",
"partition": ["partition1=value1/partition2=value2"]
},
{
"table": "db1.table2",
"condition": "LESS_THAN",
"partition": ["partition1;value1"]
},
{
"table": "db1.table3",
"condition": "GREATER_THAN",
"partition": ["partition1;value1"]
},
{
"table": "db1.table4",
"condition": "RANGE",
"partition": ["partition1;value1;value2"]
}
]
}
Filter conditions
The condition field in the JSON file supports the following values, each with
a specific format for the partition array:
IN: Specifies the exact partition paths to include. Thepartitionarray contains strings representing the exact directory structure of the partitions relative to the table base path (for example,["partition_key1=value1/partition_key2=value2"]). You can specify multiple paths in the array.LESS_THAN: Includes partitions where the primary partition key value is less than or equal to the specified value. Thepartitionarray must contain a single string in the format["<partition_key>;<value>"].GREATER_THAN: Includes partitions where the primary partition key value is greater than or equal to the specified value. Thepartitionarray must contain a single string in the format["<partition_key>;<value>"].RANGE: Includes partitions where the primary partition key value falls within the specified range (inclusive). Thepartitionarray must contain a single string in the format["<partition_key>;<start_value>;<end_value>"].
The filter conditions are subject to the following rules and restrictions:
- Inclusive values: Filter conditions for
GREATER_THAN,LESS_THAN, andRANGEare inclusive of the values provided. For example, aLESS_THANfilter with a value of2023includes partitions up to and including2023. - Partition deletion: If an existing destination partition satisfies the partition filter and is no longer present at the source, then it is dropped from the destination metastore. However, the underlying data files for that partition aren't deleted from the Cloud Storage destination bucket.
- Single table restrictions:
- Multiple filters on the same table aren't allowed.
- You can't mix different condition types (for example:
GREATER_THANandIN) on the same table.
- Target partition column: Filter conditions like
GREATER_THAN,LESS_THAN, andRANGEmust target the primary partition column. - Prefix limitations: The specified filter combination must not resolve to
more than 1000 prefixes per table. For example, a filter like
year>2020on a table partitioned byyear/month/daymust result in fewer than 1000 uniqueyear=prefixes.
Configure Translation output
You can configure a unique Cloud Storage path and database for each migrated table. To do so, perform the following steps to generate a tables mapping YAML file that you can use in your transfer configuration.
Create a configuration YAML file (suffixed with
config.yaml) in theDUMPER_BUCKETthat contains the following:type: object_rewriter relation: - match: relationRegex: ".*" external: location_expression: "'gs://MIGRATION_BUCKET/' + table.schema + '/' + table.name"
- Replace
MIGRATION_BUCKETwith the name of the Cloud Storage bucket that is the destination for your migrated table files. Thelocation_expressionfield is a common expression language (CEL) expression.
- Replace
Create another configuration YAML file (suffixed with
config.yaml) in theDUMPER_BUCKETthat contains the following:type: experimental_object_rewriter relation: - match: schema: SOURCE_DATABASE outputName: database: null schema: TARGET_DATABASE
- Replace
SOURCE_DATABASEandTARGET_DATABASEwith the name of source database name and Dataproc Metastore database or BigQuery dataset depending on the chosen metastore. Ensure that the BigQuery dataset exists if you are configuring the database for BigLake metastore.
For more information about these configuration YAML, see Guidelines to create a configuration YAML file.
- Replace
Generate tables mapping YAML file using the following command:
curl -d '{ "tasks": { "string": { "type": "HiveQL2BigQuery_Translation", "translation_details": { "target_base_uri": "TRANSLATION_OUTPUT_BUCKET", "source_target_mapping": { "source_spec": { "base_uri": "DUMPER_BUCKET" } }, "target_types": ["metadata"] } } } }' \ -H "Content-Type:application/json" \ -H "Authorization: Bearer TOKEN" -X POST https://bigquerymigration.googleapis.com/v2alpha/projects/PROJECT_ID/locations/LOCATION/workflows
Replace the following:
TRANSLATION_OUTPUT_BUCKET: (Optional) Specify a Cloud Storage bucket for the translation output. For more information, see Using Translation output.DUMPER_BUCKET: the base URI for Cloud Storage bucket that contains thehive-dumper-output.zipand configuration YAML file.TOKEN: the OAuth token. You can generate this in the command line with the commandgcloud auth print-access-token.PROJECT_ID: the project to process the translation.LOCATION: the location where the job is processed. For example,euorus.
Monitor the status of this job. When completed, a mapping file is generated for each table in database within a predefined path in
TRANSLATION_OUTPUT_BUCKET.
Orchestrate dumper execution by using the cron command
You can automate incremental transfers by using a cron job to execute the dwh-migration-dumper tool. By automating metadata extraction, you ensure that an up-to-date dump from Hadoop is available for subsequent incremental transfer runs.
Before you begin
Before using this automation script, you must do the following:
Complete all dumper installation prerequisites, including installing the
dwh-migration-dumpertool and configuring IAM permissions.Install the Google Cloud CLI. The script uses the
gsutilcommand-line tool to upload dumper output to Cloud Storage.Authenticate with Google Cloud to allow
gsutilto upload files to Cloud Storage with the following command:gcloud auth application-default login
Scheduling the automation
Save the following script to a local file. This script is designed to be configured and executed by a
crondaemon to automate the extraction and upload process of dumper output:#!/bin/bash # Exit immediately if a command exits with a non-zero status. set -e # Treat unset variables as an error when substituting. set -u # Pipelines return the exit status of the last command to exit with a non-zero status. set -o pipefail # These values are used if not overridden by command-line options. DUMPER_EXECUTABLE="DUMPER_PATH/dwh-migration-dumper" GCS_BASE_PATH="gs://PATH_TO_DUMPER_OUTPUT" LOCAL_BASE_DIR="LOCAL_BASE_DIRECTORY_PATH" # Optional arguments for cloud environments DUMPER_HOST="" DUMPER_PORT="" HIVE_KERBEROS_URL="" HIVEQL_RPC_PROTECTION="" KERBEROS_AUTHENTICATION="false" # Function to display usage information usage() { echo "Usage: $0 [options]" echo "" echo "Runs the dwh-migration-dumper tool and uploads its output to provided Cloud Storage path." echo "" echo "Required Options:" echo " --dumper-executable
The full path to the dumper executable." echo " --gcs-base-pathThe base Cloud Storage folder to upload dumper output files to. The script generates timestamped ZIP files in this folder." echo " --local-base-dirThe local base directory for logs and temp files." echo "" echo "Optional Hive connection options:" echo " --hostThe hostname for the dumper connection." echo " --portThe port number for the dumper connection." echo "" echo "To use Kerberos authentication, include the following options." echo "If --kerberos-authentication is specified, then --host, --port," echo "--hive-kerberos-url and --hiveql-rpc-protection are all required:" echo "" echo " --kerberos-authentication Enable Kerberos authentication." echo " --hive-kerberos-urlThe Hive Kerberos URL." echo " --hiveql-rpc-protection" echo " The hiveql-rpc-protection level, equal to the value of" echo " 'hadoop.rpc.protection' in '/etc/hadoop/conf/core-site.xml'," echo " with one of the following values:" echo " - authentication" echo " - integrity" echo " - privacy" echo "" echo "Other Options:" echo " -h, --help Display this help message and exit." exit 1 } # This loop processes command-line options and overrides the default configuration. while [[ "$#" -gt 0 ]]; do case $1 in --dumper-executable) DUMPER_EXECUTABLE="$2" shift # past argument shift # past value ;; --gcs-base-path) GCS_BASE_PATH="$2" shift shift ;; --local-base-dir) LOCAL_BASE_DIR="$2" shift shift ;; --host) DUMPER_HOST="$2" shift shift ;; --port) DUMPER_PORT="$2" shift shift ;; --hive-kerberos-url) HIVE_KERBEROS_URL="$2" shift shift ;; --hiveql-rpc-protection) HIVEQL_RPC_PROTECTION="$2" shift shift ;; --kerberos-authentication) KERBEROS_AUTHENTICATION="true" shift ;; -h|--help) usage ;; *) echo "Unknown option: $1" usage ;; esac done # This runs AFTER parsing arguments to ensure no placeholder values are left. if [[ "$DUMPER_EXECUTABLE" == "DUMPER_PATH"* || "$GCS_BASE_PATH" == "gs://PATH_TO_DUMPER_OUTPUT" || "$LOCAL_BASE_DIR" == "LOCAL_BASE_DIRECTORY_PATH" ]]; then echo "ERROR: One or more configuration variables have not been set. Please provide them as command-line arguments or edit the script." >&2 echo "Run with --help for more information." >&2 exit 1 fi # If Kerberos authentication is enabled, check for required fields. if [[ "$KERBEROS_AUTHENTICATION" == "true" ]]; then if [[ -z "$DUMPER_HOST" || -z "$DUMPER_PORT" || -z "$HIVE_KERBEROS_URL" || -z "$HIVEQL_RPC_PROTECTION" ]]; then echo "ERROR: If --kerberos-authentication is enabled, --host, --port, --hive-kerberos-url and --hiveql-rpc-protection must be provided." >&2 echo "Run with --help for more information." >&2 exit 1 fi fi # Remove trailing slashes from GCS_BASE_PATH, if any. GCS_BASE_PATH=$(echo "${GCS_BASE_PATH}" | sed 's:/*$::') # Create unique timestamp and directories for this run EPOCH=$(date +%s) LOCAL_LOG_DIR="${LOCAL_BASE_DIR}/logs" mkdir -p "${LOCAL_LOG_DIR}" # Ensures the base and logs directories exist # Define the unique log and zip file path for this run LOG_FILE="${LOCAL_LOG_DIR}/dumper_execution_${EPOCH}.log" ZIP_FILE_NAME="dts-cron-dumper-output_${EPOCH}.zip" LOCAL_ZIP_PATH="${LOCAL_BASE_DIR}/${ZIP_FILE_NAME}" echo "Script execution started. All subsequent output will be logged to: ${LOG_FILE}" # --- Helper Functions --- log() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $@" >> "${LOG_FILE}"; } cleanup() { local path_to_remove="$1" log "Cleaning up local file/directory: ${path_to_remove}..." rm -rf "${path_to_remove}" } # This function is called when the script exits to ensure cleanup and logging happen reliably. handle_exit() { local exit_code=$? # Only run the failure logic if the script is exiting with an error if [[ ${exit_code} -ne 0 ]]; then log "ERROR: Script is exiting with a failure code (${exit_code})." local gcs_log_path_on_failure="${GCS_BASE_PATH}/logs/$(basename "${LOG_FILE}")" log "Uploading log file to ${gcs_log_path_on_failure} for debugging..." # Attempt to upload the log file on failure, but don't let this command cause the script to exit. gsutil cp "${LOG_FILE}" "${gcs_log_path_on_failure}" > /dev/null 2>&1 || log "WARNING: Failed to upload log file to Cloud Storage." else # SUCCESS PATH log "Script finished successfully. Now cleaning up local zip file...." # Clean up the local zip file ONLY on success cleanup "${LOCAL_ZIP_PATH}" fi log "*****Script End*****" exit ${exit_code} } # Trap the EXIT signal to run the handle_exit function, ensuring cleanup always happens. trap handle_exit EXIT # Validates the dumper log file based on a strict set of rules. validate_dumper_output() { local log_file_to_check="$1" # Check for the specific success message from the dumper tool. if grep -q "Dumper execution: SUCCEEDED" "${log_file_to_check}"; then log "Validation Successful: Found 'Dumper execution: SUCCEEDED' message." return 0 # Success else log "ERROR: Validation failed. The 'Dumper execution: SUCCEEDED' message was not found." return 1 # Failure fi } # --- Main Script Logic --- log "*****Script Start*****" log "Dumper Executable: ${DUMPER_EXECUTABLE}" log "Cloud Storage Base Path: ${GCS_BASE_PATH}" log "Local Base Directory: ${LOCAL_BASE_DIR}" # Use an array to build the command safely dumper_command_args=( "--connector" "hiveql" "--output" "${LOCAL_ZIP_PATH}" ) # Add optional arguments if they are provided if [[ -n "${DUMPER_HOST}" ]]; then dumper_command_args+=("--host" "${DUMPER_HOST}") log "Using Host: ${DUMPER_HOST}" fi if [[ -n "${DUMPER_PORT}" ]]; then dumper_command_args+=("--port" "${DUMPER_PORT}") log "Using Port: ${DUMPER_PORT}" fi if [[ -n "${HIVE_KERBEROS_URL}" ]]; then dumper_command_args+=("--hive-kerberos-url" "${HIVE_KERBEROS_URL}") log "Using Hive Kerberos URL: ${HIVE_KERBEROS_URL}" fi if [[ -n "${HIVEQL_RPC_PROTECTION}" ]]; then dumper_command_args+=("-Dhiveql.rpc.protection=${HIVEQL_RPC_PROTECTION}") log "Using HiveQL RPC Protection: ${HIVEQL_RPC_PROTECTION}" fi log "Starting dumper tool execution..." log "COMMAND: JAVA_OPTS=\"-Djavax.security.auth.useSubjectCredsOnly=false\" ${DUMPER_EXECUTABLE} ${dumper_command_args[*]}" JAVA_OPTS="-Djavax.security.auth.useSubjectCredsOnly=false" "${DUMPER_EXECUTABLE}" "${dumper_command_args[@]}" >> "${LOG_FILE}" 2>&1 log "Dumper process finished." # Validate the output from the dumper execution for success or failure. validate_dumper_output "${LOG_FILE}" # Upload the ZIP file to Cloud Storage gcs_zip_path="${GCS_BASE_PATH}/${ZIP_FILE_NAME}" log "Uploading ${LOCAL_ZIP_PATH} to ${gcs_zip_path}..." if [ ! -f "${LOCAL_ZIP_PATH}" ]; then log "ERROR: Expected ZIP file ${LOCAL_ZIP_PATH} not found after dumper execution." # The script will exit here with an error code, and the trap will run. exit 1 fi gsutil cp "${LOCAL_ZIP_PATH}" "${gcs_zip_path}" >> "${LOG_FILE}" 2>&1 log "Upload to Cloud Storage successful." # The script will now exit with code 0. The trap will call cleanup and log the script end.Run the following command to make the script executable:
chmod +x PATH_TO_SCRIPT
Schedule the script using
crontab, replacing the variables with appropriate values for your job. Add an entry to schedule the job. The following examples run the script every day at 2:30 AM:If you are running on a host that has direct access to Apache Hive and does not require Kerberos authentication, use the following command:
Without Kerberos authentication
# Run the Hive dumper daily at 2:30 AM for incremental BigQuery transfer. 30 2 * * * PATH_TO_SCRIPT
--dumper-executable PATH_TO_DUMPER_EXECUTABLE
--gcs-base-path GCS_PATH_TO_UPLOAD_DUMPER_OUTPUT
--local-base-dir LOCAL_PATH_TO_SAVE_INTERMEDIARY_FILESIf your Apache Hive instance requires Kerberos authentication, use the following command:
With Kerberos authentication
# Run the Hive dumper daily at 2:30 AM for incremental BigQuery transfer with Kerberos authentication. 30 2 * * * PATH_TO_SCRIPT
--dumper-executable PATH_TO_DUMPER_EXECUTABLE
--gcs-base-path GCS_PATH_TO_UPLOAD_DUMPER_OUTPUT
--local-base-dir LOCAL_PATH_TO_SAVE_INTERMEDIARY_FILES
--kerberos-authentication
--host HIVE_HOST
--port HIVE_PORT
--hive-kerberos-url HIVE_KERBEROS_URL
--hiveql-rpc-protection HIVEQL_RPC_PROTECTIONWhen creating the transfer, ensure the
table_metadata_pathfield is set to the same Cloud Storage path you configured forGCS_PATH_TO_UPLOAD_DUMPER_OUTPUT. This is the path containing the dumper output ZIP files.
Scheduling Considerations
To avoid data staleness, the metadata dump must be ready before your scheduled
transfer begins. Configure the cron job frequency accordingly.
We recommend performing a few trial runs of the script manually to determine the
average time it takes for the dumper tool to generate its output. Use this
timing to set a cron schedule that safely precedes your DTS transfer run and
ensures freshness.
Monitor and view transfer status
You can monitor resource-level transfers for individual tables to track progress, view granular error details, and query the state of specific resources being migrated.
To view the progress and status of your resources, select one of the following options:
Console
In the Google Cloud console, go to the Data transfers page.
Click your transfer configuration from the list.
On the Transfer details page, click the Tables transferred tab.
View the list of resources being transferred. You can see details like the following:
- Last transfer status: the current state of the resource based on the latest resource transfer, including completion progress.
- Table name: the name of the resource being transferred. Click the resource name to see a detailed view of the resource.
- Latest run: the last transfer run that updated the resource.
- Status summary: granular progress metrics or error messages if the transfer failed.
- Last successful run: the last run that successfully transferred the resource.
Use the filter bar to search for specific resources by name or filter by
their current status, for example, Failed transfers. The Table name
filter supports wildcard matching—for example, using *—but wildcard
matching isn't supported for other filter fields.
API
You can query the status of transfer resources using the BigQuery Data Transfer Service API.
List all resources and their statuses
To list all resources and their statuses, use the
projects.locations.transferConfigs.transferResources.list method.
Run the API request with the following information:
GET https://bigquerydatatransfer.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/transferConfigs/CONFIG_ID/transferResources Example Response (abridged) (JSON): { "transferResources": [ { "name": "projects/.../transferResources/table1", "latestStatusDetail": { "state": "RESOURCE_TRANSFER_SUCCEEDED", "completedPercentage": 100.0 }, "updateTime": "2026-02-03T22:42:06Z" } ] }
curl command:
curl -X GET
"https://bigquerydatatransfer.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/transferConfigs/CONFIG_ID/transferResources"
-H "Authorization: Bearer $(gcloud auth print-access-token)"
-H "Accept: application/json"
You can filter the results by resource name or state. For example, to find
all failed transfers, add
?filter=latest_status_detail.state="RESOURCE_TRANSFER_FAILED"
to the request URL.
Replace the following:
CONFIG_ID: the ID of the transfer configuration.LOCATION: the location where the transfer configuration was created.PROJECT_ID: the ID of the Google Cloud project that's running the transfers.
Get a specific resource
To get the status of a specific table or partition, use the
projects.locations.transferConfigs.transferResources.get method.
Run the API request with the following information:
GET https://bigquerydatatransfer.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/transferConfigs/CONFIG_ID/transferResources/RESOURCE_ID
curl command:
curl -X GET
"https://bigquerydatatransfer.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/transferConfigs/CONFIG_ID/transferResources/RESOURCE_ID"
-H "Authorization: Bearer $(gcloud auth print-access-token)"
-H "Accept: application/json"
Replace the following:
CONFIG_ID: the ID of the transfer configuration.LOCATION: the location where the transfer configuration was created.PROJECT_ID: the ID of the Google Cloud project that's running the transfers.RESOURCE_ID: the ID of the resource, for example, the table name.