This page shows you how to resolve issues with the Cloud Data Fusion batch pipelines.
Pipeline error: Text file busy
The following error occurs when you run a batch pipeline, causing it to fail:
error=26, Text file busy
To resolve this issue, set up a trigger that automatically retries a pipeline when it fails.
- Stop the pipeline.
- Create a trigger. In this case, when you select an event to execute, choose Fails. For more information, see Create an inbound trigger on a downstream pipeline.
- Start the pipeline.
Concurrent pipeline is stuck
In Cloud Data Fusion, running many concurrent batch pipelines can put a
strain on the instance, causing jobs to get stuck in Starting, Provisioning,
or Running states. As a result, pipelines can't be stopped through the web
interface or API calls. When you run many pipelines concurrently, the web
interface can become slow or unresponsive. This issue occurs due to multiple UI
requests made to the HTTP handler in the backend.
To resolve this issue, control the number of new requests using Cloud Data Fusion flow control.
SSH connection times out while a running pipeline
The following error occurs when you run a batch pipeline:
java.io.IOException: com.jcraft.jsch.JSchException:
java.net.ConnectException: Connection timed out (Connection timed out)
To resolve this issue, do the following:
- Check for a missing firewall rule (typically port 22). To create a new firewall rule, see Dataproc cluster network configuration.
- Check that the Compute Engine enforcer allows the connection between your Cloud Data Fusion instance and the Dataproc cluster.
Response code: 401. Error: unknown error
The following error occurs when you run a batch pipeline:
java.io.IOException: Failed to send message for program run program_run:
Response code: 401. Error: unknown error
To resolve this issue, you must
grant the Cloud Data Fusion Runner role (roles/datafusion.runner)
to the service account used by Dataproc.
Pipeline with BigQuery plugin fails with Access Denied error
There is a known issue where a pipeline fails with an Access Denied error when
running BigQuery jobs. This impacts pipelines that use the
following plugins:
- BigQuery sources
- BigQuery sinks
- BigQuery Multi Table sinks
- Transformation Pushdown
Example error in the logs (might differ depending on the plugin you are using):
POST https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/jobs
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Access Denied: Project xxxx: User does not have bigquery.jobs.create permission in project PROJECT_ID",
"reason" : "accessDenied"
} ],
"message" : "Access Denied: Project PROJECT_ID: User does not have bigquery.jobs.create permission in project PROJECT_ID.",
"status" : "PERMISSION_DENIED"
}
In this example, PROJECT_ID is the project ID that you specified in the plugin. The service account for the project specified in the plugin doesn't have permission to do at least one of the following:
- Run a BigQuery job
- Read a BigQuery dataset
- Create a temporary bucket
- Create a BigQuery dataset
- Create the BigQuery table
To resolve this issue, grant the missing roles to the project (PROJECT_ID) that you specified in the plugin:
To run a BigQuery job, grant the BigQuery Job User role (
roles/bigquery.jobUser).To read a BigQuery dataset, grant the BigQuery Data Viewer role (
roles/bigquery.dataViewer).To create a temporary bucket, grant the Storage Admin role (
roles/storage.admin).To create a BigQuery dataset or table, grant the BigQuery Data Editor role (
roles/bigquery.dataEditor).
For more information, see the plugin's troubleshooting documentation (Google BigQuery Multi Table Sink Troubleshooting).
Pipeline doesn't stop at the error threshold
A pipeline might not stop after multiple errors, even if you set the error
threshold to 1.
The error threshold is intended for any exceptions raised from the directive in the event of a failure that is not otherwise handled. If the directive already uses the emitError API, then the error threshold is not activated.
To design a pipeline that fails when a certain threshold is met, use the
FAIL directive.
Whenever the condition passed to the FAIL directive is satisfied, it counts
against the error threshold and the pipeline fails after the threshold is
reached.
Delete an ephemeral Dataproc cluster
When Cloud Data Fusion creates an ephemeral Dataproc cluster during pipeline run provisioning, the cluster gets deleted after the pipeline run is finished. In rare cases, the cluster deletion fails.
Strongly recommended: Upgrade to the most recent Cloud Data Fusion version to ensure proper cluster maintenance.
Set Max Idle Time
To resolve this issue, configure the Max Idle Time
option. This lets Dataproc delete clusters automatically, even if
an explicit call on the pipeline finish fails.
Max Idle Time is available in Cloud Data Fusion versions 6.4 and later.
Recommended: For versions prior to 6.6, set Max Idle Time manually to 30
minutes or greater.
Delete clusters manually
If you can't upgrade your version or configure the Max Idle Time option,
instead delete stale clusters manually:
Get each project ID where the clusters were created:
In the pipeline's runtime arguments, check if the Dataproc project ID is customized for the run.

If a Dataproc project ID is not specified explicitly, determine which provisioner is used, and then check for a project ID:
In the pipeline runtime arguments, check the
system.profile.namevalue.
Open the provisioner settings and check if the Dataproc project ID is set. If the setting is not present or the field is empty, the project that the Cloud Data Fusion instance is running in is used.
For each project:
Open the project in the Google Cloud console and go to the Dataproc Clusters page.
Sort the clusters by the date that they were created, from oldest to newest.
If the info panel is hidden, click Show info panel and go to the Labels tab.
For every cluster that is not in use—for example, more than a day has elapsed—check if it has a Cloud Data Fusion version label. That is an indication that it was created by Cloud Data Fusion.
Select the checkbox by the cluster name and click Delete.
Pipelines fail when run on Dataproc clusters with primary or secondary workers
In Cloud Data Fusion versions 6.8 and 6.9, an issue occurs causing pipelines to fail if they run on Dataproc clusters:
ERROR [provisioning-task-2:i.c.c.i.p.t.ProvisioningTask@161] - PROVISION task failed in REQUESTING_CREATE state for program run program_run:default.APP_NAME.UUID.workflow.DataPipelineWorkflow.RUN_ID due to
Caused by: io.grpc.StatusRuntimeException: CANCELLED: Failed to read message.
Caused by: com.google.protobuf.GeneratedMessageV3$Builder.parseUnknownField(Lcom/google/protobuf/CodedInputStream;Lcom/google/protobuf/ExtensionRegistryLite;I)Z.
To resolve this issue,
upgrade to the patch
revision 6.8.3.1, 6.9.2.1, or later.
Cloud Storage plugin intermittently fails with regular expression on Dataproc 2.0
Pipelines might intermittently fail in Cloud Data Fusion version 6.10.1
when the Cloud Storage plugin uses a * regular expression pattern in
the path and the execution environment is Dataproc 2.0.
To resolve this issue, perform one of the following:
- Update the Dataproc image to version 2.1 or later.
- Revert to an earlier version of the Cloud Storage plugin.
- Increase the memory allocated for the Dataproc executor.
Pipelines fail with NoSuchMethodError on Dataproc 2.2
In Cloud Data Fusion version 6.10.1.1 and later, pipelines might fail with the following error when running on Dataproc 2.2:
java.lang.NoSuchMethodError: 'org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
org.apache.spark.sql.catalyst.encoders.RowEncoder.apply(org.apache.spark.sql.types.StructType)'
To resolve this issue, perform one of the following:
- Reconfigure the Dataproc image to version 2.1. For more information, see Change the Dataproc image version in Cloud Data Fusion.
- Upgrade your Cloud Data Fusion instance to version 6.11.
Multiple database tables plugin fails if Reference Name contains spaces
In Cloud Data Fusion version 6.10.1 and later, pipelines using the Multiple database tables batch source might fail when the Reference Name field contains space characters.
To resolve this issue, update the Multiple database tables plugin to version 1.4.1 or later from the Hub. The updated version doesn't allow spaces in the Reference Name field.