Running classic templates  |  Cloud Dataflow  |  Google Cloud (2024)

After you create and stage your Dataflow template, run the template with the Google Cloud console, REST API, or the Google Cloud CLI. You can deploy Dataflow template jobs from many environments, including App Engine standard environment, Cloud Functions, and other constrained environments.

Use the Google Cloud console

You can use the Google Cloud console to run Google-provided and custom Dataflow templates.

Google-provided templates

To run a Google-provided template:

  1. Go to the Dataflow page in the Google Cloud console.
  2. Go to the Dataflow page
  3. Click add_boxCREATE JOB FROM TEMPLATE.
  4. Running classic templates | Cloud Dataflow | Google Cloud (1)
  5. Select the Google-provided template that you want to run from the Dataflow template drop-down menu.
  6. Running classic templates | Cloud Dataflow | Google Cloud (2)
  7. Enter a job name in the Job Name field.
  8. Enter your parameter values in the provided parameter fields. You don't need the Additional Parameters section when you use a Google-provided template.
  9. Click Run Job.

Custom templates

To run a custom template:

  1. Go to the Dataflow page in the Google Cloud console.
  2. Go to the Dataflow page
  3. Click CREATE JOB FROM TEMPLATE.
  4. Running classic templates | Cloud Dataflow | Google Cloud (3)
  5. Select Custom Template from the Dataflow template drop-down menu.
  6. Running classic templates | Cloud Dataflow | Google Cloud (4)
  7. Enter a job name in the Job Name field.
  8. Enter the Cloud Storage path to your template file in the template Cloud Storage path field.
  9. If your template needs parameters, click addADD PARAMETER in the Additional parameters section. Enter in the Name and Value of the parameter. Repeat this step for each needed parameter.
  10. Click Run Job.

Use the REST API

To run a template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

See the REST API reference for projects.locations.templates.launch to learn more about the available parameters.

Create a custom template batch job

This example projects.locations.templates.launch request creates a batch job from a template that reads a text file and writes an output text file. If the request is successful, the response body contains an instance of LaunchTemplateResponse.

Modify the following values:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace LOCATION with the Dataflow region of your choice.
  • Replace JOB_NAME with a job name of your choice.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Set gcsPath to the Cloud Storage location of the template file.
  • Set parameters to your list of key-value pairs.
  • Set tempLocation to a location where you have write permission. This value is required to run Google-provided templates.
 POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://YOUR_BUCKET_NAME/templates/TemplateName { "jobName": "JOB_NAME", "parameters": { "inputFile" : "gs://YOUR_BUCKET_NAME/input/my_input.txt", "output": "gs://YOUR_BUCKET_NAME/output/my_output" }, "environment": { "tempLocation": "gs://YOUR_BUCKET_NAME/temp", "zone": "us-central1-f" } }

Create a custom template streaming job

This example projects.locations.templates.launch request creates a streaming job from a classic template that reads from a Pub/Sub subscription and writes to a BigQuery table. If you want to launch a Flex Template, use projects.locations.flexTemplates.launch instead. The example template is a Google-provided template. You can modify the path in the template to point to a custom template. The same logic is used to launch Google-provided and custom templates. In this example, the BigQuery table must already exist with the appropriate schema. If successful, the response body contains an instance of LaunchTemplateResponse.

Modify the following values:

  • Replace YOUR_PROJECT_ID with your project ID.
  • Replace LOCATION with the Dataflow region of your choice.
  • Replace JOB_NAME with a job name of your choice.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Replace GCS_PATH with the Cloud Storage location of the template file. The location should start with gs://
  • Set parameters to your list of key-value pairs. The parameters listed are specific to this template example. If you're using a custom template, modify the parameters as needed. If you're using the example template, replace the following variables.
    • Replace YOUR_SUBSCRIPTION_NAME with your Pub/Sub subscription name.
    • Replace YOUR_DATASET with your BigQuery dataset, and replace YOUR_TABLE_NAME with your BigQuery table name.
  • Set tempLocation to a location where you have write permission. This value is required to run Google-provided templates.
 POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=GCS_PATH { "jobName": "JOB_NAME", "parameters": { "inputSubscription": "projects/YOUR_PROJECT_ID/subscriptions/YOUR_SUBSCRIPTION_NAME", "outputTableSpec": "YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME" }, "environment": { "tempLocation": "gs://YOUR_BUCKET_NAME/temp", "zone": "us-central1-f" } }

Update a custom template streaming job

This example projects.locations.templates.launch request shows you how to update a template streaming job. If you want to update a Flex Template, use projects.locations.flexTemplates.launch instead.

  1. Run Example 2: Creating a custom template streaming job to start a streaming template job.
  2. Send the following HTTP POST request, with the following modified values:
    • Replace YOUR_PROJECT_ID with your project ID.
    • Replace LOCATION with the Dataflow region of the job that you're updating.
    • Replace JOB_NAME with the exact name of the job that you want to update.
    • Replace GCS_PATH with the Cloud Storage location of the template file. The location should start with gs://
    • Set parameters to your list of key-value pairs. The parameters listed are specific to this template example. If you're using a custom template, modify the parameters as needed. If you're using the example template, replace the following variables.
      • Replace YOUR_SUBSCRIPTION_NAME with your Pub/Sub subscription name.
      • Replace YOUR_DATASET with your BigQuery dataset, and replace YOUR_TABLE_NAME with your BigQuery table name.
    • Use the environment parameter to change environment settings, such as the machine type. This example uses the n2-highmem-2 machine type, which has more memory and CPU per worker than the default machine type.
     POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=GCS_PATH { "jobName": "JOB_NAME", "parameters": { "inputSubscription": "projects/YOUR_PROJECT_ID/subscriptions/YOUR_TOPIC_NAME", "outputTableSpec": "YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME" }, "environment": { "machineType": "n2-highmem-2" }, "update": true }
  3. Access the Dataflow monitoring interface and verify that a new job with the same name was created. This job has the status Updated.

Use the Google API Client Libraries

Consider using the Google API Client Libraries to easily make calls to the Dataflow REST APIs. This sample script uses the Google API Client Library for Python.

In this example, you must set the following variables:

  • project: Set to your project ID.
  • job: Set to a unique job name of your choice.
  • template: Set to the Cloud Storage location of the template file.
  • parameters: Set to a dictionary with the template parameters.

To set the region, include the location parameter.

from googleapiclient.discovery import build# project = 'your-gcp-project'# job = 'unique-job-name'# template = 'gs://dataflow-templates/latest/Word_Count'# parameters = {# 'inputFile': 'gs://dataflow-samples/shakespeare/kinglear.txt',# 'output': 'gs://<your-gcs-bucket>/wordcount/outputs',# }dataflow = build("dataflow", "v1b3")request = ( dataflow.projects() .templates() .launch( projectId=project, gcsPath=template, body={ "jobName": job, "parameters": parameters, }, ))response = request.execute()

For more information about the available options, see theprojects.locations.templates.launch methodin the Dataflow REST API reference.

Use gcloud CLI

The gcloud CLI can run either a custom or a Google-provided template using the gcloud dataflow jobs run command. Examples of running Google-provided templates are documented in the Google-provided templates page.

For the following custom template examples, set the following values:

  • Replace JOB_NAME with a job name of your choice.
  • Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
  • Set --gcs-location to the Cloud Storage location of the template file.
  • Set --parameters to the comma-separated list of parameters to pass to the job. Spaces between commas and values are not allowed.
  • To prevent VMs from accepting SSH keys that are stored in project metadata, use the additional-experiments flag with the block_project_ssh_keys service option: --additional-experiments=block_project_ssh_keys.

Create a custom template batch job

This example creates a batch job from a template that reads a text file and writes an output text file.

 gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://YOUR_BUCKET_NAME/templates/MyTemplate \ --parameters inputFile=gs://YOUR_BUCKET_NAME/input/my_input.txt,output=gs://YOUR_BUCKET_NAME/output/my_output

The request returns a response with the following format.

 id: 2016-10-11_17_10_59-1234530157620696789 projectId: YOUR_PROJECT_ID type: JOB_TYPE_BATCH

Create a custom template streaming job

This example creates a streaming job from a template that reads from a Pub/Sub topic and writes to a BigQuery table. The BigQuery table must already exist with the appropriate schema.

 gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://YOUR_BUCKET_NAME/templates/MyTemplate \ --parameters topic=projects/project-identifier/topics/resource-name,table=my_project:my_dataset.my_table_name

The request returns a response with the following format.

 id: 2016-10-11_17_10_59-1234530157620696789 projectId: YOUR_PROJECT_ID type: JOB_TYPE_STREAMING

For a complete list of flags for the gcloud dataflow jobs run command, see the gcloud CLI reference.

Monitoring and Troubleshooting

The Dataflow monitoring interface lets you to monitor your Dataflow jobs. If a job fails, you can find troubleshooting tips, debugging strategies, and a catalog of common errors in the Troubleshooting Your Pipeline guide.

Running classic templates  |  Cloud Dataflow  |  Google Cloud (2024)

FAQs

What is the difference between classic and flex templates in Dataflow? ›

While classic templates have a static job graph, Flex templates can dynamically construct the job graph.

Which Dataflow template is used in the lab to run the pipeline? ›

Run a streaming pipeline using the Google-provided Pub/Sub to BigQuery template. The pipeline gets incoming data from the input topic. Go to the Dataflow Jobs page. Click Create job from template .

How do I run a GCP Dataflow pipeline from a local machine? ›

  1. On this page.
  2. Before you begin.
  3. Set up your environment.
  4. Get the Apache Beam SDK.
  5. Run the pipeline locally.
  6. Run the pipeline on the Dataflow service.
  7. View your results.
  8. Modify the pipeline code.

How do I run Dataflow in GCP? ›

For this GCP Dataflow tutorial, you must create a new GCP project for BigQuery and your GCP Dataflow pipeline.
  1. Step 1: Signup/Login. Through any supported web browser, navigate to the Google Cloud website, and create a new account if you don't have an existing one. ...
  2. Step 2: Creating the Project. ...
  3. Step 3: Add your APIs.
Mar 26, 2024

How do I create a custom template in Dataflow? ›

Create a JSON-formatted file named TEMPLATE_NAME _metadata using the parameters in Metadata parameters and the format in Example metadata file. Replace TEMPLATE_NAME with the name of your template. Ensure the metadata file does not have a filename extension.

Why use flex template in Dataflow? ›

Dataflow Flex Templates allow you to package a Dataflow pipeline for deployment. This tutorial shows you how to build a Dataflow Flex Template and then run a Dataflow job using that template.

When to use Dataflow flex template? ›

The Flex Template allows to import a dataset into a Neo4j database through a Dataflow job, sourcing data from CSV files hosted in Google Cloud Storage buckets. It also allows to manipulate and transform the data at various steps of the import. You can use the template for both first-time and incremental imports.

What is the difference between cloud composer and Dataflow? ›

Dataflow allows you to build scalable data processing pipelines (Batch & Stream). Composer is used to schedule, orchestrate and manage data pipelines.

Is dataflow an ETL? ›

Dataflow can also run custom ETL solutions since it has: building blocks for Operational Data Store and data warehousing; pipelines for data filtering and enrichment; pipelines to de-identify PII datasets; features to detect anomalies in financial transactions; and log exports to external systems.

Does dataflow require gateway? ›

In order to create a dataflow that queries an on premise data source, you need one of the following: Administrator permissions on a gateway. Connection creator permissions on the gateway. A gateway connection for the data source(s) you intend to use already created on the gateway for which you are a user.

Is Google dataflow an ETL tool? ›

Learn about Google Cloud's portfolio of services enabling ETL including Cloud Data Fusion, Dataflow, and Dataproc. Ready to get started?

Why use Google Cloud Dataflow? ›

Dataflow uses the same programming model for both batch and stream analytics. Streaming pipelines can achieve very low latency. You can ingest, process, and analyze fluctuating volumes of real-time data. By default, Dataflow guarantees exactly-once processing of every record.

What is the difference between GCP Dataflow and airflow? ›

Airflow relies on task parallelism, where multiple tasks can be executed simultaneously, while Google Cloud Dataflow leverages data parallelism, which allows processing multiple chunks of data in parallel. This makes Google Cloud Dataflow highly scalable for processing large datasets.

Is GCP Dataflow serverless? ›

Unified stream and batch data processing that's serverless, fast, and cost-effective.

How do I run Azure Dataflow? ›

Getting started. Data flows are created from the factory resources pane like pipelines and datasets. To create a data flow, select the plus sign next to Factory Resources, and then select Data Flow. This action takes you to the data flow canvas, where you can create your transformation logic.

What does Dataflow run on? ›

Dataflow is built on the open source Apache Beam project. Apache Beam lets you write pipelines using a language-specific SDK. Apache Beam supports Java, Python, and Go SDKs, as well as multi-language pipelines. Dataflow executes Apache Beam pipelines.

How do I create a flex template in Dataflow? ›

  1. Prepare the environment.
  2. Create a Cloud Storage bucket.
  3. Create an Artifact Registry repository.
  4. Build the Flex Template.
  5. Run the Flex Template.

Top Articles
Latest Posts
Article information

Author: Margart Wisoky

Last Updated:

Views: 6250

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.