How to Create A Cloud Dataflow Pipeline Using Java and Apache MavenJose Miguel ArrietaBlockedUnblockFollowFollowingJan 6Cloud Dataflow is a managed service for executing a wide variety of data processing patterns.
This post will explain how to create a simple Maven project with the Apache Beam SDK in order to run a pipeline on Google Cloud Dataflow service.
One advantage to use Maven, is that this tool will let you manage external dependencies for the Java project, making it ideal for automation processes.
This project execute a very simple example where two strings “Hello” and “World" are the inputs and transformed to upper case on GCP Dataflow, the output is presented on console log.
Disclaimer: Purpose of this post is to present steps to create a Data pipeline using Dataflow on GCP, Java code syntax is not going to be discussed and is beyond this scope.
Hope to make in the future some specific tutorials on this.
Pre-requisitesIn order to work you will need to Enable the APIS, set up authentication and Set Google Application credentials.
Install Apache Maven http://maven.
htmlGoogle Cloud Platform Account https://console.
comEnable the APIS and select the project or create a new oneCreate Buckets: (See https://cloud.
com/storage/docs/creating-buckets)This Buckets will contain jar files and temporal files if necessary.
Set up authentication: On APIs & Services -> Credentials -> Create Credentials -> Service Account Keya.
On Service Account option, select New Service account.
Enter a name for service account name, in this case will be dataflow-service.
Role will be ownerSet Google Application credentials: Withe the JSON file previously downloaded witch containst the service account key set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of that fileexport GOOGLE_APPLICATION_CREDENTIALS="my/path/dataflow-test.
json"If you don’t set the google application credentials properly you might not access the google buckets and probably will se the following errorAn exception occured while executing the Java class.
Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.
PipelineOptions): InvocationTargetException: DataflowRunner requires gcpTempLocation, but failed to retrieve a value from PipelineOptions: Error constructing default value for gcpTempLocation: tempLocation is not a valid GCS path …1.
Use java data flow archetypeThe Maven Archetype Plugin allows the user to create a Maven project from an existing template called an archetype.
The following command generates a new project from google-cloud-dataflow-java-archetypes-startermvn archetype:generate -DarchetypeArtifactId=google-cloud-dataflow-java-archetypes-starter -DarchetypeGroupId=com.
example -DartifactId=dataflow-example -Dversion="[1.
0]" -DinteractiveMode=false This command will generate a example Java class named StarterPipeline.
java that contains the Apache Java Beam code that define pipeline steps.
Run Java main from MavenPre-requisites: Buckets for staging and temp locations already created.
com/storage/docs/creating-buckets)To compile and run the main method of the Java class with arguments, you need to execute the following command.
mvn compile exec:java -e -Dexec.
args="–project=dataflow-test-227715 –stagingLocation=gs://example-dataflow-stage/staging/ –tempLocation=gs://example-dataflow-stage/temp/ –runner=DataflowRunner"Arguments:— project: The project id in this case dataflow-test-227715.
— stagingLocation: Staging folder in a GCP Bucket.
— tempLocation: Temp folder location in GCP Bucket.
— runner: set to DataflowRunner to run on GCP.
Check Job is createdGo to Dataflow dashboard and you should see a new job created and running.
Open JobYou should see the deferents steps and when finish the words ‘HELLO’ and ‘WORLD’ on upper case on the log console.
.. More details