StreamSets Transformer pipeline to Dataproc in 5 mins
This tutorial is for Spark developers who want to learn an easy and quick way to run a Transformer pipeline on the Dataproc cluster via StreamSets Transformer
In my previous article explained what is StreamSets Transformer and how to deploy this in minishift. Now we will look at how to execute this transformer pipeline (spark applications) to the Dataproc cluster.
To get hands-on: Try StreamSets Transformer now
Stremasets transformer currently support the following Cluster Manager
- None (local) — Run the pipeline locally on the Transformer machine.
- Apache Spark for HDInsight — Run a pipeline on an HDInsight cluster.
- Databricks — Run the pipeline on a Databricks cluster.
- EMR — Run the pipeline on an EMR cluster.
- Hadoop YARN — Run the pipeline on a Hadoop YARN cluster.
- Kubernetes — Run the pipeline on a Kubernetes cluster.
- Spark Standalone — Run the pipeline on a Spark standalone cluster.
- SQL Server 2019 Big Data Cluster — Run the pipeline on SQL Server 2019 BDC.
What is DataProc?
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open-source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them.
To better understand why you should use dataproc, please check out this.
Tutorial
You can configure a pipeline to run on an existing Dataproc interactive cluster or provision a new Dataproc job cluster upon the initial run of a pipeline. In this tutorial, we will use the existing data proc cluster
Step 1: Create dataproc cluster, you can follow the steps mentioned in below document
Step 2: Transformer Pipeline Configuration
- Go to Pipeline Configuration > Cluster Tab
- Select “Cluster Manager Type” to “Dataproc”
- Select “Dataproc” tab and provide the essential information
- To generate credentials please follow the steps mentioned here
- User can use an existing cluster or provision a new cluster on pipeline start, we are selecting for existing cluster and provide its name
- Also, we need to provide staging GCS staging URI, I will share more details on why you need this staging URI
Step 3: Fine-tune your Spark configuration for your pipeline. Also, you can add any other additional configuration properties as per your requirement.
Step4: Let’s create a pipeline
In this pipeline, we are reading data from GCS and then performing some spark transformation and then finally writing to GCS and Bigquery.
You can get this pipeline from my Github repo Streamsets On Dataproc.zip
Step5: Start Pipeline
Let’s understand the magic that happens behind the scene
- Click on “Start” to start the pipeline Transformer.
- Transformer copies all required jar files to GCS Staging directory
- The staging directory structure in GCS
/streamsets/<version>/
- All required Transformer Jar files (only stage libs used in the pipeline)
- External resources like JDBC drivers
- /staging/<pipelineId>/<unique directory for each run>/
Pipeline.json
Offset.json
etc folder archive
resources folder archive
Conclusion :
In this tutorial, we learned about launching a transformer pipeline to the Dataproc cluster.