StreamSets Transformer pipeline to DataBrick in 5 mins
This tutorial is for Spark developers who want to learn an easy and quick way to run a Transformer pipeline on DataBrick cluster via StreamSets Transformer
In my previous article explained what is StreamSets Transformer and how to deploy this in minishift. Now we will look at how to execute this transformer pipeline (spark applications) to the DataBrick cluster.
To get hands-on: Try StreamSets Transformer now
Stremasets transformer currently support the following Cluster Manager
- None (local) — Run the pipeline locally on the Transformer machine.
- Apache Spark for HDInsight — Run a pipeline on an HDInsight cluster.
- Databricks — Run the pipeline on a Databricks cluster.
- EMR — Run the pipeline on an EMR cluster.
- Hadoop YARN — Run the pipeline on a Hadoop YARN cluster.
- Spark Standalone — Run the pipeline on a Spark standalone cluster.
- SQL Server 2019 Big Data Cluster — Run the pipeline on SQL Server 2019 BDC.
Tutorial
You can configure a pipeline to run on an existing Databricks interactive cluster or provision a new Databricks job cluster upon the initial run of a pipeline. In this tutorial, we will provision a new cluster.
Step 1: Generate Personal access tokens for secure authentication to the Databricks API
- Go to User Settings > Access Token
- Click on Generate New Token
- Copy the generated token in the text file.
Step 2: Transformer Pipeline Configuration
- Go to Pipeline Configuration > Cluster Tab
- Select “Cluster Manager Type” to “Databricks”
- Copy the Databricks URL in “URL to connect to Databricks”
- Select “Credential Type” to “Token”
- Copy generated/saved personal access token in the above step to “Token” config
- User can use the existing cluster or provision a new cluster on pipeline start, In this demo, we are selecting to provision a new cluster
- Update “node_type_id” and other Cluster Configuration as per your requirement.
Step 3: Fine-tune your Spark configuration for your pipeline. Also, you can add any other additional configuration properties as per your requirement.
Step 4: Start Pipeline
Let’s understand the magic that happens behind the scene
- Click on “Start” to start the pipeline Transformer.
- Transformer copies all required jar files to Databricks DBFS using Databricks REST APIs (https://docs.databricks.com/api/latest/dbfs.html#put)
- The default DBFS root path is “/streamsets”. It can be changed via Pipeline configuration > Staging Directory
- The staging directory structure in DBFS
/streamsets/<version>/
- All required Transformer Jar files (only stage libs used in the pipeline)
- External resources like JDBC drivers
- /staging/<pipelineId>/<unique directory for each run>/
Pipeline.json
Offset.json
etc folder archive
resources folder archive
- Once all files are uploaded to DBFS, transformer create Databricks SparkSubmitTask Job (https://docs.databricks.com/api/latest/jobs.html#sparksubmittask)
- Once Databricks Job is created, Streamsets run the Databricks Job using runNow REST API (https://docs.databricks.com/api/latest/jobs.html#run-now)
Clicking on the Databricks Job URL will take you to the Databricks Jobs page for monitoring status.
Conclusion :
In this tutorial, we learned about launching a transformer pipeline to the Databrick cluster.