StreamSets Transformer pipelines to AWS EMR in 10 mins

5 min readApr 13, 2020

This tutorial is for Spark developers who don’t have much knowledge on Amazon Web Services and want to learn an easy and quick way to run a Spark job on Amazon EMR via StreamSets Transformer

In my previous article explained what is StreamSets Transformer and how to deploy this in minishift. Now we will look at how to execute this transformer pipeline (spark applications) to AWS EMR.

Stremasets transformer currently support the following Cluster Manager

None (local) — Run the pipeline locally on the Transformer machine.
Apache Spark for HDInsight — Run a pipeline on an HDInsight cluster.
Databricks — Run the pipeline on a Databricks cluster.
EMR — Run the pipeline on an EMR cluster.
Hadoop YARN — Run the pipeline on a Hadoop YARN cluster.
Kubernetes — Run the pipeline on a Kubernetes cluster.
Spark Standalone — Run the pipeline on a Spark standalone cluster.
SQL Server 2019 Big Data Cluster — Run the pipeline on SQL Server 2019 BDC.

AWS and Amazon EMR

AWS is one of the most used cloud services platform, a lot of services are available, it is very well documented and easy to use.

A cloud services platform allows users to access on-demand resources (compute power, memory, storage) and services (databases, monitoring, workflow, etc.) via the internet with pay-as-you-go pricing.

Among all the cool services offered by AWS, we will only use two of them :

Simple Storage Service (S3), a massively scalable object storage service
Elastic MapReduce (EMR), a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark. It can be view like Hadoop-as-a-Service, you start a cluster with the number of nodes you want, run any job you want and only pay for the time the cluster is actually up.

Tutorial

In this tutorial, we will be launching a very simple pipeline or spark application to AWS EMR

First Let’s setup EMR Side of thing

Create an AWS account

First of all, create a free trial AWS account on AWS website. The form is simple and quick to fill, you will need to enter your credit card information which will be used for Amazon EMR pricing and for other services if you use more than the free trial account quotas.

Choose a region on which deploy the cluster

Depending on the date you signed up for an AWS account, the default region when you access a resource from the AWS Management Console is different. Each region has its own type of resources and pricing.

You can change the region by clicking at the up-right corner on each of the different AWS Management consoles

Create an Amazon S3 Bucket

In this use case, we will use Amazon S3 bucket to store our Spark application jar, logs, input and output files.

Open the Amazon S3 console
Choose Create bucket
Type a name for your bucket (ex : my-first-emr-bucket) and choose its AWS Region then click next.
On the Set properties page, you can configure some properties for the bucket. In this tutorial we don’t need any specific properties, click next.
On the Set permissions page, you manage permissions that are set on the bucket that you are creating. We will use the default permissions, click next.
On the Review page, verify the settings and choose Create bucket

Create an Amazon EMR cluster

In this step, we will launch a sample cluster

Open the Amazon EMR console
On the right-left corner, change the region on which you want to deploy the cluster
Choose Create cluster
On the General Configuration section, enter the cluster name, choose the S3 bucket you created (the logs will be stored in this bucket) and check launch mode to the cluster.
Select the software release and spark version as per your requirement

6. Go to Advanced option and configure and check Clusters enters waiting state otherwise cluster will be auto terminate

7. Create Cluster !! Wait for a few mins and you are all Set

Now Let go back to our StreamSets Transformer UI

Create simple pipeline Dev Raw Data→ Aws S3

To run a pipeline on an EMR cluster, in the pipeline properties, you configure the pipeline to use EMR as the cluster manager type, then configure the EMR properties.

When you configure a pipeline to run on an EMR cluster, you can specify an existing Spark cluster to use or you can have EMR provision a Spark cluster to run the pipeline. When provisioning a cluster, you can optionally enable logging, make the cluster visible to all users, and have EMR terminate the cluster after the pipeline stops.

Yes you read correctly, we can provision the new EMR cluster through StreamSets transformer to run this pipeline and After successful completion of the job, the cluster will be terminated. isn’t that so cool !!

You can configure the pipeline to use IAM roles or AWS access keys to connect to the EMR cluster. You also define an S3 staging URI and a staging directory within the cluster to store the Transformer libraries and resources needed to run the pipeline.

The EMR tab configured to run the pipeline on an existing EMR Spark cluster: Get the Cluster ID from EMR Console

Confugure Our AWS S3 destination

Now run this pipeline and once this complete check the S3 bucket that new directory is created with this data

You can also check the Spark Application logs , go to EMR cluster → Application History Tab

Conclusion

Easy, isn’t it ? You have learned how to simply use AWS EMR services in order to run a Spark job! and launch StreamSets Transformer pipeline to AWS EMR. For best practices for configuring a cluster, see the Amazon EMR documentation