MySQL is an open-source relational database system that is part of most of the enterprise tech stack today. In a general architecture, MySQL is used as the transactional database with a separate data warehouse to support the analytical and reporting or to other downstream application requirements.

Change-Data-Capture (CDC) allows capturing committed changes from a database in real-time and propagating those changes to the downstream applications or other target databases.

Snowflake getting tremendous popularity these days since it provides all of the functionality of an enterprise analytic database, along with many additional special features and unique capabilities.

In this blog, we…


WHAT IS PRESTO?

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Installation

Dependency to install Presto is only Java 8+ ( homebrew will take care of dependency )

brew install presto

As part of the install, the following files are created under the etc/ folder under the install directory:

* node.properties
* jvm.config
* config.properties
* log.properties

It also creates a sample JMX connector under the etc/catalog folder.


In this article, I will share my experience with Calico Operator certification course.

What do you learn in this course?

This free and self-paced course will arm you with the knowledge you need to understand how Kubernetes networking works, how to configure and manage a Calico network, and how to secure your cluster following today’s best practices, with the confidence you need to run mission-critical workloads in production.

Some key Highlight :

  • Self Paced
  • Lab-Based
  • Free
  • Certification after competition of course
  • Deep dive to K8s Networking and Security policy
  • Calico CNI

What is the Calico project?

Calico is a container…


What is ulimit?

The purpose of ulimits is to limit a program's resource utilization to prevent a run-away bug or security breach from bringing the whole system down. It is easy for modern applications to exceed default open files and other limits very quickly.

How to check the Hard limit and soft limit?

Check hard limit

ulimit -a -H

Check soft limit :

ulimit -a -S

Configuring limits in Docker containers

Controlling the limits becomes a bit trickier when Docker is involved. The Docker daemon runs as root user. By default, the same limits apply to the application running within a container as they would to the Docker daemon.

Adjusting limits within a container, however, requires privileges…


This tutorial is for Spark developers who want to learn an easy and quick way to run a Transformer pipeline on the Dataproc cluster via StreamSets Transformer

In my previous article explained what is StreamSets Transformer and how to deploy this in minishift. Now we will look at how to execute this transformer pipeline (spark applications) to the Dataproc cluster.

To get hands-on: Try StreamSets Transformer now

Stremasets transformer currently support the following Cluster Manager


In this article, we are going to deploy the Hadoop cluster to our local K8s cluster.

First thing first, Let’s create a local k8s cluster. For this purpose, we will install minikube.

Install Minikube

Prerequisites

  1. Install Docker for Mac. Docker is used to create, manage, and run our containers. It lets us construct containers that will run in Kubernetes Pods.
  2. Install VirtualBox for Mac using Homebrew. Run brew cask install virtualbox in your Terminal. VirtualBox lets you run virtual machines on your Mac (like running Windows inside macOS, except for a Kubernetes cluster.)

If the Brew Package Manager installed:

brew install minikube

If…


Spark can analyze data stored on files in many different formats: plain text, JSON, XML, Parquet, and more. But just because you can get a Spark job to run on a given data input format doesn’t mean you’ll get the same performance with all of them. Actually, the performance difference can be quite substantial.

When you are designing your datasets for your Spark application, you need to ensure that you are making the best use of the file formats available Spark.

Spark File formats consideration

  • Spark is optimized for Apache Parquet and ORC for read throughput. Spark has vectorization support that reduces disk I/O…

Whether your Kafka is provisioned in the Cloud or on-premise, you might want to push to a subset of Pub/Sub topics. Why? For the flexibility of having Pub/Sub as your GCP event notifier or to use topics to trigger Cloud Functions. Or maybe your organization has the plan to migrate from apache Kafka to managed google cloud pub/sub.

So how do you exchange messages between Kafka and Pub/Sub? Are you exploring the option for Kafka connector?

This is where the StreamSets Data Collector comes in handy. In this post, you will learn the basic steps to start working with DataCollector…


Recently I took a spark scala course over one of the online education platforms. And during that course, we have to solve the data challenges by writing small scala programs. I thought of extending that task and decided to solve the same problem with the different method including StreamSets Transformer.

Challange:1 You are given a fake friend dataset of social networking platforms in CSV file format as below and you have to perform the various task.

Note: You can find all the resources in below git repo

git clone https://github.com/rishi871/SparkScala.gitid, Name, Age, Number of Friends
0,Will,33,385
1,Jean-Luc,26,2
2,Hugh,55,221
3,Deanna,40,465
4,Quark,68,21
5,Weyoun,59,318
6,Gowron,37,220
7,Will,54,307…

In a certain scenarios, you are required to change the default ulimit. For example, an application fails to start with the below error.

Configuration of maximum open file limit is too low: 1024 (expected at least 32768). Please consult

In Unix systems, you can increase the limit by the following command:

$ ulimit -n 32768

To achieve the same in Docker, there are two options.

1. Set ulimits in container ( — ulimit)

Since setting ulimit settings in a container requires extra privileges not available in the default container, you can set these using the --ulimit flag.

Rishi Jain

Software Support Engineer @StreamSets | Hadoop | DataOps | RHCA | Ex-RedHatter | Ex-Cloudera

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store