ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Follow publication

Member-only story

Building Real-time communication with Apache Spark through Apache Livy

Ramses Alexander Coraspe Valdez
ITNEXT
Published in
5 min readJun 12, 2022

--

Dockerizing and Consuming an Apache Livy environment

We all know what Apache Spark is, There are two approaches to submit jobs to an Apache Spark cluster programmatically and each of them comes with some limitations in order to achieve a real time interaction, spark-submit and spark-shell are the only options available to submit spark apps to an Apache Spark Cluster, but, what would happen in the cases when you want to submit spark-jobs interactively from a web or mobile application?

There are cases where your Apache Spark cluster could be hosted in an on-premise infrastructure, and you would need many users consuming and running heavy aggregations against your organization’s data sources concurrently, from their mobile phones, web or desktop applications, create a “Spark-as-a-Service” environment to solve what I mentioned above is not as difficult as it sounds, one solution could be to expose your JDBC/ODBC data sources via Spark thrift server, another alternative would be to use Apache Livy.

I don’t know if Apache Livy should now be seen as a Workaround due to Apache Spark’s aggressive foray into the cloud with technologies like Google Cloud Dataproc or AWS EMR, but, this article shows and explains a dockerized environment that you can use as a template to quickly deploy a consumable Apache Livy environment.

I will try to be brief explaining what Apache Livy is: is a service that enables easy interaction with an Apache Spark cluster over a REST interface, check the image below:

Basic example of an Apache Livy interface

As you can see, In order to reproduce a real example we would need three components:

  1. Apache Spark Cluster
  2. Apache Livy Server
  3. Apache Livy Client

As an additional component I would add docker for a faster implementation, and a PostgreSQL database server to simulate an external data source available for Apache Spark.

let’s start

Apache Spark Cluster

--

--

Published in ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Written by Ramses Alexander Coraspe Valdez

Very passionate about data engineering and technology, love to design, create, test and write ideas, I hope you like my articles.

Write a response