Member-only story
Introduction to Apache Spark
Apache Spark
APACHE SPARK
Apache spark is a tool widely used by data engineers, data scientists, or machine learning engineers, it was designed and built in 2009 at UC Berkeley, it is the evolution of the old paradigm that used HADOOP with the Mapreduce algorithm, Apache spark it acts as a distributed system, spreading the workloads in the memory of different nodes of a cluster, it is simpler, it is 10 to 20 times faster than Mapreduce Hadoop, it is easier to use, it is modular, that is, it can be used for different workloads, and it focuses on speed and parallel computing instead of storage, and that’s the main difference from Apache Hadoop.
MEMORY PROCESSING
- The price of memory decreases each year, which doesn’t affect infrastructure expenses by putting more memory on each server in a cluster.
- Many datasets fit into memory of a modern computers.
- Memory is fast, using disk is not a good idea when you want to run small operations.
- There is quantitative evidence about the speed of Apache Spark vs Hadoop Mapreduce
COMPONENTS
The general overview of the architecture and its components is in the image below, all its components are divided into their APIs and…