Spark For Handling Big Data
Intro
I found the information that Spark can perform up to 100x faster than Hadoop MapReduce.
- Data processing part of Big Data Engineering.
- Apache Spark is written in Scala programming language.
- Spark is a memory based solution keeping as much as it can in a Cache RAM - that contriutes to the speed of it.
- Spark can be run both on local machine and within clusters environment in either a local infrastructure or it the cloud (f.e. Amazon EC2)
- The core idea behind Spark is MapReduce which is mapping your data set into a collection of (key, value) pairs, and then reducing over all pairs with the same key.
PySpark as Python library enables using Spark with Python instead of Scala.
- With Python, the readability of code, maintenance, and refactoring is far easier than in Scala.
- I apply functional programming for buildnig data model - f.e. passing functions both lambda and defined ones as parameters of another function.
- Pyspark operations work like Python genearators, it's being delayed until the result is called.
- This way, avoiding pulling the full data frame into memory and enabling an efficient processing across a cluster of machines (I use single local machine).
- This is in opposition to Pandas dataframes, where everything is pulled into memory at once.
- The main data type in PySpark is the Spark dataframe - equivalent to dataframes in R and Pandas.
- Wanting to apply distributed computation using PySpark, you need to be performing operations on Spark dataframes.
Features
App includes following features:
Demo
Here you can check general description of each steps. To get detailed description along with the code itself, check Jupyter Notebook linked as the source code.
Steps:
- Importing all necessary libraries and sprak features initialization.
- Setting up variable and functions for further passing as parameters.
- Importing data set.
- Building so-called RDD Data Model that will be accepting the data through.
- filtering data with lambda function that checks condition on every RDD line,
- mapping each RDD line into key-value pairs,
- reducing each key-value paris by key aggregating their values,
- sorting pairs by a key name. - Converting RDD into Spark Data Frame.
- Converting RDD into Python list of tuples.
- Converting Spark Data Frame into Pandas Data Frame - converting RDD into Pandas DF not possible.
- Exporting Pandas DF into csv flat-file.
Setup
- Install Java version 8
- Java jdk Folder will be created inside:
   C:\Program Files\Java
- Add the Environment variable (JAVA_HOME) as shown in the video:
   JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
- Add the Path variable(till bin folder in System Variable Section if in case User variable section is not recognizing it):
   PATH = C:\Program Files\Java\jdk1.8.0_201\bin
- Create a subfolder named "spark" in the C: drive e.g.
   C:\spark
- Keep your Spark installation folder inside C:\spark e.g.
   C:\spark\spark-2.3.1-bin-hadoop2.7
- Setup Environment variables for (SPARK_HOME):
   SPARK_HOME = C:\spark\spark-2.3.1-bin-hadoop2.7
- Setup Environment variables for (HADOOP_HOME):
   HADOOP_HOME = C:\spark\spark-2.3.1-bin-hadoop2.7
- Add the Path variable(till bin folder in System Variable Section if in case User variable section is not recognizing it):
   PATH = C:\spark\spark-2.3.1-bin-hadoop2.7\bin
- Copy and Paste the winutils file in
   C:\spark\spark-2.3.1-bin-hadoop2.7\bin