Portfolio Artur Skrzeta

Spark For Handling Big Data

Intro

I found the information that Spark can perform up to 100x faster than Hadoop MapReduce.

Data processing part of Big Data Engineering.
Apache Spark is written in Scala programming language.
Spark is a memory based solution keeping as much as it can in a Cache RAM - that contriutes to the speed of it.
Spark can be run both on local machine and within clusters environment in either a local infrastructure or it the cloud (f.e. Amazon EC2)
The core idea behind Spark is MapReduce which is mapping your data set into a collection of (key, value) pairs, and then reducing over all pairs with the same key.

PySpark as Python library enables using Spark with Python instead of Scala.

With Python, the readability of code, maintenance, and refactoring is far easier than in Scala.
I apply functional programming for buildnig data model - f.e. passing functions both lambda and defined ones as parameters of another function.
Pyspark operations work like Python genearators, it's being delayed until the result is called.
This way, avoiding pulling the full data frame into memory and enabling an efficient processing across a cluster of machines (I use single local machine).
This is in opposition to Pandas dataframes, where everything is pulled into memory at once.
The main data type in PySpark is the Spark dataframe - equivalent to dataframes in R and Pandas.
Wanting to apply distributed computation using PySpark, you need to be performing operations on Spark dataframes.

Features

App includes following features:

PySpark API
Jupyter Notebook
MapReduce

Demo

Here you can check general description of each steps. To get detailed description along with the code itself, check Jupyter Notebook linked as the source code.

Steps:

Importing all necessary libraries and sprak features initialization.
Setting up variable and functions for further passing as parameters.
Importing data set.
Building so-called RDD Data Model that will be accepting the data through.
- filtering data with lambda function that checks condition on every RDD line,
- mapping each RDD line into key-value pairs,
- reducing each key-value paris by key aggregating their values,
- sorting pairs by a key name.
Converting RDD into Spark Data Frame.
Converting RDD into Python list of tuples.
Converting Spark Data Frame into Pandas Data Frame - converting RDD into Pandas DF not possible.
Exporting Pandas DF into csv flat-file.

Setup

Install Java version 8
Java jdk Folder will be created inside:
C:\Program Files\Java
Add the Environment variable (JAVA_HOME) as shown in the video:
JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
Add the Path variable(till bin folder in System Variable Section if in case User variable section is not recognizing it):
PATH = C:\Program Files\Java\jdk1.8.0_201\bin
Create a subfolder named "spark" in the C: drive e.g.
C:\spark
Keep your Spark installation folder inside C:\spark e.g.
C:\spark\spark-2.3.1-bin-hadoop2.7
Setup Environment variables for (SPARK_HOME):
SPARK_HOME = C:\spark\spark-2.3.1-bin-hadoop2.7
Setup Environment variables for (HADOOP_HOME):
HADOOP_HOME = C:\spark\spark-2.3.1-bin-hadoop2.7
Add the Path variable(till bin folder in System Variable Section if in case User variable section is not recognizing it):
PATH = C:\spark\spark-2.3.1-bin-hadoop2.7\bin
Copy and Paste the winutils file in
C:\spark\spark-2.3.1-bin-hadoop2.7\bin

Source Code

You can view the source code: HERE