Spark For Handling Big Data

Intro

I found the information that Spark can perform up to 100x faster than Hadoop MapReduce.

  • Data processing part of Big Data Engineering.
  • Apache Spark is written in Scala programming language.
  • Spark is a memory based solution keeping as much as it can in a Cache RAM - that contriutes to the speed of it.
  • Spark can be run both on local machine and within clusters environment in either a local infrastructure or it the cloud (f.e. Amazon EC2)
  • The core idea behind Spark is MapReduce which is mapping your data set into a collection of (key, value) pairs, and then reducing over all pairs with the same key.

PySpark as Python library enables using Spark with Python instead of Scala.

  • With Python, the readability of code, maintenance, and refactoring is far easier than in Scala.
  • I apply functional programming for buildnig data model - f.e. passing functions both lambda and defined ones as parameters of another function.
  • Pyspark operations work like Python genearators, it's being delayed until the result is called.
  • This way, avoiding pulling the full data frame into memory and enabling an efficient processing across a cluster of machines (I use single local machine).
  • This is in opposition to Pandas dataframes, where everything is pulled into memory at once.
  • The main data type in PySpark is the Spark dataframe - equivalent to dataframes in R and Pandas.
  • Wanting to apply distributed computation using PySpark, you need to be performing operations on Spark dataframes.

Features

App includes following features:

  • PySpark API
  • Jupyter Notebook
  • MapReduce

Demo

Here you can check general description of each steps. To get detailed description along with the code itself, check Jupyter Notebook linked as the source code.

Steps:
  1. Importing all necessary libraries and sprak features initialization.
  2. Setting up variable and functions for further passing as parameters.
  3. Importing data set.
  4. Building so-called RDD Data Model that will be accepting the data through.
    - filtering data with lambda function that checks condition on every RDD line,
    - mapping each RDD line into key-value pairs,
    - reducing each key-value paris by key aggregating their values,
    - sorting pairs by a key name.
  5. Converting RDD into Spark Data Frame.
  6. Converting RDD into Python list of tuples.
  7. Converting Spark Data Frame into Pandas Data Frame - converting RDD into Pandas DF not possible.
  8. Exporting Pandas DF into csv flat-file.

Setup

  1. Install Java version 8
  2. Java jdk Folder will be created inside:
        C:\Program Files\Java
  3. Add the Environment variable (JAVA_HOME) as shown in the video:
        JAVA_HOME = C:\Program Files\Java\jdk1.8.0_201
  4. Add the Path variable(till bin folder in System Variable Section if in case User variable section is not recognizing it):
        PATH = C:\Program Files\Java\jdk1.8.0_201\bin
  5. Create a subfolder named "spark" in the C: drive e.g.
        C:\spark
  6. Keep your Spark installation folder inside C:\spark e.g.
        C:\spark\spark-2.3.1-bin-hadoop2.7
  7. Setup Environment variables for (SPARK_HOME):
        SPARK_HOME = C:\spark\spark-2.3.1-bin-hadoop2.7
  8. Setup Environment variables for (HADOOP_HOME):
        HADOOP_HOME = C:\spark\spark-2.3.1-bin-hadoop2.7
  9. Add the Path variable(till bin folder in System Variable Section if in case User variable section is not recognizing it):
        PATH = C:\spark\spark-2.3.1-bin-hadoop2.7\bin
  10. Copy and Paste the winutils file in
        C:\spark\spark-2.3.1-bin-hadoop2.7\bin

Source Code

You can view the source code: HERE