Decision Tree as Supervised Learning

Intro

Datasets:
  • Data consists of samples.
  • Each sample is a collection of features.
  • Because of kind of features, the sample can be labeled with defined category:
    sample_1: ['feature_1', 'feature_2', 'feature_3', 'feature_4']
    sample_2: ['feature_5', 'feature_6', 'feature_7', 'feature_8']
    labels: ['label_1','label_2']
    sample_1 can be labeled with label_1
    sample_2 can be labeled with label_2
Algorithm:
  • Decision Tree starts with the root that reporesents all samples already labeled with two different categories.
  • On each node of decision tree, there is an entropy being calculated.
  • Entropy is the measure of randomness in a group of samples:
    - the lower entropy, the lower uncertainity in a group of samples,
    - when group of samples is uniform (all samples can be classified by the same label) there is no randomness -> entropy is very low,
    - when there is 50/50 randomness (50% of samples are labeled as I. and second 50% of samples are labeled as II.) -> entropy equals 1,
    - the lower entropy in a group of samples, the higher information gain from it.
  • We want to gain as much information as possible from each node (group of samples) in decision tree.
  • This is possible by keeping nodes with the lowest entropy from the beginning of a tree all the way down to leaves (last nodes where branch ends).
  • We can reach the possible lowest entropy node by node coming up with the best split value (when text feature) or the best split threshold (when numeric feature).
Model Learning:
  • When new sample comes in, then we start from the root node traversing tree.
  • Samples go through a set of conditions(split values) untill the leaf node where we can assign a label to the sample.

Features

App includes following features:

  • DecisionTreeClassifier
  • SkLearn
  • Pandas

Demo

Dataset:
  • Data set is a collections of candidate's skills.
  • There are 4 features in each sample that impact the final decision of candidate hiring (this example is completely random):
    - first feature: education,
    - second feature: english,
    - third feature: experience,
    - forth feautre: interview.
  • Depending on values of each featre decision tree returns yes or no.
Features:
  • For 'education' feature there are values:
        {'Bachelor': 0, 'Master': 1, 'Student': 2, 'Technician': 3}
  • For 'english' feature there are values:
        {'Advanced': 0, 'Intermediate': 1, 'Native': 2}
  • For 'experience' feature there are values:
        {'1 - 2 years': 0, '1 year': 1, 'No experience': 2, 'above 2 years': 3}
  • For 'interview' feature there are values:
        {'excellent': 0, 'good': 1, 'neutral': 2, 'under expectation': 3}
  • Above curly braces are so-called dictionaries which pairs feature value with a number.
  • The numbers are very important because we need to do the convertions from text values into numeric:
    - 'Bachelor' in education colum will be replaced with 0,
    - 'Advanced' in english colum wil be replaced with 0,
    - 'above 2 years' in experience column will be replaced with 3 etc.
Raw Data:
  • Notice, at the end of each row we can see decision already made.
  • This is the historical records of candidates in the past.
  • Historical dataset shows how decisions were made based on different candidates with different set of features.
  • Having view on historical decisions we can learn model how to make future decison for next candidates.
  • Model receiving new candidate's features makes a decision relying on what it learnt from past which is recorded in the data input above.
Converted Data:
  • Text values are replaced with corresponding numeric values due to dicitonaries above.
  • This conversion is necessary for machine learning which accepts only numbers,
Making Decision
  • Providing new candidate's future like below:
        new_data = [
           u_education['Student'],
           u_english['Intermediate'],
           u_experience['1 year'],
           u_interview['neutral']
        ]
  • We get the decision from the tree:
        ['No']
Decision Tree graph:
  • Each tree node proivdes set of information.
  • In the root, we can see that candindates with english level below 1.5 go to left node, rest goes to right node.
  • Decision Tree Classifier takes Gini Impurity for entropy calculation by default.
  • We can check also the sample size which is in our example number of candidates.
  • By value=[14,16] we can see already that there are 14 yeses and 16 noes.

Setup

Script requires libraries installation:

pip install pandas
pip install sklearn
pip install pydotplus
pip install ipython

Source Code

You can view the source code: HERE