Decision Tree as Supervised Learning
Intro
Datasets:
- Data consists of samples.
- Each sample is a collection of features.
- Because of kind of features, the sample can be labeled with defined category:
sample_1: ['feature_1', 'feature_2', 'feature_3', 'feature_4']
sample_2: ['feature_5', 'feature_6', 'feature_7', 'feature_8']
labels: ['label_1','label_2']
sample_1 can be labeled with label_1
sample_2 can be labeled with label_2
Algorithm:
- Decision Tree starts with the root that reporesents all samples already labeled with two different categories.
- On each node of decision tree, there is an entropy being calculated.
- Entropy is the measure of randomness in a group of samples:
- the lower entropy, the lower uncertainity in a group of samples,
- when group of samples is uniform (all samples can be classified by the same label) there is no randomness -> entropy is very low,
- when there is 50/50 randomness (50% of samples are labeled as I. and second 50% of samples are labeled as II.) -> entropy equals 1,
- the lower entropy in a group of samples, the higher information gain from it. - We want to gain as much information as possible from each node (group of samples) in decision tree.
- This is possible by keeping nodes with the lowest entropy from the beginning of a tree all the way down to leaves (last nodes where branch ends).
- We can reach the possible lowest entropy node by node coming up with the best split value (when text feature) or the best split threshold (when numeric feature).
Model Learning:
- When new sample comes in, then we start from the root node traversing tree.
- Samples go through a set of conditions(split values) untill the leaf node where we can assign a label to the sample.
Features
App includes following features:
Demo
Dataset:
- Data set is a collections of candidate's skills.
- There are 4 features in each sample that impact the final decision of candidate hiring (this example is completely random):
- first feature: education,
- second feature: english,
- third feature: experience,
- forth feautre: interview. - Depending on values of each featre decision tree returns yes or no.
Features:
- For 'education' feature there are values:
   {'Bachelor': 0, 'Master': 1, 'Student': 2, 'Technician': 3}
- For 'english' feature there are values:
   {'Advanced': 0, 'Intermediate': 1, 'Native': 2}
- For 'experience' feature there are values:
   {'1 - 2 years': 0, '1 year': 1, 'No experience': 2, 'above 2 years': 3}
- For 'interview' feature there are values:
   {'excellent': 0, 'good': 1, 'neutral': 2, 'under expectation': 3}
- Above curly braces are so-called dictionaries which pairs feature value with a number.
- The numbers are very important because we need to do the convertions from text values into numeric:
- 'Bachelor' in education colum will be replaced with 0,
- 'Advanced' in english colum wil be replaced with 0,
- 'above 2 years' in experience column will be replaced with 3 etc.
Raw Data:
- Notice, at the end of each row we can see decision already made.
- This is the historical records of candidates in the past.
- Historical dataset shows how decisions were made based on different candidates with different set of features.
- Having view on historical decisions we can learn model how to make future decison for next candidates.
- Model receiving new candidate's features makes a decision relying on what it learnt from past which is recorded in the data input above.
Converted Data:
- Text values are replaced with corresponding numeric values due to dicitonaries above.
- This conversion is necessary for machine learning which accepts only numbers,
Making Decision
- Providing new candidate's future like below:
   new_data = [
      u_education['Student'],
      u_english['Intermediate'],
      u_experience['1 year'],
      u_interview['neutral']
   ]
- We get the decision from the tree:
   ['No']
Decision Tree graph:
- Each tree node proivdes set of information.
- In the root, we can see that candindates with english level below 1.5 go to left node, rest goes to right node.
- Decision Tree Classifier takes Gini Impurity for entropy calculation by default.
- We can check also the sample size which is in our example number of candidates.
- By value=[14,16] we can see already that there are 14 yeses and 16 noes.
Setup
Script requires libraries installation:
pip install pandas
pip install sklearn
pip install pydotplus
pip install ipython