Portfolio Artur Skrzeta

Decision Tree as Supervised Learning

Decision Tree starts with the root that reporesents all samples already labeled with two different categories.
On each node of decision tree, there is an entropy being calculated.
Entropy is the measure of randomness in a group of samples:
- the lower entropy, the lower uncertainity in a group of samples,
- when group of samples is uniform (all samples can be classified by the same label) there is no randomness -> entropy is very low,
- when there is 50/50 randomness (50% of samples are labeled as I. and second 50% of samples are labeled as II.) -> entropy equals 1,
- the lower entropy in a group of samples, the higher information gain from it.
We want to gain as much information as possible from each node (group of samples) in decision tree.
This is possible by keeping nodes with the lowest entropy from the beginning of a tree all the way down to leaves (last nodes where branch ends).
We can reach the possible lowest entropy node by node coming up with the best split value (when text feature) or the best split threshold (when numeric feature).

When new sample comes in, then we start from the root node traversing tree.
Samples go through a set of conditions(split values) untill the leaf node where we can assign a label to the sample.

App includes following features:

Data set is a collections of candidate's skills.
There are 4 features in each sample that impact the final decision of candidate hiring (this example is completely random):
- first feature: education,
- second feature: english,
- third feature: experience,
- forth feautre: interview.
Depending on values of each featre decision tree returns yes or no.

For 'education' feature there are values:
{'Bachelor': 0, 'Master': 1, 'Student': 2, 'Technician': 3}
For 'english' feature there are values:
{'Advanced': 0, 'Intermediate': 1, 'Native': 2}
For 'experience' feature there are values:
{'1 - 2 years': 0, '1 year': 1, 'No experience': 2, 'above 2 years': 3}
For 'interview' feature there are values:
{'excellent': 0, 'good': 1, 'neutral': 2, 'under expectation': 3}
Above curly braces are so-called dictionaries which pairs feature value with a number.
The numbers are very important because we need to do the convertions from text values into numeric:
- 'Bachelor' in education colum will be replaced with 0,
- 'Advanced' in english colum wil be replaced with 0,
- 'above 2 years' in experience column will be replaced with 3 etc.

Notice, at the end of each row we can see decision already made.
This is the historical records of candidates in the past.
Historical dataset shows how decisions were made based on different candidates with different set of features.
Having view on historical decisions we can learn model how to make future decison for next candidates.
Model receiving new candidate's features makes a decision relying on what it learnt from past which is recorded in the data input above.

Text values are replaced with corresponding numeric values due to dicitonaries above.
This conversion is necessary for machine learning which accepts only numbers,

Each tree node proivdes set of information.
In the root, we can see that candindates with english level below 1.5 go to left node, rest goes to right node.
Decision Tree Classifier takes Gini Impurity for entropy calculation by default.
We can check also the sample size which is in our example number of candidates.
By value=[14,16] we can see already that there are 14 yeses and 16 noes.

Script requires libraries installation:

pip install pandas
pip install sklearn
pip install pydotplus
pip install ipython

You can view the source code: HERE