Disclaimer: This repository is a sketchbook learning the background of decision tree algorithms. It is neither clean nor readable. Please direct yourself to Chefboost repository to have clean one.
This is the repository of Decision Trees for Machine Learning online course published on Udemy. In this course, the following algorithms will be covered. All project is going to be developed on Python (3.6.4), and neither out-of-the-box library nor framework will be used to build decision trees.
1- ID3
2- C4.5
3- CART (Classification And Regression Trees)
4- Regression Trees (CART for regression)
6- Gradient Boosting Decision Trees for Regression
7- Gradient Boosting Decision Trees for Classification
8- Adaboost
To keep yourself up-to-date you might check posts in my blog about decision trees
Here is a step-by-step example of how to run the decision tree script:
-
Install dependencies:
pip install -r requirements.txt
-
Run the main script:
python python/decision.py
-
Sample input: The script uses the dataset in
dataset/golf.txt
by default. You can change the dataset by editing thedf = pd.read_csv(...)
line inpython/decision.py
.Example of
golf.txt
(first few lines):Outlook,Temperature,Humidity,Wind,Decision Sunny,Hot,High,Weak,No Sunny,Hot,High,Strong,No Overcast,Hot,High,Weak,Yes Rain,Mild,High,Weak,Yes Rain,Cool,Normal,Weak,Yes Rain,Cool,Normal,Strong,No Overcast,Cool,Normal,Strong,Yes Sunny,Mild,High,Weak,No Sunny,Cool,Normal,Weak,Yes Rain,Mild,Normal,Weak,Yes Sunny,Mild,Normal,Strong,Yes Overcast,Mild,High,Strong,Yes Overcast,Hot,Normal,Weak,Yes Rain,Mild,High,Strong,No
-
Sample output: After running the script, a file named
rules.py
will be generated in thepython/
directory. This file contains the decision rules as a Python function. You will also see console output similar to:C4.5 tree is going to be built... finished in 0.02 seconds
Example of generated rule (in
python/rules.py
):def findDecision(obj): if obj[0] == 'Overcast': return 'Yes' if obj[0] == 'Rain': if obj[3] == 'Weak': return 'Yes' if obj[3] == 'Strong': return 'No' if obj[0] == 'Sunny': if obj[2] == 'High': return 'No' if obj[2] == 'Normal': return 'Yes'
-
Changing the algorithm or dataset: Edit the variables at the top of
python/decision.py
to select a different algorithm or dataset.
When you run the decision tree script, it generates Python files containing the decision rules. The exact filename depends on the algorithm and settings used:
- Standard decision tree:
rules.py
- Random Forest:
rule_0.py
,rule_1.py
,rule_2.py
, etc. (one file per tree) - Gradient Boosting:
rules0.py
,rules1.py
,rules2.py
, etc. (one file per iteration) - Adaboost:
rules_0.py
,rules_1.py
,rules_2.py
, etc. (one file per round)
The generated files contain a Python function called findDecision(obj)
that implements the decision tree as a series of if-else statements. For example:
def findDecision(obj):
if obj[0] == 'Sunny':
if obj[2] == 'High':
return 'No'
if obj[2] == 'Normal':
return 'Yes'
if obj[0] == 'Rain':
if obj[3] == 'Weak':
return 'Yes'
if obj[3] == 'Strong':
return 'No'
if obj[0] == 'Overcast':
return 'Yes'
-
Import the rules file:
import rules # or whatever the generated filename is
-
Make predictions:
# Create a feature vector (in the same order as your dataset columns) features = ['Sunny', 'Hot', 'High', 'Weak'] # Example for golf dataset # Get prediction prediction = rules.findDecision(features) print(f"Prediction: {prediction}")
The obj
parameter in findDecision(obj)
is a list where each element corresponds to a feature column in your dataset, in the same order:
obj[0]
= First feature columnobj[1]
= Second feature columnobj[2]
= Third feature column- And so on...
Example for the golf dataset:
obj[0]
= Outlook (Sunny, Overcast, Rain)obj[1]
= Temperature (Hot, Mild, Cool)obj[2]
= Humidity (High, Normal)obj[3]
= Wind (Weak, Strong)
When you run the script, you'll see output like:
C4.5 tree is going to be built...
finished in 0.022043228149414062 seconds
This shows:
- Which algorithm was used
- How long the tree building process took
For ensemble methods (Random Forest, Gradient Boosting, Adaboost), multiple rule files are generated. Each file represents one tree or iteration in the ensemble. To use these:
- Random Forest: Use all generated files and take a majority vote
- Gradient Boosting: Use files sequentially, each building on the previous
- Adaboost: Use all files with their respective weights
- No output files generated: Check that
dump_to_console = False
in the script - Wrong predictions: Ensure your feature vector matches the dataset column order
- Import errors: Make sure the generated rules file is in your Python path
You can now configure the script without editing the code by using command-line arguments:
python python/decision.py [OPTIONS]
Available options:
--algorithm
(ID3
,C4.5
,CART
,Regression
) — Algorithm to use (default:C4.5
)--dataset
— Path to dataset file (default:dataset/golf.txt
)--random-forest
— Enable Random Forest--num-trees
— Number of trees for Random Forest (default: 3)--multitasking
— Enable multitasking for Random Forest--adaboost
— Enable Adaboost--gradient-boosting
— Enable Gradient Boosting--epochs
— Number of epochs for boosting (default: 10)--learning-rate
— Learning rate for boosting (default: 1)--dump-to-console
— Print rules to console instead of file
Example:
python python/decision.py --algorithm ID3 --dataset dataset/golf.txt --dump-to-console
- Python 3.6 or higher
-
Clone the repository:
git clone https://github.com/yourusername/decision-trees-for-ml.git cd decision-trees-for-ml
-
Install dependencies:
pip install -r requirements.txt
You can run the main script with various options using command-line arguments.
Basic usage:
python python/decision.py
Specify algorithm and dataset:
python python/decision.py --algorithm ID3 --dataset dataset/golf.txt
Enable Random Forest:
python python/decision.py --random-forest --num-trees 5
Print rules to console:
python python/decision.py --dump-to-console
For a full list of options, see the Running with Command-Line Arguments section below.
- The script will print the algorithm being used and the time taken to build the tree.
- By default, it will generate a Python file (e.g.,
rules.py
) containing the decision rules. - If
--dump-to-console
is used, rules will be printed to the terminal instead of being saved to a file.
See the Output Files and Results section for more details.
This repository is licensed under the MIT License - see LICENSE for more details.