Specialized Consulting for Integrated Project: Data Mining

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

⚠️ Important Notes

Whenever possible, projects and deliverables developed during the course will be made publicly accessible.
The course emphasizes practical, hands-on experience with real datasets to emulate professional consulting scenarios in the field of Data Mining.
All activities and materials will strictly adhere to the academic and ethical guidelines of PUC-SP. Any content not authorized for public disclosure will remain confidential and stored in private repositories.

Course Overview
- I - class 1 - Intoductioon and Assessment
- II - class_2 - Introduction - Data Mining With Python
- III - class_3 - Stats Review
Objectives
Syllabus
Weekly Schedule
Tools and Technologies
Installation and Setup
Assessment
Bibliography
- Basic Bibliography
- Complementary Bibliography
Notes

Course Overview

This course introduces data mining techniques with a focus on unsupervised learning methods, including:

Clustering algorithms (K-Means, Affinity Propagation, Mean-Shift)
Principal Component Analysis (PCA)
Dictionary Learning
Novelty and outlier detection

Students will work on practical projects inspired by real-world problem-solving in third-sector organizations. Final deliverables will be shared in open repositories and made available to the broader community, schools, libraries, and non-profits.

Objectives

Enable students to plan, conduct, and complete a research project applying key data mining concepts, algorithms, and methodologies.

Syllabus

Fundamentals of Data Mining
Data cleaning and preparation
Predictive analysis
Clustering methods (K-Means, Affinity Propagation, Mean-Shift)
Principal Component Analysis (PCA)
Dictionary Learning
Novelty and outlier detection
Application of concepts to real-world consulting scenarios

Weekly Schedule

Week	Repos	Methodology	Tools
1	Course introduction	Active methodology	–
2–3	Review of statistical methods	Active methodology	Python
4	Fundamentals of Data Mining	Active methodology	Python
5–6	Data cleaning and preparation	Active methodology	Python
7	Predictive analysis	Active methodology	Python
8, 10	Clustering techniques	Active methodology	Python
9	P1 Exam	Written (Individual)	–
11	K-Means algorithm	Active methodology	Python
12	Affinity Propagation	Active methodology	Python
13	Mean-Shift algorithm	Active methodology	Python
14	Principal Component Analysis (PCA)	Active methodology	Python
15	Dictionary Learning	Active methodology	Python
16	P2 Exam	Written (Individual)	–
17	P3 Exam & Grade Closure	Written (Individual)	–
18	Final grade submission	–	–

Tools and Technologies

Programming Language: Python
Libraries: NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn
Environment: Jupyter Notebook or other Python IDEs

Installation and Setup

Follow these steps to set up your local environment for the course projects:

1. Clone the repository

git clone https://github.com/<username>/<repository-name>.git
cd <repository-name>

2. Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate   \# Mac/Linux
venv\Scripts\activate      \# Windows

3. Install dependencies Make sure pip is updated:


pip install --upgrade pip

Then install the required packages:


pip install -r requirements.txt

(If requirements.txt is not provided, install manually:)


pip install numpy pandas scikit-learn matplotlib seaborn jupyter

4. Run Jupyter Notebook

jupyter notebook

5. Open course notebooks and start practicing.

I - Intoductioon and Assessment

Exam	Date	Format	Weight
P1	01/10/2025	Written – Individual	Arithmetic mean
P2	19/11/2025	Written – Individual	Arithmetic mean
P3	Substitution exam	Written – Individual	Replaces lowest score

Final Grade: Arithmetic mean of assessments.

II - class_2- Introduction - Data Mining With Python

☞ Access Booklet

Example 1

The following sample lists the number of minutes that 60 cable TV users watched content from their package in the last two hours. Construct a frequency distribution with 8 classes and build a histogram.

Data:

20, 55, 5, 64, 78, 49, 91, 87, 18, 83, 33, 39, 30, 31, 59, 85, 102, 24, 27, 28,
92, 108, 98, 67, 85, 109, 48, 19, 32, 69, 24, 59, 6, 49, 116, 37, 92, 43, 101, 60,
55, 107, 25, 33, 57, 25, 17, 49, 24, 101, 14, 45, 73, 120, 91, 2, 11, 47, 21, 38

Step 1: Determine Range and Number of Classes

Minimum value: 2
Maximum value: 120
Number of classes ($k$): 8 (given)

Step 2: Calculate Class Width

$$ \huge w = \left\lceil \frac{\text{max} - \text{min}}{k} \right\rceil = \left\lceil \frac{120 - 2}{8} \right\rceil = 15 $$

Step 3: Construct Class Intervals (from minimum value)

Class Interval	Explanation
2 - 16	Starts from minimum 2
17 - 31	16 + 1 to 31
32 - 46	Next range
47 - 61	Next range
62 - 76	Next range
77 - 91	Next range
92 - 106	Next range
107 - 121	Covers maximum 120

Step 4: Frequency Distribution Table

Class Interval	Frequency
2 - 16	5
17 - 31	14
32 - 46	8
47 - 61	13
62 - 76	5
77 - 91	8
92 - 106	6
107 - 121	5

Step 5: Calculate Midpoints for Each Class

$$ \Huge x_i = \frac{\text{Lower limit} + \text{Upper limit}}{2} $$

Class Interval	Midpoint ($x_i$)
2 - 16	9
17 - 31	24
32 - 46	39
47 - 61	54
62 - 76	69
77 - 91	84
92 - 106	99
107 - 121	114

Step 6: Calculate Mean Using Frequency and Midpoints

Mean: ($\bar{x}$) is calculated by:

$$ \Huge \bar{x} = \frac{\sum f_i x_i}{\sum f_i} $$

Where: $f_i$ = frequency, $x_i$ = Midpoint.

Calculate each product:

Class Interval	$f_i$	$x_i$	$f_i \times x_i$
2 - 16	5	9	45
17 - 31	14	24	336
32 - 46	8	39	312
47 - 61	13	54	702
62 - 76	5	69	345
77 - 91	8	84	672
92 - 106	6	99	594
107 - 121	5	114	570

Sum frequencies: $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = 64

Sum of products: $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = 3576

Calculate mean:

$$ \huge \bar{x} = \frac{3576}{64} = 55.875 $$

Step 7: Histogram, Bar Plot and Time Series Frequency Distribution Over Time

Construct a histogram, bar plot and Time Series with class intervals on the x-axis and frequencies on the y-axis.
Each bar height corresponds to the frequency of the class.

☞ Access Code

☞ Access Dataset

☞ Access Plots

###Frequency Analysis and Time Series Visualization

This notebook demonstrates how to perform frequency analysis on a CSV dataset, visualize results with histograms and bar plots, and create a time series chart using Python.

1. Install and Import Libraries

# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. Load Dataset

# Load CSV file (semicolon-separated)
df = pd.read_csv('chose your dataset', sep=';')

# Select only the "day" column
df1 = df['day']

3. Calculate Frequencies

# Calculate absolute frequency (ascending order)
freq_abs = pd.Series(df1).value_counts(ascending=True)

# Calculate relative frequency (normalized, 3 decimal places)
freq_rel = pd.Series(df1).value_counts(normalize=True).round(3)

# Create a DataFrame with both measures
df_freq = pd.DataFrame({
    'Absolute Frequency': freq_abs,
    'Relative Frequency': freq_rel
})

# Display the frequency table
display(df_freq)

4. Histogram (Dark Theme)

# Create figure and axes with dark background
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 4))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Plot histogram
sns.histplot(df1, color='turquoise', ax=ax)

# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')

# Show plot
plt.show()

5. Bar Plot (Dark Theme)

# Create figure and axes
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(10, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Bar plot of absolute frequency
df_freq['Absolute Frequency'].plot(kind='bar', color="turquoise", ax=ax)

# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.xticks(rotation=0, color='white')
plt.yticks(color='white')

# Show plot
plt.show()

6. Time Series Preparation

# Inspect available columns
print(df.columns)

# Create a new DataFrame for time series analysis
df_time_series = df[['day', 'month']].copy()

# Add dummy year (if year column is missing)
df_time_series['year'] = 2022

# Convert to strings for concatenation
df_time_series['day'] = df_time_series['day'].astype(str)
df_time_series['year'] = df_time_series['year'].astype(str)

# Create "date" column in dd-MMM-yyyy format
df_time_series['date'] = df_time_series['day'] + '-' + df_time_series['month'] + '-' + df_time_series['year']
df_time_series['date'] = pd.to_datetime(df_time_series['date'], format='%d-%b-%Y')

# Set "date" as index
df_time_series = df_time_series.set_index('date')

# Count occurrences per day
daily_counts = df_time_series.groupby(df_time_series.index).size()

# Display first rows
display(daily_counts.head())

7. Time Series Plot (Dark Theme)

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Plot time series
plt.plot(daily_counts, color='turquoise')

# Customize labels and ticks
plt.title("Frequency Distribution Over Time", color='white')
plt.xlabel("Date", color='white')
plt.ylabel("Frequency", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')

# Show plot
plt.show()

Summary

Dummy Year: 2022 was used when year column was missing.

Visualizations: Histograms, bar plots, and time series chart.

III - class_3 - Stats Review

Bibliography

Basic Bibliography

CASTRO, L. N. Introdução a mineração de dados: conceitos básicos, algoritmos e aplicações. Saraiva, 2016.
PIRIM, H. Recent Applications in Data Clustering. IntechOpen, 2018.
SEN, J. Machine Learning: Artificial Intelligence. IntechOpen, 2021.

Complementary Bibliography

THOMAS, C. Data Mining. IntechOpen, 2018.
HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
class_1-Introduction		class_1-Introduction
class_2 - Introduction - Data Mining With Python		class_2 - Introduction - Data Mining With Python
class_3 - Stats Review		class_3 - Stats Review
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Uh oh!

License

Quantum-Software-Development/specialized-consulting-data-mining

Folders and files

Latest commit

History

Repository files navigation

⚠️ Important Notes

Table of Contents

Installation and Setup

I - Intoductioon and Assessment

II - class_2- Introduction - Data Mining With Python

Step 1: Determine Range and Number of Classes

Step 2: Calculate Class Width

Step 3: Construct Class Intervals (from minimum value)

Step 4: Frequency Distribution Table

Step 5: Calculate Midpoints for Each Class

Step 6: Calculate Mean Using Frequency and Midpoints

Mean: ($\bar{x}$) is calculated by:

Where: $f_i$ = frequency, $x_i$ = Midpoint.

Calculate each product:

Sum frequencies: $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = 64

Sum of products: $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = 3576

Calculate mean:

Step 7: Histogram, Bar Plot and Time Series Frequency Distribution Over Time

1. Install and Import Libraries

2. Load Dataset

3. Calculate Frequencies

4. Histogram (Dark Theme)

5. Bar Plot (Dark Theme)

6. Time Series Preparation

7. Time Series Plot (Dark Theme)

III - class_3 - Stats Review

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Languages