Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
-
Whenever possible, projects and deliverables developed during the course will be made publicly accessible.
-
The course emphasizes practical, hands-on experience with real datasets to emulate professional consulting scenarios in the field of Data Mining.
-
All activities and materials will strictly adhere to the academic and ethical guidelines of PUC-SP. Any content not authorized for public disclosure will remain confidential and stored in private repositories.
- Course Overview
- Objectives
- Syllabus
- Weekly Schedule
- Tools and Technologies
- Installation and Setup
- Assessment
- Bibliography
- Notes
This course introduces data mining techniques with a focus on unsupervised learning methods, including:
- Clustering algorithms (K-Means, Affinity Propagation, Mean-Shift)
- Principal Component Analysis (PCA)
- Dictionary Learning
- Novelty and outlier detection
Students will work on practical projects inspired by real-world problem-solving in third-sector organizations. Final deliverables will be shared in open repositories and made available to the broader community, schools, libraries, and non-profits.
Enable students to plan, conduct, and complete a research project applying key data mining concepts, algorithms, and methodologies.
- Fundamentals of Data Mining
- Data cleaning and preparation
- Predictive analysis
- Clustering methods (K-Means, Affinity Propagation, Mean-Shift)
- Principal Component Analysis (PCA)
- Dictionary Learning
- Novelty and outlier detection
- Application of concepts to real-world consulting scenarios
Week | Repos | Methodology | Tools |
---|---|---|---|
1 | Course introduction | Active methodology | – |
2–3 | Review of statistical methods | Active methodology | Python |
4 | Fundamentals of Data Mining | Active methodology | Python |
5–6 | Data cleaning and preparation | Active methodology | Python |
7 | Predictive analysis | Active methodology | Python |
8, 10 | Clustering techniques | Active methodology | Python |
9 | P1 Exam | Written (Individual) | – |
11 | K-Means algorithm | Active methodology | Python |
12 | Affinity Propagation | Active methodology | Python |
13 | Mean-Shift algorithm | Active methodology | Python |
14 | Principal Component Analysis (PCA) | Active methodology | Python |
15 | Dictionary Learning | Active methodology | Python |
16 | P2 Exam | Written (Individual) | – |
17 | P3 Exam & Grade Closure | Written (Individual) | – |
18 | Final grade submission | – | – |
- Programming Language: Python
- Libraries: NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn
- Environment: Jupyter Notebook or other Python IDEs
Follow these steps to set up your local environment for the course projects:
1. Clone the repository
git clone https://github.com/<username>/<repository-name>.git
cd <repository-name>
2. Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate \# Mac/Linux
venv\Scripts\activate \# Windows
3. Install dependencies
Make sure pip
is updated:
pip install --upgrade pip
Then install the required packages:
pip install -r requirements.txt
(If requirements.txt
is not provided, install manually:)
pip install numpy pandas scikit-learn matplotlib seaborn jupyter
4. Run Jupyter Notebook
jupyter notebook
5. Open course notebooks and start practicing.
Exam | Date | Format | Weight |
---|---|---|---|
P1 | 01/10/2025 | Written – Individual | Arithmetic mean |
P2 | 19/11/2025 | Written – Individual | Arithmetic mean |
P3 | Substitution exam | Written – Individual | Replaces lowest score |
Final Grade: Arithmetic mean of assessments.
The following sample lists the number of minutes that 60 cable TV users watched content from their package in the last two hours. Construct a frequency distribution with 8 classes and build a histogram.
Data:
20, 55, 5, 64, 78, 49, 91, 87, 18, 83, 33, 39, 30, 31, 59, 85, 102, 24, 27, 28,
92, 108, 98, 67, 85, 109, 48, 19, 32, 69, 24, 59, 6, 49, 116, 37, 92, 43, 101, 60,
55, 107, 25, 33, 57, 25, 17, 49, 24, 101, 14, 45, 73, 120, 91, 2, 11, 47, 21, 38
Step 1: Determine Range and Number of Classes
- Minimum value: 2
- Maximum value: 120
- Number of classes (
$k$ ): 8 (given)
Step 2: Calculate Class Width
Step 3: Construct Class Intervals (from minimum value)
Class Interval | Explanation |
---|---|
2 - 16 | Starts from minimum 2 |
17 - 31 | 16 + 1 to 31 |
32 - 46 | Next range |
47 - 61 | Next range |
62 - 76 | Next range |
77 - 91 | Next range |
92 - 106 | Next range |
107 - 121 | Covers maximum 120 |
Step 4: Frequency Distribution Table
Class Interval | Frequency |
---|---|
2 - 16 | 5 |
17 - 31 | 14 |
32 - 46 | 8 |
47 - 61 | 13 |
62 - 76 | 5 |
77 - 91 | 8 |
92 - 106 | 6 |
107 - 121 | 5 |
Step 5: Calculate Midpoints for Each Class
Class Interval | Midpoint ( |
---|---|
2 - 16 | 9 |
17 - 31 | 24 |
32 - 46 | 39 |
47 - 61 | 54 |
62 - 76 | 69 |
77 - 91 | 84 |
92 - 106 | 99 |
107 - 121 | 114 |
Step 6: Calculate Mean Using Frequency and Midpoints
Mean: ($\bar{x}$ ) is calculated by:
Class Interval | |||
---|---|---|---|
2 - 16 | 5 | 9 | 45 |
17 - 31 | 14 | 24 | 336 |
32 - 46 | 8 | 39 | 312 |
47 - 61 | 13 | 54 | 702 |
62 - 76 | 5 | 69 | 345 |
77 - 91 | 8 | 84 | 672 |
92 - 106 | 6 | 99 | 594 |
107 - 121 | 5 | 114 | 570 |
Sum frequencies: $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = 64
Sum of products: $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = 3576
Step 7: Histogram, Bar Plot and Time Series Frequency Distribution Over Time
- Construct a histogram, bar plot and Time Series with class intervals on the x-axis and frequencies on the y-axis.
- Each bar height corresponds to the frequency of the class.
###Frequency Analysis and Time Series Visualization
This notebook demonstrates how to perform frequency analysis on a CSV dataset, visualize results with histograms and bar plots, and create a time series chart using Python.
1. Install and Import Libraries
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
2. Load Dataset
# Load CSV file (semicolon-separated)
df = pd.read_csv('chose your dataset', sep=';')
# Select only the "day" column
df1 = df['day']
3. Calculate Frequencies
# Calculate absolute frequency (ascending order)
freq_abs = pd.Series(df1).value_counts(ascending=True)
# Calculate relative frequency (normalized, 3 decimal places)
freq_rel = pd.Series(df1).value_counts(normalize=True).round(3)
# Create a DataFrame with both measures
df_freq = pd.DataFrame({
'Absolute Frequency': freq_abs,
'Relative Frequency': freq_rel
})
# Display the frequency table
display(df_freq)
4. Histogram (Dark Theme)
# Create figure and axes with dark background
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 4))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Plot histogram
sns.histplot(df1, color='turquoise', ax=ax)
# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')
# Show plot
plt.show()
5. Bar Plot (Dark Theme)
# Create figure and axes
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(10, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Bar plot of absolute frequency
df_freq['Absolute Frequency'].plot(kind='bar', color="turquoise", ax=ax)
# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.xticks(rotation=0, color='white')
plt.yticks(color='white')
# Show plot
plt.show()
6. Time Series Preparation
# Inspect available columns
print(df.columns)
# Create a new DataFrame for time series analysis
df_time_series = df[['day', 'month']].copy()
# Add dummy year (if year column is missing)
df_time_series['year'] = 2022
# Convert to strings for concatenation
df_time_series['day'] = df_time_series['day'].astype(str)
df_time_series['year'] = df_time_series['year'].astype(str)
# Create "date" column in dd-MMM-yyyy format
df_time_series['date'] = df_time_series['day'] + '-' + df_time_series['month'] + '-' + df_time_series['year']
df_time_series['date'] = pd.to_datetime(df_time_series['date'], format='%d-%b-%Y')
# Set "date" as index
df_time_series = df_time_series.set_index('date')
# Count occurrences per day
daily_counts = df_time_series.groupby(df_time_series.index).size()
# Display first rows
display(daily_counts.head())
7. Time Series Plot (Dark Theme)
# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')
# Plot time series
plt.plot(daily_counts, color='turquoise')
# Customize labels and ticks
plt.title("Frequency Distribution Over Time", color='white')
plt.xlabel("Date", color='white')
plt.ylabel("Frequency", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')
# Show plot
plt.show()
Dummy Year: 2022 was used when year column was missing.
Visualizations: Histograms, bar plots, and time series chart.
III - class_3 - Stats Review
- CASTRO, L. N. Introdução a mineração de dados: conceitos básicos, algoritmos e aplicações. Saraiva, 2016.
- PIRIM, H. Recent Applications in Data Clustering. IntechOpen, 2018.
- SEN, J. Machine Learning: Artificial Intelligence. IntechOpen, 2021.
- THOMAS, C. Data Mining. IntechOpen, 2018.
- HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
- NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
- RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
- SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.
🛸๋ My Contacts Hub
────────────── 🔭⋆ ──────────────
➣➢➤ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.