Skip to content

data mining, focusing on unsupervised learning methods (clustering, PCA, dictionary learning, anomaly detection) applied to real-world projects for third-sector organizations. Results are shared publicly in open repositories and community platforms.

License

Notifications You must be signed in to change notification settings

Quantum-Software-Development/specialized-consulting-data-mining

Repository files navigation




Sponsor Quantum Software Development





Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva





Table of Contents


  1. Course Overview
  2. Objectives
  3. Syllabus
  4. Weekly Schedule
  5. Tools and Technologies
  6. Installation and Setup
  7. Assessment
  8. Bibliography
  9. Notes




This course introduces data mining techniques with a focus on unsupervised learning methods, including:

  • Clustering algorithms (K-Means, Affinity Propagation, Mean-Shift)
  • Principal Component Analysis (PCA)
  • Dictionary Learning
  • Novelty and outlier detection

Students will work on practical projects inspired by real-world problem-solving in third-sector organizations. Final deliverables will be shared in open repositories and made available to the broader community, schools, libraries, and non-profits.



Enable students to plan, conduct, and complete a research project applying key data mining concepts, algorithms, and methodologies.




  • Fundamentals of Data Mining
  • Data cleaning and preparation
  • Predictive analysis
  • Clustering methods (K-Means, Affinity Propagation, Mean-Shift)
  • Principal Component Analysis (PCA)
  • Dictionary Learning
  • Novelty and outlier detection
  • Application of concepts to real-world consulting scenarios




Week Repos Methodology Tools
1 Course introduction Active methodology
2–3 Review of statistical methods Active methodology Python
4 Fundamentals of Data Mining Active methodology Python
5–6 Data cleaning and preparation Active methodology Python
7 Predictive analysis Active methodology Python
8, 10 Clustering techniques Active methodology Python
9 P1 Exam Written (Individual)
11 K-Means algorithm Active methodology Python
12 Affinity Propagation Active methodology Python
13 Mean-Shift algorithm Active methodology Python
14 Principal Component Analysis (PCA) Active methodology Python
15 Dictionary Learning Active methodology Python
16 P2 Exam Written (Individual)
17 P3 Exam & Grade Closure Written (Individual)
18 Final grade submission




  • Programming Language: Python
  • Libraries: NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn
  • Environment: Jupyter Notebook or other Python IDEs



Installation and Setup


Follow these steps to set up your local environment for the course projects:


1. Clone the repository

git clone https://github.com/<username>/<repository-name>.git
cd <repository-name>

2. Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate   \# Mac/Linux
venv\Scripts\activate      \# Windows

3. Install dependencies Make sure pip is updated:


pip install --upgrade pip

Then install the required packages:


pip install -r requirements.txt

(If requirements.txt is not provided, install manually:)


pip install numpy pandas scikit-learn matplotlib seaborn jupyter

4. Run Jupyter Notebook

jupyter notebook

5. Open course notebooks and start practicing.




Exam Date Format Weight
P1 01/10/2025 Written – Individual Arithmetic mean
P2 19/11/2025 Written – Individual Arithmetic mean
P3 Substitution exam Written – Individual Replaces lowest score

Final Grade: Arithmetic mean of assessments.




Access Booklet



The following sample lists the number of minutes that 60 cable TV users watched content from their package in the last two hours. Construct a frequency distribution with 8 classes and build a histogram.


Data:

20, 55, 5, 64, 78, 49, 91, 87, 18, 83, 33, 39, 30, 31, 59, 85, 102, 24, 27, 28,
92, 108, 98, 67, 85, 109, 48, 19, 32, 69, 24, 59, 6, 49, 116, 37, 92, 43, 101, 60,
55, 107, 25, 33, 57, 25, 17, 49, 24, 101, 14, 45, 73, 120, 91, 2, 11, 47, 21, 38



Step 1: Determine Range and Number of Classes

  • Minimum value: 2
  • Maximum value: 120
  • Number of classes ($k$): 8 (given)



Step 2: Calculate Class Width



$$ \huge w = \left\lceil \frac{\text{max} - \text{min}}{k} \right\rceil = \left\lceil \frac{120 - 2}{8} \right\rceil = 15 $$



Step 3: Construct Class Intervals (from minimum value)

Class Interval Explanation
2 - 16 Starts from minimum 2
17 - 31 16 + 1 to 31
32 - 46 Next range
47 - 61 Next range
62 - 76 Next range
77 - 91 Next range
92 - 106 Next range
107 - 121 Covers maximum 120

Step 4: Frequency Distribution Table


Class Interval Frequency
2 - 16 5
17 - 31 14
32 - 46 8
47 - 61 13
62 - 76 5
77 - 91 8
92 - 106 6
107 - 121 5



Step 5: Calculate Midpoints for Each Class


$$ \Huge x_i = \frac{\text{Lower limit} + \text{Upper limit}}{2} $$



Class Interval Midpoint ($x_i$)
2 - 16 9
17 - 31 24
32 - 46 39
47 - 61 54
62 - 76 69
77 - 91 84
92 - 106 99
107 - 121 114



Step 6: Calculate Mean Using Frequency and Midpoints


Mean: ($\bar{x}$) is calculated by:



$$ \Huge \bar{x} = \frac{\sum f_i x_i}{\sum f_i} $$



Where: $f_i$ = frequency, $x_i$ = Midpoint.



Class Interval $f_i$ $x_i$ $f_i \times x_i$
2 - 16 5 9 45
17 - 31 14 24 336
32 - 46 8 39 312
47 - 61 13 54 702
62 - 76 5 69 345
77 - 91 8 84 672
92 - 106 6 99 594
107 - 121 5 114 570

Sum frequencies: $5 + 14 + 8 + 13 + 5 + 8 + 6 + 5$ = 64

Sum of products: $45 + 336 + 312 + 702 + 345 + 672 + 594 + 570$ = 3576




$$ \huge \bar{x} = \frac{3576}{64} = 55.875 $$



Step 7: Histogram, Bar Plot and Time Series Frequency Distribution Over Time

  • Construct a histogram, bar plot and Time Series with class intervals on the x-axis and frequencies on the y-axis.
  • Each bar height corresponds to the frequency of the class.

Access Code

Access Dataset

Access Plots



###Frequency Analysis and Time Series Visualization

This notebook demonstrates how to perform frequency analysis on a CSV dataset, visualize results with histograms and bar plots, and create a time series chart using Python.


1. Install and Import Libraries

# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. Load Dataset

# Load CSV file (semicolon-separated)
df = pd.read_csv('chose your dataset', sep=';')

# Select only the "day" column
df1 = df['day']

3. Calculate Frequencies

# Calculate absolute frequency (ascending order)
freq_abs = pd.Series(df1).value_counts(ascending=True)

# Calculate relative frequency (normalized, 3 decimal places)
freq_rel = pd.Series(df1).value_counts(normalize=True).round(3)

# Create a DataFrame with both measures
df_freq = pd.DataFrame({
    'Absolute Frequency': freq_abs,
    'Relative Frequency': freq_rel
})

# Display the frequency table
display(df_freq)

4. Histogram (Dark Theme)

# Create figure and axes with dark background
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 4))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Plot histogram
sns.histplot(df1, color='turquoise', ax=ax)

# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')

# Show plot
plt.show()

Image

5. Bar Plot (Dark Theme)

# Create figure and axes
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(10, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Bar plot of absolute frequency
df_freq['Absolute Frequency'].plot(kind='bar', color="turquoise", ax=ax)

# Customize labels and ticks
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.title("Frequency Distribution", color='white')
plt.xticks(rotation=0, color='white')
plt.yticks(color='white')

# Show plot
plt.show()

Image

6. Time Series Preparation

# Inspect available columns
print(df.columns)

# Create a new DataFrame for time series analysis
df_time_series = df[['day', 'month']].copy()

# Add dummy year (if year column is missing)
df_time_series['year'] = 2022

# Convert to strings for concatenation
df_time_series['day'] = df_time_series['day'].astype(str)
df_time_series['year'] = df_time_series['year'].astype(str)

# Create "date" column in dd-MMM-yyyy format
df_time_series['date'] = df_time_series['day'] + '-' + df_time_series['month'] + '-' + df_time_series['year']
df_time_series['date'] = pd.to_datetime(df_time_series['date'], format='%d-%b-%Y')

# Set "date" as index
df_time_series = df_time_series.set_index('date')

# Count occurrences per day
daily_counts = df_time_series.groupby(df_time_series.index).size()

# Display first rows
display(daily_counts.head())

7. Time Series Plot (Dark Theme)

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(16, 6))
fig.patch.set_facecolor('black')
ax.set_facecolor('black')

# Plot time series
plt.plot(daily_counts, color='turquoise')

# Customize labels and ticks
plt.title("Frequency Distribution Over Time", color='white')
plt.xlabel("Date", color='white')
plt.ylabel("Frequency", color='white')
plt.tick_params(axis='x', colors='white')
plt.tick_params(axis='y', colors='white')

# Show plot
plt.show()

Image

Dummy Year: 2022 was used when year column was missing.

Visualizations: Histograms, bar plots, and time series chart.
















  • CASTRO, L. N. Introdução a mineração de dados: conceitos básicos, algoritmos e aplicações. Saraiva, 2016.
  • PIRIM, H. Recent Applications in Data Clustering. IntechOpen, 2018.
  • SEN, J. Machine Learning: Artificial Intelligence. IntechOpen, 2021.

  • THOMAS, C. Data Mining. IntechOpen, 2018.
  • HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
  • NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
  • RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
  • SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.





🛸๋ My Contacts Hub





────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

data mining, focusing on unsupervised learning methods (clustering, PCA, dictionary learning, anomaly detection) applied to real-world projects for third-sector organizations. Results are shared publicly in open repositories and community platforms.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Sponsor this project