Class 2 and 3: Introduction to Data Mining with Python and Stats Review

Specialized Consulting for Integrated Project: Data Mining - Full Repository Access

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

⚠️ Important Notes

Whenever possible, projects and deliverables developed during the course will be made publicly accessible.
The course emphasizes practical, hands-on experience with real datasets to emulate professional consulting scenarios in the field of Data Mining.
All activities and materials will strictly adhere to the academic and ethical guidelines of PUC-SP. Any content not authorized for public disclosure will remain confidential and stored in private repositories.

Overview

This repository contains materials and examples for the Introduction to Data Mining with Python Class 1 course, focusing on fundamental statistical concepts and data analysis techniques essential for data mining applications.

Repository Structure

├── data/                 # Sample datasets
├── notebooks/           # Jupyter notebooks with examples
├── scripts/             # Python scripts for analysis
├── images/              # Generated plots and visualizations
└── docs/                # Additional documentation

Getting Started

Prerequisites:

Python 3.7+
Required libraries: pandas, numpy, matplotlib, seaborn, scikit-learn

Installation:

pip install pandas numpy matplotlib seaborn scikit-learn jupyter

Quick Start:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load sample data
data = [50, 40, 41, 17, 11, 7, 22, 44, 28, 21, 19, 23, 37, 51, 54, 42, 86,
        41, 78, 56, 72, 56, 17, 7, 69, 30, 80, 56, 29, 33, 46, 31, 39, 20,
        18, 29, 34, 59, 73, 77, 36, 39, 30, 62, 54, 67, 39, 31, 53, 44]

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=7, edgecolor='black')
plt.title('Internet Usage Distribution')
plt.xlabel('Minutes Online')
plt.ylabel('Frequency')
plt.show()

# Calculate statistics
print(f"Mean: {np.mean(data):.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Standard Deviation: {np.std(data):.2f}")

Key Learning Outcomes

After completing this course, students will be able to:

Construct and interpret frequency distributions from raw data
Create various types of histograms and understand their relationship to frequency distributions
Identify and handle outliers in datasets
Analyze distribution shapes and their implications
Calculate and interpret central tendency measures
Apply statistical concepts to data mining problems
Use Python tools for statistical analysis and visualization

Important Notes

Outliers require careful consideration - they may represent valuable insights or data quality issues
Histogram bins should be chosen thoughtfully - too few may hide patterns, too many may create noise
Frequency distributions are fundamental to understanding data structure before applying advanced data mining techniques
Visual analysis complements numerical statistics for comprehensive data understanding

This material is part of the Introduction to Data Mining with Python course, focusing on fundamental statistical concepts essential for effective data analysis and mining.

Class_1 Content

Syllabus (Ementa)

Descriptive Statistics Review
Data Mining Concepts
Exploratory Data Analysis
Predictive Analysis
Clustering
Association Rules

Assessment Criteria

Minimum 75% attendance required
Final grade ≥ 5.0
Formula: MF = (N₁ + N₂)/2, where Nᵢ = (Pᵢ + Aᵢ)/2
- Pᵢ = Project grade for semester i
- Aᵢ = Activity/exam grade for semester i

Key Topics Covered

1. Frequency Distribution

A frequency distribution is a table that shows classes or intervals of data with a count of the number of entries in each class. It's fundamental for understanding data patterns and is the foundation for creating histograms.

Components:

Class limits: Lower and upper boundaries of each class
Class size: The width of each class interval
Frequency (f): Number of data entries in each class
Relative frequency: Proportion of data in each class (f/n)
Cumulative frequency: Sum of frequencies up to a given class

Construction Steps:

Decide the number of classes (typically 5-20)
Calculate class size: (max - min) / number of classes
Determine class limits
Count frequencies for each class
Calculate additional measures (relative, cumulative frequencies)

2. Histograms and Their Relationship to Frequency Distributions

Histograms are vectorially related to frequency distributions - they are the graphical representation of frequency distribution tables.

Key Characteristics:

Bar chart representing frequency distribution
Horizontal axis: Quantitative data values (class boundaries)
Vertical axis: Frequencies or relative frequencies
Consecutive bars must touch (unlike regular bar charts)
Class boundaries: Numbers that separate classes without gaps

Types of Histograms:

Frequency Histogram: Shows absolute frequencies
Relative Frequency Histogram: Shows proportions/percentages
Frequency Polygon: Line graph emphasizing continuous change

3. Outliers in Histograms

Outliers, by definition, have few values and can represent various phenomena:

What Outliers May Indicate:

Data entry errors (typing mistakes)
Measurement errors
Fraudulent activities
Genuine extreme values
Equipment malfunctions

Impact on Histograms:

Generate few bars (sparse representation)
Create gaps in the distribution
Skew the overall pattern
Affect central tendency measures
May require special handling in analysis

Outlier Detection in Histograms:

Visible as isolated bars far from main distribution
Large gaps between bars
Extremely tall or short bars at distribution extremes
Asymmetric patterns in otherwise normal distributions

4. Distribution Shapes

Understanding distribution shapes helps identify data characteristics:

Symmetric Distribution:

Mean ≈ Median ≈ Mode
Bell-shaped or uniform patterns
Equal spread on both sides

Left-Skewed (Negatively Skewed):

Mean < Median < Mode
Tail extends to the left
Few extremely low values

Right-Skewed (Positively Skewed):

Mode < Median < Mean
Tail extends to the right
Few extremely high values

Uniform Distribution:

All classes have equal frequencies
Rectangular shape in histogram

5. Central Tendency Measures

Mean (μ or x̄):

Sum of all values divided by count
Most affected by outliers
Uses all data points

Median:

Middle value when data is ordered
Less affected by outliers
Robust measure

Mode:

Most frequently occurring value
May not exist or may be multiple
Good for categorical data

6. Practical Applications

Data Mining Context:

Pattern Recognition: Identifying data distributions
Anomaly Detection: Finding outliers
Data Quality Assessment: Checking for errors
Feature Engineering: Understanding variable distributions
Model Selection: Choosing appropriate algorithms based on data distribution

Python Implementation Examples:

import matplotlib.pyplot as plt
import numpy as np

# Create frequency distribution
def create_frequency_distribution(data, num_classes=7):
    min_val, max_val = min(data), max(data)
    class_size = (max_val - min_val) / num_classes
    
    # Define class boundaries
    boundaries = [min_val + i * class_size for i in range(num_classes + 1)]
    
    # Count frequencies
    frequencies = []
    for i in range(num_classes):
        count = sum(1 for x in data if boundaries[i] <= x < boundaries[i+1])
        frequencies.append(count)
    
    return boundaries, frequencies

# Create histogram
def plot_histogram(data, title="Frequency Distribution"):
    plt.figure(figsize=(10, 6))
    plt.hist(data, bins=7, edgecolor='black', alpha=0.7)
    plt.title(title)
    plt.xlabel('Values')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)
    plt.show()

Bibliography

Primary References:

Castro, L. N. & Ferrari, D. G. (2016). Introdução à mineração de dados: conceitos básicos, algoritmos e aplicações. Saraiva.
Ferreira, A. C. P. L. et al. (2024). Inteligência Artificial - Uma Abordagem de Aprendizado de Máquina. 2nd Ed. LTC.
Larson & Farber (2015). Estatística Aplicada. Pearson.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
class_1- Introduction - Data Mining With Python		class_1- Introduction - Data Mining With Python
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Uh oh!

License

Quantum-Software-Development/class_2-and-3-data-mining-python

Folders and files

Latest commit

History

Repository files navigation

Class 2 and 3: Introduction to Data Mining with Python and Stats Review

Specialized Consulting for Integrated Project: Data Mining - Full Repository Access

⚠️ Important Notes

Overview

Repository Structure

Getting Started

Prerequisites:

Installation:

Quick Start:

Key Learning Outcomes

Important Notes

Class_1 Content

Syllabus (Ementa)

Assessment Criteria

Key Topics Covered

1. Frequency Distribution

Components:

Construction Steps:

2. Histograms and Their Relationship to Frequency Distributions

Key Characteristics:

Types of Histograms:

3. Outliers in Histograms

What Outliers May Indicate:

Impact on Histograms:

Outlier Detection in Histograms:

4. Distribution Shapes

Symmetric Distribution:

Left-Skewed (Negatively Skewed):

Right-Skewed (Positively Skewed):

Uniform Distribution:

5. Central Tendency Measures

Mean (μ or x̄):

Median:

Mode:

6. Practical Applications

Data Mining Context:

Python Implementation Examples:

Bibliography

Primary References:

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Languages

Packages