Skip to content

🐙 Lung cancer prediction with logistic regression using clinical features and feature selection, with robust evaluation metrics. Code and results live in the analysis notebook.

License

Notifications You must be signed in to change notification settings

Agasthi1212/lung-cancer-prediction-logistic-regression

Repository files navigation

https://github.com/Agasthi1212/lung-cancer-prediction-logistic-regression/releases

Release Assets

Lung Cancer Risk Prediction with Logistic Regression: Clinico-Lifestyle Insights

Lung cancer visualization banner

A practical project that shows how to predict lung cancer risk using logistic regression. The work combines clinical symptoms and lifestyle factors to build a robust classifier. It covers data cleaning, exploratory data analysis, feature selection, and model evaluation. All code and results live in an analysis notebook. The repo focuses on classification, data visualization, exploratory data analysis, healthcare, jupyter-notebook, logistic-regression, lung-cancer, machine-learning, python, and scikit-learn.

This README describes how the project is organized, how to reproduce the work, and how to extend it. It walks through data preparation, model design, and model validation. It also explains how to interpret the results and how to use the notebook to reproduce the experiments. The content is written so a data scientist can reuse the approach on similar health datasets.

If you want to inspect or reuse the release artifacts, visit the Releases page. The link is provided above and linked again in the badge. The analysis notebook is the central artifact for code, figures, and results.

Table of contents

  • Why this project exists
  • What you’ll find in the repository
  • How to reproduce the results
  • Data description and features
  • Data cleaning and preprocessing
  • Exploratory data analysis (EDA)
  • Feature selection strategies
  • Modeling approach
  • Evaluation metrics and interpretation
  • Reproducibility and environment
  • Repository structure
  • How to contribute
  • Ethics, fairness, and limitations
  • Roadmap and future work
  • References and further reading
  • Frequently asked questions
  • Appendix

Why this project exists Lung cancer remains a major health challenge. Early risk assessment can help guide screening decisions and lifestyle interventions. This project demonstrates a straightforward approach to estimating lung cancer risk using logistic regression. It combines clinical symptoms, lifestyle factors, and basic demographics to produce a practical classifier. The goal is not to replace medical advice but to provide a transparent, data-driven method to support decision making in healthcare research and educational settings.

What you’ll find in the repository

  • A single, comprehensive analysis notebook that contains all steps from data cleaning to evaluation. The notebook includes code cells, narrative explanations, figures, and results. It serves as the primary source of analysis and can be re-run with a fresh dataset.
  • Clear documentation on the data features used. You will read about clinical symptoms such as persistent cough, chest pain, shortness of breath, hemoptysis, and systemic factors like age, smoking history, occupational exposures, and other lifestyle attributes.
  • Visualization outputs that help you understand data distribution, relationships between features, and model performance. Visuals include histograms, bar charts, correlation heatmaps, ROC curves, and calibration plots.
  • A practical guide to feature selection, regularization choices, and model interpretation. The notebook demonstrates how to reduce feature space while preserving predictive power.
  • Reproducible environment instructions, including library versions and commands to install dependencies.

How to reproduce the results

  • Start by inspecting the release assets. The Releases page contains the analysis notebook and any produced figures or supplementary data. Use the link provided above to access the assets.
  • Install the required software. The notebook relies on Python and standard data science libraries. A typical setup includes:
    • Python 3.8 or newer
    • numpy
    • pandas
    • scikit-learn
    • matplotlib
    • seaborn
    • jupyter
    • optional: seaborn and plotly for enhanced visuals
  • Create an isolated environment to avoid conflicts. For example:
    • Using conda: conda create -n lung-cancer-prediction python=3.9
    • Then: conda activate lung-cancer-prediction
    • Or using venv: python -m venv venv; source venv/bin/activate
  • Install dependencies:
    • If a requirements file is provided: pip install -r requirements.txt
    • If not, install the core packages individually:
      • pip install numpy pandas scikit-learn matplotlib seaborn jupyter
  • Run the analysis notebook:
    • Start Jupyter: jupyter notebook
    • Open the notebook file in the analysis directory and run all cells in order
  • Reproducing results requires the same data and notebook. The notebook contains steps to load data, preprocess it, fit the model, and generate results. If you have a new dataset, you can adapt the notebook by adjusting the data loading cell and feature preprocessing steps.

Data description and features

  • The dataset centers on predicting lung cancer risk. It blends clinical symptoms with lifestyle factors to reflect the real-world decision context for screening and diagnosis.
  • Common features you will encounter include:
    • Demographic: age, sex, BMI, ethnicity (if available)
    • Smoking metrics: smoking status (current, former, never), pack-years, age at smoking initiation, cessation duration
    • Symptoms and medical history: persistent cough, coughing up blood, chest pain, shortness of breath, wheezing, recent weight loss, fatigue
    • Exposures and risk modifiers: occupational exposure to carcinogens, air pollution, family history of cancer, prior respiratory illness
    • Lab or imaging proxies (when available): basic bloodwork indicators or imaging-based observations as features or derived features
  • The notebook defines a consistent feature schema and includes feature engineering steps to transform categorical variables into numerical representations suitable for logistic regression.

Data cleaning and preprocessing

  • Data quality is critical for reliable predictions. The notebook implements a straightforward, transparent pipeline:
    • Handling missing values: choose simple, interpretable imputation strategies (for example, filling missing binary indicators with the mode, numeric features with median, or using a simple imputation strategy informed by domain knowledge).
    • Encoding categorical variables: convert categories to numeric codes. Use one-hot encoding only when necessary to preserve interpretability.
    • Feature scaling: standardize numeric features to ensure the logistic regression model converges smoothly and coefficients are comparable.
    • Outlier handling: assess extreme values, cap or transform outliers if needed, and document any decisions.
    • Data splitting: partition data into train and test sets with a stratified split to preserve class distribution. Consider a hold-out validation set if your dataset is large enough.
  • The notebook emphasizes reproducibility and traceability. Each preprocessing step is documented with rationale and a clear code snippet. You will find the exact transformation rules and how they apply to both training and test data.

Exploratory data analysis (EDA)

  • EDA is used to build intuition and to justify modeling choices. The notebook includes:
    • Distribution checks: how common each symptom is, how lifestyle factors vary with age, and how risk correlates with smoking intensity
    • Correlation exploration: simple correlations between features and the outcome, as well as pairwise feature relationships that could inform interactions
    • Visual storytelling: charts that help stakeholders understand the data, such as:
      • Bar charts showing the prevalence of risk factors
      • Box plots for continuous features across risk groups
      • Heatmaps to examine relationships among features
    • Initial insights about potential predictive features and their clinical relevance
  • The EDA step is designed to be reproducible and easy to extend with additional plots or datasets.

Feature selection strategies

  • The notebook demonstrates practical feature selection for logistic regression:
    • Regularization-based selection: L1 (lasso) and L2 (ridge) penalties to control model complexity and encourage sparse coefficients
    • Recursive feature elimination (RFE): iteratively removes the least important features based on model coefficients
    • Univariate screening: quick checks for features with strong individual associations with the target
    • Stability checks: across multiple train/test splits to identify features that consistently contribute to predictive performance
  • The goal is to balance interpretability with accuracy. A lean feature set makes the model easier to interpret by clinicians and easier to deploy in practice.

Modeling approach

  • Logistic regression serves as a transparent baseline model that yields interpretable coefficients. The notebook explores:
    • Binary outcome: cancer risk (yes/no) based on available labels
    • Regularization choices: L1 and L2 to manage multicollinearity and feature sparsity
    • Class balance handling: techniques such as class weights if the dataset is imbalanced
    • Cross-validation: to obtain robust estimates of model performance and to tune hyperparameters
    • Calibration: evaluation of probability estimates to ensure that predicted risks align with observed frequencies
  • Additional commentary covers when to prefer other models (for example, tree-based models) and how logistic regression remains a strong baseline due to interpretability and stability.

Evaluation metrics and interpretation

  • The notebook presents a thorough evaluation framework:
    • Discrimination metrics: ROC AUC, precision-recall AUC, accuracy, recall, precision
    • Confusion matrix: true positives, false positives, true negatives, false negatives
    • Threshold analysis: selecting decision thresholds that balance sensitivity and specificity according to clinical needs
    • Calibration assessment: reliability diagrams and Brier score to evaluate the quality of probability estimates
    • Feature coefficients: left-to-right interpretation of how each feature affects the log-odds of cancer
  • The evaluation is designed to help readers understand model behavior in real-world scenarios, including trade-offs when adjusting thresholds.

Interpretation and clinical relevance

  • The model is designed to be transparent. Coefficients map to increases or decreases in risk, making it easy to explain to clinicians.
  • The notebook discusses how to present results to a healthcare audience. It emphasizes:
    • Clear explanations of what features mean in plain language
    • The limitations of a statistical model in the context of medical decision making
    • How to use predicted risk in conjunction with clinical judgment and patient preferences
  • The project prioritizes responsible use. It highlights that predictions should support, not replace, medical expertise.

Reproducibility and environment

  • Reproducibility is central. The notebook includes:
    • Exact library versions used for each step
    • Clear environment setup instructions
    • Instructions to reproduce figures and metrics from the same data
  • If you plan to reuse this work with a different dataset, you can adapt the notebook by:
    • Replacing the data loading cell with your dataset
    • Ensuring your feature engineering steps align with the new data
    • Re-evaluating model performance with your data

Repository structure

  • analysis/ folder: contains the primary notebook with end-to-end steps: data cleaning, EDA, feature selection, modeling, evaluation, and results. It is the central artifact for code and results.
  • data/ folder: (if provided) includes raw and processed datasets. If this repository does not include raw data due to privacy concerns, the notebook documents how to handle data ingestion from trusted sources.
  • figures/ folder: stores generated plots and visualizations used in the notebook and this README.
  • docs/ folder: supplementary explanations, references, and methodological notes.
  • notebooks/ folder: supplementary notebooks that demonstrate specific steps or experiments.
  • requirements.txt or environment.yml: lists dependencies to reproduce the environment.
  • LICENSE: if present, states the usage terms for the work.
  • README.md: this file, which documents the project and how to use it.

How to contribute

  • If you want to improve the model, you can:
    • Add new features that reflect clinical knowledge and patient lifestyle
    • Experiment with alternative models while keeping the focus on interpretability
    • Extend the EDA with more robust visualizations and summaries
    • Improve documentation and comments in the notebook to aid future readers
  • To contribute, follow these steps:
    • Fork the repository
    • Create a feature branch with a concise name
    • Implement your changes in the analysis notebook or add supporting scripts
    • Run the notebook locally to verify results
    • Submit a pull request with a clear description of the changes and the motivation
  • The project welcomes collaboration from data scientists, clinicians, and students who want to learn about the application of logistic regression to health data.

Ethics, fairness, and limitations

  • The project discusses ethical considerations relevant to health data:
    • Privacy: handle any patient data with care and follow applicable laws and guidelines
    • Bias: be mindful of biases in the data that may affect model fairness across groups
    • Transparency: maintain clear documentation of feature definitions, preprocessing steps, and model assumptions
  • Limitations are acknowledged in the notebook:
    • The model is a statistical tool that complements clinical judgment
    • Data quality and representativeness limit generalizability
    • The model’s performance depends on the features available in the dataset
  • Readers should interpret results in the context of clinical practice and local guidelines. The notebook includes sections that discuss these considerations in plain language.

Roadmap and future work

  • Short-term goals:
    • Expand feature engineering to capture more nuanced lifestyle factors
    • Improve calibration for different demographic groups
    • Add alternative evaluation scenarios such as time-to-event analysis if data allows
  • Mid-term goals:
    • Compare logistic regression with other interpretable models (e.g., generalized additive models) to assess if performance can improve without sacrificing interpretability
    • Integrate model results into a simple decision-support tool for clinicians
    • Create lightweight, user-friendly visualizations to explain risk scores to patients
  • Long-term goals:
    • Validate the approach in external datasets and across healthcare settings
    • Explore integration with electronic health records (EHRs) for seamless data ingestion
    • Develop a dashboard that summarizes risk factors and model outputs for clinical teams

References and further reading

  • Classic logistic regression and binary classification materials
  • Tutorials on data cleaning and preprocessing for health data
  • Guides on interpretability and model explanation for clinical use
  • Basic papers on using logistic regression in healthcare contexts
  • Practical tutorials on feature selection methods such as L1/L2 regularization and recursive feature elimination
  • Resources on data visualization best practices for healthcare data

Frequently asked questions

  • What is the main goal of this project?
    • To demonstrate a clear, transparent approach to predicting lung cancer risk using logistic regression with clinical symptoms and lifestyle factors.
  • What is inside the analysis notebook?
    • The notebook contains data cleaning, EDA, feature selection, modeling, evaluation, and visualization steps. It is the central artifact for code and results.
  • How can I reproduce the results?
    • Follow the environment setup instructions, install dependencies, load the data, and run the notebook from start to finish. If you use a different dataset, adapt the data loading and feature engineering steps accordingly.
  • Where can I download the release artifacts?
    • The Releases page contains assets related to the analysis notebook and results. Use the link at the top of this README and the badge near the top to access it.
  • What if the data is not available in the repository?
    • The notebook explains how to load data from trusted sources and documents preprocessing steps so you can adapt the workflow to your dataset.

Appendix

  • Glossary
    • Logistic regression: a model that estimates the probability of a binary outcome by applying a logistic function to a linear combination of features.
    • ROC AUC: a measure of how well the model separates the two classes across all thresholds.
    • Calibration: how well predicted probabilities reflect actual outcomes.
    • Feature engineering: creating new features from existing data to improve model performance.
    • Regularization: a technique to prevent overfitting by penalizing large coefficients.
  • Useful commands
    • Install dependencies: pip install -r requirements.txt
    • Run Jupyter: jupyter notebook
    • Start a quick diagnostic: python -c "import sklearn; print('scikit-learn version:', sklearn.version)"
  • Licensing and reuse
    • If a license is present, follow its terms for reuse and distribution.
    • When reusing content, reference the original notebook and dataset provenance as appropriate.

Note

  • The analysis notebook is the authoritative source for code, steps, and results. This README mirrors the structure and topics of that notebook to help readers navigate the project without needing to search inside the file. The goal is to provide a clear, practical path to reproduce and extend the work using public data and standard Python tools.

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •