PubMed RCT & Critical Care Scraper

This repository contains a Python-based scraper that queries PubMed for Randomised Controlled Trials (RCTs) in critical care journals, processes the resulting records, and saves them in an Excel file. The scraper utilises the NCBI Entrez API (via BioPython), pandas, and rich libraries.

Features

PubMed Query Construction
Combines search terms for RCTs, critical care, date ranges, and human studies, then filters out meta-analyses and reviews.
Journal-Focused
Searches a predefined list of key journals by applying the [TA] field filter.
Result Parsing
Extracts essential publication information such as title, authors, DOI, and publication details.
Excel Export
Consolidates new results into an Excel workbook (with a timestamped filename) or appends to an existing one.
Custom Logging
Uses the rich library to provide stylised log messages in the console.

Directory Structure

.
├── config.py         # Configuration for queries, journal list, folder paths, etc.
├── main.py           # Main scraper script
├── requirements.txt  # (Optional) Could include necessary Python packages
└── README.md         # Project documentation

RES_DIR: The results directory is set in config.py (default: /results/), where Excel files will be saved.
config.py: Contains constants and query definitions used by the scraper.

Requirements

Python 3.7+ (Recommended 3.9+)
The following Python libraries:
- Biopython
- pandas
- rich

biopython
pandas
rich
openpyxl  # Required for Excel reading/writing

(You can install these via pip install -r requirements.txt.)

Installation

Clone or Download the Repository

git clone https://github.com/yourusername/pubmed-rct-scraper.git
cd pubmed-rct-scraper

Set Up a Python Virtual Environment (recommended)

python -m venv venv
source venv/bin/activate  # For macOS/Linux
# or
venv\Scripts\activate     # For Windows

Install Dependencies

pip install -r requirements.txt

or manually:

pip install biopython pandas rich openpyxl

Create Results Directory (if needed)
By default, the scraper uses /results/ (defined in config.py). On some operating systems, you may need to create this folder manually or adjust permissions.

Usage

Configuration

Open config.py and update the constants as necessary:

RES_DIR: Directory where Excel results will be saved.
NCBI_EMAIL: Your valid email address (required by NCBI).
RCT_QUERY, CRITICAL_QUERY, DATE_QUERY: Search strings for the PubMed query.
JOURNALS: List of journals to search.
OUTPUT_HEADERS: Columns for the output Excel sheet.

Ensure you have the right date range and any other query constraints you want.

Running the Scraper

Simply execute main.py:

python main.py

This will:

Read each journal name from JOURNALS.
Construct a PubMed query for that journal.
Fetch up to 200 results (adjustable in main.py).
Parse each record, extract relevant info, and append it to a master list.
Write (or append) the results to a timestamped Excel file in RES_DIR.

Output

Excel File: A file named pubmed_results_YYYYMMDD_HHMMSS.xlsx (e.g., pubmed_results_20250304_134501.xlsx) will be created or updated in your specified RES_DIR.

Customisation

Journals
Add or remove journal titles in config.JOURNALS.
Search Terms
Modify config.RCT_QUERY, config.CRITICAL_QUERY, or config.EXCLUSION_QUERY to fit your needs.
Maximum Records
Increase or decrease the retmax value in fetch_records() (within main.py).

Troubleshooting

Permission Errors:
Make sure your Python script has permission to create or write to the RES_DIR.
Network Issues:
The script relies on the NCBI Entrez API, so it requires internet access. If you have a firewall or proxy, ensure it's correctly configured.
No Records Found:
If you see a lot of [yellow]No records found[/yellow] messages, check your query parameters. Ensure that your date ranges, journal names, or search terms are correct.
Biopython or Other Dependencies Missing:
Double-check that you have installed all dependencies via pip or another package manager.

Contributing

Contributions are welcome! If you want to add features, fix bugs, or improve documentation:

Fork this repository.
Create a new branch for your changes.
Submit a pull request.

For any major changes, please open an issue first to discuss what you would like to change.

License

MIT

Author: Dr Leila Janani (%99) Dr Amin Haghighatbin (%1)
Contact: aminhb@tutanota.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PubMed RCT & Critical Care Scraper

Table of Contents

Features

Directory Structure

Requirements

Installation

Usage

Configuration

Running the Scraper

Output

Customisation

Troubleshooting

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
results		results
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

License

Haghighatbin/PubMedScraper

Folders and files

Latest commit

History

Repository files navigation

PubMed RCT & Critical Care Scraper

Table of Contents

Features

Directory Structure

Requirements

Installation

Usage

Configuration

Running the Scraper

Output

Customisation

Troubleshooting

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages