This repository contains a Python-based scraper that queries PubMed for Randomised Controlled Trials (RCTs) in critical care journals, processes the resulting records, and saves them in an Excel file. The scraper utilises the NCBI Entrez API (via BioPython), pandas, and rich libraries.
- Features
- Directory Structure
- Requirements
- Installation
- Usage
- Customisation
- Troubleshooting
- Contributing
- License
-
PubMed Query Construction
Combines search terms for RCTs, critical care, date ranges, and human studies, then filters out meta-analyses and reviews. -
Journal-Focused
Searches a predefined list of key journals by applying the[TA]
field filter. -
Result Parsing
Extracts essential publication information such as title, authors, DOI, and publication details. -
Excel Export
Consolidates new results into an Excel workbook (with a timestamped filename) or appends to an existing one. -
Custom Logging
Uses therich
library to provide stylised log messages in the console.
.
├── config.py # Configuration for queries, journal list, folder paths, etc.
├── main.py # Main scraper script
├── requirements.txt # (Optional) Could include necessary Python packages
└── README.md # Project documentation
RES_DIR
: The results directory is set inconfig.py
(default:/results/
), where Excel files will be saved.config.py
: Contains constants and query definitions used by the scraper.
biopython
pandas
rich
openpyxl # Required for Excel reading/writing
(You can install these via pip install -r requirements.txt
.)
-
Clone or Download the Repository
git clone https://github.com/yourusername/pubmed-rct-scraper.git cd pubmed-rct-scraper
-
Set Up a Python Virtual Environment (recommended)
python -m venv venv source venv/bin/activate # For macOS/Linux # or venv\Scripts\activate # For Windows
-
Install Dependencies
pip install -r requirements.txt
or manually:
pip install biopython pandas rich openpyxl
-
Create Results Directory (if needed)
By default, the scraper uses/results/
(defined inconfig.py
). On some operating systems, you may need to create this folder manually or adjust permissions.
Open config.py
and update the constants as necessary:
RES_DIR
: Directory where Excel results will be saved.NCBI_EMAIL
: Your valid email address (required by NCBI).RCT_QUERY
,CRITICAL_QUERY
,DATE_QUERY
: Search strings for the PubMed query.JOURNALS
: List of journals to search.OUTPUT_HEADERS
: Columns for the output Excel sheet.
Ensure you have the right date range and any other query constraints you want.
Simply execute main.py
:
python main.py
This will:
- Read each journal name from
JOURNALS
. - Construct a PubMed query for that journal.
- Fetch up to 200 results (adjustable in
main.py
). - Parse each record, extract relevant info, and append it to a master list.
- Write (or append) the results to a timestamped Excel file in
RES_DIR
.
- Excel File: A file named
pubmed_results_YYYYMMDD_HHMMSS.xlsx
(e.g.,pubmed_results_20250304_134501.xlsx
) will be created or updated in your specifiedRES_DIR
.
- Journals
Add or remove journal titles inconfig.JOURNALS
. - Search Terms
Modifyconfig.RCT_QUERY
,config.CRITICAL_QUERY
, orconfig.EXCLUSION_QUERY
to fit your needs. - Maximum Records
Increase or decrease theretmax
value infetch_records()
(withinmain.py
).
-
Permission Errors:
Make sure your Python script has permission to create or write to theRES_DIR
. -
Network Issues:
The script relies on the NCBI Entrez API, so it requires internet access. If you have a firewall or proxy, ensure it's correctly configured. -
No Records Found:
If you see a lot of[yellow]No records found[/yellow]
messages, check your query parameters. Ensure that your date ranges, journal names, or search terms are correct. -
Biopython or Other Dependencies Missing:
Double-check that you have installed all dependencies viapip
or another package manager.
Contributions are welcome! If you want to add features, fix bugs, or improve documentation:
- Fork this repository.
- Create a new branch for your changes.
- Submit a pull request.
For any major changes, please open an issue first to discuss what you would like to change.
MIT
Author: Dr Leila Janani (%99) Dr Amin Haghighatbin (%1)
Contact: aminhb@tutanota.com