This repository contains the code for pretraining a BERT model on domain-specific data.
com/mhire/data_processing/
: Contains scripts for data preparation, including PDF parsing, sentence chunking, and Next Sentence Prediction (NSP) data generation.com/mhire/pre_training/
: Contains the core pretraining implementation using thetransformers
library.com/mhire/pdf_processing_pipeline.py
: Orchestrates the PDF processing steps to prepare data for pretraining.com/mhire/pre_training_runner.py
: The main script to initiate the pretraining process.com/mhire/utility/
: Contains utility functions for directory management, Google Cloud Storage interactions, NLTK data handling, and zip operations.
The pretraining pipeline involves:
- PDF Processing: Parsing PDF documents to extract text and prepare it for model input.
- Data Preparation: Generating Masked Language Model (MLM) and Next Sentence Prediction (NSP) training examples.
- Model Pretraining: Training a BERT model using the prepared data.
After domain-adaptive pretraining, the validation result for the MLM/NSP task was 89% accurate.
You can access the pretraining dataset on Kaggle:
➡️ Medical Domain Corpus
The dataset contains cleaned and parsed sentences from globally recognized medical textbooks and peer-reviewed journals.
- Python 3.x
- pip (Python package installer)
-
Clone the repository:
git clone https://github.com/syeda434am/Domain-Adaptive-BERT-Pretraining.git cd Domain-Adaptive-BERT-Pretraining
-
Install the required Python packages:
pip install -r requirements.txt
Ensure your PDF documents are placed in the designated input directory (as configured in pdf_processing_pipeline.py
). The pipeline will process these PDFs and generate the necessary JSONL files for pretraining.
To run the PDF processing pipeline:
python com/mhire/pdf_processing_pipeline.py
Once the data is prepared, you can start the pretraining process. The pre_training_runner.py
script handles the entire pretraining workflow.
To run the pretraining:
python com/mhire/pre_training_runner.py
Configuration parameters for pretraining (e.g., model name, batch size, epochs, output directories) can be adjusted within pre_training_runner.py
and pre_training/pre_training.py
.
Upon completion of the domain-adaptive pretraining, the model achieved an 89% accuracy on the validation set for the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks.
If you use this repository for your research or projects, please consider citing it. This project is licensed under the Apache 2.0 License.
@misc{domain_adaptive_bert_pretraining,
title={Domain-Adaptive BERT Pretraining},
author={Syeda Aunanya Mahmud},
year={2025},
publisher={GitHub},
url={https://github.com/syeda434am/Domain-Adaptive-BERT-Pretraining},
license={Apache 2.0},
}