LLM_NER_MultiNERD is a Using the MultiNERD Named Entity Recognition (NER) dataset, complete the following instructions to train and evaluate a Named Entity Recognition model for English using BERT and XLNET. Built on top of the familiar 🤗 Transformers library.
Fine-tune chosen model bert-base-cased and xlnet-base-cased on the English subset of the training set.
Train a model that will predict only five entity types and the O
tag (I.e. not part of an entity). Therefore, the necessary pre-processing steps should be performed on the dataset. All examples should thus remain, but entity types not belonging to one of the following five should be set to zero: PERSON(PER)
, ORGANIZATION(ORG)
, LOCATION(LOC)
, DISEASES(DIS)
, ANIMAL(ANIM)
. Fine-tune the chosen models on the filtered dataset.
BERT (Bidirectional Encoder Representations from Transformers) employs a bidirectional attention mechanism to capture contextual information from both left and right contexts. It uses pre-training tasks, such as masked language modeling, to learn contextualized embeddings.
XLNet improves upon BERT by introducing permutation language modeling. It captures bidirectional context like BERT but allows for a more flexible information flow. In Named Entity Recognition (NER) tasks, these models excel at understanding the relationships between words and recognizing entities such as persons, organizations, and locations. Their deep contextual embeddings enable them to capture nuanced patterns, improving accuracy in identifying named entities within text.
Go to folder docker/
.
docker build -f Dockerfile -t NER-MultiNERD \
--build-arg username=$(username) .
docker run -it --shm-size 60G --gpus all \
-v /path/to/dir/:/home/username/NER-MultiNERD/ \
-v /path/to/storage/:/storage/ NER-MultiNERD
You can install the following dependencies to run tasks in the environment:
pip install -r requirements.txt
The BIOS tag scheme of the input, with each character its label for one line. Sentences are split with a null line.
The fine-tune BERT for System A:
python main_A.py --MODEL_CKPT bert-base-cased
The fine-tune BERT for System B:
python main_B.py --MODEL_CKPT bert-base-cased
The fine-tune XLNET for System A:
python main_A.py --MODEL_CKPT xlnet-base-cased
The fine-tune XLNET for System B:
python main_B.py --MODEL_CKPT xlnet-base-cased
I have uploaded the fine-tuned models to Hugging Face, you can load or inference them with API directly, here is an example.
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("medxiaorudan/bert-base-cased-finetuned-MultiNERD-SystemA")
model = AutoModelForTokenClassification.from_pretrained("medxiaorudan/bert-base-cased-finetuned-MultiNERD-SystemA")
The overall performance of BERT and XLNET on dev (The more detailed results about validation and visualization can be found in Notebooks:
Accuracy (entity) | Recall (entity) | Precision (entity) | F1 score (entity) | |
---|---|---|---|---|
BERT+SystemA | 0.9861 | 0.9685 | 0.8699 | 0.9165 |
BERT+SystemB | 0.9922 | 0.9740 | 0.9206 | 0.9466 |
XLNET+SystemA | 0.9759 | 0.9548 | 0.7967 | 0.8687 |
XLNET+SystemB | 0.9915 | 0.9741 | 0.9145 | 0.9434 |