Data Version History

Jump to bottom

Shakleen Ishfar edited this page May 7, 2024 · 5 revisions

Data Versions

Version 1

First 512 tokens of a sequence with truncation and padding.
Cleaning: Strip extra space from the end.

Version 2

512 sequence length with truncation and padding
Cleaning:
- Replace new line character with space.
- Replace multiple spaces with a single space
Sliding window technique: Divide long sequences into smaller max 512 length sequences.

Version 3

Version 2 properties and additionally

Negative sampling: Score 2, 3, 4 (negative) have a lot more samples than 1, 5, 6 (positive). So, we split the negative into 3 chunks randomly and for each chunk we add the positive samples.

Version 4

Same as Version 3 except the splitting of negative samples is done using StratifiedKFold from Numpy with K=3.

Tokenizer Versions

Version 1

Base tokenizer with no added tokens.

Version 2

Base tokenizer with two added tokens

New line token (\n)
Double space token ( )

Model Version

Version 1

DeBERTA-V3 model with Mean pooling for classification