-
Notifications
You must be signed in to change notification settings - Fork 0
Data Version History
Shakleen Ishfar edited this page May 7, 2024
·
5 revisions
- First 512 tokens of a sequence with truncation and padding.
- Cleaning: Strip extra space from the end.
- 512 sequence length with truncation and padding
- Cleaning:
- Replace new line character with space.
- Replace multiple spaces with a single space
- Sliding window technique: Divide long sequences into smaller max 512 length sequences.
Version 2 properties and additionally
- Negative sampling: Score 2, 3, 4 (negative) have a lot more samples than 1, 5, 6 (positive). So, we split the negative into 3 chunks randomly and for each chunk we add the positive samples.
Same as Version 3 except the splitting of negative samples is done using StratifiedKFold from Numpy with K=3.
Base tokenizer with no added tokens.
Base tokenizer with two added tokens
- New line token (\n)
- Double space token ( )
DeBERTA-V3 model with Mean pooling for classification