Skip to content

Data Version History

Shakleen Ishfar edited this page May 7, 2024 · 5 revisions

Data Versions

Version 1

  1. First 512 tokens of a sequence with truncation and padding.
  2. Cleaning: Strip extra space from the end.

Version 2

  1. 512 sequence length with truncation and padding
  2. Cleaning:
    • Replace new line character with space.
    • Replace multiple spaces with a single space
  3. Sliding window technique: Divide long sequences into smaller max 512 length sequences.

Version 3

Version 2 properties and additionally

  1. Negative sampling: Score 2, 3, 4 (negative) have a lot more samples than 1, 5, 6 (positive). So, we split the negative into 3 chunks randomly and for each chunk we add the positive samples.

Version 4

Same as Version 3 except the splitting of negative samples is done using StratifiedKFold from Numpy with K=3.

Tokenizer Versions

Version 1

Base tokenizer with no added tokens.

Version 2

Base tokenizer with two added tokens

  1. New line token (\n)
  2. Double space token ( )

Model Version

Version 1

DeBERTA-V3 model with Mean pooling for classification

Clone this wiki locally