-
Notifications
You must be signed in to change notification settings - Fork 121
Description
Hi @warner-benjamin @ohmeow,
Nice work on the ModernBERT project — I’ve been learning a lot from it.
I have one question regarding the masking logic here:
ModernBERT/src/sequence_packer.py
Lines 264 to 281 in 8c57a0f
(masked_batch, labels) = SequencePacker.mlm_masking( | |
batch, self.mask_prob, self.mask_token_id, self.pad_token_id, self.ignore_token_id, self.np_rng | |
) | |
yieldval = { | |
"input_ids": torch.from_numpy(masked_batch), | |
"labels": torch.from_numpy(labels), | |
"cu_seqlens": cu_seq_lens, | |
"max_seqlen": max_seq_lens, | |
"attention_mask": torch.from_numpy(np.where(batch == self.pad_token_id, 0, 1)), | |
} | |
self._token_count += yieldval["attention_mask"].sum().item() | |
# # assert isinstance(yieldval[0], torch.Tensor), f"Unexpected {type(yieldval[0])=}" | |
# if not self.suppress_masking: | |
# assert isinstance(yieldval[1], torch.Tensor), f"Unexpected {type(yieldval[1])=}" | |
# assert isinstance(yieldval[2], list), f"Unexpected {type(yieldval[2])=}" | |
# if yieldval[2]: | |
# assert isinstance(yieldval[2][0], torch.Tensor), f"Unexpected {type(yieldval[2][0])=}" | |
yield yieldval |
From what I understand, masking is applied after the sequence packing step.
This means that the masking probability is applied across the entire packed sequence (pseq), without regard to the original sample boundaries. As a result, some original samples inside a packed sequence might end up with no masked tokens at all.
I was curious about the intent behind applying masking at the packed-sequence level rather than per original sample.
Could you share the reasoning or trade-offs for this design choice?
Thanks,
- Youngjoon Jang