Skip to content

Question about masking - why mask on unpadded sequence? #242

@yjoonjang

Description

@yjoonjang

Hi @warner-benjamin @ohmeow,
Nice work on the ModernBERT project — I’ve been learning a lot from it.

I have one question regarding the masking logic here:

(masked_batch, labels) = SequencePacker.mlm_masking(
batch, self.mask_prob, self.mask_token_id, self.pad_token_id, self.ignore_token_id, self.np_rng
)
yieldval = {
"input_ids": torch.from_numpy(masked_batch),
"labels": torch.from_numpy(labels),
"cu_seqlens": cu_seq_lens,
"max_seqlen": max_seq_lens,
"attention_mask": torch.from_numpy(np.where(batch == self.pad_token_id, 0, 1)),
}
self._token_count += yieldval["attention_mask"].sum().item()
# # assert isinstance(yieldval[0], torch.Tensor), f"Unexpected {type(yieldval[0])=}"
# if not self.suppress_masking:
# assert isinstance(yieldval[1], torch.Tensor), f"Unexpected {type(yieldval[1])=}"
# assert isinstance(yieldval[2], list), f"Unexpected {type(yieldval[2])=}"
# if yieldval[2]:
# assert isinstance(yieldval[2][0], torch.Tensor), f"Unexpected {type(yieldval[2][0])=}"
yield yieldval

From what I understand, masking is applied after the sequence packing step.
This means that the masking probability is applied across the entire packed sequence (pseq), without regard to the original sample boundaries. As a result, some original samples inside a packed sequence might end up with no masked tokens at all.

I was curious about the intent behind applying masking at the packed-sequence level rather than per original sample.
Could you share the reasoning or trade-offs for this design choice?

Thanks,

  • Youngjoon Jang

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions