Question about masking - why mask on unpadded sequence?

Hi @warner-benjamin @ohmeow,  
Nice work on the ModernBERT project — I’ve been learning a lot from it.

I have one question regarding the masking logic here:  
https://github.com/AnswerDotAI/ModernBERT/blob/8c57a0f01c12c4953ead53d398a36f81a4ba9e38/src/sequence_packer.py#L264-L281

From what I understand, masking is applied **after** the sequence packing step.  
This means that the masking probability is applied across the entire packed sequence (pseq), without regard to the original sample boundaries. As a result, some original samples inside a packed sequence might end up with no masked tokens at all.

I was curious about the intent behind applying masking at the packed-sequence level rather than per original sample. 
Could you share the reasoning or trade-offs for this design choice?

Thanks,  
- Youngjoon Jang


	(masked_batch, labels) = SequencePacker.mlm_masking(
	batch, self.mask_prob, self.mask_token_id, self.pad_token_id, self.ignore_token_id, self.np_rng
	)
	yieldval = {
	"input_ids": torch.from_numpy(masked_batch),
	"labels": torch.from_numpy(labels),
	"cu_seqlens": cu_seq_lens,
	"max_seqlen": max_seq_lens,
	"attention_mask": torch.from_numpy(np.where(batch == self.pad_token_id, 0, 1)),
	}
	self._token_count += yieldval["attention_mask"].sum().item()
	# # assert isinstance(yieldval[0], torch.Tensor), f"Unexpected {type(yieldval[0])=}"
	# if not self.suppress_masking:
	# assert isinstance(yieldval[1], torch.Tensor), f"Unexpected {type(yieldval[1])=}"
	# assert isinstance(yieldval[2], list), f"Unexpected {type(yieldval[2])=}"
	# if yieldval[2]:
	# assert isinstance(yieldval[2][0], torch.Tensor), f"Unexpected {type(yieldval[2][0])=}"
	yield yieldval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about masking - why mask on unpadded sequence? #242

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about masking - why mask on unpadded sequence? #242

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions