Determine optimal amount of LM adaptation needed

## Description

To optimize zero-shot performance, we are taking our MLM models through LM adaptation (see #5). For now, we are considering doing this for ~10% of the pre-training steps (around ~3GT). This is the same setup as T5, but it's pretty arbitrary. 

Ideally, we should explore what's the optimal ratio of MLM to (C)LM training: 10%, 20%, 5%, 40%? For a fixed number of tokens (~30GT), we should plot the end-task performance at different ratio of MLM to CLM training. That will give an idea of the optimum is there is one. 

Note that this is a _nice-to-have_, that we should only pursue if we have enough compute budget/bandwidth. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Determine optimal amount of LM adaptation needed #6

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Determine optimal amount of LM adaptation needed #6

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions