forked from google-research/t5x
-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Description
To optimize zero-shot performance, we are taking our MLM models through LM adaptation (see #5). For now, we are considering doing this for ~10% of the pre-training steps (around ~3GT). This is the same setup as T5, but it's pretty arbitrary.
Ideally, we should explore what's the optimal ratio of MLM to (C)LM training: 10%, 20%, 5%, 40%? For a fixed number of tokens (~30GT), we should plot the end-task performance at different ratio of MLM to CLM training. That will give an idea of the optimum is there is one.
Note that this is a nice-to-have, that we should only pursue if we have enough compute budget/bandwidth.