Skip to content

Determine optimal amount of LM adaptation needed #6

@slippylolo

Description

@slippylolo

Description

To optimize zero-shot performance, we are taking our MLM models through LM adaptation (see #5). For now, we are considering doing this for ~10% of the pre-training steps (around ~3GT). This is the same setup as T5, but it's pretty arbitrary.

Ideally, we should explore what's the optimal ratio of MLM to (C)LM training: 10%, 20%, 5%, 40%? For a fixed number of tokens (~30GT), we should plot the end-task performance at different ratio of MLM to CLM training. That will give an idea of the optimum is there is one.

Note that this is a nice-to-have, that we should only pursue if we have enough compute budget/bandwidth.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions