Replace AdamW with Muon as the optimizer

Looks like PyTorch core is getting Muon as an optimizer option - from [Soumith Chintala's X post](https://x.com/soumithchintala/status/1945297225988354315)

Might be useful to try (Muon is almost twice the training efficiency compared to Adam?) once it's there.