
回复
言简意赅,发现月之暗面开源MoE模型,总参数量15.29B,激活参数2.24B,使用Muon优化器,在5.7T Tokens的训练数据下,拿到了很好的效果。
Github:https://github.com/MoonshotAI/Moonlight
HF:https://huggingface.co/moonshotai/Moonlight-16B-A3B
Paper:https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf
效果如下:
比较 Muon 和 Adam 的扩展定律实验,发现Muon 的样本效率比 Adam 高 2 倍。
Muon 优化器原理如下:
同时,Moonlight-16B-A3B的模型架构与DeepSeek-V3一致。
HF快速使用:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "moonshotai/Moonlight-16B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
messages = [
{"role": "system", "content": "You are a helpful assistant provided by Moonshot-AI."},
{"role": "user", "content": "Is 123 a prime?"}
]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=500)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)
本文转载自NLP工作站,作者: 刘聪NLP