✨
AI Summary
- DeepSeek V3 and Mistral Large both deploy 128-expert MoE architectures with shared vocabulary (129K) and embeddings (7,168)
- DeepSeek V3 activates 1 shared + 6 experts per token (37B active parameters) versus alternative allocation strategies
- Initial dense FFN blocks precede MoE layers in both 670+ billion parameter models, optimizing early computation