✨
AI Summary
- Introduces Group Sequence Policy Optimization (GSPO) for LLM training.
- Contrasts GSPO with unstable GRPO, addressing token-level importance sampling.
- Defines importance ratios based on entire sequence likelihood for stability.