✨
AI Summary
- Medmarks v0.1 benchmark introduces MedXpertQA reasoning-heavy tasks saturating previous medical AI benchmarks
- Open-weight models like Qwen3 match frontier accuracy but require 5-6x token volume; reasoning post-training creates Pareto improvements
- Order bias discovered even in frontier models; medical-specialized models tested against generalist giants in RL environments