Episodes (Page 8)
✨
NS-NAC algorithm for non-stationary environments.
✨
Offline RL for personalized policies from diverse data.
✨
SeRA mitigates spurious correlations in RLHF.
✨
AXIOM uses object-centric models and active inference.
✨
Policy entropy declines rapidly in RL for LLMs, limiting exploration.
✨
Episode: FLEX Robot-Agnostic Force-Based Manipulation Learning
✨
ZeroTIR trains LLMs to use Python for math via RL.
✨
RLVR may not fundamentally improve LLM reasoning beyond base models.
✨
Reward model quality, not just accuracy, impacts RLHF efficiency.
✨
Graph RL optimizes power grid control with masked actions.
Graph Reinforcement Learning
✨
LGTC-IPPO uses dynamic cluster agreements for decentralized resource allocation.
✨
RL enables humanoid robots for dexterous manipulation using vision.
✨
µCODE generates code iteratively using single-step execution rewards.
✨
CRPO improves machine translation data selection for LLMs.
✨
MiCRo framework learns diverse human preferences for LLMs
✨
ProRL enhances LLM reasoning with KL divergence control
✨
Open CaptchaWorld benchmarks multimodal AI agents
✨
ProxyThinker guides large models with small reasoners
Small Reasoners
✨
DexMachina enables functional bimanual manipulation
✨
3DMEM-BENCH advances embodied AI with dual-memory system