VLA Memory
RL-trained memory for long-horizon robot manipulation
A hierarchical VLM + VLA system for robots that have to remember across tasks. A Qwen2.5-VL-7B planner is fine-tuned with GRPO to pick the keyframes that actually matter and issue the next subtask; a frozen π₀.₅ policy carries out the low-level motion.
- Swaps the imitation learning used in MemER (ICLR 2026) for reinforcement learning. The planner is rewarded for whether the task gets done, not for copying which keyframes a human looked at.
- Same architecture, backbone, and benchmark as the prior work, so the training algorithm is the only thing that changes.
- GRPO loop: sample several rollouts per prompt with the frozen policy, score each on episode success, and update the planner with a group-normalized advantage and a KL anchor to the supervised model.
GRPO · Reinforcement Learning · Qwen2.5-VL · π₀.₅ · JAX / openpi · Modal · A100