put all food items in the bowl.TRA adds a time-contrastive loss for learning task representations to use with a goal- and language-conditioned policy. While TRA can implicitly decompose the task into steps and execute them one by one, the behavioral cloning (BC) and offline RL (AWR) methods fail at this compositional task. The structured representations learned by TRA enable this compositional behavior without explicit planning or hierarchical structure.
Abstract
Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositionality. We show that learning to associate the representations of current and future states with a temporal alignment loss can improve compositional generalization, even in the absence of any explicit subtask planning or reinforcement learning. This approach is able to generalize to novel composite tasks specified as goal images or language instructions, without assuming any additional reward supervision or explicit subtask planning. We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.
Method
Real-World Results
Take the mushroom out of the drawer
Put the food items into the drawer
Put the blue objects into the pan
Fold the towel into its center
Move the pepper to the right, and then sweep the towel to the top
Put the corn on the plate, and then put the sushi in the pot
Put everything into the bowl
Close the drawer
Simulation Results
We evaluate TRA in OGBench, a challenging offline simulation environment designed for goal-reaching policies. TRA outperforms behavior cloning in general experiments, and demonstrates strong performance compared to other non-hierarchical reinforcement learning methods when using stitch datasets.
${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$
@misc{myers2025temporal,
author = {Myers, Vivek and Zheng, Bill Chunyuan and Dragan, Anca and Fang, Kuan
and Levine, Sergey},
eprint = {2502.05454},
eprinttype = {arXiv},
howpublished = {arXiv:2502.05454},
title = {{Temporal Representation Alignment}: {Successor Features Enable
Emergent Compositionality} in {Robot Instruction Following}},
url = {https://arxiv.org/abs/2502.05454},
year = {2025},
}