Temporal Representation Alignment: Emergent Compositionality in Instruction Following with Successor Features

Vivek Myers 1 Bill Chunyuan Zheng 1 Anca Dragan 1 Kuan Fang 2 Sergey Levine 1
Equal Contribution 1 University of California, Berkeley 2 Cornell University
Figure: We show our Temporal Representation Alignment (TRA) method performing a language task, put all food items in the bowl. TRA adds a time-contrastive loss for learning task representations to use with a goal- and language-conditioned policy. While TRA can implicitly decompose the task into steps and execute them one by one, the behavioral cloning (BC) and offline RL (AWR) methods fail at this compositional task. The structured representations learned by TRA enable this compositional behavior without explicit planning or hierarchical structure.

Abstract

Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositionality. We show that learning to associate the representations of current and future states with a temporal alignment loss can improve compositional generalization, even in the absence of any explicit subtask planning or reinforcement learning. This approach is able to generalize to novel composite tasks specified as goal images or language instructions, without assuming any additional reward supervision or explicit subtask planning. We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.

Method

Real-World Results

Take the mushroom out of the drawer
Put the food items into the drawer
Put the blue objects into the pan
Fold the towel into its center
Move the pepper to the right, and then sweep the towel to the top
Put the corn on the plate, and then put the sushi in the pot
Put everything into the bowl
Close the drawer

Simulation Results

We evaluate TRA in OGBench, a challenging offline simulation environment designed for goal-reaching policies. TRA outperforms behavior cloning in general experiments, and demonstrates strong performance compared to other non-hierarchical reinforcement learning methods when using stitch datasets.

Left: Comparing TRA with baselines in OGBench. Right: The $\texttt{cube-single}$ and $\texttt{humanoidmaze}$ environments.

${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$

@misc{myers2025temporal,
    author       = {Myers, Vivek and Zheng, Bill Chunyuan and Dragan, Anca and Fang, Kuan
                    and Levine, Sergey},
    eprint       = {2502.05454},
    eprinttype   = {arXiv},
    howpublished = {arXiv:2502.05454},
    title        = {{Temporal Representation Alignment}: {Successor Features Enable
                    Emergent Compositionality} in {Robot Instruction Following}},
    url          = {https://arxiv.org/abs/2502.05454},
    year         = {2025},
}