Self-Improving Embodied Foundation Models

Appearing at NeurIPS 2025: Two-stage post-training for Embodied Foundation Models (EFMs)

  • Stage 1) Supervised Fine-Tuning (behavioral cloning + steps-to-go prediction)
  • Stage 2) Self-Improvement via online RL with self-predicted rewards and success detection
Method overview: SFT (BC + steps-to-go) followed by Self-Improvement with reward shaping and success detection
Source: Figure 1 in paper.
Abstract

Foundation models trained on web-scale data have revolutionized robotics, but their application to low-level control remains largely limited to behavioral cloning. Drawing inspiration from the success of the reinforcement learning stage in fine-tuning large language models, we propose a two-stage post-training approach for robotics. The first stage, Supervised Fine-Tuning (SFT), fine-tunes pretrained foundation models using both: a) behavioral cloning, and b) steps-to-go prediction objectives. In the second stage, Self-Improvement, steps-to-go prediction enables the extraction of a well-shaped reward function and a robust success detector, enabling a fleet of robots to autonomously practice downstream tasks with minimal human supervision. Through extensive experiments on real-world and simulated robot embodiments, our novel post-training recipe unveils significant results on Embodied Foundation Models. First, we demonstrate that the combination of SFT and Self-Improvement is significantly more sample-efficient than scaling imitation data collection for supervised learning, and that it leads to policies with significantly higher success rates. Further ablations highlight that the combination of web-scale pretraining and Self-Improvement is the key to this sample-efficiency. Next, we demonstrate that our proposed combination uniquely unlocks a capability that current methods cannot achieve: autonomously practicing and acquiring novel skills that generalize far beyond the behaviors observed in the imitation learning datasets used during training. These findings highlight the transformative potential of combining pretrained foundation models with online Self-Improvement to enable autonomous skill acquisition in robotics.

Overview

Method — Two-stage post-training

Stage 1 — Supervised Fine-Tuning (SFT)

Fine-tune an Embodied Foundation Model (EFM) initialized from a web-scale pretrained multimodal foundation model with two objectives:

Behavioral Cloning (BC) loss. \[\mathcal{L}_{\mathrm{BC}}(\mathrm{EFM})\;=\; -\,\mathbb{E}_{(o_t,a_t,g_{t'})\sim\mathcal{D}}\,\Big[\,\log p^{\mathrm{EFM}}_{\mathrm{action}}(a_t\mid o_t, g_{t'})\,\Big]\]

Steps-to-go loss. \[\mathcal{L}_{\mathrm{steps\text{-}to\text{-}go}}(\mathrm{EFM})\;=\; -\,\mathbb{E}_{(o_t,a_t,g_{t'})\sim\mathcal{D}}\,\Big[\,\log p^{\mathrm{EFM}}_{\mathrm{steps\text{-}to\text{-}go}}(\,t' - t\mid o_t, g_{t'}\,)\,\Big]\]

Stage 2 — Self-Improvement (Online RL)

Self-predicted rewards and success detector enable robots to autonomously practice downstream tasks and improve online with minimal supervision.

\[ d(o,g)\;:=\;\mathbb{E}_{\;p^{\mathrm{EFM}}_{\mathrm{steps\text{-}to\text{-}go}}(\,\text{steps-to-go}\mid o,g\,)}\big[\,\text{steps-to-go}\,\big] \]

Reward function. \[ r(o_t, a_t, o_{t+1}, g) = d(o_t, g) - d(o_{t+1}, g) \]

Success detector. \[ \mathrm{success}(o, g) = \mathbb{1}[\, d(o, g) \le s \,] \]

This eliminates manual reward engineering and, combined with web-scale pretraining, enables behavioral generalization beyond imitation data.

Key Results

Self-Improvement Stage 2 delta overview
Stage 2 Self-Improvement deltas.

KR1 Self-Improvement significantly improves policy performance beyond Supervised Fine-Tuning (SFT), and the combination of SFT + Self-Improvement is much more sample-efficient than supervised learning alone.

In the Simulated LanguageTable domain, starting from a behavioral cloning policy with a success rate of 45%, we observe:

Collecting 10% additional episodes during Online Self-Improvement (Active RL)
45% → 75%
vs.
Collecting 8x Imitation Learning Data for Supervised Fine-Tuning (Passive Behavioral Cloning)
45% → 60%

KR2 Self-Improvement is robust and effective for real-world robot learning.

In the Real-World LanguageTable domain, starting from a behavioral cloning policy with a success rate of 63%, we observe:

Collecting 15% additional episodes during Online Self-Improvement (Active RL)
63% → 87.5%
vs.
Collecting 4x Imitation Learning Data for Supervised Fine-Tuning (Passive Behavioral Cloning)
63% → 62%

KR3 Online Self-Improvement + web-scale pretraining enables policies to rapidly acquire new skills that generalize far beyond imitation datasets.

Real2Sim

Generalization across simulation and real-world with Self-Improvement.

Real2Sim environments
Real2Sim evaluation environments.
Strong Generalisation

Acquiring novel skills beyond the imitation datasets.

BananaTable environments
Strong Generalisation: LanguageTable → BananaTable.
Strong Generalization — BananaTable

We start from a policy and reward model that has only seen the LanguageTable dataset, and never seen a banana nor LanguageTable without the blocks. After 8 hours of Self-Improvement, the model rapidly aquires the novel challenging skill of effectively moving the banana around the table.

Before Self-Improvement (~63% success rate)
After 8 hours of Self-Improvement (~85% success rate)

KR4 Multimodal pretraining is a key enabler of sample-efficiency and stronger Self-Improved policies.

We ablate alternative reward-model variants. Our results demonstrate the significant value of web-scale multimodal pretraining, in particular on smaller dataset sizes.

  • PaLI / PaLI-X: pretrained multimodal foundation models (arXiv).
  • Uni-PaLI: vision and language components trained unimodally, without joint training.
  • Scratch: no pretraining, same architecture.
Ablation: unimodal vs multimodal
Effect of multimodal pretraining of reward model on Simulated LanguageTable.
Real2Sim performance plot
Effect of multimodal pretraining of reward model on Real2Sim transfer.
Intuition

Visual Intuition

We formulate steps-to-go prediction as a discrete prediction task. The first two figures visualize the model's predicted distribution over steps-to-go at interesting moments during Aloha Single Insertion. The third figure plots the expectation \( \mathbb{E}[\text{steps-to-go}] \) over time for a full episode. All figures are from the Aloha Single Insertion task.
In the first frame, the policy is about to successfully insert the peg and complete the task, so the model predicts that with high likelihood the policy will succeed soon. However, in the next frame the policy lets go of the peg too soon and the peg is about to fall. Thus the predicted steps-to-go widens drastically into a multimodal distribution, considering the spectrum of possibilities from a quick recovery to longer recovery times. As the policy recovers in the fourth and fifth frames, the model's prediction narrows back to a unimodal distribution, with high likelihood of success in the near horizon.
In the first two frames the policy is on track to successfully complete the task, so the model predicts that with high likelihood the policy will succeed soon. However, in the third frame the socket begins to slip out of the left gripper. Despite this slippage being barely visible from the left wrist camera, and not visible in any of the other camera views, the model immediately picks up on this event and its predictions widen significantly with multiple modes. Specifically, the model places some probability mass on an immediate save, and distributes the rest of the probability mass over a range of possible recovery times. In the fourth and fifth frames the socket fully slips out of the gripper, so the model removes the probability mass on the immediate save outcome.
An example trajectory from the Aloha Single Insertion Task and a plot representing \( \mathbb{E}[\text{steps-to-go}] \) under the model's prediction (i.e., \( d(o, g) \)). Key moments: 1) Model believes the episode is about to complete successfully, 2) Policy accidentally drops the peg and \( d(o, g) \) increases, 3) Policy regrasps the peg from a bad angle not suitable for insertion so \( d(o, g) \) remains high, 4) Policy drops the peg, providing an opportunity to regrasp correctly which reduces \( d(o, g) \), 5) Policy is pushing the peg inside and \( d(o, g) \) marks that the policy is about to succeed, 6) The right hand knocks the socket out of the left hand's grip which increases \( d(o, g) \).
Aloha Steps-to-Go Predictions
LanguageTable Steps-to-Go Prediction
BananaTable Steps-to-Go Prediction
Intuition

Mathematical Intuition

We use the expectation of the model's steps-to-go predictions:

\[ d(o, g) := \mathbb{E}_{\;p_{\text{steps-to-go}}(\,\text{steps-to-go}\mid o, g\,)}\big[\,\text{steps-to-go}\,\big] \]

To define the reward function as the improvement in steps-to-go:

\[ r(o_t, a_t, o_{t+1}, g) = d(o_t, g) - d(o_{t+1}, g) \]

This also lets us define success detection via thresholding steps-to-go predictions:

\[ \mathrm{success}(o, g) = \mathbb{1}\,[\, d(o, g) \le s \,] \]

Letting \( \mu \) be the policy corresponding to the imitation dataset (e.g., human dataset), we can define the value function of \( \mu \) as follows:

\[ V^{\mu}(o_t, g) = \mathbb{E}_{\mu}\Big[ \sum_{i=t}^{T} -\, \mathbb{1}\big[\, o_i\ \text{satisfies}\ g \,\big] \Big] = \mathbb{E}_{\mu}\big[ -\, \text{steps-to-go} \big] =: -\, d(o_t, g) \]

This enables us to decompose our proposed reward function:

\[ r(o_t, a_t, o_{t+1}, g) = V^{\mu}(o_{t+1}, g) - V^{\mu}(o_t, g) = \underbrace{(1 - \gamma)\, V^{\mu}(o_{t+1}, g)}_{\text{core reward}} \;+\; \underbrace{\left[ \gamma\, V^{\mu}(o_{t+1}, g) - V^{\mu}(o_t, g) \right]}_{\text{reward shaping}} \]

Self-Improvement leads to policies that achieve intended goals more efficiently than the dataset policy \( \mu \), while being implicitly regularized to stay close to regions of the state space where \( \mu \) is proficient.

Simplifying the Monte Carlo returns we observe a built-in baseline subtraction:

\[ R_t = \sum_{i=t}^{T} \gamma^{\, i-t}\; r(o_i, a_i, o_{i+1}, g) \;=\; \Big[\, (1-\gamma)\, \sum_{i=t}^{T} \gamma^{\, i-t}\; V^{\mu}(o_{i+1}, g) \,\Big] \; - \; \underbrace{V^{\mu}(o_t, g)}_{\text{baseline}} \]

The built-in baseline subtraction leads to lower variance estimates, enabling use to use REINFORCE for policy improvement.

Intuition

PointMass Domain

We provide a self-contained Colab notebook that demonstrates Self-Improvement on a pointmass navigation domain. Each episode starts from a random position and aims for a randomly sampled goal. We intentionally construct a sub-optimal imitation dataset by using a PD-controller that visits five intermediate waypoints before heading to the goal. We then fine-tune an MLP policy and a steps-to-go prediction model with our two-stage recipe. As expected, Stage 1 BC mimics the dataset’s sub-optimalities, while Stage 2 Self-Improvement (without ground-truth rewards) quickly drives policies close to optimal. The figure below shows sample trajectories from the dataset, BC (Stage 1), and Self-Improved (Stage 2). The videos below show the Stage 1 and Stage 2 policies.

Stage 1 SFT Policy
Stage 2 Self-Improvement Policy
Pointmass Navigation Domain
Pointmass Navigation Domain. Sample trajectories from the imitation learning dataset, as well as BC (Stage 1) and Self-Improved (Stage 2) policies.
Paper & BibTeX

Resources

PDF arXiv
Cite as
@inproceedings{self_improving_efms_2025,
  title={Self-Improving Embodied Foundation Models},
  author={Seyed Ghasemipour, Seyed Kamyar and Wahid, Ayzaan and Tompson, Jonathan and Sanketi, Pannag and Mordatch, Igor},
  booktitle={NeurIPS},
  year={2025},
  note={Appearing in NeurIPS 2025},
  url={https://arxiv.org/abs/2509.15155}
}
Team

Authors

Seyed Kamyar Seyed Ghasemipour (Generalist), Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, Igor Mordatch (Google DeepMind)

Contact: kamyar@generalistai.com

Equal supervision noted in the paper.

FAQ

Common questions

What is steps-to-go? A model prediction of remaining steps to reach a goal. Its decrease across time forms a dense progress signal.

How are rewards computed? By differences in steps-to-go: r = d(o, g) − d(o′, g).

How is success detected? When d(o, g) ≤ s, providing a principled termination signal.

Why not just use more imitation data? Shaped signals and online practice deliver larger gains with less data collection.

Which embodiments? LanguageTable and Aloha, in simulation and real-world.

Limitations? Mis-calibrated d and latency constraints; addressed via thresholds, filtering, and local inference (Infra v2).