RT-2

Modelactive

RT-2 (Robotic Transformer 2) is a vision-language-action (VLA) model developed by Google DeepMind (formerly Google Robotics) that enables robots to transfer web-scale knowledge to robotic control. Published in July 2023, RT-2 demonstrated that fine-tuning large pre-trained vision-language models (VLMs) on robot data can produce models capable of robotic control while retaining internet-scale semantic knowledge. The key insight of RT-2 is that by representing robot actions as text tokens (e.g., "move arm 1.0 0.5..."), a standard VLM can be fine-tuned to output action tokens alongside language tokens, enabling the model to leverage its pre-trained understanding of objects, scenes, and instructions to perform novel manipulation tasks zero-shot. RT-2 was trained on a large dataset of robot demonstrations and showed emergent capabilities — it could follow commands involving objects it had never seen in robot data, leveraging its internet-scale vision-language pretraining. This made RT-2 a landmark paper in the VLA approach to robot learning. The model family includes RT-2, RT-2-PaLM-E, and RT-2-BEiT, with the largest variant based on PaLM-E (562B parameters). RT-2 was a successor to RT-1 (Robotic Transformer 1).

Details

Updated:6/6/2026

https://robotics-transformer2.github.io

open sourcefalse

release date2023-07-28

paper urlhttps://arxiv.org/abs/2307.15818

model familyVLA (Vision-Language-Action)

Relationships

No relationships found.

Sources

https://robotics-transformer2.github.io

website

Visit

https://arxiv.org/abs/2307.15818