ViMoS: Democratizing Humanoid Motion Imitation without MoCap

Gabriel P. O. Ruotolo^1,2, Marcos P. C. M. Queiroz^1,3, José G. R. Teles^1,4, Lavínia R. Lima^1,2, Lucas Ólives^1,2, Juan R. B. de F. Silva^1,5, Telma W. de L. Soares^1,3

¹AKCIT — Advanced Knowledge Center for Immersive Technologies, Federal University of Goiás, Goiânia, Brazil

²School of Electrical, Mechanical and Computer Engineering, Federal University of Goiás, Goiânia, Brazil

³Institute of Informatics, Federal University of Goiás, Goiânia, Brazil

⁴Institute of Physics, Federal University of Goiás, Goiânia, Brazil

⁵Division of Electrical and Computer Engineering, Technological Institute of Aeronautics, São José dos Campos, Brazil

^*Alphabetical order, equal contribution

Paper

arXiv

YouTube

Code

Abstract

Teaching humanoid robots expressive, athletic behaviors has historically required expensive Motion Capture (MoCap) infrastructure — placing this capability out of reach for most academic and competition teams. This paper presents ViMoS, an open-source, end-to-end pipeline that transforms a single monocular video into a deployable robot control policy using only a consumer-grade GPU (tested on an NVIDIA RTX 4090).

By integrating GENMO for generative motion estimation, GMR for kinematic retargeting, and BeyondMimic for reinforcement learning inside NVIDIA Isaac Lab, ViMoS enables teams to deploy whole-body behaviors onto physical humanoids such as the Unitree G1 and Booster T1. We validate the pipeline by successfully training and transferring four qualitatively distinct motion skills. A fully containerized Docker workflow minimizes setup friction to under 30 minutes, making expressive humanoid motion accessible to the broader RoboCup community.

Introduction

Expressive whole-body motion is a cornerstone of humanoid robotics, yet creating new behaviors remains prohibitively expensive and technically demanding. High-fidelity motion data traditionally relies on Motion Capture (MoCap) systems, which carry hardware costs in the tens of thousands of dollars and require calibrated laboratory environments — infrastructure unavailable to most academic and competition teams.

Beyond hardware, a significant software fragmentation barrier exists. Tools for video-based pose estimation, kinematic retargeting, and physics-based policy learning often reside in isolated repositories with incompatible dependencies. Integrating these into a functional pipeline typically requires weeks of engineering effort, a process often lost between competition cycles.

ViMoS addresses both barriers with a unified, containerized pipeline. No MoCap hardware is required — a single monocular smartphone video is sufficient. By integrating GENMO, GMR, and BeyondMimic within Isaac Lab, ViMoS provides a documented, end-to-end tool deployable with a single Docker command. Validated on a consumer GPU (NVIDIA RTX 4090) with peak VRAM under 12 GB, ViMoS enables the rapid creation of diverse skills — including dancing, handshaking, and running — fostering an agile research environment for the RoboCup community.

Pipeline Architecture

The ViMoS pipeline is composed of four specialized frameworks that bridge the gap from visual perception to physical execution. Each module addresses a specific challenge of the embodiment transfer process: from raw video to hardware-ready control policies.

Stage 1 — GENMO: Generative Human Motion Estimation

The first stage employs GENMO (Generalist Model for Human Motion), a diffusion-based framework designed for robust pose estimation from monocular video. Unlike traditional regressive models, GENMO utilizes an Asymmetric Diffusion Transformer (AsymmDiT) architecture with 16 layers of RoPE-based Transformer blocks.

Its primary innovation lies in Dual-Mode Training: an Estimation Mode that uses Maximum Likelihood Estimation (MLE) to produce precise SMPL-X parameters even under heavy occlusion, and a Global Trajectory Recovery mode that estimates global root velocities and camera poses — allowing the system to extract clean motion data from dynamic, handheld footage without fixed MoCap setups.

Stage 2 — GMR: General Motion Retargeting

Once the human motion is extracted, GMR performs the kinematic mapping to the robot's specific topology. Retargeting is often plagued by foot sliding and self-collisions due to differing limb proportions. GMR resolves this through a two-stage process:

First, Non-Uniform Local Scaling dynamically adjusts the human reference to match the specific segment lengths of robots like the Unitree G1 or Booster T1. Second, Optimized Differential IK uses the Mink solver to minimize the error between human key-points and robot end-effectors while strictly enforcing joint limits. Supports 18+ humanoid platforms out of the box.

Stage 3 — BeyondMimic: Policy Learning in Isaac Lab

The control policy is trained with BeyondMimic inside NVIDIA's Isaac Lab simulation, leveraging GPU-parallelized Reinforcement Learning. Adaptive Sampling via within-motion failure-aware bin resampling identifies frames where the robot frequently loses balance and increases their sampling frequency during training, significantly accelerating convergence for agile tasks like kicks or spins. A compact tracking reward combines local joint imitation with world-frame global constraints, reducing long-term drift in locomotion tasks. The same reward formulation works for any motion without parameter tuning.

Stage 4 — Deployment: ROS 2 and TorchScript

For the Unitree G1, ViMoS exports policies in ONNX format and integrates them into a ROS 2 Jazzy controller node. For the Booster T1, the framework uses TorchScript JIT for high-frequency low-level control at up to 200 Hz. This dual-export capability ensures ViMoS is a generalist tool, capable of powering various humanoid platforms through a unified software interface.

Getting Started

ViMoS is fully containerized with Docker, so the only requirements are a Linux machine with an NVIDIA GPU (RTX 3090+), Docker with nvidia-container-toolkit, and a free Weights & Biases account for the artifact registry. You will also need the GENMO checkpoint (s050000.ckpt) placed at retarget/GENMO/inputs/checkpoints/.

1. Clone and install

git clone --recurse-submodules https://github.com/AKCIT-RL/ViMoS.git
cd ViMoS
docker compose build                      # Stages 1–2 (GENMO + GMR)
docker compose --profile wbt build        # Stage 3 (Isaac Lab training)

2. Video → Robot Motion (Stages 1 & 2)

A single script runs GENMO and GMR end-to-end and outputs the retargeted motion as a CSV ready for training:

python scripts/video_to_robot_docker.py --video my_video.mp4 --robot booster_t1
# or: --robot unitree_g1  |  add --headless for SSH  |  add --record_video to save visualization

3. Policy Training (Stage 3)

Convert the CSV to NPZ, upload to W&B, then launch PPO training in Isaac Lab:

cd train/whole_body_tracking
python scripts/csv_to_npz.py --input_file motion.csv --output_name my_motion --robot booster_t1 --headless
python scripts/rsl_rl/train.py --task=Tracking-Flat-T1-Wo-State-Estimation-v0 --num_envs 4096 --headless

For Unitree G1 use --task=Tracking-Flat-G1-v0. After training, export with play.py to get a .pt (T1) or .onnx (G1) policy file.

4. Deploy in MuJoCo (Stage 4)

# Unitree G1 via ROS 2
ros2 launch motion_tracking_controller mujoco.launch.py policy_path:=/path/to/policy.onnx

For Booster T1, see deploy/booster_deploy/README.md.

For detailed instructions, troubleshooting, and advanced options, visit the GitHub repository.

Experiments

To demonstrate that ViMoS generalizes across qualitatively different motion types and robot platforms, we trained and evaluated three distinct skills on both the Booster T1 and Unitree G1: Aceno (wave), Boxe (boxing), and Macarena (dance). Each row below shows the full pipeline from raw video to physical deployment.

All tasks achieve comparable reward curves; differences reflect motion duration rather than task difficulty. Peak GPU memory consumption during training did not exceed 12 GB, confirming that the pipeline is viable on a single high-end consumer GPU. Full training logs are available on Weights & Biases.

Booster T1

Unitree G1

Results

Aceno (Wave)

Source Video

Motion Extraction

Retarget T1

Deploy T1 (MuJoCo)

Retarget G1

Deploy G1 (MuJoCo)

Boxe (Boxing)

Source Video

Motion Extraction

Retarget T1

Deploy T1 (MuJoCo)

Retarget G1

Deploy G1 (MuJoCo)

Macarena

Source Video

Motion Extraction

Retarget T1

Deploy T1 (MuJoCo)

Retarget G1

Deploy G1 (MuJoCo)

Conclusion

ViMoS lowers three compounding barriers — cost, fragmentation, and hardware uncertainty — that currently prevent most teams from developing expressive humanoid behaviors. By unifying GENMO, GMR, and BeyondMimic into a single containerized workflow and validating it on consumer hardware across four qualitatively distinct motion skills (two dances, a handshake, and a running gait), we demonstrate that high-quality motion imitation is no longer the exclusive domain of well-resourced labs.

We release ViMoS as a fully open-source tool and invite the RoboCup community to extend and build upon it.

Related Work

ViMoS integrates four open-source components. Please refer to each project for credits, licenses, and upstream documentation.

Acknowledgments

This work has been [partially/fully] funded by the project AKCIT-Robotics: Immersive Environments Accelerating Robot Learning, supported by the Advanced Knowledge Center in Immersive Technologies (AKCIT), with financial resources from the PPI IoT of the MCTI grant number 057/2023, signed with EMBRAPII. The authors are also grateful to the Fundação de Amparo à Pesquisa do Estado de Goiás (FAPEG) for the financial support provided for this research (Grant 64448878/2024). We thank the authors of GMR, GENMO, and BeyondMimic for their open-source contributions.

ViMoS: Democratizing Humanoid Motion Imitation without MoCap

Gabriel P. O. Ruotolo1,2*, Marcos P. C. M. Queiroz1,3*, José G. R. Teles1,4, Lavínia R. Lima1,2, Lucas Ólives1,2, Juan R. B. de F. Silva1,5, Telma W. de L. Soares1,3

1AKCIT — Advanced Knowledge Center for Immersive Technologies, Federal University of Goiás, Goiânia, Brazil

2School of Electrical, Mechanical and Computer Engineering, Federal University of Goiás, Goiânia, Brazil

3Institute of Informatics, Federal University of Goiás, Goiânia, Brazil

4Institute of Physics, Federal University of Goiás, Goiânia, Brazil

5Division of Electrical and Computer Engineering, Technological Institute of Aeronautics, São José dos Campos, Brazil

*Alphabetical order, equal contribution

Abstract

Introduction

Pipeline Architecture

Stage 1 — GENMO: Generative Human Motion Estimation

Stage 2 — GMR: General Motion Retargeting

Stage 3 — BeyondMimic: Policy Learning in Isaac Lab

Stage 4 — Deployment: ROS 2 and TorchScript

Getting Started

1. Clone and install

2. Video → Robot Motion (Stages 1 & 2)

3. Policy Training (Stage 3)

4. Deploy in MuJoCo (Stage 4)

Experiments

Results

Conclusion

Related Work

BeyondMimic

GENMO

GMR — General Motion Retargeting

Isaac Lab

Acknowledgments

Gabriel P. O. Ruotolo^1,2, Marcos P. C. M. Queiroz^1,3, José G. R. Teles^1,4, Lavínia R. Lima^1,2, Lucas Ólives^1,2, Juan R. B. de F. Silva^1,5, Telma W. de L. Soares^1,3

¹AKCIT — Advanced Knowledge Center for Immersive Technologies, Federal University of Goiás, Goiânia, Brazil

²School of Electrical, Mechanical and Computer Engineering, Federal University of Goiás, Goiânia, Brazil

³Institute of Informatics, Federal University of Goiás, Goiânia, Brazil

⁴Institute of Physics, Federal University of Goiás, Goiânia, Brazil

⁵Division of Electrical and Computer Engineering, Technological Institute of Aeronautics, São José dos Campos, Brazil

^*Alphabetical order, equal contribution