Teaching humanoid robots expressive, athletic behaviors has historically required expensive Motion Capture (MoCap) infrastructure — placing this capability out of reach for most academic and competition teams. This paper presents ViMoS, an open-source, end-to-end pipeline that transforms a single monocular video into a deployable robot control policy using only a consumer-grade GPU (tested on an NVIDIA RTX 4090).
By integrating GENMO for generative motion estimation, GMR for kinematic retargeting, and BeyondMimic for reinforcement learning inside NVIDIA Isaac Lab, ViMoS enables teams to deploy whole-body behaviors onto physical humanoids such as the Unitree G1 and Booster T1. We validate the pipeline by successfully training and transferring four qualitatively distinct motion skills. A fully containerized Docker workflow minimizes setup friction to under 30 minutes, making expressive humanoid motion accessible to the broader RoboCup community.
Expressive whole-body motion is a cornerstone of humanoid robotics, yet creating new behaviors remains prohibitively expensive and technically demanding. High-fidelity motion data traditionally relies on Motion Capture (MoCap) systems, which carry hardware costs in the tens of thousands of dollars and require calibrated laboratory environments — infrastructure unavailable to most academic and competition teams.
Beyond hardware, a significant software fragmentation barrier exists. Tools for video-based pose estimation, kinematic retargeting, and physics-based policy learning often reside in isolated repositories with incompatible dependencies. Integrating these into a functional pipeline typically requires weeks of engineering effort, a process often lost between competition cycles.
ViMoS addresses both barriers with a unified, containerized pipeline. No MoCap hardware is required — a single monocular smartphone video is sufficient. By integrating GENMO, GMR, and BeyondMimic within Isaac Lab, ViMoS provides a documented, end-to-end tool deployable with a single Docker command. Validated on a consumer GPU (NVIDIA RTX 4090) with peak VRAM under 12 GB, ViMoS enables the rapid creation of diverse skills — including dancing, handshaking, and running — fostering an agile research environment for the RoboCup community.
The ViMoS pipeline is composed of four specialized frameworks that bridge the gap from visual perception to physical execution. Each module addresses a specific challenge of the embodiment transfer process: from raw video to hardware-ready control policies.
The first stage employs GENMO (Generalist Model for Human Motion), a diffusion-based framework designed for robust pose estimation from monocular video. Unlike traditional regressive models, GENMO utilizes an Asymmetric Diffusion Transformer (AsymmDiT) architecture with 16 layers of RoPE-based Transformer blocks.
Its primary innovation lies in Dual-Mode Training: an Estimation Mode that uses Maximum Likelihood Estimation (MLE) to produce precise SMPL-X parameters even under heavy occlusion, and a Global Trajectory Recovery mode that estimates global root velocities and camera poses — allowing the system to extract clean motion data from dynamic, handheld footage without fixed MoCap setups.
Once the human motion is extracted, GMR performs the kinematic mapping to the robot's specific topology. Retargeting is often plagued by foot sliding and self-collisions due to differing limb proportions. GMR resolves this through a two-stage process:
First, Non-Uniform Local Scaling dynamically adjusts the human reference to match the specific segment lengths of robots like the Unitree G1 or Booster T1. Second, Optimized Differential IK uses the Mink solver to minimize the error between human key-points and robot end-effectors while strictly enforcing joint limits. Supports 18+ humanoid platforms out of the box.
The control policy is trained with BeyondMimic inside NVIDIA's Isaac Lab simulation, leveraging GPU-parallelized Reinforcement Learning. Adaptive Sampling via within-motion failure-aware bin resampling identifies frames where the robot frequently loses balance and increases their sampling frequency during training, significantly accelerating convergence for agile tasks like kicks or spins. A compact tracking reward combines local joint imitation with world-frame global constraints, reducing long-term drift in locomotion tasks. The same reward formulation works for any motion without parameter tuning.
For the Unitree G1, ViMoS exports policies in ONNX format and integrates them into a ROS 2 Jazzy controller node. For the Booster T1, the framework uses TorchScript JIT for high-frequency low-level control at up to 200 Hz. This dual-export capability ensures ViMoS is a generalist tool, capable of powering various humanoid platforms through a unified software interface.
ViMoS is fully containerized with Docker, so the only requirements are a Linux machine with an NVIDIA GPU (RTX 3090+), Docker with nvidia-container-toolkit, and a free Weights & Biases account for the artifact registry. You will also need the GENMO checkpoint (s050000.ckpt) placed at retarget/GENMO/inputs/checkpoints/.
git clone --recurse-submodules https://github.com/AKCIT-RL/ViMoS.git
cd ViMoS
docker compose build # Stages 1–2 (GENMO + GMR)
docker compose --profile wbt build # Stage 3 (Isaac Lab training)
A single script runs GENMO and GMR end-to-end and outputs the retargeted motion as a CSV ready for training:
python scripts/video_to_robot_docker.py --video my_video.mp4 --robot booster_t1
# or: --robot unitree_g1 | add --headless for SSH | add --record_video to save visualization
Convert the CSV to NPZ, upload to W&B, then launch PPO training in Isaac Lab:
cd train/whole_body_tracking
python scripts/csv_to_npz.py --input_file motion.csv --output_name my_motion --robot booster_t1 --headless
python scripts/rsl_rl/train.py --task=Tracking-Flat-T1-Wo-State-Estimation-v0 --num_envs 4096 --headless
For Unitree G1 use --task=Tracking-Flat-G1-v0. After training, export with play.py to get a .pt (T1) or .onnx (G1) policy file.
# Unitree G1 via ROS 2
ros2 launch motion_tracking_controller mujoco.launch.py policy_path:=/path/to/policy.onnx
For Booster T1, see deploy/booster_deploy/README.md.
For detailed instructions, troubleshooting, and advanced options, visit the GitHub repository.
To demonstrate that ViMoS generalizes across qualitatively different motion types and robot platforms, we trained and evaluated three distinct skills on both the Booster T1 and Unitree G1: Aceno (wave), Boxe (boxing), and Macarena (dance). Each row below shows the full pipeline from raw video to physical deployment.
All tasks achieve comparable reward curves; differences reflect motion duration rather than task difficulty. Peak GPU memory consumption during training did not exceed 12 GB, confirming that the pipeline is viable on a single high-end consumer GPU. Full training logs are available on Weights & Biases.
Booster T1
Unitree G1
Aceno (Wave)
Boxe (Boxing)
Macarena
ViMoS lowers three compounding barriers — cost, fragmentation, and hardware uncertainty — that currently prevent most teams from developing expressive humanoid behaviors. By unifying GENMO, GMR, and BeyondMimic into a single containerized workflow and validating it on consumer hardware across four qualitatively distinct motion skills (two dances, a handshake, and a running gait), we demonstrate that high-quality motion imitation is no longer the exclusive domain of well-resourced labs.
We release ViMoS as a fully open-source tool and invite the RoboCup community to extend and build upon it.
ViMoS integrates four open-source components. Please refer to each project for credits, licenses, and upstream documentation.
This work has been [partially/fully] funded by the project AKCIT-Robotics: Immersive Environments Accelerating Robot Learning, supported by the Advanced Knowledge Center in Immersive Technologies (AKCIT), with financial resources from the PPI IoT of the MCTI grant number 057/2023, signed with EMBRAPII. The authors are also grateful to the Fundação de Amparo à Pesquisa do Estado de Goiás (FAPEG) for the financial support provided for this research (Grant 64448878/2024). We thank the authors of GMR, GENMO, and BeyondMimic for their open-source contributions.