TL;DR · Key Takeaways
  • GR00T N1.6 leads on speed — 3B params, 27.3 Hz inference, 76.8% real-world success, Apache 2.0 license, optimized for NVIDIA hardware
  • pi0.6 leads on manipulation — RECAP self-improvement loop, 2x throughput over pi0.5, ~80% manipulation success, $11B valuation
  • OpenVLA leads on openness — Fully open weights/code/data on the OXE dataset, but zero-shot generalization lags closed-source by 20-30 points
  • Neurosymbolic alternatives are gaining — A 2B NS-VLA hits 95% vs 34% for pure 7B VLAs, suggesting architecture beats scale for structured tasks

The race to build the “foundation model for robots” has produced three standout contenders by mid-2026. Each takes a fundamentally different approach to the same problem: how do you give a robot the ability to see, understand, and act in the physical world?

This guide breaks down NVIDIA GR00T N1, Physical Intelligence pi0, and OpenVLA across architecture, performance, licensing, and practical deployment considerations. New to the field? Start with our getting started with embodied AI guide first.

What Are VLA Models?

Vision-Language-Action (VLA) models are the embodied AI equivalent of large language models. Where GPT processes text, VLAs process visual input + language instructions and output physical actions (joint positions, gripper commands, navigation waypoints).

The key insight: instead of hand-coding robot behaviors, you train a single neural network to map from “what the robot sees + what you tell it” to “what the robot does.”

Field Update: Unitree G1 Is Becoming the Testbed

The most important practical shift in spring 2026 is not just another named VLA model. It is the rise of Unitree G1 as a common, affordable humanoid testbed for manipulation research. That matters because the field needs repeatable results on the same physical platform, not only model cards and demo videos.

Two recent papers show the pattern:

The takeaway for VLA buyers and builders: model choice is no longer separable from robot platform, sensing stack, and deployment workflow. A slower model with clean data collection and a calibrated robot can beat a larger model trapped behind a brittle integration pipeline. Unitree’s own product lineup now explicitly includes G1, R1, H1/H2, G1-D, and G1 application variants, which makes the ecosystem increasingly relevant for hands-on VLA evaluation.

The Three Contenders

NVIDIA GR00T N1.6

Architecture: 3B parameters, 32-layer Diffusion Transformer (DiT), paired with Cosmos-Reason-2B VLM for scene understanding.

Key specs:

  • Inference speed: 27.3 Hz (real-time capable)
  • Real-world success rate: 76.8%
  • License: Apache 2.0 (fully open)
  • Training: Large-scale simulation + real-world data

Strengths:

  • Best-in-class inference speed for real-time control
  • Open-source with NVIDIA ecosystem support (Isaac Sim, Jetson)
  • Strong sim-to-real transfer thanks to Cosmos pre-training

Weaknesses:

  • Requires NVIDIA hardware for optimal performance
  • 76.8% real-world success rate still leaves room for improvement
  • Relatively new, community ecosystem still developing

Physical Intelligence pi0.6

Architecture: RECAP three-stage self-improvement pipeline with diffusion policy action heads, 2x throughput over pi0.5.

Key specs:

  • Valuation: $11B (largest in embodied AI)
  • License: Open-source via openpi
  • Training: Large-scale real-world manipulation data

Strengths:

  • Self-improvement loop (gets better with more deployment data)
  • Backed by massive funding and top talent
  • Strong dexterous manipulation capabilities

Weaknesses:

  • Primarily focused on manipulation (not full-body humanoid control)
  • Self-improvement requires significant compute for each cycle
  • openpi is open but the full training pipeline is not

OpenVLA (Open-Source Community)

Architecture: Community-driven, building on OXE (Open X-Embodiment) dataset.

Key specs:

  • Fully open-source: model weights, training code, data
  • Multiple model sizes available
  • Active research community (Stanford, Berkeley, Google)

Strengths:

  • Completely open — no vendor lock-in
  • Largest diversity of training data (OXE covers 20+ robot embodiments)
  • Best for research and customization

Weaknesses:

  • Zero-shot generalization significantly lags closed-source models
  • OXE data quality is inconsistent
  • No single commercial entity driving production-readiness

Head-to-Head Comparison

FeatureGR00T N1.6pi0.6OpenVLA
Parameters3B~3B1-7B variants
Inference Speed27.3 Hz~10 Hz5-15 Hz
Real-World Success76.8%~80% (manipulation)40-70% (varies)
LicenseApache 2.0Partial openFully open
Best ForFull-body humanoidDexterous manipulationResearch & customization
Hardware RequiredNVIDIA GPU/JetsonGPUAny GPU
Self-ImprovementNoYes (RECAP)Manual fine-tuning

Which Should You Choose?

For production humanoid robots: GR00T N1, especially if you’re in the NVIDIA ecosystem. The 27.3 Hz inference speed is critical for real-time bipedal control.

For manipulation tasks: pi0, particularly if you need the self-improvement loop for continuous deployment optimization.

For research and experimentation: OpenVLA gives you full access to modify every component, and the OXE dataset provides the broadest coverage of robot embodiments.

The Neurosymbolic Alternative

It’s worth noting the emerging neurosymbolic VLA approach. Recent work from Tufts (ICRA 2026) showed that combining PDDL symbolic planning with diffusion policies achieved 95% success rates versus 34% for pure VLA models, with 100x better energy efficiency. The NS-VLA architecture (2B parameters) outperforms 7B pure VLA baselines on LIBERO and CALVIN benchmarks. We cover this paradigm shift in depth in our neurosymbolic VLA explainer.

This suggests the future may not be “bigger VLA models” but “smarter combinations of symbolic reasoning and neural control.”

What’s Next

The VLA space is evolving rapidly. Key trends to watch:

  • Discrete diffusion VLA — faster inference through discrete token generation
  • Reasoning VLA — models that “think before acting” using chain-of-thought
  • Memory-augmented VLA — persistent memory for long-horizon tasks

We’ll be tracking all of these developments. Subscribe to stay updated.

Source notes and update policy: This comparison is maintained as a technical field guide, not a vendor ranking. It draws on official model releases, project documentation, public research papers, and deployment reports. The May 2026 update added Unitree G1 field evidence from arXiv papers on autonomous humanoid grasping and ACT imitation learning. If a benchmark or model release changes, send a correction through the contact page.