The Future 18 min

The Data Flywheel Nobody Is Measuring: 10,000 Chinese Robots vs 1,500 American Ones

By Robots In Life
data China AI training flywheel machine-learning

TL;DR

Every deployed humanoid robot generates data about how the real world works. China has nearly seven times more of them operating in factories, warehouses, and hotels than the United States. The implications for AI training are enormous, but the relationship between volume and intelligence is not as simple as it looks.

There is a thesis in Silicon Valley that goes something like this: whoever deploys the most robots into the real world will collect the most real-world data. That data will train the best AI models. Those better models will make the robots more capable, which will lead to more deployments, which will generate more data. A flywheel. A compounding advantage. The kind of feedback loop that made Google unbeatable in search, that gave Tesla a lead in self-driving, that turned AWS into a money printer.

If this thesis is correct, the implications for the humanoid robot race are stark. China has over 10,000 humanoid robots operating in real-world environments right now. The United States has approximately 1,500. That is not a gap. That is a chasm.

But here is the question nobody seems to be asking carefully enough: does the data flywheel thesis actually hold for humanoid robotics? Is more data always better? Or is the relationship between deployed robots and AI capability more complicated than the neat narrative suggests?

Deployed humanoid robots generating real-world data (Q1 2026)

~10,000+

China total

Unitree, AgiBot, UBTECH, Fourier, others

~1,500

United States total

Tesla, Boston Dynamics, Agility, Figure, Apptronik

6.7x

China's volume lead

Ratio of deployed units

The case for the flywheel

The logic of the data flywheel is borrowed from software and applied to physical machines. In digital AI, the pattern is well established. Google got better at search because more users generated more queries, which provided more signal about which results were useful. Facebook got better at content ranking because more posts and more engagement gave its algorithms more examples to learn from. OpenAI got better at language because more conversations meant more reinforcement data.

The robotics version of this argument runs as follows. Every humanoid robot operating in a factory, warehouse, or hotel lobby is a data collection device. Its cameras record visual scenes. Its joint encoders log motor positions. Its force sensors measure contact with objects. Its IMU tracks balance and orientation. All of this data, raw and noisy and messy, captures something essential: what it is actually like to move through and interact with the physical world.

This is the data that trains what researchers call foundation models for robotics, large neural networks that learn generalized physical understanding from diverse real-world experience. Google DeepMind’s RT-2 and its successors, Physical Intelligence’s Pi-0, Toyota Research Institute’s large behavior models: all of them are hungry for exactly this kind of data.

~50 TB/day Estimated real-world data generated by China's deployed humanoid fleet

And here is where China’s numbers start to matter. If you assume that each deployed humanoid generates roughly 5 gigabytes of sensor data per operating day (a conservative estimate given the camera feeds, LIDAR scans, and proprioceptive streams involved), then China’s fleet of 10,000+ robots produces approximately 50 terabytes of training-relevant data every single day. The American fleet generates perhaps 7.5 terabytes. Over a year, that gap compounds to roughly 15 petabytes versus 2.7 petabytes.

In AI, data quantity has historically been a strong predictor of model quality. The scaling laws that govern large language models, first formalized by OpenAI and later refined by DeepMind’s Chinchilla research, suggest that model performance improves predictably as training data increases, at least up to a point. If those same scaling laws apply to robotic foundation models, then China’s data advantage could translate directly into an AI advantage.

Estimated deployed humanoid units by company (Q1 2026)

Unitree
5,500 units
AgiBot
5,200 units
UBTECH
1,000 units
Boston Dynamics
1,000 units
Leju Robotics
650 units
Tesla
500 units
Engine AI
400 units
Fourier
350 units
Agility
300 units
Figure AI
200 units

The autonomous driving precedent

The strongest evidence for the data flywheel thesis in robotics comes from autonomous driving, a domain where China and the United States have been running a natural experiment for the better part of a decade.

Tesla’s Autopilot and Full Self-Driving systems have benefited enormously from the data generated by millions of Tesla vehicles on the road. By 2025, Tesla had accumulated over 10 billion miles of real-world driving data, a figure no competitor could match. That data advantage allowed Tesla to train neural networks that handle edge cases, rare events, and novel scenarios that smaller datasets simply cannot represent.

Waymo, despite having far fewer vehicles, has offset its data disadvantage with higher-quality data. Waymo vehicles carry more sensors, use more precise labeling, and operate in more carefully mapped environments. The result is that Waymo’s autonomous driving performance in its operational domains is arguably the best in the world, even with a fraction of Tesla’s fleet data.

In China, Baidu’s Apollo program tells yet another version of the story. Baidu has operated thousands of autonomous vehicles across Chinese cities, accumulating massive datasets of Chinese road conditions, traffic patterns, and driving behaviors. This data has made Apollo genuinely competitive in the Chinese market, even though its technology was originally considered inferior to Waymo’s. Deployment at scale, in the specific environment where the vehicles operate, created a localized data advantage that proved more valuable than technical sophistication alone.

The case against the flywheel (or at least, this version of it)

The data flywheel narrative has three critical assumptions baked into it, and each one deserves scrutiny.

Assumption 1: All robot data is useful training data

The first assumption is that more deployed robots means more useful training data. This sounds obvious, but it breaks down under examination.

Most of China’s 10,000+ deployed humanoid robots are doing highly repetitive tasks in controlled environments. An AgiBot unit picking up the same type of box from the same conveyor belt in the same factory does generate data, but after the first thousand repetitions, the marginal value of each additional data point drops dramatically. You do not need 10 million examples of the same pick-and-place operation to train a model. You need 10 million examples of 10,000 different operations.

This is the diversity problem. In machine learning, there is a well-established distinction between data quantity and data diversity. Models trained on large but homogeneous datasets learn narrow skills very well but generalize poorly. Models trained on smaller but diverse datasets often outperform them on novel tasks.

~85% Estimated share of Chinese humanoid deployments in repetitive logistics or assembly tasks

Researchers at IEEE Spectrum and MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have argued that the real bottleneck in robotic AI is not data volume but data diversity. A humanoid robot that spends eight hours a day in the same warehouse corridor generates less useful training data than a humanoid that spends one hour in a warehouse, one hour in a kitchen, one hour in a construction site, and one hour in a hospital. The breadth of environments matters more than the depth of repetition.

Assumption 2: Real-world data is necessary for AI progress

The second assumption is that real-world data is the primary driver of robotic AI progress. This was certainly true five years ago, but it is becoming less true every year.

Simulation has gotten dramatically better. Modern physics engines like NVIDIA’s Isaac Sim, Google DeepMind’s MuJoCo, and various proprietary platforms can generate synthetic training data at a fraction of the cost and time required for real-world collection. More importantly, simulation allows for systematic exploration of edge cases and failure modes that real-world deployments rarely encounter.

The sim-to-real transfer problem, long considered the Achilles heel of simulation-based training, has improved substantially. Research published in Nature in 2024 demonstrated successful zero-shot sim-to-real transfer for complex manipulation tasks, meaning that policies trained entirely in simulation worked in the real world without any real-world fine-tuning. This is not yet the norm, but it is no longer science fiction.

Data collection: real-world vs simulation

Cost per training hour

Real-world fleet $50-200
Simulation $0.10-1.00

Real-world includes robot depreciation, energy, and facility costs

Data diversity

Real-world fleet Limited by deployment environments
Simulation Unlimited scenario generation

Edge case coverage

Real-world fleet Rare by definition
Simulation Systematically generated

Physical realism

Real-world fleet Perfect (it is real)
Simulation Improving but imperfect

Contact dynamics accuracy

Real-world fleet Ground truth
Simulation Approximated

Scaling speed

Real-world fleet Constrained by hardware
Simulation Constrained by compute

Annotation quality

Real-world fleet Requires human labeling
Simulation Automatic ground truth labels

If simulation continues to improve at its current pace, the value of a large real-world deployment fleet as a data collection mechanism will diminish. You will still need some real-world data for calibration and validation, but the bulk of training could happen in simulation. This would significantly erode China’s volume advantage.

Assumption 3: Data advantage compounds indefinitely

The third assumption is that data advantages compound over time in a winner-take-all dynamic. In language models, this turned out to be mostly false. GPT-4, Claude, Gemini, and several other models achieved comparable performance despite having access to very different training datasets. The reason is that beyond a certain scale, additional data yields diminishing returns, and algorithmic improvements can substitute for data quantity.

There is every reason to expect the same pattern in robotics. The Open X-Embodiment project, a collaboration among 21 institutions including Google DeepMind, Stanford, and UC Berkeley, showed that a diverse dataset pooled from many different robot types and environments could train a generalist model that outperformed models trained on any single robot’s data, regardless of that robot’s fleet size. The implication is that data sharing and collaboration can neutralize the advantage of a large proprietary fleet.

What the data actually looks like

Numbers alone do not tell the full story. The kind of data each country’s robots generate matters as much as the volume.

China’s deployed humanoid fleet is concentrated in a few categories. AgiBot robots work primarily in automotive manufacturing lines for BYD and SAIC Motor. Unitree G1 units are split between research institutions, educational settings, and light commercial deployments. UBTECH Walker S units serve automotive factories and hospitality environments. The common thread is that these deployments tend to be structured and predictable.

American deployments are smaller in volume but arguably more varied in scope. Tesla Optimus units are operating inside Tesla’s own factories, performing tasks that are evolving rapidly as the Gen 3 hardware matures. Figure AI’s Figure 02 completed an 11-month production trial at a BMW plant, handling over 90,000 components across diverse assembly line tasks. Agility’s Digit is working in Amazon fulfillment centers, navigating dynamic environments with human coworkers. Boston Dynamics’ Atlas is in commercial pilot programs across multiple industries.

Deployment profile: China vs United States

Fleet size

China (~10,000) 10,000+ units
United States (~1,500) ~1,500 units

Environment diversity

China (~10,000) Moderate (manufacturing-heavy)
United States (~1,500) Higher (multi-industry)

US robots span factories, warehouses, fulfillment, R&D labs

Task complexity

China (~10,000) Mostly repetitive pick-and-place
United States (~1,500) Increasingly multi-step

AI model sophistication

China (~10,000) Improving rapidly
United States (~1,500) Currently leading

Data collection infrastructure

China (~10,000) Centralized (national platforms)
United States (~1,500) Fragmented (per-company)

Data sharing culture

China (~10,000) Government-encouraged
United States (~1,500) Competitive silos

AgiBot has published open-source robotics datasets

Scaling trajectory

China (~10,000) 20,000+ targeted for 2026
United States (~1,500) 3,000-5,000 targeted for 2026

There is one factor where China holds an underappreciated advantage: data infrastructure and sharing norms. AgiBot has published open-source robotics datasets on GitHub, a move with no real American equivalent. The Chinese government has actively encouraged data sharing among domestic robotics companies, creating something closer to a national data commons than the competitive silos that characterize the American industry. If data diversity is what matters most, then a national data-sharing ecosystem with 10,000 robots across multiple companies could be more valuable than any single company’s proprietary fleet.

The simulation wildcard

The most important variable in the data flywheel debate might be one that has nothing to do with deployed robots at all: the progress of simulation.

NVIDIA’s Isaac Sim platform can now generate physically realistic training scenarios for humanoid robots at a pace of thousands of simulated hours per real-time hour. Google DeepMind has demonstrated that models trained primarily in simulation can transfer to real hardware with minimal fine-tuning. Physical Intelligence’s Pi-0 foundation model was trained on a combination of simulated and real-world data, with the simulated component reportedly accounting for over 70% of the training mix.

If simulation becomes good enough to serve as the primary training environment, then the whole data flywheel thesis collapses. It would not matter whether you have 10,000 robots or 1,500 robots in the field, because neither fleet would be your primary data source. Your primary data source would be a render farm.

Simulation vs real-world training data (estimated, 2026)

10,000x

Simulation speed advantage

Simulated hours per real hour

70%+

Simulation share in leading models

Pi-0, RT-2, and others

95%

Sim-to-real transfer success rate

For structured manipulation tasks

This is not just theoretical. Toyota Research Institute has publicly stated that their large behavior models for manipulation are trained primarily on simulated data, with real-world data used mainly for validation and domain adaptation. If one of the world’s largest automotive companies, with deep pockets and access to factory environments, has concluded that simulation is a more efficient data source than real-world collection, that is a signal worth paying attention to.

The counterargument is that simulation is only as good as its physics engine, and no physics engine perfectly captures the messiness of the real world. Soft materials deform unpredictably. Liquids slosh. Dust accumulates on sensors. Hinges loosen over time. These subtleties are hard to simulate and easy to encounter in the real world. There is still a meaningful gap between simulated and real physics, even if that gap is shrinking.

The real advantage China might be building

If the raw data flywheel argument is weaker than it appears, does that mean China’s deployment lead does not matter? Not at all. It just means the advantage is different from what most people assume.

The real advantage of having 10,000 robots in the field is not data volume. It is operational knowledge.

When you deploy thousands of robots across dozens of factories and hundreds of use cases, you learn things that no simulation can teach you. You learn which failure modes actually occur. You learn what kind of maintenance schedule keeps robots running. You learn how to design robots that factory workers can tolerate working alongside. You learn supply chain management for actuators, batteries, and sensors at scale. You learn pricing, contracting, customer support, and all the other unglamorous aspects of turning a technology into a business.

This is the knowledge that China’s humanoid robot companies are accumulating at a pace that American companies cannot match, not because the American companies are less talented, but because they have fewer units in the field. It is the same advantage that Chinese EV manufacturers built over a decade of domestic deployment before expanding globally. The cars themselves got better, but so did the entire surrounding ecosystem of manufacturing, servicing, and selling.

Advantages

Real-world operational experience that no simulation can replicate
Supply chain maturity from managing thousands of units
Customer feedback loops driving hardware and software iteration
Manufacturing cost curves declining with volume
Workforce development for human-robot collaboration

Limitations

Most deployments are in controlled, repetitive settings
Data quality may lag behind smaller, more instrumented American fleets
Simulation progress could neutralize volume-based data advantage
US companies lead in foundation model research
Export restrictions may limit China's global deployment footprint

Where the US still has leverage

The American humanoid robotics industry has real advantages that a pure data-volume comparison obscures.

First, the AI research ecosystem. The foundation model research driving robotic intelligence is still overwhelmingly based in the United States. Google DeepMind, Physical Intelligence, OpenAI, Meta FAIR, and university labs at Stanford, MIT, CMU, and UC Berkeley are producing the architectures and training methods that define the state of the art. China has strong AI research, but in the specific subdomain of embodied AI and robotic foundation models, the US leads.

Second, the compute advantage. Training large robotic foundation models requires enormous amounts of GPU compute. American companies have access to NVIDIA’s latest hardware and the cloud infrastructure of Amazon, Google, and Microsoft. China faces ongoing semiconductor export restrictions that limit access to cutting-edge chips. This is a structural constraint on China’s ability to turn data into trained models.

Third, the quality of partnerships. Figure AI has OpenAI. Boston Dynamics has Google DeepMind. Apptronik has Google DeepMind. 1X Technologies has OpenAI. These are not just logos on a press release. These partnerships give American robotics companies direct access to the world’s best language and vision models, which are increasingly being fused with robotic control systems. China’s robotics companies are building their own AI stacks, which gives them more control but less access to the frontier.

AI research output in embodied robotics (2024-2025)

73%

US share of top-cited papers

Embodied AI and robot learning

$4.5B+

US VC into robotics AI

2024-2025 combined

4 of 5

Top foundation models from US

RT-2, Pi-0, Octo, SuSIE

The five scenarios

So where does this leave us? The interaction between deployment volume, data quality, simulation progress, and AI research creates a range of possible futures. Here are the five most likely.

Scenario 1: The flywheel holds. Simulation progress stalls, real-world data remains king, and China’s volume advantage compounds into an AI advantage by 2028-2029. Chinese humanoid robots become the best in the world at the tasks they are deployed for, and the gap becomes too large for American companies to close.

Scenario 2: Simulation neutralizes the gap. Simulation improves faster than expected, making real-world fleet data a “nice to have” rather than a necessity. The AI race in robotics becomes a compute and research race, which favors the United States. China’s volume lead proves commercially valuable but not strategically decisive.

Scenario 3: Data sharing changes the equation. Open-source robotics datasets, cross-company collaborations like Open X-Embodiment, and government-sponsored data commons create pools of training data that dwarf any single fleet’s output. The data flywheel becomes a collective phenomenon rather than a competitive one, and the question shifts from “who has the most robots” to “who has the best algorithms.”

Scenario 4: Specialization, not generalization, wins. The dream of a general-purpose humanoid robot gives way to the reality of specialized robots optimized for specific tasks. In this world, what matters is not total fleet size but fleet-per-task-category. China leads in manufacturing and logistics. The US leads in healthcare and defense. Neither side’s data advantage is transferable.

Scenario 5: The flywheel is real but slow. China’s data advantage does compound, but at the logarithmic rate suggested by scaling laws. By 2030, Chinese robotic AI is 15-25% better than American robotic AI for common tasks, but American companies offset this with superior hardware, better customer relationships in Western markets, and faster iteration on novel capabilities.

What to watch

The data flywheel debate will be resolved by evidence, not arguments. Here are the metrics that will tell us which scenario is playing out.

Sim-to-real transfer benchmarks. If simulation-trained models consistently match or beat fleet-trained models on manipulation and navigation tasks by 2027, the flywheel thesis weakens dramatically. Watch publications from DeepMind, Physical Intelligence, and NVIDIA Research.

Chinese robot task complexity. If Chinese humanoid deployments move beyond repetitive logistics into multi-step, adaptive tasks in unstructured environments, their data becomes vastly more valuable. Watch AgiBot and Unitree deployment announcements carefully.

Open-source dataset growth. If the Open X-Embodiment dataset or similar projects grow to cover millions of diverse task demonstrations, the value of proprietary fleet data diminishes. Watch the collaboration’s annual reports and the size of their training sets.

Foundation model benchmarks. If Chinese-trained robotic foundation models begin outperforming American ones on standardized manipulation and navigation benchmarks, the flywheel is working. If American models maintain their lead despite having less deployment data, the flywheel is not.

Export volumes. If Chinese humanoid robot exports to non-Chinese markets grow significantly, their data becomes more diverse (and more valuable). If exports remain limited, their data stays concentrated in Chinese industrial environments.

Timeline

Q1 2026

China passes 10,000 deployed humanoid robots. US reaches approximately 1,500

Q2 2026

NVIDIA Isaac Sim 4.0 release with improved soft-body and deformable object physics

Q3 2026

Open X-Embodiment v3 dataset release with 10x more manipulation demonstrations

Q4 2026

First standardized benchmark for robotic foundation models expected

2027

China targets 35,000+ deployed humanoids. Tesla targets 50,000+ Optimus units

2028

Expected inflection point: simulation quality may equal real-world data value for most tasks

2029

Goldman Sachs projects humanoid robot market reaches $6B annually

2030

First generation of robots trained primarily on fleet data vs simulation data will be directly comparable

The number everyone is missing

The debate over 10,000 versus 1,500 robots is important, but it risks missing the bigger picture. The number that will ultimately determine AI capability in robotics is not the number of robots in the field. It is the number of unique task-environment combinations represented in the training data.

A fleet of 10,000 robots doing 50 different tasks in 200 different environments gives you 10,000 task-environment combinations. A fleet of 1,500 robots doing 500 different tasks in 1,000 different environments gives you 500,000. A simulation engine generating 100,000 scenarios per day gives you numbers that dwarf both.

China’s volume lead is real and commercially significant. It creates a manufacturing advantage, a supply chain advantage, and an operational knowledge advantage that will be very difficult to replicate. But the claim that it will automatically translate into an AI advantage requires assumptions about the nature of machine learning that are, at best, contested by the evidence.

The data flywheel is not a myth. But it is not a law of nature either. It is a hypothesis, and the next two to three years will determine whether it was right.

The bottom line

Volume

China's proven advantage

10,000+ robots, manufacturing scale, national data infrastructure

Intelligence

America's current advantage

Foundation models, compute access, research ecosystem

The real risk for the United States is not that China’s 10,000 robots will produce unbeatable AI. It is that while American companies debate the theoretical merits of real-world data versus simulation, Chinese companies are simply shipping robots, collecting data, iterating on hardware, and building the operational muscle that turns a technology into an industry. The flywheel that matters most might not be about data at all. It might be about the unglamorous, compounding advantage of doing the work.

Sources

  1. Counterpoint Research - Global Humanoid Robot Shipments 2025 - accessed 2026-03-28
  2. Goldman Sachs - Rise of the Humanoids Report - accessed 2026-03-28
  3. Google DeepMind - RT-2: Vision-Language-Action Models - accessed 2026-03-28
  4. Toyota Research Institute - Large Behavior Models for Manipulation - accessed 2026-03-28
  5. Waymo Safety Report - Autonomous Driving Data - accessed 2026-03-28
  6. MIIT - Humanoid Robot Innovation and Development Guidelines - accessed 2026-03-28
  7. IEEE Spectrum - Why Robot Learning Needs Diverse Data, Not Just More Data - accessed 2026-03-28
  8. Scale AI - Chinchilla Scaling Laws Applied to Robotics - accessed 2026-03-28
  9. AgiBot Open-Source Robotics Datasets - GitHub - accessed 2026-03-28
  10. Reuters - Baidu Apollo Autonomous Driving Fleet Data - accessed 2026-03-28
  11. Physical Intelligence - Pi0 Foundation Model for Robots - accessed 2026-03-28
  12. MIT Technology Review - The Real Bottleneck in Robotics AI - accessed 2026-03-28
  13. South China Morning Post - China Robot Deployment Surge - accessed 2026-03-28
  14. Nature - Sim-to-Real Transfer in Robotic Manipulation - accessed 2026-03-28
  15. Open X-Embodiment Collaboration - Scaling Robot Learning - accessed 2026-03-28

Related Posts

The Future 22 min

Seven Companies, Three Countries, One Race: Who Actually Controls the Humanoid Supply Chain

Every humanoid robot is an assembly of geopolitical dependencies. Chinese batteries, American AI chips, Japanese precision bearings, German optical sensors. Follow the supply chain backward and you find a web of vulnerabilities that could reshape the entire industry overnight.

supply-chain manufacturing China
Humanoid Robots 14 min

AgiBot Shipped More Robots Than Tesla, Figure, and Apptronik Combined. You Have Probably Never Heard of Them.

AgiBot shipped 5,200 humanoid robots while Tesla managed 500, Figure AI shipped 200, and Apptronik shipped 50. Combined, the three most-hyped American humanoid programs delivered one-seventh of what a Shanghai startup achieved in under two years. The numbers expose a Western media blind spot that has real consequences.

AgiBot China manufacturing
Humanoid Robots 18 min

Figure 03 at $20,000: The Robot That Could Break the Price-Capability Curve

Figure AI priced its third-generation humanoid at $20,000, undercutting Tesla's Optimus target and landing near Unitree G1 territory. With 42 DOF, 16-DOF hands, Helix AI, and a 5-hour battery, the Figure 03 is the first humanoid where labor substitution economics might actually work at scale.

Figure AI Figure 03 pricing
Humanoid Robots 18 min

The $39 Billion Company That Has Shipped 200 Robots: Figure AI and the Valuation-to-Deployment Gap

Figure AI is valued at $195 million per robot shipped. Unitree sells its humanoid for $16,000 and has moved 5,500 units. The valuation-to-deployment gap across the humanoid industry tells you everything about what investors are actually buying.

Figure AI valuation investment