China Autonomous Driving Data Closed Loop Research Report, 2025

Data Closed-Loop Research: Synthetic Data Accounts for Over 50%, Full-process Automated Toolchain Gradually Implemented

Key Points:

From 2023 to 2025, the proportion of synthetic data increased from 20%-30% to 50%-60%, becoming a core resource to fill long-tail scenarios.
Full-process automated toolchain from collection to deployment is gradually implemented, helping reduce costs and improve efficiency.
Efficient collaboration of the vehicle-cloud integrated data closed-loop is a key factor in achieving faster iterations.

The essence of autonomous driving data closed-loop is a cyclic optimization system of "collection-transmission-processing-training-deployment". In 2025, the industry is accelerating from the "0→1" stage to the "high-quality and high-efficiency" era, with core contradictions focusing on long-tail scenario coverage and cost control. OEMs and Tier 1 suppliers are actively establishing their own data closed-loop solutions. Through efficient data collection, processing and analysis processes, they continuously improve autonomous driving algorithms, thereby significantly enhancing the accuracy and stability of intelligent driving systems.

I. From 2023 to 2025, the Proportion of Synthetic Data Increased from 20%-30% to Over 50%

The efficiency of acquiring high-quality data determines the evolution speed of intelligent driving. Currently, data sources in the automotive field include mass-produced vehicle-triggered data transmission, high-value specific scenario data collection by collection vehicles, engineering practices for physical world restoration through roadside real data, and data synthesis technology based on world models. The core path for the large-scale application of autonomous driving technology → real data anchors basic capabilities, and synthetic data breaks through capability boundaries. From 2023 to 2025, the proportion of real data and synthetic data in autonomous driving training data has undergone significant changes, gradually shifting from a real data-dominated model in the early stage to a hybrid model with an increasing proportion of synthetic data.

2023: Real data dominates, synthetic data starts (synthetic data accounts for 20%-30%): Real data is still the main body, mainly used for basic scenario training, but faces the problem of insufficient coverage of long-tail scenarios. For example, Tesla relied on real road test data from over one million vehicles in the early stage, but the collection efficiency of extreme scenarios (such as pedestrians breaking in during heavy rain) is low. Synthetic data accounts for about 20%-30%, mainly used to supplement long-tail scenarios. Experiments by Applied Intuition show that after adding 30% of synthetic data with frequent appearance of cyclists to real data, the recognition accuracy (mAP score) of the perception model for cyclists is significantly improved.

2024: Accelerated penetration of synthetic data (proportion rises to 40%-50%): Synthetic data has upgraded from an "auxiliary tool" to a "core production material". Its penetration rate rising to 40%-50% marks that intelligent driving has entered a new data-driven paradigm. At the end of 2024, the Shanghai High-level Autonomous Driving Demonstration Zone launched a plan of 100 data collection vehicles. Through a hybrid model of "real data collection + world model-generated virtual data", the proportion of synthetic data is close to 50%; for example, Nvidia DRIVE Sim generates synthetic data of distant objects (100-350 meters) to solve the problem of sparse real annotations. After adding 92,000 synthetic images, the detection accuracy (F1 score) of vehicles 200 meters away is improved by 33%.

2025: Synthetic data surpasses (accounts for over 50%): The ratio of synthetic data to real data moves towards "5:5" or even higher. Academician Wu Hequan pointed out that 90% of the training for L4/L5 is simulation data, and only 10%-20% of real data is retained as a "gene pool" to avoid model deviation. In terms of innovative applications of synthetic data, take Li Auto as an example. It uses world models to reconstruct historical scenarios and expand variants (such as virtualizing ordinary intersections into rainy night and foggy conditions), and automatically generates extreme cases for cyclic training. The proportion of synthetic data in Li Auto exceeds 90%, replacing real-vehicle testing and verifying reliability.

According to Lang Xianpeng from Li Auto, in 2023, the effective real-vehicle test mileage of Li Auto was about 1.57 million kilometers, with a cost of 18 yuan per kilometer. By the first half of 2025, a total of 40 million kilometers had been tested, including only 20,000 kilometers of real-vehicle testing and 38 million kilometers of synthetic data. The test cost dropped to an average of 0.5 yuan per kilometer. Moreover, the test quality is high, all scenarios can be inferred from one instance, and complete retesting is possible.

The advantages of synthetic data are not only reflected in cost and efficiency but also in its value density beyond human experience. Synthetic data is generated in batches through technical means at extremely low cost, perfectly matching the high-frequency training needs of AI; it can also independently generate extreme corner case scenarios that "humans have not experienced but comply with physical laws".

II. Full-process Automated Toolchain from Collection to Deployment is Gradually Implemented, Helping Reduce Costs and Improve Efficiency

The autonomous driving data closed-loop has shifted from focusing on a single link (such as improving annotation efficiency) in the early stage to an end-to-end automated architecture covering "collection-annotation-training-simulation-deployment". The core breakthrough is to break through data flow barriers through AI large models and cloud-edge collaboration technology, realizing closed-loop self-evolution.

LiangDao Intelligence LD Data Factory is a full-link 4D ground truth solution from collection to delivery. The LD Data Factory toolchain product has been delivered to more than a dozen automotive OEMs and Tier 1s in China, Germany, and Japan. This automated 4D annotation tool software has automatically annotated more than 3,300 hours of road-collected data for customers, obtaining high-quality 4D continuous frame ground truth; by the middle of 2025, LiangDao Intelligence had delivered more than 55 million frames of data to a well-known German luxury car brand.

LD Data Factory integrates "data collection, automated annotation, manual annotation, quality control, and performance evaluation". The toolchain includes AI preprocessing and VLM-assisted collection, an automated annotation module for target detection, full-process closed loop of automatic quality inspection, and hybrid cloud and private deployment. LD Data Factory covers several core modules and realizes data management and task collaboration through a unified data management platform: including time synchronization and spatial calibration, distributed storage and indexing services, a visual annotation platform LDEditor (full-stack annotation), an automated quality control module LD Validator, and a perception performance evaluation module LD KPI.

Main products under MindFlow currently include an integrated data annotation platform, a data management platform (including a vector database), and a model training platform, covering the entire value chain from raw data to model implementation. Users can complete the entire algorithm development process in one stop without switching multiple tools or platforms, redefining a new paradigm of AI data services. The technical highlights of its MindFlow SEED platform (third generation) include support for 4D point cloud annotation (lane lines, segmentation), RPA automated processes, and AI pre-annotation covering more than 4,000 functional modules.

Currently, MindFlow empowers customers including SAIC Group, Changan Automobile, Great Wall Motors, Geely Automobile, FAW Group, Li Auto, Huawei, Bosch, ECARX, MAXIEYE, NavInfo and RoboSense.

III. Efficient Collaboration of the Vehicle-Cloud Integrated Data Closed-Loop is a Key Factor in Achieving Faster Iterations

The essence of the vehicle-cloud integrated data closed-loop is to build a collaborative system of "vehicle-side lightweight + cloud-side intelligence", break through data flow barriers, and realize the continuous evolution of intelligent vehicles. The vehicle side is responsible for real-time collection of environmental perception data (such as road conditions, vehicle operation data), which is uploaded to the cloud after desensitization, encryption, and compression. The cloud processes massive amounts of data (PB/EB level), performs annotation, model training, and algorithm optimization, generates new capabilities, and issues them to the vehicle side to realize OTA upgrades.

The ExceedData data closed-loop solution is a vehicle-cloud integrated solution, which has gained the trust and mass production application of more than 15 automotive OEMs and is deployed in more than 30 mainstream models.

The composition of the ExceedData data closed-loop solution includes the vehicle-side edge computing engine (vCompute), edge data engine (vADS), edge database (vData), as well as the cloud-side algorithm development tool (vStudio), cloud computing engine (vAnalyze), and cloud management platform (vCloud). This solution can reduce data transmission costs by 75%, cloud storage costs by 90%, and cloud computing costs by 33%. According to the calculation of an OEM case cooperating with ExceedData: the total cost optimization can be reduced by 85%.

In terms of OEMs, take Xpeng Motors as an example. Its self-built "cloud-side model factory" has a computing power reserve of 10 EFLOPS in 2025, and the end-to-end iteration cycle is shortened to an average of 5 days, supporting rapid closed-loop from cloud-side pre-training to vehicle-side model deployment.

Xpeng launched China's first 72 billion parameter multimodal world base model for L4 high autonomous driving, which has chain-of-thought (CoT) reasoning capabilities and can simulate human common-sense reasoning and generate control signals. Through model distillation technology, the capabilities of the base model are migrated to the vehicle-side small model, realizing personalized deployment of "small size and high intelligence".

High-value data (such as corner cases) is initially screened through the vehicle-side rule engine. The cloud combines synthetic data generation technologies (such as GAN, diffusion models) to fill data gaps and improve model generalization capabilities. At the same time, end-to-end (E2E) and VLA models integrate multimodal inputs to directly output control commands, relying on cloud-side large model training (such as Xpeng's 72 billion parameter base model) to achieve lightweight deployment on the vehicle side.

With the comprehensive modeling of the entire intelligent driving system, car companies are pursuing "better cost, higher efficiency, and more stable services" in the data closed-loop. The delivery method of intelligent driving is accelerating from delivering code for single-vehicle deployment to a subscription-based cloud service as the core. The efficiently collaborative data closed-loop of vehicle-cloud integration is the key for intelligent vehicles to achieve faster iterations driven by AI.

1 Overview/Trends of Autonomous Driving Data Closed-Loop

1.1 Overview of Data Closed-Loop

One-stop Cases of Data Intelligence Platforms
Comparison of Data Closed-Loop Deployment Case Strategies

1.2 Data Closed-Loop Moves Towards the Era of Full-Stack Self-Evolution
1.3 Summary of Data Closed-Loop Progress Cases
1.4 Data Closed-Loop Cooperation Models
1.5 Summary of OEMs’ Data Closed-Loop Related Cooperation
1.6 Trend 1
1.7 Trend 2
1.8 Trend 3
1.9 Trend 4
1.10 Trend 5
1.11 Trend 6

Accelerated Popularization of High Computing Power on the Vehicle Side
Comparison of Major Autonomous Driving Chips
Comparison of Cloud-Side Computing Power and Intelligent Computing Centers
Case Analysis of Intelligent Computing Centers

2 Research on High-Quality Data Collection/Synthetic Simulation

2.1 High-Quality Data Collection

Case 1: Lan-You Technology
Case 2: Kunyi Electronic
Case 3: TZTEK
Case 4: Keymotek
Case 5: EMQ Technologies
Case 6: ExceedData
Case 7: CARLINX
Case 8: YOOTTA

2.2 Synthetic/Simulation Data

Overview of Autonomous Driving Synthetic Data
Advantages and Challenges of Synthetic Data
Summary of Synthetic Data Application Scenarios
Changes in the Proportion of Synthetic Data Applications
World Model-Based Data Synthesis Technology
Case 1: Synkrotron
Toolchain Products
Data Management Platform
Synthetic Data Solutions
Traffic Flow Synthetic Data Platform for Advanced Intelligent Driving
Case 2: 51SIM
End-to-End Data-Driven Closed-Loop
Case 3: WayLancer
Data Products
Data Closed-Loop
Case 4: ThousandSim
Case 5: Lightwheel AI

3 Research on Data Storage/Processing

3.1 Data Storage

Case 1: JOYNEXT
Case 2: MacrooSAN Technology
Case 3: Alibaba Cloud
Case 4: Baidu
Case 5: Tencent Intelligent Mobility
Case 5: Tencent Data Closed-Loop Platform
Case 6: AWS

3.2 Efficient Data Processing

Case 1: Lan-You Technology
Case 2: ExceedData
Case 3: Keymotek
Case 4: Synkrotron
Case 5: Alibaba Intelligent Driving Data Preprocessing Solution

4 Research on Automated (AI) Annotation

Summary: Comparison of Automated Annotation Solutions

4.1 Rere Data

Profile
Intelligent Driving Solutions of Retention Data
Enable AI Intelligent Data Annotation Platform
Data Collection Services
Data Security Management

4.2 MindFlow

Profile
Data Service Solutions
Third-Generation MindFlow SEED Platform
4D Point Cloud Processing
Development Dynamics

4.3 StardustAI

Profile
Self-Developed Algorithms
Rosetta Annotation Platform
MorningStar AI Data Management Platform
COSMO Large Model Data Pyramid Solution
Autonomous Driving Service Scenarios
Autonomous Driving Service Cases and Technical Capabilities
Data Annotation Service Customers

4.4 Datatang

Intelligent Driving Solutions
Intelligent Driving Training Datasets
Comparison of Intelligent Driving Training Datasets
Shujiajia Pro Artificial Intelligence Data Annotation Platform

4.5 Databaker Technology

Profile
Development History
Easy Collection Tool
4D-BEV Annotation Tool
AI Data Platform
Large Model Data Solutions
Large-Scale High-Quality Datasets

4.6 Boden AI

Profile
Product Matrix
Datasets
Autonomous Driving Datasets
Autonomous Driving Solutions
BASE Data Annotation Platform
4D Point Cloud Annotation
BBot Agent Platform
Cooperation Cases

4.7 ByteTree AI

Technology Layout
Full-link Data Services
Shanhai Data Management Platform
Intelligent Driving Data Closed-Loop Solutions
Ground Truth Reuse Solutions
4D Dynamic Automated Annotation Large Model
Data Closed-Loop Capabilities
Cooperation Partners
Cooperation Cases

5 Research on Algorithms and Model Training

Algorithm Evolution
Algorithm Architecture Evolution
Comparison of Core Algorithm Architectures of OEMs
VLA Development Status
Latest Progress of VLA Solutions in Data Closed-Loop
Latest Progress of OEMs/Tier 1s in VLA Solutions
Comparative Analysis of Advanced Large Models
Case 1: DeepRoute.ai
Intelligent Driving Mileage and Commercialization Progress
End-to-End Technical Solutions
Data Closed-Loop Capabilities
Case 2: Nullmax
Data Closed-Loop Technology Progress
Platform-Based BEV-AI Architecture Design
One-Model End-to-End Core Technology
MaxDrive Platform-Based Solutions
Latest Development Dynamics
Case 3: iMotion Automotive Technology
Core Competitiveness
Intelligent Driving Technology and Model Training
Large Model R&D System
Advanced Parking and Driving Algorithm Platform and Products
Data Closed-Loop Capabilities
Case 4: Momenta
Data Closed-Loop and Mass Production Implementation
Mass Production/Cooperation Dynamics
Overview of World Models
Overview of World Models
Summary of Latest World Models
Core Architectures of Mainstream World Models
Development Direction of Synthetic Data for World Models
Case 1: SenseAuto
Case 2: YOOTTA
Case 3: Company H
Case 4: Horizon Robotics
Case 5: Xiaomi
Case 6: Wayve

6 Research on Representative Suppliers of Data Closed-Loop Technology

Summary: Comparison of Data Closed-Loop Technology Solutions of Representative Suppliers

6.1 WUWEN.AI

Profile
Core Technologies
Data Closed-Loop Management Platform
Simulation Verification Platform
AI Data Annotation Platform

6.2 LiangDao Intelligence

Data Factory
Core Modules of Data Factory
4D Ground Truth Toolchain
Continuous Frame 4D Annotation
Customers

6.3 ExceedData

Profile
Data Base
Vehicle-Cloud Full-Stack Products
Vehicle-Cloud Computing Engine
Empowerment of Vehicle-Cloud Computing Architecture
vADS Intelligent Driving Data Engine
vData Edge Database
vStudio Algorithm Development Tool

6.4 Freetech

Intelligent Driving Platform ODIN3.0
FUZE Middleware Platform
Software and Algorithms
Data Closed-Loop Services
Product Matrix
Development Dynamics
Cooperation Partners

6.5 MAXIEYE

Haishi Data Intelligent System
Mass Production Data Mileage

6.6 Ruqi Mobility

Data Closed-Loop Flywheel
Annotation Base
Operation Data

6.7 Yoocar

Profile
Development Updates
“DriveCloud” Intelligent Computing Solution

6.8 Roadgrids

Automated Mass Production Mapping Capability Technology Architecture
Data Closed-Loop

6.9 NavInfo

AI Infra-Empowered Data Closed-Loop
Services/Cloud Cooperation

6.10 Kotei Informatics

7 Research on Typical OEMs’ Data Closed-Loop

7.1 XPeng Motor

Summary of Data Closed-Loop and Software Supply Chain
Computing Power Data Center and Platform
Data Management Platform
Autonomous Driving Base Model

7.2 Xiaomi Auto

Summary of Data Closed-Loop and Software Supply Chain
Delivery Data Statistics
Data Training
End-to-End Assisted Driving
Intelligent Driving Physical World Modeling System
End-to-End OTA Deployment

7.3 NIO

Summary of Data Closed-Loop and Software Supply Chain
World Model
Delivery Data
Intelligent Function OTA Deployment

7.4 Li Auto

Summary of Data Closed-Loop and Software Supply Chain
Model Training
E2E Training Data Scale
Intelligent Driving Data Training Volume
Intelligent Function OTA Deployment

7.5 Leapmotor

Summary of Data Closed-Loop and Software Supply Chain
Delivery Data Analysis

7.6 IM Motors

Summary of Data Closed-Loop and Software Supply Chain

7.7 Tesla

Summary of Data Closed-Loop and Software Supply Chain

7.8 BYD

Summary of Data Closed-Loop and Software Supply Chain
End-to-End Large Model Training
Summary of DiPilot Series Assisted Driving Functions
Global Deployment

7.9 Geely Automobile

Summary of Data Closed-Loop and Software Supply Chain
Full-domain AI Intelligence
Brand/Global Production Capacity Layout

7.10 FAW Group

Summary of Data Closed-Loop and Software Supply Chain
Intelligent Upgrade

7.11 GAC

Data Closed-Loop System

7.12 Summary of Changan Automobile Data Closed-Loop and Software Supply Chain
7.13 Dongfeng Motor's "One Core, Two Bases, Two Elements" System
7.14 Summary of Dongfeng Nissan Data Closed-Loop and Software Supply Chain
7.14 Dongfeng Nissan Autonomous Driving Software Solutions and Supply Chain Construction
7.15 Summary of Volkswagen Data Closed-Loop and Software Supply Chain
7.16 Summary of Toyota Data Closed-Loop and Software Supply Chain

Companies Mentioned

Lan-You Technology
Kunyi Electronic
TZTEK
Keymotek
EMQ Technologies
ExceedData
CARLINX
YOOTTA
Synkrotron
51SIM
WayLancer
ThousandSim
Lightwheel AI
Rere Data
MindFlow
StardustAI
Datatang
Databaker Technology
Boden AI
ByteTree AI
DeepRoute.ai
Nullmax
iMotion Automotive Technology
Momenta
SenseAuto
YOOTTA
Company H
Horizon Robotics
Xiaomi
Wayve
WUWEN.AI
LiangDao Intelligence
ExceedData
Freetech
MAXIEYE
Ruqi Mobility
Yoocar
Roadgrids
NavInfo
Kotei Informatics
XPeng Motor
Xiaomi Auto
NIO
Li Auto
Leapmotor
IM Motors
Tesla
BYD
Geely Automobile
FAW Group
GAC

License	Format	Properties	Price
SINGLE USER LICENSE PDF	The product is a PDF.	This is a single user license, allowing one user access to the product.	€3835EUR$4,300USD£3,325GBP
ENTERPRISE LICENSE PDF	The product is a PDF.	This is an enterprise license, allowing all employees within your organization access to the product.	€5707EUR$6,400USD£4,949GBP

Key Points:

I. From 2023 to 2025, the Proportion of Synthetic Data Increased from 20%-30% to Over 50%

II. Full-process Automated Toolchain from Collection to Deployment is Gradually Implemented, Helping Reduce Costs and Improve Efficiency

III. Efficient Collaboration of the Vehicle-Cloud Integrated Data Closed-Loop is a Key Factor in Achieving Faster Iterations

Table of Contents

Companies Mentioned

Related Topics

Related Reports

Autonomous Driving Map (HD/LD/SD MAP, Online Reconstruction, Real-time Generative Map) Industry Report 2025

End-to-End Autonomous Driving Research Report, 2025

Autonomous Vehicle Simulation Solutions Market Opportunity, Growth Drivers, Industry Trend Analysis, and Forecast 2025-2034

Autonomous Driving Simulator Market - Global Forecast 2025-2030

Autonomous Driving IMU Market - Global Forecast 2025-2030