Silicon Valley startup taps India’s gig workers to train global AI and robotics systems

Human Archive, a startup founded by researchers from UC Berkeley and Stanford University, is leveraging India’s vast gig economy to collect physical training data for artificial intelligence and robotics development. The company is recruiting gig workers across India to wear camera-equipped caps and sensor devices that capture real-world visual and spatial information—data that AI labs and robotics companies globally are competing intensely to acquire for training autonomous systems.

The data collection model represents a significant shift in how AI and robotics companies source training datasets. Rather than relying solely on synthetic data or expensive in-house collection efforts, Human Archive is tapping into India’s established freelance workforce, where labor costs are substantially lower than in the United States or Europe. This approach mirrors the broader trend of offshoring data annotation and collection work to South Asia, a practice that has grown exponentially as generative AI and robotics companies race to build more sophisticated models requiring millions of hours of real-world video and sensor data.

The stakes in this competition are substantial. Physical AI—systems that can perceive and interact with the real world—represents one of the fastest-growing segments of artificial intelligence development. Companies like Tesla, Boston Dynamics, Figure AI, and numerous robotics startups require vast amounts of real-world video footage, depth sensing data, and environmental context to train models capable of performing complex physical tasks. The market for such training data is estimated to be worth billions of dollars, and access to quality datasets has become a critical competitive advantage in the emerging autonomous robotics sector.

India presents a particularly attractive location for such data collection efforts. The country’s gig economy—encompassing platforms like Upwork, Fiverr, and India-specific services like Urban Company—involves millions of workers accustomed to task-based, flexible employment. Labor costs in India remain 60-70% lower than comparable Western markets, allowing companies like Human Archive to deploy large numbers of data collectors at scale. Additionally, India’s diverse urban and semi-urban environments provide varied real-world contexts that improve the robustness of AI training datasets. The country’s existing infrastructure for remote work and digital payments facilitates efficient management of distributed data collection teams.

The arrangement benefits multiple stakeholders, though with varying degrees. For gig workers in India, the opportunity represents additional income with relatively low barriers to entry—workers need only to don the equipment and move through their daily environments. For AI and robotics companies, the approach provides access to high-volume, geographically diverse real-world data at significantly reduced costs compared to alternatives. For Indian technology and business infrastructure, the development positions the country as a critical node in the global AI supply chain, extending beyond traditional IT services into the emerging AI training data economy. However, labor advocates have raised concerns about fair compensation, data privacy protections, and the concentration of wealth creation in Silicon Valley while collection labor occurs in emerging markets.

The broader implications of this model extend across multiple dimensions of the global technology landscape. It exemplifies how capital-intensive AI development is concentrating among well-funded Western companies while labor-intensive components are distributed to lower-cost jurisdictions—a pattern that echoes historical outsourcing dynamics but in the high-technology sector. For Indian workers and policymakers, the question emerges whether this represents genuine economic opportunity or a new form of resource extraction. The Government of India has been promoting itself as a destination for AI development and innovation, but this data collection model highlights how India’s role may be positioned as a labor provider rather than as an innovator or intellectual property generator. Questions also arise regarding data sovereignty—whether video and sensor data collected from Indian citizens and environments should be considered Indian intellectual or economic property.

Looking ahead, the success or failure of Human Archive’s model will likely influence how other AI and robotics companies approach data collection at scale. If the arrangement proves profitable and produces high-quality training data, expect similar startups to emerge, potentially creating a competitive market for India-based gig data collection. Conversely, regulatory scrutiny around data privacy, worker protections, and labor standards could constrain growth in this emerging sector. Indian regulators and policymakers may face pressure to establish frameworks governing such data collection activities, similar to how data localization policies have been implemented in other contexts. The development also underscores the growing importance of the AI training data economy—an often-invisible but strategically critical component of the global technology infrastructure that will warrant increased attention from investors, technologists, and policymakers alike.

Vikram

Vikram is an independent journalist and researcher covering South Asian geopolitics, Indian politics, and regional affairs. He founded The Bose Times to provide independent, contextual news coverage for the subcontinent.