
AI Data Curator - Foundational Model
Soket AI
Job Description
Role Summary
Company: Soket AI
Location: Bangalore (On-Site)
Experience: 0–1 Years
Compensation: ₹18,00,000 – ₹25,00,000 per annum
Type: Full-Time
About Soket AI
Soket AI is an AI research company building next-generation foundation models in reasoning, mathematics, coding, multilingual intelligence, speech, and multimodal AI.
Supported by the IndiaAI Mission and backed by a mission to create efficient, open, and accessible AI systems, Soket is focused on advancing frontier AI research for India, the Global South, and the world.
About the Role
Training great AI models starts with great data.
As an AI Data Curator, you'll work at the heart of frontier AI development—discovering, curating, evaluating, and improving the datasets that power large language models and reasoning systems.
You'll collaborate closely with research scientists, data engineers, and model teams to ensure training data is diverse, high-quality, safe, and aligned with evolving model capabilities.
This role is ideal for candidates who enjoy working with large datasets, have strong analytical skills, and are excited about shaping the data behind next-generation AI systems.
What You'll Do
- Source, curate, and organize large-scale datasets across web, code, documents, speech, and synthetic data sources
- Build datasets that improve capabilities in reasoning, coding, mathematics, multilingual understanding, and instruction following
- Support web crawling, data acquisition, and large-scale ingestion pipelines
- Evaluate dataset quality using manual review and automated validation methods
- Detect and remove duplicated, noisy, unsafe, low-quality, or contaminated data
- Design and execute annotation, labeling, and quality assurance workflows
- Curate synthetic datasets including reasoning traces, tool-use trajectories, and preference datasets
- Maintain dataset versioning, metadata, provenance, and reproducible workflows
- Monitor benchmark contamination and training-test overlap
- Collaborate with research and engineering teams to improve data quality and model performance
- Continuously enhance dataset diversity, coverage, and reliability
Requirements
- Bachelor's or Master's degree in Computer Science, Data Science, Artificial Intelligence, Computational Linguistics, Information Science, or related fields
- Strong analytical thinking and attention to detail
- Interest in AI, machine learning, and large language models
- Experience working with datasets, annotations, research corpora, or AI training data
- Ability to evaluate data quality and make nuanced decisions
- Strong collaboration and communication skills
Preferred Skills
Data & Programming
- Python
- Pandas
- Hugging Face Datasets
- Apache Spark
- Apache Arrow
- JSONL and Parquet workflows
- Dask
Data Collection & Processing
- Web crawling and data extraction
- API-based data acquisition
- HTML parsing
- Large-scale document processing
- Scrapy
- BeautifulSoup
- Selenium
Data Quality & Validation
- Deduplication and contamination detection
- Metadata enrichment
- Corpus diversity analysis
- FAISS
- Elasticsearch
- MinHash and LSH techniques
Annotation & Evaluation
- Human-in-the-loop annotation workflows
- Reasoning trace evaluation
- Code correctness validation
- Mathematical solution review
- Preference and ranking datasets
- Quality assurance systems
Speech & Multimodal Data (Bonus)
- Speech and audio dataset curation
- ASR and TTS datasets
- Familiarity with Common Voice, OpenSLR, or similar datasets
Collaboration Tools
- Git
- GitHub
Who Will Thrive Here?
- Candidates who enjoy organizing and improving large-scale datasets
- Individuals passionate about AI research and foundation models
- Detail-oriented problem solvers who care about data quality
- Builders who enjoy working at the intersection of research and engineering
- Curious learners eager to understand how state-of-the-art AI systems are trained
Why Join Soket AI?
- Work on frontier AI problems tackled by only a handful of teams globally
- Contribute directly to the development of large-scale foundation models
- Collaborate with top researchers and engineers in AI
- Gain exposure to cutting-edge research in reasoning, coding, multilingual AI, speech, and multimodal systems
- Work with supercomputing-scale infrastructure and large-scale datasets
- Help build open and accessible AI systems for India and the Global South
Who Should Apply?
Recent graduates and early-career professionals who are passionate about AI, data quality, machine learning, and building the datasets that power the next generation of intelligent systems.
Required Skills
Job Insights
Other Opportunities
No other opportunities available at the moment.