Job Image

AI Data Curator - Foundational Model

Soket AI

Bangalore, Karnataka
Full-time
₹18,00,000 – ₹25,00,000 per annum
19 days left

Job Description

Role Summary

Company: Soket AI

Location: Bangalore (On-Site)

Experience: 0–1 Years

Compensation: ₹18,00,000 – ₹25,00,000 per annum

Type: Full-Time

About Soket AI

Soket AI is an AI research company building next-generation foundation models in reasoning, mathematics, coding, multilingual intelligence, speech, and multimodal AI.

Supported by the IndiaAI Mission and backed by a mission to create efficient, open, and accessible AI systems, Soket is focused on advancing frontier AI research for India, the Global South, and the world.

About the Role

Training great AI models starts with great data.

As an AI Data Curator, you'll work at the heart of frontier AI development—discovering, curating, evaluating, and improving the datasets that power large language models and reasoning systems.

You'll collaborate closely with research scientists, data engineers, and model teams to ensure training data is diverse, high-quality, safe, and aligned with evolving model capabilities.

This role is ideal for candidates who enjoy working with large datasets, have strong analytical skills, and are excited about shaping the data behind next-generation AI systems.

What You'll Do

  • Source, curate, and organize large-scale datasets across web, code, documents, speech, and synthetic data sources
  • Build datasets that improve capabilities in reasoning, coding, mathematics, multilingual understanding, and instruction following
  • Support web crawling, data acquisition, and large-scale ingestion pipelines
  • Evaluate dataset quality using manual review and automated validation methods
  • Detect and remove duplicated, noisy, unsafe, low-quality, or contaminated data
  • Design and execute annotation, labeling, and quality assurance workflows
  • Curate synthetic datasets including reasoning traces, tool-use trajectories, and preference datasets
  • Maintain dataset versioning, metadata, provenance, and reproducible workflows
  • Monitor benchmark contamination and training-test overlap
  • Collaborate with research and engineering teams to improve data quality and model performance
  • Continuously enhance dataset diversity, coverage, and reliability

Requirements

  • Bachelor's or Master's degree in Computer Science, Data Science, Artificial Intelligence, Computational Linguistics, Information Science, or related fields
  • Strong analytical thinking and attention to detail
  • Interest in AI, machine learning, and large language models
  • Experience working with datasets, annotations, research corpora, or AI training data
  • Ability to evaluate data quality and make nuanced decisions
  • Strong collaboration and communication skills

Preferred Skills

Data & Programming

  • Python
  • Pandas
  • Hugging Face Datasets
  • Apache Spark
  • Apache Arrow
  • JSONL and Parquet workflows
  • Dask

Data Collection & Processing

  • Web crawling and data extraction
  • API-based data acquisition
  • HTML parsing
  • Large-scale document processing
  • Scrapy
  • BeautifulSoup
  • Selenium

Data Quality & Validation

  • Deduplication and contamination detection
  • Metadata enrichment
  • Corpus diversity analysis
  • FAISS
  • Elasticsearch
  • MinHash and LSH techniques

Annotation & Evaluation

  • Human-in-the-loop annotation workflows
  • Reasoning trace evaluation
  • Code correctness validation
  • Mathematical solution review
  • Preference and ranking datasets
  • Quality assurance systems

Speech & Multimodal Data (Bonus)

  • Speech and audio dataset curation
  • ASR and TTS datasets
  • Familiarity with Common Voice, OpenSLR, or similar datasets

Collaboration Tools

  • Git
  • GitHub

Who Will Thrive Here?

  • Candidates who enjoy organizing and improving large-scale datasets
  • Individuals passionate about AI research and foundation models
  • Detail-oriented problem solvers who care about data quality
  • Builders who enjoy working at the intersection of research and engineering
  • Curious learners eager to understand how state-of-the-art AI systems are trained

Why Join Soket AI?

  • Work on frontier AI problems tackled by only a handful of teams globally
  • Contribute directly to the development of large-scale foundation models
  • Collaborate with top researchers and engineers in AI
  • Gain exposure to cutting-edge research in reasoning, coding, multilingual AI, speech, and multimodal systems
  • Work with supercomputing-scale infrastructure and large-scale datasets
  • Help build open and accessible AI systems for India and the Global South

Who Should Apply?

Recent graduates and early-career professionals who are passionate about AI, data quality, machine learning, and building the datasets that power the next generation of intelligent systems.

Required Skills

PythonPandasHugging Face DatasetsWeb crawling and data extraction

Job Insights

Deadline6/28/2026
Application StatusActive

Other Opportunities

No other opportunities available at the moment.