Introducing the Poseidon Voice AI Dataset

January 28, 2026
Why We Need Under-Represented Language Datasets for Voice AI
Public voice datasets suffer from structural imbalances. Most prioritize Western languages, offer limited coverage of real-world acoustic environments, and often lack transparent quality-control standards. As a result, speech-to-text (STT) and text-to-speech (TTS) systems frequently underperform for the majority of global users who do not speak English or who communicate in environments far noisier than studio conditions. Addressing this gap requires large, rights-cleared corpora capturing linguistic, cultural, and acoustic diversity at scale.

Our Dataset at a Glance
The Poseidon Voice AI Dataset contains over 33,000 hours of prompted speech collected in approximately three weeks from thousands of globally distributed contributors. Languages include Hindi, Urdu, Indonesian, Vietnamese, Korean, Mandarin, and additional under-represented languages with limited existing training data. In several of these languages, our corpus exceeds the scale of public datasets built over many years. Figure 1 illustrates Poseidon’s language coverage relative to major public datasets.

Figure 1. Hours of recorded audio per language in Poseidon compared with Common Voice, FLEURS, MLS, and VoxPopuli. Poseidon achieves an order-of-magnitude increase for multiple low-resource languages in a fraction of the collection time.

Figure 2. Standard voice datasets, like Mozilla’s Common Voice, took years to collect. Poseidon’s crowdsourced approach scaled globally in just weeks, as seen above.
Benefits of Crowdsourcing via Incentivized Participation
Leveraging decentralized, incentive-driven participation for data collection provides three advantages. First, contributors can join globally, enabling rapid scaling. Second, real-world acoustic settings (background noise, varied microphones, spontaneous behavior) are naturally captured rather than artificially filtered. Third, tracking data contributions through secure provenance infrastructure ensures transparent data lineage, enabling compliant licensing for AI training workflows.
Challenges of Crowdsourcing
Open participation inevitably introduces diversity in acoustic quality. Contributors may submit partial transcripts, incorrect languages, recordings with severe background interference etc. Detecting and filtering such anomalies requires a rigorous, multi-stage validation pipeline.
The Poseidon App – How We Collected Data
Contributors recorded 30–60 second utterances by reading ground-truth text in their native language. Authentication occurred through a unique seed phrase tied to each contributor’s secure account, establishing a persistent voice anchor. Users consented to dataset usage for AI workflows and submitted recordings across topics including banking, customer support, sports, and general conversation. Integration with World, a proof-of-humanity protocol, made Poseidon the top trending AI app on the World app store. Web-based uploads were also supported.

Figure 3. Example of the Poseidon voice collection app and transcripts in the Korean campaign.
The Data Collection Process and Data Rights
Our initial campaign (“Season 1”) produced over 33,000 hours of raw audio with associated transcripts, metadata, and authentication samples. All data is rights-cleared, with provenance anchored via Story’s Intellectual Property (IP) framework to support downstream research and enterprise use.
Challenge 1: Filtering for High Semantic Quality – The Poseidon Score
Our key question of interest is: “Does a user’s uploaded audio faithfully adhere to the intended transcript while accounting for a few mis-spoken words?”
To determine whether a spoken utterance faithfully matches its assigned transcript, we compute a composite Poseidon Score combining character error rate, word error rate, and semantic similarity. The resulting distribution is bimodal, separating faithful readings from incomplete or incorrect samples and enabling language-specific thresholds for filtering.

Figure 4. Distribution of Poseidon Scores across four languages (Hindi, Korean, Mandarin, and Urdu). The bimodal structure reflects clear separation between high-fidelity readings and lower-quality submissions, often due to background variation etc.
Challenge 2: Data Annotation
For selected languages, we employed three independent native-speaking validators per sample across seven annotation categories, including transcript correctness, language verification, and spoken accent. These annotations calibrate automated scores, identify ambiguous or adversarial recordings, and reveal corner cases where both humans and detectors exhibit uncertainty.

Figure 5. HuBERT embedding projections across Hindi, Korean, Mandarin, and Urdu. Dense central regions reflect typical, clean speech; peripheral clusters correspond to outliers such as noise-heavy recordings or non-native pronunciations.

Figure 6. HuBERT embedding projection for Urdu.
This Is Just the Start – We Are Scaling to Hundreds of Thousands of Hours
Applications include multilingual STT and TTS model development, noise-robust ASR tuning, and speaker verification. Future campaigns will scale to hundreds of thousands of hours per language, including diarized multi-speaker conversations and domain-specific speech (for example, medicine, finance, customer support). Demographic metadata such as age, gender, dialect, and native language will enrich downstream modeling.
Contact Below To Access The Dataset
The dataset release includes audio files, transcripts, embeddings, Poseidon Scores and human annotations in a unified schema. Example loaders and model-training utilities are provided for straightforward use in PyTorch or other frameworks.
Academic and Enterprise Licensing
For research collaborations, commercial licensing, or dataset samples, please contact the Poseidon team.