The Complexity of Multi-Speaker Audio Evaluation

Krishna Rupaakula, Sebastian Liu, LJW, and Sandeep Chinchali
March 17, 2026
The next wave of AI products runs on conversational voice data. Meeting assistants, customer support systems, and voice agents all depend on models trained from multi-speaker audio. That makes conversational voice data one of the most important inputs in AI.
On the surface level, it sounds simple: everyone needs more data. The first instinct for many companies is to focus on volume: more hours, speakers, and languages. However, more data does not necessarily mean better data, nor does it automatically lead to better model performance. If the underlying data is inconsistent or poorly transcribed it can degrade performance rather than improve it.
Evaluating quality of the underlying data is as important as the data itself. It turns out building reliable pipelines for conversational audio is harder than it looks, and the difficulty isn't what most people expect.
At Poseidon, we build validation infrastructure for voice datasets. We designed a scoring pipeline, applied it across both single-speaker and dual-speaker recordings, and expected the scores to reflect transcript quality.
They didn't.
Single-speaker audio scored well. Dual-speaker conversations fell significantly behind. But our human reviewers didn't hear that gap. To them, the conversations sounded clear and the words matched. The metrics told a different story.
So we had a choice. Trust the numbers and try to fix the data, or trust the humans and figure out why the numbers were wrong. We opted for the latter.

The Experiment
We had a dataset of conversational Bengali audio collected for voice model training. The dataset included both single-speaker recordings and dual-speaker conversations. Before any of it could be used, we needed to validate whether the transcripts met quality standards. That's what our pipeline was built to do.
To measure transcript quality, we built a unified metric called the Poseidon Score (PSDN). It combines three dimensions of ASR performance into a single number between 0.0 and 1.0: word accuracy, character accuracy, and semantic similarity. A single composite score gave us a consistent way to compare across recording types.
We applied the PSDN score to a dataset of conversational Bengali audio, a low-resource language where labeled data is difficult to produce and evaluation errors carry extra weight. The results diverged sharply. Single-speaker clips averaged 0.86. Dual-speaker conversations came in at 0.71.
A 15-point gap is hard to ignore, but our human reviewers weren't hearing it. Before assuming the conversations were lower quality, we wanted to test a simpler explanation: what if the structure of the audio itself was throwing off the score?

The Silence Problem
Dual-speaker audio has a structural property that single-speaker audio doesn't: silence. When each speaker's channel is isolated, long gaps appear wherever the other person is talking. We suspected these gaps, not the content, might be driving the score difference.
To test our hypothesis, we took high-quality single-speaker clips averaging a PSDN score of 0.95 and artificially inserted silence gaps to mimic the rhythm of a two-person conversation. We kept the same words, speaker, and audio quality, nothing about the speech itself changed.
We tested two approaches for trimming silence:
One based on amplitude thresholds: energy trimming
One using known transcript boundaries: timestamp trimming
The scores dropped.

That confirmed it. Part of the gap between single-speaker and dual-speaker scores had nothing to do with transcription quality. It was an artifact of how the audio was structured: the silences, the turn-taking, the segmentation. The evaluation pipeline was penalizing the recording format, not the fidelity of the data itself.
No Preprocessing Strategy Wins Twice
Knowing that silence was skewing the scores, the natural next question was whether we could preprocess it away.
We tested six different strategies for dual-speaker audio, ranging from simple amplitude-based silence removal to neural voice activity detection to full-conversation diarization (see Appendix for details on each method). No single method consistently outperformed the others. A strategy that worked perfectly for one conversation would underperform on the next. The "best" approach depended entirely on the specific clip. This led us to a Best-of-N approach where for each conversation, the pipeline runs all six methods, scores each result, and keeps the highest score.

Best-of-N brought dual-speaker scores to 0.856, near parity with single-speaker clips at 0.86. The approach doesn't modify the audio or the transcripts. It removes the noise introduced by preprocessing choices. If a clip scores poorly under one strategy but well under another, the difference is about preparation, not quality.
The Best-of-N scores aligned with what our human reviewers had been hearing all along.
Better Models, Fewer Workarounds
All of the above was tested using ElevenLabs Scribe v1. When we upgraded to Scribe v2, much of the instability disappeared on its own. Scribe v1 had three failure modes that disproportionately affected conversational audio:
Truncated transcripts: The model would stop before the speaker finished
Format sensitivity: Scores swung depending on how audio was prepared
Script confusion: Outputting Devanagari in Bengali transcripts
Scribe v2 reduced all three without any changes to our pipeline.

We only knew which improvements to look for because we had already diagnosed the underlying problem.
Without the silence experiment and the Best-of-N analysis, a model upgrade would have looked like a general improvement. With that context, we could see exactly what it fixed and what still required pipeline-level solutions.
Why Your Eval Pipeline Has a Blind Spot
What initially looked like a quality problem turned out to be a measurement problem. Our scores were reacting to audio structure, the silences, the turn-taking, the segmentation, not transcription quality. The audio was fine but the metric has to be built for the shape of the data. Our human reviewers caught what the automated scores missed, and that feedback loop led us to the diagnosis. Without it, we would have spent time fixing audio that was already good enough.
The numbers said our conversational data was worse. The humans said it was fine. Both were right. They were just answering different questions.
This isn't unique to our pipeline. Most of the audio that AI products need to handle is conversational. To build models that work on real-world speech, you need training data that reflects it. And to know whether that data is any good, you need evaluation pipelines that can tell the difference between a quality problem and a structural one. Without that, it's garbage in, garbage out, and you won't know it until the model is in production.
Acknowledgements: The authors would like to thank Emma Joelle for her contributions to the piece.