by Victoria Hart
January 12, 2026

We are standing at a turning point for live captioning. For years, the industry has relied on a relatively stable mix of human expertise and automatic speech recognition. Now large language models are very much present, and with them comes a different kind of intelligence, a different set of risks and responsibilities.
You feed audio into a system and you get words out. In reality, there are now at least two very different families of AI involved in that journey: traditional ASR engines and large language models. Understanding the difference between them is central to how we protect accuracy, dignity, and trust for the people who rely on captions.
ASR and LLMs are not doing the same job
Traditional ASR engines were built for one task: turn speech into text as faithfully and as quickly as possible. They don’t worry about whether the content is appropriate. They don’t stop to consider if the speaker sounds like an expert. They transcribe what they hear, with all the strengths and weaknesses that implies.
LLMs have a different purpose. They are trained to produce language that fits patterns of conversation, instruction, and style. They are rewarded for being helpful and safe, not simply accurate. Put them in the wrong place in a captioning workflow and you can feel the tension between those goals.
A simple example illustrates this. An ASR engine will happily output a complex medical or legal phrase, even if it has no understanding of the underlying concepts. It treats those words as sounds to be matched to its acoustic and language models.
A large language model, on the other hand, has been trained to weigh questions like:
- Is this request safe?
- Does this user sound qualified to receive this information?
- Should I refuse and suggest they speak to a professional?
We hear this in feedback from migrating clients. Ask an LLM for detailed instructions in a high-risk domain and it may say something like, “I’m sorry, I can’t help with that, please consult a licensed professional.” That is a positive trait in many contexts. In live captioning, it can become a serious distortion.
If you run live speech through a model that believes it should sometimes refuse or rephrase content, you risk captions that do not reflect what was actually said. For an audience that is deaf, hard of hearing, or relying on captions as a primary channel of information, that is not a cosmetic error. It is a loss of access.
Where things can go wrong
When we talk about “using AI” in live captioning, it is tempting to treat it as a single decision. In practice, several choices are layered together, and each one carries its own failure modes.
Some of the less visible risks include:
- Refusal in the middle of a sentence
If a captioning pipeline leans too heavily on LLM logic, a model might decide that part of the content is sensitive, offensive, or unsafe. Instead of faithfully reflecting the speaker, it might silently edit, soften, or omit. The audience sees a cleaned-up record of the moment, not the moment itself.
- Over-correction and “helpfulness”
LLMs are rewarded for being polished and coherent. That can lead them to “improve” phrases, correct grammar, or fill in gaps. In a live caption, this may look like the speaker being more articulate or certain than they actually were. For legal, academic, or governmental contexts, that difference matters.
For Line 21, our AI Proofreader is an optional layer to protect brand accuracy and safeguard on text goofs in live events; we test in advance, and we set firm controls in place.
- Hallucination under pressure
When an ASR engine does not understand a term, it usually produces a mishearing that is obviously wrong, or at least traceable to the sound. An LLM, especially when asked to smooth or “fix” text, can introduce information that never existed in the audio. Under time pressure, with multiple layers of AI, this can be hard to spot.
- Latency and buffering
Live captioning is as much about timing as it is about text. Each extra processing step introduces delay. If we ask an LLM to rewrite entire sentences rather than lightly assist at the word level, we can nudge captions further away from real time. That might be acceptable for some events. For emergency information, public announcements, or interactive sessions, it is not.
None of these outcomes are inevitable. They are design choices. That is why the way we deploy AI is as important as the technology itself.
Prompts, policies, and the new craft of captioning workflows
With ASR, most of the complexity lived inside the acoustic and language models. Providers tuned vocabularies, microphones, and networks. Captioners managed terminology, context, and quality.
LLMs introduce a new layer: prompts and policies. The prompt tells the model how to behave. In other words, the prompt becomes a statement of values.
If the prompt is too vague, or copied from a generic use case, the model will fall back to its default instincts. It might focus on politeness instead of precision, or readability instead of fidelity.
This shifts some of the responsibility from model developers to the teams designing captioning workflows. We now have to ask detailed questions like:
- Are we instructing the model to be a writer, or a stenographer?
- Are we allowing it to refuse content, and if so, under what conditions?
- How do we detect when the model has stepped beyond transcription into interpretation?
Getting these answers right requires experimentation and humility. It also requires collaboration between engineers, captioners, clients, and the deaf and hard of hearing communities.
A layered, hybrid approach
One promising pattern is a layered approach that respects what each component does best.
- Use ASR as the first listener, responsible for capturing speech as faithfully as possible.
- Apply carefully constrained AI processing to assist with punctuation, casing, and basic readability.
- Limit large language models to safe, well defined tasks, such as formatting, adding speaker labels, or handling language switching, always with strict instructions to retain meaning.
- Keep human captioners in the loop for high stakes environments, quality control, and nuanced judgement.
In this model, the LLM is treated as a specialist tool, not an all-purpose filter for reality. The goal is still the same: to give people access to what was said, in the moment it was said, in a form they can use. The technology serves that goal, not the other way around.
At Line 21, this is the lens we use when we evaluate new tools. We don't ask “Can this model do captions?” as a single question. We ask where in the pipeline it belongs. What instructions it needs, and how we can prove that it behaves in line with accessibility standards. It’s why we’re transparent about the models we use, and include robust testing in advance of an event.
Standards, trust, and lived experience
Regulators around the world describe caption quality in terms such as accuracy, synchronicity, completeness, and readability. These principles do not change just because the underlying technology has become more complex.
If a model refuses content, accuracy suffers.
If processing delays the words until long after they are spoken, synchronicity suffers.
If sensitive or uncomfortable material quietly disappears, completeness suffers.
If we over-optimize for elegance of phrasing, readability might improve while authenticity suffers.
At the same time, people who rely on captions don’t experience them as a checklist. Over time, they learn whether they can trust that the captions will be there, will be accurate, and will not filter or censor on their behalf.
This is where lived experience matters. Feedback from deaf and hard of hearing users, interpreters, and captioners is the most important signal we have. They are often the first to notice when AI choices affect nuance, tone, or respect. Their insight guides how we tune and deploy our systems.
This feels… different?
This moment in captioning feels different from previous technical shifts. We are not replacing one recogniser with another that is a bit faster or more accurate. We are introducing systems that can make judgement calls, attempt to interpret intent and rewrite language on the fly.
That potential is exciting. AI can help us support more languages with specialised vocabularies and complex environments. It can assist human captioners with preparation and terminology. It can help organisations meet their accessibility obligations at greater scale.
The core promise of live captioning has not changed. People are trusting us with their access to information, culture, and participation. They are trusting that what appears on the screen is what is happening in the session.
In this new era, the real differentiator will not simply be who has the newest model. It will be who uses these models with care and accountability. At Line 21, we see this as an invitation to think more deeply about every line of text we help bring to the screen, and every person on the other side of it.