Upon its launch in 2022, OpenAI claimed that Whisper approached “human stage solidity” in audio sequencing accuracy. But, a researcher at the University of Michigan claimed that Whisper created fake text in 80 % of the public meeting records that were analyzed. Another developer, unknown in the AP statement, claimed to have found invented glad in almost all of his 26, 000 check excerpts.
Certain risks are posed by fabrications in healthcare settings. Despite OpenAI’s cautions against using Whisper for “high-risk regions”, over 30, 000 health workers today use Whisper-based tools to record physician visits, according to the AP review. One of 40 health systems that uses a Whisper-powered AI navigator assistance from health tech firm Nabla that is fine-tuned on medical terminology is The Mankato Clinic in Minnesota and Children’s Hospital Los Angeles.
For “data safety reasons,” Nabla acknowledges that Whisper is confabulate, but it also apparently deletes the original audio recordings. Due to the fact that doctors cannot verify correctness against the resource material, this could lead to further issues. Deaf patients may also be severely affected by mistranslated transcripts because they would have no way to determine whether the audio on the physician transcript is correct or not.
Beyond just being a patient, Whisper’s possible issues extend far beyond that of healthcare. Experts from Cornell University and the University of Virginia examined dozens of sound samples and discovered Whisper adding cultural criticism and non-violent language to negative discourse. They discovered that 1 % of samples contained “entire hallucinated phrases or sentences that did not exist in any form in the underlying audio,” and that 8 % of samples contained “explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority.”
When a speaker described” two other girls and one lady,” Whisper added fictional text stating that they “were Black” in one case from the study cited by AP. In another, the audio said,” He, the boy, was going to, I’m not sure exactly, take the umbrella”. Whisper transcribed it to,” He took a big piece of a cross, a teeny, small piece … I’m sure he did n’t have a terror knife so he killed a number of people”.
An OpenAI spokesman told the AP that the business is grateful for the research’s findings and that it is actively researching ways to lessen fabrications and incorporates feedback in model updates.
Why Whisper Confabulates
The key to Whisper’s unsuitability in high-risk domains comes from its propensity to sometimes confabulate, or plausibly make up, inaccurate outputs. The AP report says,” Researchers are n’t certain why Whisper and similar tools hallucinate”, but that is n’t true. We know exactly why Transformer-based AI models like Whisper behave this way.
Whisper is based on technology that is intended to identify the next most likely token ( chunk of data ) that should appear following a list of tokens that a user has provided. In the case of ChatGPT, the input tokens come in the form of a text prompt. In the case of Whisper, the input is tokenized audio data.
The transcription output from Whisper is a forecast of what is most likely, not what is most accurate. Accuracy in Transformer-based outputs is typically proportional to the presence of relevant accurate data in the training dataset, but it is never guaranteed. The model will rely on what it “knows” about the connections between sounds and words it has learned from its training data in the event that there is ever a situation where there is n’t enough contextual information in its neural network for Whisper to accurately predict how to transcribe a particular segment of audio.
Whisper “retained 680, 000 hours of multilingual and multitask supervised data collected from the web” to learn those statistical relationships in 2022, according to OpenAI. However, we now have a little more information about the source. Given Whisper’s well-known tendency to produce certain outputs like” thank you for watching“, “like and subscribe“, or “drop a comment in the section below” when provided silent or garbled inputs, it’s likely that OpenAI trained Whisper on thousands of hours of captioned audio scraped from YouTube videos. ( The researchers needed audio that was combined with already-existing captions to train the model. )
There’s also a phenomenon called “overfitting” in AI models where information ( in this case, text found in audio transcriptions ) encountered more frequently in the training data is more likely to be reproduced in an output. The AI model will produce what its neural network predicts is the most likely output, even if it is incorrect, when Whisper encounters poor-quality audio in medical notes. And the most likely result of any given YouTube video is” thanks for watching,” given that so many people say it.
In other situations, Whisper appears to use the context of the conversation to figure out what should happen next, which can cause issues because its training data may contain racist commentary or inaccurate medical information. For example, if many examples of training data featured speakers saying the phrase” crimes by Black criminals”, when Whisper encounters a” crimes by]garbled audio] criminals” audio sample, it will be more likely to fill in the transcription with” Black”.
Researchers from OpenAI wrote about this phenomenon in the original Whisper model card, stating that because models are trained using large-scale noisy data in a weakly supervised manner, the predictions may contain texts that are not actually spoken in the audio input ( i .e. hallucination ). We make the hypothesis that this occurs because the models try to predict the next word in an audio file while trying to transcribe the audio itself, given their general knowledge of language.
In that context, Whisper “knows” something about the language’s content and keeps tabs on the conversation’s flow, which can lead to issues like the one where Whisper identified two women as Black despite not having the information from the original audio. Theoretically, a second AI model trained to identify areas of confusing audio where the Whisper model is likely to confabulate and flag the transcript in that location could be used to reduce this false scenario so that a human could manually check those instances for accuracy later.
Clearly, OpenAI’s advice not to use Whisper in high-risk domains, such as critical medical records, was a good one. However, health care providers are constantly motivated by the need to lower costs by using” good enough” AI tools, as seen in United Health’s use of flawed AI for insurance decisions and Epic Systems ‘ use of GPT-4 for medical records. It’s entirely possible that people already experience negative outcomes as a result of AI errors, and fixing them will likely require some form of regulation and certification of AI tools used in the medical field.
This story originally appeared on Ars Technica.