AI isn’t just a chat box you type a prompt into to get a text response back from anymore. All the major public AI services like ChatGPT, Google Gemini and Anthropic’s Claude are now multimodal, meaning they’ll accept inputs like images, video and computer code, as well as speech, to come up with answers which might be visual, audio or even plain old text.
These AIs, generally known as Large Language Models, or LLMs, are trained on massive amounts of data, much of it scraped from the public internet as well as, as several recent court cases have demonstrated, copyrighted materials sourced in various — often shady — ways. And it’s because of the grey areas around where training data is sourced that none of the AI companies disclose much information about how their tools were trained.
Related Article Block Placeholder
Article ID: 325543
We’ll gloss over the fact LLMs are prone to making things up and parroting the biases in their training data, as these problems are well known but not easily dealt with. What’s tougher is as we enter the era of multimodal AI and agentic AI, where we may be relying on our voices to give the model instructions, the models struggle with accents that aren’t primarily white, urban and from north America.
At best, speaking to an AI in your broad Far North Queensland accent could mean it doesn’t understand your intention and delivers a poor response. At worst, your accent could lead to you being discriminated against, potentially by companies and even governments correlating accent and dialect with other social, economic and demographic factors.
C’mon Aussie…. C’mon?
“When AI models are trained, they propagate the biases and the patterns of the data they’re trained on,” says Kathy Reid, a PhD candidate at The Australian National University’s School of Cybernetics, as well as a current employee of the Mozilla Foundation (makers of the Firefox web browser), working on its Common Voice Project.
Smarter business news. Straight to your inbox.
For startup founders, small businesses and leaders. Build sharper instincts and better strategy by learning from Australia’s smartest business minds. Sign up for free.
By continuing, you agree to our Terms & Conditions and Privacy Policy.
If anyone knows the ins and outs of AI voice models, it’s Reid.
One of the most used AI speech recognition engines is OpenAI’s Whisper model (OpenAI is developer of ChatGPT), and OpenAI won’t disclose where its speech training data came from any further than saying “the internet”.
Related Article Block Placeholder
Article ID: 276329
Reid says these unclear sources for training data — and there are many other closed source speech models also relying on voice data scraped from the internet — could be problematic. The reason for this is the internet skews North American and English, and so representation of other accents and dialects could be sketchy or missing altogether from scraped speech data.
However, Reid says many LLMs perform reasonably well with Australian accents, but where they trip up is in what’s known in the trade as “named entities”. Named entities are, as the title suggests, proper nouns, like place names, product names and people’s names.
Because training data is thin on Australian named entities, particularly indigenous place names like Wagga Wagga and Canberra etc, the speech engines will usually get them wrong.
So far so good. It looks like the speech AI’s are doing okay if you’re an Aussie.
Not so fast, says Dr Mostafa Shahin, from UNSW’s School of Electrical Engineering. Dr Shahin’s work focuses on areas including research designed to detect speech issues in children from an early age, as well as research intended to detect Alzheimer’s disease early based on a patient’s speech patterns.
“There is evidence the current [AI speech systems] exhibit biases because of their training data,” Dr Shahin says.
For Australians, this creates a real issue because we’re a nation of migrants, all with different accents and dialects. Mass media might teach us all Aussies speak the same, but it’s a different reality on the streets and in towns across the country.
The systems may also exhibit low accuracy with different Australian dialects, he says, which could create issues around equity of access, or even security, if a voice biometric is being used to validate access to a secure system.
Related Article Block Placeholder
Article ID: 325254
Shahin says solving the equity and inclusion problems with speech engines is thorny, for several reasons. First, good, clean data containing deep examples of different accents and dialects may simply not exist or be of good enough quality to thoroughly train an AI model. Second, the big AI companies may not have any real incentive to include diverse sources in their models, figuring the cost-benefit equation simply isn’t worth it.
This is why, he says, sovereign Australians AI models are needed, particularly to power government services like Centrelink and the ATO, as well as for major companies like banks and energy providers, with broad, diverse customer bases, to tap into.
Automating bias and discrimination?
AI voice systems are getting smarter and more capable, and it’s likely the multimodal systems will, in the not-too-distant future, be capable of distinguishing between different accents and dialects.
And this could lead to real problems of bias and discrimination, says ANU’s Reid.
Related Article Block Placeholder
Article ID: 324710
“Once we can discriminate between accents, then it’s possible to automate the accent discrimination people face in real life,” she says.
Be honest with yourself. Consciously or not, you probably judge someone based on their accent. All humans do it and, as Reid says, we give some accents more prestige than others.
Reid says it’s possible AI hiring bots could assess someone not only on their facial expressions, but also on where their accent is from, and what it indicates about a person’s nominal place in society.
However, there’s a flipside. AI able to detect accents could also lead to better government services for marginalised groups, Reid says, because the system would “understand” someone with English as a second language may need more robust or tailored support.
“There are positive ways this technology could be used, but overall, I’m quite concerned about our ability to classify and distinguish accents and speech data,” she says.