In case you haven’t noticed, "voice" is taking over the tech world. It won’t be long before your keyboard is in a museum alongside floppy drives, cathode ray monitors, and parallel printer cables.
Voice recognition is a tremendous technical challenge, and a lot of very talented people are employed trying to get it to work properly. "Working properly" means that a machine will interpret what you say the same way (or even better than) a human being would.
We’re getting better at it all the time, but I personally wish we’d stop trying so hard before we erode what little is left of our culture. Let me explain.
Typical explanations for how voice recognition works break the process into two parts: recognizing the words and divining their meanings.
The first seems to be a relatively straightforward task of interpreting audio wave forms as phonemes and stringing them together to get words. So if I say "light on," the computer recognizes all the aural bits and pieces and figures out that it’s two words, both of which are in its data base so no problem there.
Then, it has to figure out what I mean by "light on." Am I asking for it to be turned on, or asking if it’s already on? So maybe the machine tries to detect the intonation of the final syllable: Is it harsh and sharply inflected ("light on!") or is there an upward lilt ("light on?").
That’s not an easy task, and most voice recognition systems don’t bother trying to figure it out. Instead, they use context, and as experts in artificial intelligence (AI) will tell you, the more constrained the context, the smarter the system will seem to be. In the above example, if it’s a home automation system (like Amazon’s Echo/Alexa), it’s not likely that the homeowner who just walked into a room needs to ask if the light is on. So the system can reasonably assume it’s a command.
If it’s a safety officer monitoring a nuclear power plant and he heard something go clunk in the cooling tower, he might be asking the system if a warning light just came on.
So context counts, which is why specific-use ("restricted domain") AI systems are more accurate than general purpose ones. As it happens, advanced general purpose AI systems try to use the context of a conversation to assist in deriving meaning. If you say to Siri, "I need to get the brakes on my car relined," she’ll spell it "brakes," but if you say, "Some people get all the breaks," she’ll correctly resolve the homophone and spell it "breaks." (Try it yourself. Quite amazing if you think about it.)
In other words, Siri uses context to restrict the domain to either the automotive or self-pitying. That’s a critical step in correctly interpreting a spoken sentence.
In both of those examples, the speech being dictated was properly formulated. Now try to imagine what a general purpose voice recognition system would do with this: "So, I’m like, is it totally, like, gonna, you know, rain or whatever? Know’m sayin?"
Back in the day I’d have cracked wise about your teenaged daughter after a sentence like that, but as most of you are already aware, that kind of language is now ubiquitous. Your teenaged daughter’s teachers probably speak like that, as do television news people, on-air sports reporters, and tweeting politicians.
And what are our AI experts doing about it? They’re scrambling to get their machines to deal with it, along with slurred speech, upward inflections at the end of every sentence ("So I’m with Billy? And we’re going to the movies? And there’s this horror flick? And I’m, like so, scared? And Billy’s making fun of me? Because he’s not?" Aaaargh…!) and slang words and phrases with all the linguistic longevity of a fruit fly.
The greatest success is achieved when the device, no matter how awkwardly or ungrammatically or seemingly incomprehensibly a query is formulated, manages to interpret it correctly anyway.
In other words, cyberlinguists, who as a rule are highly skilled in the nuances of language and generally appreciative of its beauty, are bending over backwards to reinforce the bad behavior of people bent on killing it. If a machine says, "Shall I turn on the lights?" and the answer is, "Duh, it’s like totes dark, yo?" and the machine turns on the lights, what reason does Gina have to suspect that all is not right with her verbal world?
The net effect is that poor language use is consistently reinforced instead of corrected. And the smarter we get about machine learning, the more our use of language deteriorates.
Technology to the Rescue
So here’s a modest proposal for those brave souls not afraid to make trouble in the high school cafeteria: Instead of trying so hard to make machines understand us, we should, in selected situations and environments, turn that paradigm on its head and teach machines to teach us instead.
Imagine an optional, on-demand capability (were I developing it, I’d call it "Loquitur") that can be added to such devices as Echo/Alexa, Google Voice, and Siri to prevent the device from complying with requests unless they’re properly formulated. Loquitur would gently prompt for a corrected version of the request, or provide coaching if the user was having difficulty.
Several levels of "strictness" could be selected, such as "It has to be perfect" to "It just has to be reasonable." As an example, "None of them are going to the ballgame" would be acceptable as "reasonable" but the "strict" setting would demand the use of "is" rather than "are."
Options could also be provided to detect certain phrases that should be actively discouraged, such as "whatever" and "totally," overuse of the word "like" and whatever obscure usage happens to be in vogue at the moment.
Analytics could be employed to inform parents as to how their children are faring. Or even the parents themselves.
It wouldn’t be easy. Challenges include recognizing who’s doing the speaking and coming up with a universally acceptable set of criteria as to what "proper language and usage" means that incorporates dialogue acceptable on a regional basis (e.g. "y’all" or "eh?") and acknowledges that "none is/none are" in the above example is in fact a matter of debate.
Loquitur might be one good way of making sure that there’s a hopeful answer to the question, "Is our children learning?"
Lee Gruenfeld is a Principal with the TechPar Group in New York, a boutique consulting firm consisting exclusively of former C-level executives and "Big Four" partners. He was Vice President of Strategic Initiatives for Support.com, Senior Vice President and General Manager of a SaaS division he created for a technology company in Las Vegas, national head of professional services for computing pioneer Tymshare, and a Partner in the management consulting practice of Deloitte in New York and Los Angeles. Lee is also the award-winning author of fourteen critically-acclaimed, best-selling works of fiction and non-fiction. For more of his reports — Click Here Now.
© 2023 Newsmax. All rights reserved.