Maybe, but I don’t think so.
I’ve done several searches using a variety of terms that should be relevant and using a variety of platforms from regular Google search, to Google Scholar searches, to Semantic Scholar searches.
If there’s a research paper, patent, blog post, or just about any other publicly accessible resource out there, then I feel like I should have found it; I usually find what I’m looking for.
If I missed something key, I’d love to hear about it.
I found plenty of resources on Active Noise Control (ANC), and plenty more on AI Voice Assistants, but the closest that I could find for this application was this paper on noise cancellation for voice assistants, but alas, the scope was limited to external noise, as discussed above for multi-mic setups: https://pdfs.
1549069009Diagram by SilentiumA quick overview of terms:Active Noise Control (ANC), sometimes referred to as Active Noise Cancellation, filters out unwanted noise by recording the unwanted noise then playing back its opposite waveform (basically just the same sound shifted half a phase back in time), thus canceling out the unwanted noise while keeping the wanted noise.
Our problem effectively boils down to audio feedback, which is where a system is picking up its own audio output within the audio input.
In a live sound system like in a theatre, feedback quickly shifts into an ear-splitting screech because of a feedback loop amplifying the high frequencies.
Fortunately, our voice assistants don’t succumb to feedback loops because they don’t immediately output their input audio, but whenever your assistant triggers itself, it is reacting to feedback.
My background in A/V systems is likely why I previously assumed that this method was already widely used.
How it’d work:This system should be able to be implemented completely within the software of most voice assistant systems, and thus able to be deployed with an update on current generation voice hardware such as Google Homes and Amazon Echos.
Instead of using a separate external microphone as the input to the ANC filter, I propose using the audio feed that is being sent to the speaker.
Since the device is generating that audio feed, we can know with 100% certainty that we don’t need to use any of that audio as input.
This should not only prevent feedback from accidentally triggering the assistant, but it’d also help the assistant focus on the human’s commands while audio is being outputted, thus making it easier to shout over your music when you want to talk to your assistant.
Best of all, this approach doesn’t ever require pinging the cloud for verification, unlike Amazon’s approach, and shouldn’t add undue processing overhead to the assistant device as ANC can be a relatively simple calculation.
Just like most people are unable to tickle themselves because our brains know what inputs are being generated by our own actions, this method would make AI assistants unable to trigger themselves.
The orange elements represent the software bits that I propose adding to the system.
Similar implementations of this idea:While I wasn’t able to find any evidence of this technique used for voice assistant devices, I did find this method utilized in other applications, such as feedback suppression in pro audio sound systems or speakerphones like this expired patent: https://patents.
com/patent/US6212273B1/enSo if this is already a part of an existing AI assistant’s design, I apologize for restating the obvious, but keep reading because there are a few ways to build on this idea, which I don’t believe are yet developed.
To the best of my knowledge, this method or anything similar is not currently being used in any AI assistants.
Potential roadblocks:One possible issue could be digitally introducing feedback if no significant acoustic feedback is being picked up by the device.
Many voice assistant devices have pretty sophisticated microphone arrays and speakers designed to minimize feedback, so a simplistic implementation of this method could theoretically be adding new audio to the input if that same audio is not being picked up by the microphone loudly enough.
This is due to how ANC works: simply playing back the same audio out of phase.
One way to avoid this problem would be to scale the gain/volume of the output audio sent to the ANC by the level of audio coming in from the microphone input.
This scaling could happen dynamically as a function of the ANC algorithm, and a baseline could be set every time that the device is powered on as a way to calibrate it for its specific acoustic environment.
Another potential downside might be the need to retrain trigger word detection algorithms to account for the filtering of the ANC algorithm.
However, from what I understand about the algorithms, they would likely be robust enough to function well under those different parameters.
Plus, if they were trained with very clean audio, this filtered audio would likely be much closer to the training data.
Gif by tenorFurther capabilities:The holy grail of AI voice interfaces is Full Duplex Communication with the AI, where there can be overlap between the AI talking and the human talking and both parties can keep up.
Current versions of voice assistants are like walkie-talkies, where if you interrupt it, it has to stop talking; this is referred to as Half Duplex.
While the steps to making a voice AI that is convincingly human are much more complex than just avoiding feedback, feedback could definitely be a major stumbling block for the AI.
This method should help get us there: by preventing the AI from listening to its own words.
Additionally, while this method talks about the output from one device digitally being fed through an ANC filter, there’s no reason to limit our thinking to feedback from just the same device.
The assistant could potentially receive a feed of audio from any device within earshot.
Say for example we didn’t want the audio from a Chromecast or a Smart TV confusing our assistant, you could set up those devices to digitally send their audio feed over to the assistant for it to filter out their signals as being irrelevant or unwanted.
Similar to the issue mentioned above if there is a lack of acoustic feedback, it’d be crucial for the devices to all calibrate themselves in their environments.
This could be accomplished by whenever one boots up, it plays a quick calibration tone, and then the assistants within earshot use that to set the appropriate levels on the ANC filter.
This would solve the problem that Amazon attempts with their Alexa fingerprint database, by simply having the TV tell the assistant to ignore all audio from it, including any potential triggers.
Why Present This ProposalWith the level of detail that I’ve developed for this idea and my research into similar applications, I could probably successfully file for a patent.
But I believe in sharing good ideas and as far as I could tell, this idea has never been shared, even though it seemed incredibly obvious to me.
Plus, this should count as prior art to preempt anyone else from filing a patent for the application of internal ANC to voice assistants.
I don’t have the resources to execute on every idea I have, so why not put them out there and see if anyone else can benefit?.I’d love to hear from you if you end up using this idea or if you’d like to work together on something.
If any experts in ANC, HCI (voice), Conversational AI, or any other related field are reading who would like to explore this deeper, please let me know.
I’d be open to researching this further or testing it out.
Perhaps I missed some key factor or work already proposing this method.
If so, I‘d really like to find out about it.
Also, I have a tangential idea for preventing accidental AI triggers using ultrasonic frequencies.
Let me know if anyone would be interested in hearing more about it, and I’d be happy to develop the idea further.
I look forward to hearing any feedback or questions anyone has on this article or the topics discussed, either in the responses here or on social media.
Feel free to connect with me (just let me know that you saw this article) →twitter.