Whether it be multiple meetings occurring in a small space, children playing loudly nearby, or construction noise outside of your home office, unwanted background noise can be really distracting in Teams meetings. We are excited to announce that users will have the ability to remove unwelcome background noise during their calls and meetings with our new AI-based noise suppression option.
Users can enable this helpful new feature by adjusting their device settings before their call or meeting and selecting “High” in the “Noise suppression” drop-down (note this feature is currently only supported in the Teams Windows desktop client). See this support article for details about how to turn it on and more here: https://aka.ms/noisesuppression.
Our new noise suppression feature works by analyzing an individual’s audio feed and uses specially trained deep neural networks to filter out noise and only retain speech. While traditional noise suppression algorithms can only address simple stationary noise sources such as a consistent fan noise, our AI-based approach learns the difference between speech and unnecessary noise and is able to suppress various non-stationary noises, such as keyboard typing or food wrapper crunching. With the increased work from home due to the COVID-19 pandemic, noises such as vacuuming, your child’s conflicting school lesson or kitchen noises have become more common but are effectively removed by our new AI-based noise suppression, exemplified in the video below.
The AI-based noise suppression relies on machine learning (ML) to learn the difference between clean speech and noise. The key is to train the ML model on a representative dataset to ensure it works in all situations our Teams customers are experiencing. There needs to be enough diversity in the data set in terms of the clean speech, the noise types, and the environments from which our customers are joining online meetings.
To achieve this dataset diversity, we have created a large dataset with approximately 760 hours of clean speech data and 180 hours of noise data. To comply with Microsoft’s strict privacy standards, we ensured that no customer data is being collected for this data set. Instead, we either used publicly available data or crowdsourcing to collect specific scenarios. For clean speech we ensured that we had a balance of female and male speech and we collected data from 10+ languages which also include tonal languages to ensure that our model will not change the meaning of a sentence by distorting the tone of the words. For the noise data we included 150 noise types to ensure we cover diverse scenarios that our customers may run into from keyboard typing to toilet flushing or snoring. Another important aspect was to include emotions in our clean speech so that expressions like laughter or crying are not suppressed. The characteristics of the environment from which our customers are joining their online Teams meetings has a strong impact on the speech signal as well. To capture that diversity, we trained our model with data from more than 3,000 real room environments and more than 115,000 synthetically created rooms.
Since we use deep learning it is important to have a powerful model training infrastructure. We use Microsoft Azure to allow our team to develop improved versions of our ML model. Another challenge is that the extraction of original clean speech from the noise needs to be done in a way that the human ear perceives as natural and pleasant. Since there are no objective metrics which are highly correlated to human perception, we developed a framework which allowed us to send the processed audio samples to crowdsourcing vendors where human listeners rated their audio quality on a one to five-star scale to produce mean opinion scores (MOS). With these human ratings we were able to develop a new perceptual metric which together with the subjective human ratings allowed us to make fast progress on improving the quality of our deep learning models.
To advance the research in this field we have also open-sourced our dataset and the perceptual quality crowdsourcing framework. This has been the basis of two competitions we hosted as part of the Interspeech 2020 and ICASSP 2021 conferences as outlined here: https://www.microsoft.com/en-us/research/dns-challenge/home/
Finally, we ensured that our deep learning model could run efficiently on the Teams client in real-time. By optimizing for human perception, we were able to achieve a good trade-off between quality and complexity which ensures that most Windows devices our customers are using can take advantage of our AI-based noise suppression. Our team is currently working on bringing this feature also to our Mac and mobile platforms.
AI based noise suppression is an example of how our deep learning technology has a profound impact on our customer’s quality of experience.