Speech recognition

Speech recognition technologies allow computers equipped with a source of sound input, such as a microphone, to interpret human speech, e.g. for transcription or as an alternative method of interacting with a computer.

Contents

1 Classification

2 Use

3 Approaches

3.1 Grammar constrained recognition
3.2 Natural language recognition
3.3 Dictation

Classification

Such systems can be classified as to

whether they require the user to "train" the system to recognise their own particular speech patterns or not,
whether the system is trained for one user only or is speaker independent,
whether the system can recognise continuous speech or requires users to break up their speech into discrete words,
whether the system is intended for clear speech material, or is designed to operate on distorted transfer channels (e.g. cellular phones) and possibly background noise or other speaker talking simultaneously, and
whether the vocabulary the system recognises is small (in the order of tens or at most hundreds of words), or large (thousands of words).

Speaker dependent systems requiring a short amount of training can (as of 2001) capture continuous speech with a large vocabulary at normal pace with an accuracy of about 98% (getting two words in one hundred wrong) if operated under optimal conditions, and different systems that require no training can recognize a small number of words (for instance, the ten digits of the decimal system) as spoken by most English speakers. Such systems are popular for routing incoming phone calls to their destinations in large organisations.

Use

Commercial systems for speech recognition have been available off-the-shelf since the 1990s. Despite the apparent success of the technology, few people use such speech recognition systems on their desktop computers. It appears that most computer users can create and edit documents and interact with their computer more quickly with conventional input devices, a keyboard and mouse, despite the fact that most people are able to speak considerably faster than they can type. Using both keyboard and speech recognition simultaneously, however, can in some cases be more efficient than using any one of these inputs alone. A typical office environment, with a high amplitude of background speech, is one of the most adverse environments for current speech recognition technologies, and large-vocabulary systems with speaker-independence that are designed to operate within these adverse environments have significantly lower recognition accuracy. The typical achievable recognition rate as of 2005 for large-vocabulary speaker-independent systems is about 80%-90% for a clear environment, but can be as low as 50% for scenarios like cellular phone with background noise. Additionally, heavy use of the speech organs can result in vocal loading.

Nevertheless, speech recognition technology is used more and more for telephone applications like travel booking and information, financial account information, customer service call routing, and directory assistance. Using constrained grammar recognition (described below), such applications can achieve remarkably high accuracy. Research and development in speech recognition technology has continued to grow as the cost for implementing such voice-activated systems has dropped and the usefulness and efficacy of these systems has improved. For example, recognition systems optimized for telephone applications can often supply information about the confidence of a particular recognition, and if the confidence is low, it can trigger the application to prompt callers to confirm or repeat their request (for example "I heard you say 'billing', is that right?"). Furthermore, speech recognition has enabled the automation of certain applications that are not automatable using push-button interactive voice response (IVR) systems, like directory assistance and systems that allow callers to "dial" by speaking names listed in an electronic phone book. Nevertheless, speech recognition based systems remain the exception because push-button systems are still much cheaper to implement and operate.

Approaches

The two most common approaches used to recognize a speaker’s response are often called grammar constrained recognition and natural language recognition. When ASR (Automatic Speech Recognition) is used to transcribe speech, it is commonly called dictation.

Grammar constrained recognition

Grammar constrained recognition works by constraining the possible recognized phrases to a small or medium-sized formal grammar of possible responses, which is typically defined using a grammar specification language. This type of recognition works best when the speaker is providing short responses to specific questions, like yes-no questions; picking an option from a menu; selecting an item from a well-defined list, such as financial securities like stocks and mutual funds or names of airports; or reading a sequence of numbers or letters, like an account number.

The grammar specifies the most likely words and phrases a person will say in response to a prompt and then maps those words and phrases to a token, or a semantic concept. For example, a yes-no grammar might map "yes", "yeah", "uh-huh", "sure", and "okay" to the token "yes" and "no", "nope", "nuh-uh", and "no way dude!" to the token "no". A grammar for entering a 10-digit account number would have ten slots each of which contain one digit which could be zero through nine, and result from the grammar would be the 10-digit number that was spoken.

If the speaker says something that doesn't match an entry in the grammar, recognition will fail. Typically, if recognition fails, the application will reprompt users to repeat what they said, and recognition will be tried again. If a system is well-designed and is repeatedly unable to understand the user (typically due to the caller misunderstanding the question, having a thick accent, mumbling, or speaking over a large amount of background noise or interference), it should fall back to some other input method or transfer the call to an operator. Research shows that callers who are asked to repeat themselves over and over quickly become frustrated and agitated.

Natural language recognition

Natural language recognition allows the speaker to provide natural, sentence-length responses to specific questions. Natural language recognition uses statistical models. The general procedure is to create as large a corpus as possible of typical responses, with each response matched up to a token or concept. In most approaches, a technique called Wizard of Oz is used. A person (the wizard) listens in real time or via recordings to a very large number of speakers responding naturally to a prompt. The wizard then selects the concept that represents what the user meant. A software program then analyzes the corpus of spoken utterances and their corresponding semantics and it creates a statistical model which can be used to map similar sentences to the appropriate concepts for future speakers.

For example, an application that routes phone calls for a customer helpdesk asks the caller to briefly describe their problem. For the concept "forward my call to the billing department", you would want to recognize sentences like "I have a problem with my bill", "I was charged incorrectly", "How much do I owe this month", etc. While you could construct a grammar with all the likely keywords (bill, charge, charged, owed, etc.), if the caller speaks in sentences, you may pick up multiple conflicting matches. You might also miss sentences that fit the right pattern, but just miss the pre-ordained keywords. It is difficult to create large, rich grammars that consider the context in which the words are said. In addition, as a grammar gets very large, the chances of having similar sounding words in the grammar greatly increases.

The obvious advantage of natural language recognition over the grammar constrained approach is that it is unnecessary to identify the exact words and phrases. A big disadvantage, though, is that for it to work well, the corpus must typically be very large. Creating a large corpus is time consuming and expensive. Furthermore, open-ended questions used by such systems encourage callers to speak quickly and be creative in their responses in a way that often makes it difficult for computers to understand what they mean. Also, in such systems it is difficult to devise a list of possible confirmation prompts to assure callers that their requests were correctly recognized. Instead most successful voice response applications use prompts that encourage the caller to use short phrases which are more likely to be correctly recognized using grammar-constrained recognition.

Some systems use a hybrid of constrained grammar and natural language recognition that permits sentence-length responses to specific questions, but ignores the irrelevant part of the sentence using a natural language "garbage model". Combining this approach with prompts that encourage short answers can be effective at maximizing the accuracy and correctness of recognition.

Dictation

Used to transcribe someone’s speech, word for word. Unlike grammar constrained and natural language recognition, dictation does not require semantic understanding. The goal isn’t to understand what the speaker meant by their speech, just to identify the exact words. However, contextual understanding of what is being said can greatly improve the transcription. Many words, at least in English language, sound alike. If the dictation system doesn’t understand the context for one of these words, it will not be able to confidently identify the correct spelling.

Technical issues

Template:Section cleanup Modern speech recognition systems are generally based on Hidden Markov Models (HMMs). This is a statistical model which outputs a sequence of symbols or quantities. Having a model which gives us the probability of an observed sequence of acoustic data given one or another word (or word sequence) will enable us to work out the most likely word sequence by the application of Bayes' rule:

<math>P(word|acoustics) = \frac{p(acoustics|word) P(word)}{p(acoustics)}<math>

For a given sequence of acoustic data (think Wave file), p(acoustics) is a constant and can be ignored. P(word) is the prior likelihood of the word, obtained through language modeling (a science in itself; suffice it to say that P(mushroom soup) > P(much rooms hope)); p(acoustics|word) is the most involved term on the right hand side of the equation and is obtained from the aforementioned HMM.

HMMs are popular because they can be trained automatically and are simple and computationally feasible to use. For the purposes of this explanation, HMM will have to remain not much more than a TLA (three letter acronym). In speech recogntition, to give the very simplest setup possible, the HMM would output a sequence of n-dimensional real-valued vectors with N around, say, 13, outputting one of these every 100 milliseconds. The vectors, again in the very simplest case, would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short-time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The HMM will tend to have, in each state, a statistical distribution called a mixture of diagonal Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; an HMM for a sequence of words or phonemes is made by concatenating the individual trained HMMs for the separate words and phonemes.

The above is a very brief introduction to some of the more central aspects of speech recognition. Modern speech recognition systems use a host of standard techniques which it would be too time consuming to properly explain, but just to give a flavor, a typical large-vocabulary continuous system would probably have the following parts. It would need context dependency for the phones (so phones with different left and right context have different realizations); to handle unseen contexts it would need tree clustering of the contexts; it would of course use cepstral normalization to normalize for different recording conditions and depending on the length of time that the system had to adapt on different speakers and conditions it might use cepstral mean and variance normalization for channel differences, VTLN for male-female normalization and MLLR for more general speaker adaptation. The features would have delta and delta-delta coefficients to capture speech dynamics and in addition might use HLDA; or might skip the delta and delta-delta coefficients and use LDA followed perhaps by HLDA or a global semitied covariance transform (also known as MLLT). A serious company with a large amount of training data would probably want to consider discriminative training techniques like MMI, MPE, or (for short utterances) MCE, and if a large amount of speaker-specific enrollment data was available a more wholesale speaker adaptation could be done using MAP or, at least, tree-based MLLR. Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, but there is a choice between dynamically creating the combination HMM which includes both the acoustic and language model information, or combining it statically beforehand (the AT&T approach, for which their FSM toolkit might be useful). Those who value their sanity might consider the AT&T approach, but be warned that it is memory hungry.

The more popular speech recognition conferences held each year or two include ICASSP, Eurospeech/Interspeech and the smaller ASRU. Examining the proceedings of these conferences, or journals such as the IEEE transactions on Speech and Audio Processing, or Computer Speech and Language, may give the interested person a flavor of the current work, but be warned that much of the research is either not very good or is unnecessarily obfuscated. Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date. Keep an eye on government sponsored competitions such as those organised by DARPA (the telephone speech evaluation was most recently known as Rich Transcription). In terms of freely available resources, the HTK book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting (if you are very brave). You could also search for Carnegie Mellon University's SPHINX toolkit.

Some other key technical problems in speech recognition are:

Speech recognition system are based on simplified stochastic models, so any aspects of the speech that may be important to recognition but are not represented in the models cannot be used to aid in recognition.
Co-articulation of phonemes and words, depending on the input language, can make the task of speech recognition considerably more difficult. In some languages, like English, co-articulatory effects are extensive and far-reaching, meaning that the expected phonetic signal of a whole utterance can be vastly different than a simple concatenation of the expected phonetic signal of each sound or word. Consider for example the sentence "what are you going to do?", which when spoken might sound like "whatchagonnado?", which has a phonetic signal which is very different from the expected phonetic signal of each word separately.
Intonation and sentence stress can play an important role in the interpretation of an utterance. As a simple example, utterances that might be transcribed as "go!", "go?" and "go." can clearly be recognized by a human, but determining which intonation corresponds to which punctuation is difficult for a computer. Most speech recognition systems are unable to provide any more information about an utterance other than what words were pronounced, so information about stress and intonation cannot be used by the application using the recognizer. Researchers are currently investigating emotion recognition, which may have practical applications. For example if a system detects anger or frustation, it can try asking different questions or forward the caller to a live operator.
In a system designed for dictation, an ordinary spoken signal doesn't provide sufficient information to create a written from that obeys the normal rules for written language, such as punctuation and capitalization. These systems typically require that the speaker to explicitly say where punctuation is to appear.
In naturally spoken language, there are no pauses between words, so it is difficult for a computer to decide where word boundaries lie. Some sets of utterances can sound the same, but can only be disambiguated by an appeal to context: one famous T-shirt worn by Apple Computer researchers made this point: I helped Apple wreck a nice beach, which, when spoken, sounds like I helped Apple recognize speech. Using common sense and context to disambiguate cases like this can be considered a separate field of inquiry: natural language understanding. A general solution of many of the above problems effectively requires human knowledge and experience, and would thus require advanced pattern recognition and artificial intelligence technologies to be implemented on a computer. In particular, statistical language models are often employed for disambiguation and improvement of the recognition accuracies.

Market players

The challenge for developers of ASR engines is that the end customer judges them on one criterion - did it understand what I said? That leaves little room for differentiation. Of course, there are areas like multi-language support, tuning tools, integration API (the proposed standard MRCP or proprietary) , etc., but recognition quality is most visible. Because of the complex algorithms and language models required to implement a high-quality speech recognition engine, it is both difficult for new companies to enter this market as well as difficult for existing vendors to maintain the necessary investment level to keep up and move ahead.

Currently, Nuance and ScanSoft dominate the speech recognition market for sever-based telephony and PC applications (these are now in the process of merging, with ScanSoft acquiring Nuance). There are several small vendors, like Aculab, Fonix Speech, Loquendo, LumenVox, Verbio, etc., but they are essentially niche players. The speech recognition side of ScanSoft is actually composed of SpeechWorks and the products of several former niche players. IBM has also participated in the speech recognition engine market, but their ViaVoice product has gained traction primarily in the desktop command and control (grammar-constrained) and dictation markets. ScanSoft also makes Dragon NaturallySpeaking, a desktop dictation system with recognition rates of up to 99 percent.

Speaker-independent speech recognition embedded for mobile phones is one of the fastest growing market segments. Grammar-based command and control and even dictation systems can now be purchased in mobile handsets from operators such as Cingular Wireless, Sprint PCS, Verizon Wireless, and Vodafone. VoiceSignal is the dominant vendor in this rapidly growing segment. Microsoft, Scansoft, and IBM have also announced intentions to enter this segment.

This is all changing. The big software heavyweights, Microsoft (Speech Server) and IBM (references - main site, voice toolkit preview, eWeek article, older InternetNews article, new InternetNews article on VXML toolkits) are now making substantial investments in speech recognition. IBM claims to have put one hundred speech researchers on the problem of taking ASR beyond the level of human speech recognition by 2010. Bill Gates is also making very large investments in speech recognition research at Microsoft. At SpeechTEK, Gates predicted that by 2011 the quality of ASR will catch up to human speech recognition. IBM and Microsoft are still well behind Nuance and ScanSoft in market share.

External links

CMU Sphinx (http://sourceforge.net/projects/cmusphinx/) - free open source speech recognition engine
ScanSoft's Dragon NaturallySpeaking (http://www.scansoft.com/naturallyspeaking/)
Microsoft Speech Server (http://www.microsoft.com/speech/)
LumenVox Speech Recognition Engine (http://www.lumenvox.com)
Nuance Technologies (http://www.nuance.com)
Phonewire, Inc. (http://www.phonewire.com) - voicemail-to-text transcription service provider
Sphinx Open Source Speech Recognition Engine (http://cmusphinx.sourceforge.net/)
Xvoice: Speech control of X applications (http://xvoice.sourceforge.net/)
Speech Recognition wiki (http://SpeechWiki.org/)
Verbio ASR - Applied Technologies on Language and Speech (ATLAS) (http://www.verbio.com)
Voice Signal Technologies (http://www.voicesignal.com)
IBM WebSphere Voice Server (http://www-306.ibm.com/software/pervasive/voice_server/)da:Talegenkendelse

de:Spracherkennung es:Comprensión del lenguaje fi:Puheentunnistus fr:Reconnaissance vocale ko:음성 인식 ja:音声認識 nl:Spraakherkenning pt:Reconhecimento de voz sv:taligenkänning

Retrieved from "https://academickids.com:443/encyclopedia/index.php/Speech_recognition"

Categories: Computational linguistics | Speech recognition

Speech recognition

Classification

Use

Approaches

Grammar constrained recognition

Natural language recognition

Dictation

Technical issues

Market players

See also

External links

Navigation

Search

Toolbox

Personal tools