All information contained herein is subject to change. Synthesis System. Recognition Telephony Applications Conclusion.
Microsoft has been researching and developing speech technologies for over a decade. The team has continued to grow and over the years has released a series of increasingly powerful speech platforms. In recent years, Microsoft has placed an increasing emphasis on bringing speech technologies into mainstream usage. The strategy of coupling powerful speech technology with a powerful API has continued right through to Windows Vista.
There is also a state-of-the-art general purpose speech recognition engine. Not only is this an extremely accurate engine, but it's also available in a variety of languages. Windows Vista also includes the first of the new generation of speech synthesizers to come out of Microsoft, completely rewritten to take advantage of the latest techniques. This allows developers to easily speech-enable Windows Forms applications and apps based on the Windows Presentation Framework.
The concept of speech technology really encompasses two technologies: synthesizers and recognizers see Figure 1. A speech synthesizer takes text as input and produces an audio stream as output.
Speech synthesis is also referred to as text-to-speech TTS. A speech recognizer, on the other hand, does the opposite. It takes an audio stream as input, and turns it into a text transcription.
A lot has to happen for a synthesizer to accurately convert a string of characters into an audio stream that sounds just as the words would be spoken. The easiest way to imagine how this works is to picture the front end and back end of a two-part system. The front end specializes in the analysis of text using natural language rules. It analyzes a string of characters to figure out where the words are which is easy to do in English, but not as easy in languages such as Chinese and Japanese.
This front end also figures out details like functions and parts of speech—for instance, which words are proper nouns, numbers, and so forth; where sentences begin and end; whether a phrase is a question or a statement; and whether a statement is past, present, or future tense.
All of these elements are critical to the selection of appropriate pronunciations and intonations for words, phrases, and sentences. Consider that in English, a question usually ends with a rising pitch, or that the word "read" is pronounced very differently depending on its tense. Clearly, understanding how a word or phrase is being used is a critical aspect of interpreting text into sound. To further complicate matters, the rules are slightly different for each language. So, as you can imagine, the front end must do some very sophisticated analysis.
The back end has quite a different task. It takes the analysis done by the front end and, through some non-trivial analysis of its own, generates the appropriate sounds for the input text.
Older synthesizers and today's synthesizers with the smallest footprints generate the individual sounds algorithmically, resulting in a very robotic sound. Modern synthesizers, such as the one in Windows Vista, utilize a database of sound segments built from hours and hours of recorded speech. The effectiveness of the back end depends on how good it is at selecting the appropriate sound segments for any given input and smoothly splicing them together.
If this all sounds vastly complicated, well, it is. Having these-text- to speech capabilities built into the operating system is very advantageous, as it allows applications to just use this technology. There's no need to go create your own speech engines. As you'll see later in the article, you can invoke all of this processing with a single function call.
Lucky you! Speech recognition is even more complicated than speech synthesis. However, it too can be thought of as having a front end and a back end. The front end processes the audio stream, isolating segments of sound that are probably speech and converting them into a series of numeric values that characterize the vocal sounds in the signal.
The back end is a specialized search engine that takes the output produced by the front end and searches across three databases: an acoustic model, a lexicon, and a language model.
The acoustic model represents the acoustic sounds of a language, and can be trained to recognize the characteristics of a particular user's speech patterns and acoustic environments.
The lexicon lists a large number of the words in the language, along with information on how to pronounce each word. The language model represents the ways in which the words of a language are combined.
Neither of these models is trivial. It's impossible to specify exactly what speech sounds like. And human speech rarely follows strict and formal grammar rules that can be easily defined.
An indispensable factor in producing good models is the acquisition of very large volumes of representative data. An equally important factor is the sophistication of the techniques used to analyze that data to produce the actual models. Of course, no word has ever been said exactly the same way twice, so the recognizer is never going to find an exact match. And for any given segment of sound, there are very many things the speaker could potentially be saying.
The quality of a recognizer is determined by how good it is at refining its search, eliminating the poor matches, and selecting the more likely matches. A recognizer's accuracy relies on it having good language and acoustic models, and good algorithms both for processing sound and for searching across the models. The better the models and algorithms, the fewer the errors that are made, and the quicker the results are found. Needless to say, this is a difficult technology to get right.
While the built-in language model of a recognizer is intended to represent a comprehensive language domain such as everyday spoken English , any given application will often have very specific language model requirements.
A particular application will generally only require certain utterances that have particular semantic meaning to that application. Hence, rather than using the general purpose language model, an application should use a grammar that constrains the recognizer to listen only for speech that the application cares about.
This has a number of benefits: it increases the accuracy of recognition, it guarantees that all recognition results are meaningful to the application, and it enables the recognition engine to specify the semantic values inherent in the recognized text. Figure 2 shows one example of how these benefits can be put to use in a real-world scenario. Accuracy is only part of the equation.
With the Windows Vista speech recognition technology, Microsoft has a goal of providing an end-to-end speech experience that addresses key features that users need in a built-in desktop speech recognition experience.
This includes an interactive tutorial that explains how to use speech recognition technology and helps the user train the system to understand the user's speech. The system includes built-in commands for controlling Windows—allowing you to start, switch between, and close applications using commands such as "Start Notepad" and "Switch to Calculator.
Windows Vista speech technology includes built-in dictation capabilities for converting the user's voice into text and edit controls for inserting, correcting, and manipulating text in documents. You can correct misrecognized words by redictating, choosing alternatives, or spelling. For example, "Correct Robot, Robert. The user interface is designed to be unobtrusive, yet to keep the user in control of the speech system at all times see Figure 3. You have easy access to the microphone state, which includes a sleeping mode.
Text feedback tells the user what the system is doing, and provides instructions to the user. There's also a user interface used for clarifying what the user has said—when the user utters a command that can be interpreted in multiple ways, the system uses this interface to clarify what was intended. Here, you can create and train speech recognition profiles. This is useful when more than one person shares the computer. You can also choose whether to run speech recognition at startup and whether to allow the computer to review your documents and mail to improve the accuracy of the speech recognition engine.
In addition, you can select the number of spaces to insert after punctuation marks and adjust the microphone level. I was impressed with the ease of use and accuracy of the Vista speech recognition engine after half an hour of training time.
I've tried dictation programs before and never found them at all usable; I could always type much faster than I could dictate and correct text. Now I finally feel that if I should ever lose the use of my hands, there would still be a way for me to continue to get my work done. For me, a combination of speech recognition primarily for commands and keyboard input works well. I can't vouch for how fast it works on a less powerful computer.
I'm also using a headset microphone. As I mentioned, my experiences shows that a desktop microphone doesn't work nearly as well. Putting in some time training it to your own voice also makes a big difference.
For obvious reasons, speech recognition wouldn't work well in a noisy environment where you share an office with other people who are talking or on the phone while you work, nor would it work well if you like to listen to music or talk radio while you work. Before you decide to start talking to your computer all the time, be aware that there's a security issue involved with using speech recognition. George Ou went into detail about it in his blog. Here's the gist: An attacker could embed a sound file that plays automatically when you go to a Web page or send you a sound file in e-mail that plays when you double-click on it.
If the sound file that plays through your computer speakers is a command recognized by Vista's Speech engine, and the speech recognition feature is running, the computer will carry out the command.
This isn't quite as scary as it could be. To perform most administrative tasks in Vista, you have to respond to the User Account Control prompt, which can't be done by voice. However, it's possible for the attacker to delete a file on your computer using this method. When speech recognition is in Sleep mode, it responds only to the words "Start listening"--but the attacker could easily put that phrase at the beginning of the sound file to turn it on.
Thus, the best practice is to always turn speech recognition off completely when you aren't using it, rather than leaving it in Sleep mode, and don't configure it to run when you start Windows.
Debra Littlejohn Shinder, MCSE, MVP is a technology consultant, trainer, and writer who has authored a number of books on computer operating systems, networking, and security. Deb is a tech editor, developmental editor, and contributor to over 20 add Figure A Vista speech recognition is set up and configured through the Control Panel. How it works There are two ways to use speech recognition technology: To control the software: Start and close programs and switch between them, save and delete files, and so forth.
To dictate text to be typed verbatim into a document and edit the text. Setting up and configuring speech recognition Before you can start using speech recognition, you need to complete the following steps: Turn on speech recognition. Set up your microphone. Complete a tutorial not required, but recommended. Train the recognition engine to understand your voice not required, but recommended.
Figure C The first step is to configure your speech recognition experience. Figure D The Speech control console appears at the top of the screen when speech recognition is turned on. Editor's Picks. The best programming languages to learn in Check for Log4j vulnerabilities with this simple-to-use script.
TasksBoard is the kanban interface for Google Tasks you've been waiting for. Paging Zefram Cochrane: Humans have figured out how to make a warp bubble.
Select next 20 words; Select next 10 words. For example, say "press alpha" to press A or "press bravo" to press B. Click Recycle Bin ; click Computer ; click file name. Double-click Computer ; double-click Recycle Bin ; double-click folder name. Right-click Computer ; right-click Recycle Bin ; right-click folder name. Show numbers Numbers will appear on the screen for every item in the active window. Say an item's corresponding number to select it. Close that; Close Paint ; Close Documents.
Minimize that; Minimize Paint ; Minimize Documents. Maximize that; Maximize Paint ; Maximize Documents. Restore that; Restore Paint ; Restore Documents. Scroll down 2 pages; Scroll up 10 pages. Number of the square where the item appears followed by mark; 3 mark; 7 mark; 9 mark. Number of the square where you want to drag the item followed by click; 4 click; 5 click; 6 click. Use voice recognition in Windows. Accessibility support for Windows. To use Speech Recognition, the first thing you need to do is set it up on your computer.
When you're ready to use Speech Recognition, you need to speak in simple, short commands. The tables below include some of the more commonly used commands. Say "start listening" or click the Microphone button to start the listening mode.
The following table shows some of the most commonly used commands in Speech Recognition. Words in italic font indicate that you can say many different things in place of the example word or phrase and get useful results. Click File ; Start ; View. Note that this command is only available if you're using the U. English Speech Recognizer. For more information, see Setting speech options. The following table shows commands for using Speech Recognition to work with text.
Insert the literal word for the next command for example, you can insert the word "comma" instead of the punctuation mark. The following table shows commands for using Speech Recognition to press keyboard keys. For example, you can say "press alpha" to press "a" or "press bravo" to press "b. The following table shows commands for using Speech Recognition to insert punctuation marks and special characters. The following table shows commands for using Speech Recognition to perform tasks in Windows.
File ; Edit ; View ; Save ; Bold.
0コメント