Home / Products & Services

Integrating VRX What's Involved?

 

Integrating an ASR with your application may seem like a daunting task, but conceptually it is very simple. There are 3 key questions that need to be answered:

  • What interface is needed?
  • What language will you be using?
  • What is the end application?

The Interface

There are many ways to interface with an ASR or for an ASR to interface with your application. In simple terms the interface will be how the audio gets to the ASR and how the recognition result gets back to your application. The chosen interface could be very simple or quite complex and is typically driven by the application. VRX has both a C++ and C API. MRCP can be supported on request.

The Language

The chosen language may have a significant impact. Although the components of the ASR are very complex, they do not change from one application to the next. The data that is used by the components, however, does. A new language may require development of an acoustic model, a phonetic dictionary and a language model.

The Application

The end application will dictate further the language model. Some applications will demand the flexibility of a statistical language model, others will be more functional using constrained language models. Fortunately VRX supports both. The application will also drive the operating platform including the desired OS. Do you need a desktop capability, an embedded ASR or even a cloud based solution?

Additional Considerations

Acoustic Models

When comparing ASR systems, it is insufficient to ask "what languages do you support?". A vendor may offer an impressive list of supported languages but unless the acoustic model format matches the desired target audio format, performance will likely be very poor. An acoustic model created for a traditional telephone application will likely be 8 kHz due to the limitations of the telephone technology. This model will be inadequate if your application requires the use of a microphone equipped headset, where the actual audio is 16 kHz. Seek a vendor that is willing to discuss your specific application needs and provide the appropriate level of support. A final word of warning; the performance of an acoustic model is closely related to the quantity of training data used to create it. A vendor that provides an excellent US English model will not necessarily match that performance for all advertised languages where training data was scarce. At Verbyx we have created new capabilities for training acoustic models on minimal training data that outclass any competing products.

Processors

For many applications the choice of processor is of little concern, any modern processor is capable of running a single recognition instance. However, processor choice is critical when running multiple instances. Some processors are better suited to scaling and heavy duty arithmetic. Before finalizing your deployment platform, feel free to ask for our recommendations.

Speed v Accuracy

Any successful deployment of speech recognition is always a trade-off between speed and accuracy. We are often asked what hardware configuration is required? if money were no object, our answer would be that you can never have too much memory and processor speed. In speech recognition speed can almost always be traded for accuracy and accuracy for speed. The ASR contains many tuning parameters and the most successful systems are those that balance the tuning parameters to produce the most optimal speed and accuracy for your deployment.