Using The Microsoft Kinect Speech Recognition Features To Control SoapBox Add-Ins
Last weekend I presented the following three sessions at the Southern California Code Camp in San Diego:
1.) Managed Extensibility Framework (MEF)
2.) Soapbox Core
3.) XBox Kinect
I met some great people and got a lot of positive feedback (The SoapBox session went well if I don’t say so myself. I’m sorry I didn’t record it - I will for sure record it the next time I give that talk).
After my final talk, one of the brave souls that had sat through all three one-hours sessions of my babblings asked the very interesting question, “So when are you going to create a Kinect add-in for SoapBox?” Oddly enough, I had never thought of putting the two together until I got that question. I don’t think it was supposed to be a challenge but I took it as one so on my 2 hour drive back to LA I created the Kinect add-in in my head then got home and created it last night.
THE BIG IDEA The goal: I want to be able to open the PinBallTable Add-In that comes with the SoapBox Core Demo download via a simple verbal command recieved through an XBox Kinect sensor.
Here were my self-imposed design constraints
1.) To make using the Kinect add-in as easy as possible, hooking into the Kinect Add-in should require ZERO source code changes to the existing PinBallTable add-in.
2.) The Kinect add-in should use as little memory as possible, so lazy loading is a must.
With these constraints in-mind the design became clear. I should create my own custom metadata attribute that holds the text of the verbal command to which the exported class responds to via the well supported command pattern in SoapBox. This custom metadata could then be parsed in the OnImportsSatisfied() method on my Kinect add-in to build a grammar that could be used by the Microsoft SpeechRecognitionEngine hooked-up to the Kinect audio stream. Once this was all set-up all I’d have to do is create a SpeechRecognized event handler attached to the SpeechRecognitionEngine and fire the command associated with the recognized speech.
As it turns out, the Kinect SDK code that I needed was practically already made for me in the Audio Fundamentals Quickstart .
There are, of course, several prerequisites needed to make this Kinect add-in work. The obvious one is a Kinect sensor. In-terms of software, the Kinect add-in requires all the same packages as the Audio Quickstart and nothing more.
Below is a description of how I created a SoapBox add-in that allows users to issue voice commands to any SoapBox Core add-in via an XBox Kinect. First we will look at the new code created for this add-in, then we will look at the extremely minimal changes we needed to make to the existing PinBallTable add-in and finally we will discuss a few changes you can make on your own to make this Kinect add-in even better.
THE NEW CODE
- Custom Metadata
First, we need to add the following custom metadata attribute definition to a Kinect folder in the SoapBox.Core.Contracts project of Soapbox Core
using System;
using System.Collections.Generic;
using System.ComponentModel.Composition;
using System.Windows.Input;
namespace SoapBox.Core
{
[MetadataAttribute]
[AttributeUsage(AttributeTargets.Class, AllowMultiple = false)]
public class AudioCommandMetadata : ExportAttribute
{
public AudioCommandMetadata()
: base(typeof(ICommand))
{
}
public AudioCommandMetadata(IDictionary<string, object> dict)
: this()
{
this.Action = dict["Action"] as string;
this.Subject = dict["Subject"] as string;
}
public string Action { get; set;}
public string Subject { get; set; }
}
}
This metadata attribute class simply allows any class supporting the ICommand interface to specify a subject string and action string that will eventually help us build verbal commands to which the program responds. Note: we could have made this attribute a little simpler by giving it a single ‘VerbalCommand’ property instead of the more complex Action property AND Subject property. In this case, I chose to use two properties so that it is easier to standardize the possible verbal commands. I hope that this will make it easier for the user to operate when there are lots and lots of commands by repeating the same basic verbal formula of “SoapBox [ACTION] [SUBJECT]” to trigger anything. I also hope that by using two properties it will be harder for other developers on my team to create their own verbal command patterns that don’t really match with other members of the team.
Now that we have made the necessary addition to the core, we are ready to create our Kinect add-in.
- Add-in Class Itself
Here is the entire Kinect add-in, it’s only 200 lines!
using System;
using System.Collections.Generic;
using System.ComponentModel.Composition;
using System.IO;
using System.Linq;
using System.Threading;
using System.Windows.Input;
using System.Windows.Threading;
using Microsoft.Research.Kinect.Audio;
using Microsoft.Speech.AudioFormat;
using Microsoft.Speech.Recognition;
using SoapBox.Core;
namespace SoapBox.KinectAddIn
{
[Export(SoapBox.Core.ExtensionPoints.Host.Void, typeof(Object))]
[Export(SoapBox.Core.ExtensionPoints.Host.ShutdownCommands, typeof(IExecutableCommand))]
public class KinectAudioPlugIn : AbstractExtension, IExecutableCommand, IPartImportsSatisfiedNotification
{
private const string RecognizerId = "SR_MS_en-US_Kinect_10.0";
private const string SoftwareName = "SoapBox";
private const double ConfidenceCutoff = 0.95;
protected SpeechRecognitionEngine _sre;
protected KinectAudioSource _kinectAudioSource;
protected Stream _kinectAudioStream;
protected Dispatcher _uiThreadDispatcher;
#region Protected Properties
protected IDictionary<string, IList<Lazy<ICommand, AudioCommandMetadata>>> AudioToCommandDict { get; set; }
[ImportMany(typeof(ICommand))]
protected IEnumerable<Lazy<ICommand, AudioCommandMetadata>> Commands {get; set; }
#endregion Protected Properties
#region Constructors
public KinectAudioPlugIn()
{
}
#endregion Constructors
#region Kinect Speech Recognition Event Handlers
void SreSpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e)
{
logger.Info("\nSpeech Rejected");
}
void SreSpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
{
logger.InfoWithFormat("\rSpeech Hypothesized: \t{0}", e.Result.Text);
}
void SreSpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
var resultText = e.Result.Text; var confidence = e.Result.Confidence;
if (confidence > ConfidenceCutoff && this.AudioToCommandDict.ContainsKey(resultText))
{
var cmdList = this.AudioToCommandDict[resultText];
var a = new Action(() =>
{
foreach (var item in cmdList)
{
if (item.Value.CanExecute(null))
{
item.Value.Execute(null);
}
}
});
this._uiThreadDispatcher.Invoke(a, null);
}
else
{
logger.InfoWithFormat("\nLow Confidence Speech Ignored:\n\tText = {0}\n\tConfidence = {1}",new object[2]{resultText,confidence});
}
}
#endregion Kinect Speech Recognition Event Handlers
#region Helpers
Choices GetAllRecognizedCommands()
{
//This implementation could probably be made into a nice LINQ statement if anyone knows/cares to do it
var recognizedCommands = new Choices();
this.AudioToCommandDict = new Dictionary<string, IList<Lazy<ICommand, AudioCommandMetadata>>>();
foreach (var item in this.Commands)
{
var action = item.Metadata.Action.Trim();
var subject = item.Metadata.Subject.Trim();
if (string.IsNullOrEmpty(action) || string.IsNullOrEmpty(subject))
{
continue;
}
var phrase = string.Format("{0} {1} {2}", SoftwareName, action, subject);
if (this.AudioToCommandDict.ContainsKey(phrase))
{
this.AudioToCommandDict[phrase].Add(item);
}
else
{
recognizedCommands.Add(phrase);
this.AudioToCommandDict.Add(phrase, new List<Lazy<ICommand, AudioCommandMetadata>>() { item });
}
}
return recognizedCommands;
}
#endregion Helpers
#region IPartImportsSatisfiedNotification Members
public void OnImportsSatisfied()
{
var t = new Thread(() =>
{ //I know this is a perfect example of bad excepion handling. I just don't know all the things
//that can go wrong with the Kinect yet, so I am just putting this entire initialization in
//one bug try-catch block. If some knows how to make this better, let me know.
try
{
this._kinectAudioSource = new KinectAudioSource();
_kinectAudioSource.FeatureMode = true;
_kinectAudioSource.AutomaticGainControl = false; //Important to turn this off for speech recognition
_kinectAudioSource.SystemMode = SystemMode.OptibeamArrayOnly; //No AEC for this sample
RecognizerInfo ri = SpeechRecognitionEngine.InstalledRecognizers().Where(r => r.Id == RecognizerId).FirstOrDefault();
if (ri == null)
{
return;
}
this._sre = new SpeechRecognitionEngine(ri.Id);
var recCmnds = GetAllRecognizedCommands();
var gb = new GrammarBuilder();
//Specify the culture to match the recognizer in case we are running in a different culture.
gb.Culture = ri.Culture;gb.Append(recCmnds);
var g = new Grammar(gb);
// Create the actual Grammar instance, and then load it into the speech recognizer.
_sre.LoadGrammar(g);
_sre.SpeechRecognized += SreSpeechRecognized;
_sre.SpeechHypothesized += SreSpeechHypothesized;
_sre.SpeechRecognitionRejected += SreSpeechRecognitionRejected;
this._kinectAudioStream = _kinectAudioSource.Start();
_sre.SetInputToAudioStream(_kinectAudioStream,new SpeechAudioFormatInfo( EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null));
_sre.RecognizeAsync(RecognizeMode.Multiple);
}
catch(Exception e)
{
logger.Error("ERROR: Could not initialize Kinect audio and/or speech recognition engine.", e);
}
});
t.Start();
this._uiThreadDispatcher = Dispatcher.CurrentDispatcher;
}
#endregion
#region IExecutableCommand Members
/// <summary>
/// This is the shutdown command that cleans-up all the pieces we use here
/// </summary>
/// <param name="args"></param>
public void Run(params object[] args)
{
this._kinectAudioStream.Dispose();
this._sre.Dispose();
this._kinectAudioSource.Dispose();
}
#endregion
}
}
Most of (90+% of) this code comes directly from the Audio Fundamentals Quickstart mentioned above. For a detailed explanation of that code, please watch the video and/or read the article. That said, there are a few threading tricks I had to implement in-order to get the RecognizeAsync() method to work properly, but nothing there is terribly difficult to understand.
The parts of the above code that are of interest to SoapBox developers are the following:
GetAllRecognizedCommands Methods
The GetAllRecognizedCommands() method of the Kinect add-in is really where most of the magic happens. In this method we parse all of the imported AudioCommandMetadata to create a dictionary mapping between the (lazily loaded) ICommand objects and their verbal triggers. We then return a Choices object that will be used to create the speech recognition grammar given to the SpeechRecognitionEngine.
SreSpeechRecognized Method
By the time we reach the SreSpeechRecognized event handler we are already on the home stretch. Instead of just writing the recognized text to the console - like is done in the Audio Fundamentals Quickstart - we use it to find the commands to be executed when that particular speech is recognized. Once we have the list of ICommand objects to be executed we go through each of them and execute the ones that are executable.
CHANGES TO EXISTING CODE
- Just Add Water Metadata
As you may already be able to tell, getting the existing SoapBox PinBall demo to hook into this add-in is just as easy as adding the following metadata attribute to its ViewMenuPinBallTable class
[AudioCommandMetadata(Action = “Show”, Subject = “PinBallTable”)]
Since the ViewMenuPinBallTable class inherits from AbstractMenuItem it already supports the ICommand class so all we have to do now is run the application with a Kinect hooked-up to our machine and say the words, “SoapBox Show PinBallTable” and the Run() method of the ViewMenuPinBallTable class will be executed by the Kinect add-in so that the pinball table magically appears on the screen. Pretty Cool, huh !?!
- Is The Custom Metadata Even Really Needed?
If you refer back to the top of this post you will see that my number one design constraint was to make, “ZERO source code changes to the existing PinBallTable add-in”. Now, some of you may be saying to yourselves, “Hey, you cheated, you added that metadata attribute and therefore you changed the PinBallTable demo source code!” Well, you are kind of right, if you really call metadata attributes source code. I mean are you really going to write new tests or expect old ones to break after adding a single line of metadata? If so you might want to take a second look at your tests and design because something is seriously wrong.
For all you hardcore sticklers out there I want to make it clear that we could have achieved a very similar result without even making that single metadata attribute addition (an important point if you don’t want to, or can’t re-compile your existing plug-in(s)). If instead of using the [ImportMany(typeof(ICommand))] contract on the Commands property in the Kinect add-in we could have used the [Import(SoapBox.Core.ExtensionPoints.Workbench.MainMenu.ViewMenu, typeof(IMenuItem))] contract then instead of parsing metadata attributes we could have simply parsed the ID property from each of the IMenuItems and the verbal commands would have become something like “SoapBox PinBallTable” because “PinBallTable” is the value given to the ID property of the ViewMenuPinBallTable class.
Though this approach doesn’t use lazy loading, it does something better, it only uses objects that have necessarily already been loaded. Since all the menu items must be loaded at start-up there would be no new instantiations made with this approach.
The reason I chose the metadata approach is because I think it results in a much more robust and broadly useable solution. Specifically, it forces the designer to specify the Subject and Action of every command and it provides a more general way to hook into the Kinect add-in.
Conclusion
In this article, we discussed a method for robustly enabling voice commands in a wide range of SoapBox add-in. We saw that through minimal effort on new code and essentially ZERO changes to existing code we were able to make the pre-existing PinBallTable add-in respond to verbal commands from the user.
All you SoapBox veterans out there are probably not too surprised by how easy SoapBox Core makes all of this, and how little code was needed because you already know how powerful the framework is. On the other hand, if you are new to MEF and/or SoapBox you are almost certainly amazed at how easy it was to add Kinect support to an existing SoapBox application – in which case, I hope you use this article as a motivator to check-out SoapBox Core and give it a shot.
I hope you have enjoyed and understood this post. If you have questions, comments, concerns, suggestions, requests or jokes - or if you want my source code for this post - please e-mail beachfrontcoding@gmail.com and I will be happy to send it to you. Once you get the code up and running, here are a couple of things you can do to make it better:
1.) Add more commands. Change the AllowMultiple attribute on the custom metadata class to true so that the PinBallTable can respond to both “SoapBox Show PinBallTable” and “SoapBox Open PinBallTable” with the same ICommand object.
2.) Create file similar to the Extensions.cs file in Soapbox Core that you can use to manage the subject and action string values for lots of commands used across your application/
3.) Export the “SoftwareName” and “ConfidenceCutoff” properties from another class property so that you have a strongly-typed config file to use in the Kinect add-in.
4.) The SpeechRecognitionEngine is sometimes a little slow and it makes the PinBallTable appear to load slowly. Add a StatusBarLabel to update the user on the happenings as soon as a piece of speech is recognized so that they don’t get impatient and issue the command over and over again.
5.) Change the AllowMultiple attribute on the AudioCommandMetadata class to true and change the corresponding constructor appropriately so that ICommand classes can responde to more than one verbal command. For example, maybe the pinball table should open when the user says, “SoapBox Open PinBallTable” AND when the user says, “SoapBox Show PinBallTable”
Now go, get to work on SoapBox!
— Karl B.