AIY Voice Kit Project: Story Listener

Here's the git repo for this project.

My wife and I love fairy tales and short stories. When we were first dating, one of the ways that we bonded was by telling each other silly bedtime stories. Every once in a while, she likes one of the stories enough to get out of bed and write it down. At some point, we might have enough of these stories to put out a collection or something.

The problem is that coming up with silly stories works a little better when you're very tired (they get sillier that way). That's also the time you least want to write them down. What we needed was some way to automatically record and transcribe any stories that we tell each other. When one of my friends gave me an AIY Voice Kit, my wife knew exactly what we should do with it.

The Story Listener


The AIY Voice Kit gives you all the power of Google Home, but completely programmable. You just need to add a Raspberry Pi to do the processing. Most of the voice commands and speech processing are done in the cloud, so once you set up get set up with Google's API you can make full use of their NLP models (including the CloudSpeech API).

As an aside, the Voice Kit only works with newer models of Raspberry Pi. When I pulled out my old pi, the kit booted but wouldn't run any of the examples. Turns out you need a Raspberry Pi 2B or newer. A quick Amazon Prime order got us going again.

Our plan was to make an app that would listen for the start of a story. Once it heard a story start, it would record the story, transcribe it, and then email the transcription to us.

Getting Started with the API

Most of the Voice Kit projects rely on Google APIs that require access permissions to use. The API and permissions need to be enabled for the Google account you're using with a Voice Kit project. You'll need to set that up and then download the json credential file to do anything interesting.

Detecting When a Story Started

To make story detection easier, we decided to preface all of our stories with one of a few different sentences. We chose "Once upon a time" and "Tell me a story" as good options. Detecting these key phrases using the Google CloudSpeech API is pretty easy.

The CloudSpeech API has a nice library associated with it in the Voice Kit library. You can create a recognizer object that sends audio to the API, and you'll get back strings that contain the text from the audio. You can improve the recognition accuracy by telling the recognizer to expect certain phrases.

import aiy.cloudspeech
recognizer = aiy.cloudspeech.get_recognizer()
recognizer.expect_phrase("once upon a time")

# waits for audio, then transliterates it
text = recognizer.recognize() 
# the transliteration doesn't have guarantees 
# about case, so take care of that here
text = text.lower() 
if ("once upon a time" in text):

The expect_phrase method improves the voice recognition accuracy of that particular phrase. Then you can search for that phrase in whatever text the CloudSpeech API finds. If you see your key-phrase, then it's time to move on to the next step.

Recording Audio with the Voice Kit

The Voice Kit library allows various "processors" to be added to the audio stream coming from the microphone. The processor is just a class that operates on the audio data (the recognizer is one such processor). In order to record audio while still detecting key-words. It turns out that the AIY library even had a WaveDump class that would save audio to a file.

The WaveDump class was almost exactly what we were looking for, but had a couple of drawbacks. It was originally designed to record audio for a certain length of time, and we wanted to record audio until a story was over (which we would recognize by listening for "the end"). We created a sub-class of the WaveDump class to allow us to have more control over how long we recorded audio for.

class StoryDump(
    def __init__(self, filepath, max_duration):
        # just do the normal file setup
        super().__init__(filepath, max_duration)
        # keep track of whether we should end the recording early
        self.done = False 
    def add_data(self, data):
        # keep track of the number of bytes recorded
        # to be sure that we don't write too much
        max_bytes = self._bytes_limit - self._bytes
        data = data[:max_bytes]
        # save the audio to the file
        if data and not self.done:
            self._bytes += len(data)
    def finish(self):
        self.done = True
    def is_done(self):
        return self.done or (self._bytes >= self._bytes_limit)

With this class now defined, it's easy to add an instance of it as a processor to the audio stream.

# assume all stories are < 20min
story_wav = StoryDump("filename.wav", 20*60)

And once you see that the story is over, you can finish the recording like so:

recognizer.expect_phrase("the end")
if "the end" in text:

Because we're already using the CloudSpeech API to transliterate audio and look for keywords, the story transcription happens almost for free. All we have to do is wait until a story starts (looking for one of the keyphrases in the text), and then write all subsequent text to a file. Emailing the file once it's done is also a straightforward exercise in python. Once you have the audio recognition, transcription, and saving done, making the project start when the Raspberry Pi boots is also just a linux exercise.

One slightly annoying aspect of the Voice Kit library is that isn't a complete Python Package. That means that you can't install it with setuptools or pip, so accessing the library is a bit annoying. The examples for the VoiceKit all recommend putting your application code in the same directory as the Voice Kit library. This is a bit annoying when you want to create a repo for your project that isn't a fork of the Voice Kit repo. We fixed this by creating an environment variable that pointed to the location of the AIY library.


The CloudSpeech API works better than I expected it to, but it is definitely not yet good enough to use for transcription. It will often mess up tenses on verbs, skip transcription of definite and indefinite articles, and select words that are close homonyms to what was actually said. I think that some of this is that the API is probably doing some analysis of how much the text makes sense. If you're telling a silly absurdist story, you're likely to string words together in a way that isn't high probability in standard usage.

once upon a time there was a time girl made completely from clay
she was very energetic and like to run all over the place
and one day she ran so fast her clay arms and legs stretched out from the wind
and then she wasn't such a tiny girl anymore
was actually a very very tall and skinny girl
the end

Here's the wave file that's from:

Another limitation of the CloudSpeech API for story transcription is the latency. The API seems to be intended mostly for interactive use: you say a thing, it says a thing, etc. Since we just want to transcribe a long series of utterances without pausing, this causes issues. It seems that the recognizer will wait until a pause in the voice, or until there's some number of words available, then it will try to recognize all of it. This has some delay, and any words said during that delay will be missed (they still get recorded, just not transcribed). We want to have on-line transcription so that we know when the story is over, but it may make sense to then re-transcribe the save audio all at once.

Next Steps

We're pretty happy with how the story listener is working out. It would be nice to have better transcription, but I expect that will come in time.

For me, the biggest issue with the Voice Kit in general is the privacy concern. If we have it listening for stories all the time, then it's going to be continually sending anything we say (in bed, where we tell bedtime stories) to Google. That's not exactly what I want.

The Voice Kit manual advertises support for TensorFlow, but there aren't any good tutorial for integrating that yet. It looks like the best way to integrate an ML model with the Voice Kit would be to create a new audio processor to add to the recorder. That audio processor could tensor-ize the audio and feed it through a classification model.

Once we get that figured out. it might be worth trying to recognize a few key phrases. Running a model on the Raspberry Pi itself would make the device independent of an internet connection, and would solve a lot of the privacy concerns that we have. The transcription would probably go down in accuracy a lot, but if we're already manually transcribing stories that might be fine.