AIY Voice Kit Project: Story Listener

Here's the git repo for this project.

My wife and I love fairy tales and short stories. When we were first dating, one of the ways that we bonded was by telling each other silly bedtime stories. Every once in a while, she likes one of the stories enough to get out of bed and write it down. At some point, we might have enough of these stories to put out a collection or something.

The problem is that coming up with silly stories works a little better when you're very tired (they get sillier that way). That's also the time you least want to write them down. What we needed was some way to automatically record and transcribe any stories that we tell each other. When one of my friends gave me an AIY Voice Kit, my wife knew exactly what we should do with it.

The Story Listener


The AIY Voice Kit gives you all the power of Google Home, but completely programmable. You just need to add a Raspberry Pi to do the processing. Most of the voice commands and speech processing are done in the cloud, so once you set up get set up with Google's API you can make full use of their NLP models (including the CloudSpeech API).

As an aside, the Voice Kit only works with newer models of Raspberry Pi. When I pulled out my old pi, the kit booted but wouldn't run any of the examples. Turns out you need a Raspberry Pi 2B or newer. A quick Amazon Prime order got us going again.

Our plan was to make an app that would listen for the start of a story. Once it heard a story start, it would record the story, transcribe it, and then email the transcription to us.

Getting Started with the API

Most of the Voice Kit projects rely on Google APIs that require access permissions to use. The API and permissions need to be enabled for the Google account you're using with a Voice Kit project. You'll need to set that up and then download the json credential file to do anything interesting.

Detecting When a Story Started

To make story detection easier, we decided to preface all of our stories with one of a few different sentences. We chose "Once upon a time" and "Tell me a story" as good options. Detecting these key phrases using the Google CloudSpeech API is pretty easy.

The CloudSpeech API has a nice library associated with it in the Voice Kit library. You can create a recognizer object that sends audio to the API, and you'll get back strings that contain the text from the audio. You can improve the recognition accuracy by telling the recognizer to expect certain phrases.

import aiy.cloudspeech
recognizer = aiy.cloudspeech.get_recognizer()
recognizer.expect_phrase("once upon a time")

# waits for audio, then transliterates it
text = recognizer.recognize() 
# the transliteration doesn't have guarantees 
# about case, so take care of that here
text = text.lower() 
if ("once upon a time" in text):

The expect_phrase method improves the voice recognition accuracy of that particular phrase. Then you can search for that phrase in whatever text the CloudSpeech API finds. If you see your key-phrase, then it's time to move on to the next step.

Recording Audio with the Voice Kit

The Voice Kit library allows various "processors" to be added to the audio stream coming from the microphone. The processor is just a class that operates on the audio data (the recognizer is one such processor). In order to record audio while still detecting key-words. It turns out that the AIY library even had a WaveDump class that would save audio to a file.

The WaveDump class was almost exactly what we were looking for, but had a couple of drawbacks. It was originally designed to record audio for a certain length of time, and we wanted to record audio until a story was over (which we would recognize by listening for "the end"). We created a sub-class of the WaveDump class to allow us to have more control over how long we recorded audio for.

class StoryDump(
    def __init__(self, filepath, max_duration):
        # just do the normal file setup
        super().__init__(filepath, max_duration)
        # keep track of whether we should end the recording early
        self.done = False 
    def add_data(self, data):
        # keep track of the number of bytes recorded
        # to be sure that we don't write too much
        max_bytes = self._bytes_limit - self._bytes
        data = data[:max_bytes]
        # save the audio to the file
        if data and not self.done:
            self._bytes += len(data)
    def finish(self):
        self.done = True
    def is_done(self):
        return self.done or (self._bytes >= self._bytes_limit)

With this class now defined, it's easy to add an instance of it as a processor to the audio stream.

# assume all stories are < 20min
story_wav = StoryDump("filename.wav", 20*60)

And once you see that the story is over, you can finish the recording like so:

recognizer.expect_phrase("the end")
if "the end" in text:

Because we're already using the CloudSpeech API to transliterate audio and look for keywords, the story transcription happens almost for free. All we have to do is wait until a story starts (looking for one of the keyphrases in the text), and then write all subsequent text to a file. Emailing the file once it's done is also a straightforward exercise in python. Once you have the audio recognition, transcription, and saving done, making the project start when the Raspberry Pi boots is also just a linux exercise.

One slightly annoying aspect of the Voice Kit library is that isn't a complete Python Package. That means that you can't install it with setuptools or pip, so accessing the library is a bit annoying. The examples for the VoiceKit all recommend putting your application code in the same directory as the Voice Kit library. This is a bit annoying when you want to create a repo for your project that isn't a fork of the Voice Kit repo. We fixed this by creating an environment variable that pointed to the location of the AIY library.


The CloudSpeech API works better than I expected it to, but it is definitely not yet good enough to use for transcription. It will often mess up tenses on verbs, skip transcription of definite and indefinite articles, and select words that are close homonyms to what was actually said. I think that some of this is that the API is probably doing some analysis of how much the text makes sense. If you're telling a silly absurdist story, you're likely to string words together in a way that isn't high probability in standard usage.

once upon a time there was a time girl made completely from clay
she was very energetic and like to run all over the place
and one day she ran so fast her clay arms and legs stretched out from the wind
and then she wasn't such a tiny girl anymore
was actually a very very tall and skinny girl
the end

Here's the wave file that's from:

Another limitation of the CloudSpeech API for story transcription is the latency. The API seems to be intended mostly for interactive use: you say a thing, it says a thing, etc. Since we just want to transcribe a long series of utterances without pausing, this causes issues. It seems that the recognizer will wait until a pause in the voice, or until there's some number of words available, then it will try to recognize all of it. This has some delay, and any words said during that delay will be missed (they still get recorded, just not transcribed). We want to have on-line transcription so that we know when the story is over, but it may make sense to then re-transcribe the save audio all at once.

Next Steps

We're pretty happy with how the story listener is working out. It would be nice to have better transcription, but I expect that will come in time.

For me, the biggest issue with the Voice Kit in general is the privacy concern. If we have it listening for stories all the time, then it's going to be continually sending anything we say (in bed, where we tell bedtime stories) to Google. That's not exactly what I want.

The Voice Kit manual advertises support for TensorFlow, but there aren't any good tutorial for integrating that yet. It looks like the best way to integrate an ML model with the Voice Kit would be to create a new audio processor to add to the recorder. That audio processor could tensor-ize the audio and feed it through a classification model.

Once we get that figured out. it might be worth trying to recognize a few key phrases. Running a model on the Raspberry Pi itself would make the device independent of an internet connection, and would solve a lot of the privacy concerns that we have. The transcription would probably go down in accuracy a lot, but if we're already manually transcribing stories that might be fine.

The industrial revolution

From Makers: The New Industrial Revolution:

Although we think of factories as the “dark satanic mills” of William Blake’s phrase, poisoning their workers and the land, the main effect of industrialization was to improve health. As people moved from rural communities to industrial towns, they moved from mud- walled cottages to brick buildings, which protected them from the damp and disease. Mass- produced cheap cotton clothing and good- quality soap allowed even the poorest families to have clean clothing and practice better hygiene, since cotton was easier to wash and dry than wool. Add to that the increased income that allowed a richer and more varied diet and the improved access to doctors, schools, and other shared resources that came with the migration to cities, whatever ill effects resulted from working in the factories was more than compensated for by the positive effects of living around them. (To be clear, working in factories was tough, with long hours and poor conditions. But the statistics suggest that working on farms was even worse.)

It's interesting that the industrial revolution seems to have led to a net increase in health even at the beginning. I had always thought that people who worked in factories experienced much worse health overall. Given that the switch to agriculture did have a negative impact on health, it seems that social changes have pretty un-intuitive consequences.

Men in Black ethics

I remember really enjoying the Men in Black movies when I was younger. They've got explosions, aliens, flying saucers, Will Smith. They've got everything that makes a movie great. However, the movies are missing one very important thing: good ethics.

It slipped by me when I was watching the movies as a kid, but humans in the Men in Black universe are kind of the galaxy's village idiot. In the first movie it's explained how television, computers, and basically every other technology was given to us by aliens. We're apparently not capable of developing any of these things ourselves.

In a universe where its easy to travel from one planet to another and there are all kinds of interesting planets with interesting life to visit, we humans are stuck on earth. We get the alien technology that they don't want: television. They keep their faster than light travel for themselves.

The culture of craft and making that's seen a resurgence in the past decade or so has me very excited. It shows that people are creative, interested in learning, and willing to build things that make the world a better and more exciting place to live. It's humans making these inventions, and we celebrate those past humans who invent. People like Philo Farnsworth, the Wright Brothers, and Alan Turing.

When movies like MIB cast humans as incapable of inventing, they do a disservice to our culture and our history.