Say It Ain’t So: A simple Speech-To-Text experiment with serious implications

Photo taken during the workshop 'Say it Ain't So' during the Urgent Publishing conference

Photo taken by INC Amsterdam

By Simon Browne.

On the final day of the Urgent Publishing conference is “Say It Ain’t So”, a workshop organised by artist Amy Pickles and designer and researcher Cristina Cochior. The topic is speech to text processing, including technical aspects of speech recognition software such as the open source engine PocketSphinx, and issues of visibility and invisibility.

The workshop is in response to an urgent need to raise awareness to digital discrimination arising from voice technology developments. This is illustrated in a speech_recognition_interview between Amy and, as it turns out, all of us, collectively reading out lines from a script. It doesn’t go well for Amy; she is rejected due to data drawn from not just what she said, but also how she said it. Her fate is sealed by low percentages of the things that matter, such as confident delivery and use of predetermined key words.

In contrast with the perception that discrete parts of language are mostly stable, speech recordings contain more dynamic, complex elements than we imagine. Speech to text uses a ‘bag of words‘ model; utterances are sliced into basic units of language and indexed by frequency. More frequent combinations are matched with corresponding equivalents from sourced dictionaries; speech to text and vice-versa. This is illustrated in a quick demonstration of PocketSphinx transcription with mixed results; either rendering (relatively) faithfully or producing comical phrases that barely resemble natural language, especially when confronted with accents.

Writer Ursula K. Le Guin’s “carrier bag theory of fiction” suggests that the first tool was a bag (rather than a weapon), with contents that allowed us to form narratives through powerful relational qualities. In this workshop, spread out on a carpet, are a collection of plastic bags filled with printed texts. We are invited to record ourselves reading from them in groups, either obscuring or emphasizing elements. Most adopt tactics of sabotage and subterfuge, such as broken syllables, speaking continuously, using languages other than English, etcetera. Some aim for clarity; text to speech, exploiting acoustics or carefully pronouncing certain words.

The workshop wraps up with listening to recordings from the morning, and reading printed transcriptions. Each transcription contains a list of phonemes next to eerily accurate but semantically unrelated matches. We record parts of the transcriptions and assign them as phone ringtones to play during the plenary session, with comedic effect. It’s easy to laugh at the mess made of what comes so naturally to us; language. But there are more serious implications, as we see in a screening of a video of academic Halcyon Lawrence, who maintains that homophony is engrained, and confronting accent bias is a crucial part of ensuring access to technology.

The hallmark of algorithmic natural language applications is invisibility, relying on a participant’s lack of awareness of the process. However, invisibility is also a result of these applications, in their ability to discriminate between the contents of the bags of words they employ, and so hide differences; discarding what is considered to be indistinct.