Hotword detection is a critical feature for speech recognition systems like Siri or Alexa. In a recent tutorial from AssemblyAI, developers are guided through how to implement this feature using AssemblyAI’s Streaming Speech-to-Text API with the Go programming language.
Introduction to hotword detection
Hotword detection allows AI systems to respond to specific trigger words or phrases. Popular AI systems like Alexa and Siri use predefined hotwords to activate their features. This tutorial from AssemblyAI shows how to use Go and AssemblyAI’s API to create a similar system called ‘Jarvis’, a tribute to Iron Man.
Preferences
Before starting coding, developers need to set up their environment. This includes installing PortAudio’s Go bindings for capturing raw audio data from the microphone and the AssemblyAI Go SDK for interfacing with the API. The following commands are used to set up the project:
mkdir jarvis
cd jarvis
go mod init jarvis
go get github.com/gordonklaus/portaudio
go get github.com/AssemblyAI/assemblyai-go-sdk
Next, you will need an AssemblyAI account to obtain an API key. Developers can sign up on the AssemblyAI website and configure their billing details to access the Streaming Speech-to-Text API.
Recorder implementation
The core functionality starts with recording raw audio data. In the tutorial, recorder.go
file that defines recorder
A structure that captures audio data using PortAudio. This structure contains methods for starting, stopping, and reading the audio stream.
package main
import (
"bytes"
"encoding/binary"
"github.com/gordonklaus/portaudio"
)
type recorder struct
stream *portaudio.Stream
in ()int16
func newRecorder(sampleRate int, framesPerBuffer int) (*recorder, error)
in := make(()int16, framesPerBuffer)
stream, err := portaudio.OpenDefaultStream(1, 0, float64(sampleRate), framesPerBuffer, in)
if err != nil
return nil, err
return &recorder
stream: stream,
in: in,
, nil
func (r *recorder) Read() (()byte, error)
if err := r.stream.Read(); err != nil
return nil, err
buf := new(bytes.Buffer)
if err := binary.Write(buf, binary.LittleEndian, r.in); err != nil
return nil, err
return buf.Bytes(), nil
func (r *recorder) Start() error
return r.stream.Start()
func (r *recorder) Stop() error
return r.stream.Stop()
func (r *recorder) Close() error
return r.stream.Close()
Creating a real-time transcriber
AssemblyAI’s real-time transcriber requires event handlers for various stages of the transcription process. These handlers are transcriber
Structures and contains the following events: OnSessionBegins
, OnSessionTerminated
and OnPartialTranscript
.
package main
import (
"fmt"
"github.com/AssemblyAI/assemblyai-go-sdk"
)
var transcriber = &assemblyai.RealTimeTranscriber
OnSessionBegins: func(event assemblyai.SessionBegins)
fmt.Println("session begins")
,
OnSessionTerminated: func(event assemblyai.SessionTerminated)
fmt.Println("session terminated")
,
OnPartialTranscript: func(event assemblyai.PartialTranscript)
fmt.Printf("%s\r", event.Text)
,
OnFinalTranscript: func(event assemblyai.FinalTranscript)
fmt.Println(event.Text)
,
OnError: func(err error)
fmt.Println(err)
,
sewing everything together
The final step involves integrating all components. main.go
file. This includes setting up the API client, initializing the recorder, and handling recording events. The code also includes logic to detect hotwords and respond appropriately.
package main
import (
"context"
"fmt"
"log"
"os"
"os/signal"
"strings"
"syscall"
"github.com/AssemblyAI/assemblyai-go-sdk"
"github.com/gordonklaus/portaudio"
)
var hotword string
var transcriber = &assemblyai.RealTimeTranscriber
OnSessionBegins: func(event assemblyai.SessionBegins)
fmt.Println("session begins")
,
OnSessionTerminated: func(event assemblyai.SessionTerminated)
fmt.Println("session terminated")
,
OnPartialTranscript: func(event assemblyai.PartialTranscript)
fmt.Printf("%s\r", event.Text)
,
OnFinalTranscript: func(event assemblyai.FinalTranscript)
fmt.Println(event.Text)
hotwordDetected := strings.Contains(
strings.ToLower(event.Text),
strings.ToLower(hotword),
)
if hotwordDetected
fmt.Println("I am here!")
,
OnError: func(err error)
fmt.Println(err)
,
func main() {
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
logger := log.New(os.Stderr, "", log.Lshortfile)
portaudio.Initialize()
defer portaudio.Terminate()
hotword = os.Args(1)
device, err := portaudio.DefaultInputDevice()
if err != nil
logger.Fatal(err)
var (
apiKey = os.Getenv("ASSEMBLYAI_API_KEY")
sampleRate = device.DefaultSampleRate
framesPerBuffer = int(0.2 * sampleRate)
)
client := assemblyai.NewRealTimeClientWithOptions(
assemblyai.WithRealTimeAPIKey(apiKey),
assemblyai.WithRealTimeSampleRate(int(sampleRate)),
assemblyai.WithRealTimeTranscriber(transcriber),
)
ctx := context.Background()
if err := client.Connect(ctx); err != nil
logger.Fatal(err)
rec, err := newRecorder(int(sampleRate), framesPerBuffer)
if err != nil
logger.Fatal(err)
if err := rec.Start(); err != nil
logger.Fatal(err)
for {
select {
case
Run application
To run the application, developers need to set the AssemblyAI API key as an environment variable and run the Go program using the desired hotword.
export ASSEMBLYAI_API_KEY='***'
go run . Jarvis
This command sets ‘Jarvis’ as the hotword and the program responds with ‘I am here!’ Whenever a hotword is detected in the audio stream.
conclusion
This tutorial from AssemblyAI provides a comprehensive guide for developers to implement hotword detection using the Streaming Speech-to-Text API and Go. The combination of PortAudio for audio capture and AssemblyAI for transcription provides a powerful solution for creating voice-activated applications. For more information, see the original tutorial.
Image source: Shutterstock