Dec 12, 2022

Automatic Transcription of WhatsApp Voice Messages

Using open source software and a free API to automatically transcribe WhatsApp voice messages.

The Problem

Voice messages are fun and easy, but sometimes you’re unable to listen to them. There could be myriad of reasons for that, your data plan might be limited, your environment too loud, your attention span too short.

Telling people to please not send you audio memos does not work, unfortunately. Their next message is just going to be even longer, because of the preface of explanations and/or excuses.
Even worse is if the voice message is really long, but a crucial part of information is hidden between all the filler!

The Solution

Legal Warning: I am a software developer, not a lawyer. This might violate WhatsApp’s ToS, and sending the audio messages to a third party server might be a violation of the privacy laws of your jurisdiction. Proceed at your own risk!

Code Style Warning: I whipped this up in 30 minutes for myself, so the code might not be the most beautiful. Treat this as a proof of concept, not as gospel on how beautiful code should look like.

This solution involves a Raspberry Pi running a small NodeJS script, parsing all incoming WhatsApp messages, sending the audio files to an AI based audio transcription service, and sending the result via text.

For interfacing with WhatsApp, we’ll be using Baileys, and the transcription itself is done with Deepgram and their javascript SDK.
Deepgram offers a free allowance of a generous 200 hours of transcribed audio at the time of writing - more than enough to last you for many years (hopefully).

We include the two needed libraries and instantiate Deepgram:

const Baileys = require('@adiwajshing/baileys')
const { Deepgram } = require('@deepgram/sdk')
const deepgram = new Deepgram(DEEPGRAM_API_KEY)

Then we use Baileys to connect to WhatsApp.
Most recent auth state will always be persisted in the subfolder auth_info_baileys; on first login or expiration, a QR code will be printed in the terminal, so you can authorize the connection with your phone.

const { state, saveCreds } = await Baileys.useMultiFileAuthState('auth_info_baileys')
const conn = Baileys.default({
    auth: state,
    printQRInTerminal: true,
}) 
conn.ev.on ('creds.update', saveCreds)

Because we might not be receiving push notifications otherwise, and as to not always show up as being online, we set our node client to present itself as unavailable:

conn.ev.on('connection.update', (update) => {
  const { connection } = update
  if(connection === 'open') {
    conn.sendPresenceUpdate('unavailable')
  }
})

Lastly, we hook into the message received hook and do what we need to do:

// this event is fired whenever a message is received,
// it will fire for messages you sent yourself as well
conn.ev.on('messages.upsert', async ({ messages }) => {
  // iterate over all messages we received
  messages.forEach(async (message) => {
    // only parse audio messages
    if (!message?.message?.audioMessage) {
        return
    }

    // use this code to skip messages sent in chat groups
    /*
    if (!!message?.message?.key?.participant) {
      return
    }
    */

   // get a buffer of the audio attachment of the message
    const buffer = await Baileys.downloadMediaMessage(
        message,
        'buffer',
        {},
        { reuploadRequest: conn.updateMediaMessage },
    )

    // send the audio buffer to deepgram for transcription
    // see available features here: https://developers.deepgram.com/documentation/features/
    const response = await deepgram.transcription.preRecorded({
        buffer,
        mimetype: message.message.audioMessage.mimetype,
    }, {
        language: 'en', // remove this and use `detect_language: true` for multi language use
        model: 'general', // which AI model to use https://developers.deepgram.com/documentation/features/model/
        numerals: true, // write `nine hundred` as `900`
        profanity_filter: false, // profanity filter leads to strange results
        punctuate: true, // makes it easier to read
        tier: 'enhanced', // better detection, cost is not an issue for this project
    })
    // detect ways in which response could be empty or invalid
    if (!response?.results?.channels?.length) {
        return
    }
    if (!response.results.channels[0]?.alternatives?.length) {
        return
    }
    if (!response.results.channels[0].alternatives[0].transcript) {
        return
    }
    // shows the duration of the audio message as something like `1:45`
    const durationString = `${Math.floor(message.message.audioMessage.seconds / 60)}:${('0' + (message.message.audioMessage.seconds % 60)).slice(-2)}`
    // generates a nice message, looking something like this:
    // 
    // 🤖 [From: John Doe] [Duration: 0:20] [Confidence: 98%]
    // Hey man, have you seen the game yesterday? It was amazing! [...]
    // 
    // `confidence` is giving you a percentage on how certain the AI model is to have understood the message correctly.
    // If you're using `detect_language`, you could add `[Language: ${response.results.channels[0].detected_language}]`
    const transcriptionMessage = `🤖 [From: ${message.pushName}] [Duration: ${durationString}] [Confidence: ${Math.round(response.results.channels[0].alternatives[0].confidence * 100)}%]
${response.results.channels[0].alternatives[0].transcript}`

    // Create a group with only you as participant,
    // pin it to the top of your chat list,
    // and insert its ID here for a private mailbox!
    conn.sendMessage(`${INBOX_GROUP_ID}@g.us`, {
        text: transcriptionMessage,
    })

    // If you want the transcription to be a reply to
    // the original message instead, use this key.
    // Be aware that this might alienate your
    // conversation partner.
    /*
    conn.sendMessage(message.key.remoteJid, {
        text: transcriptionMessage,
    }, {
        quoted: message,
    })
    */
  })
})

Especially for low quality voice messages in non-English languages, the transcription is often hilariously off, but still good enough to get a rough idea if the message is urgent (or even worth listening to).

Keeping it running

On the Raspberry (or DigitalOcean droplet, or your private server, or whichever platform you prefer), use a simple process management tool like pm2 to keep the script running and restart it on a reboot or connection error.