Using Azure’s Speech SDK to transcribe real time audio

Published in

Analytics Vidhya

3 min readJul 19, 2024

The advancements in LLMs is increasing at a very rapid rate, with all the big tech companies jumping strainght into it from the Microsoft, Google, Amazon and Meta. Through this article I will demonstarte how easy it is to transcribe real time audio.

In this article, we’ll walk through the creation of a real-time transcription web app using Azure Cognitive Services. We’ll delve into the code, explaining each part to help you understand how to implement this functionality.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Real-Time Transcription</title>
    <script src="https://aka.ms/csspeech/jsbrowserpackageraw"></script>
</head>
<body>
    <h1>Real-Time Transcription</h1>
    <button id="startButton">Start Listening</button>
    <button id="stopButton" disabled>Stop Listening</button>
    <p id="transcription"></p>
    <a id="downloadLink" href="#" download="transcription.txt" style="display:none;">Download Transcription</a>
</body>
</html>

The code is a basic html code with two buttons to start and stop listening to the audio from the input. once the stop button is clicked the webpage provides a link to download the transcript file as a text document.

Now lets move on to the javascript code:

<script>
    const subscriptionKey = "your_subscription_key";
    const serviceRegion = "your_service_region";

    let audioConfig;
    let speechConfig;
    let recognizer;
    let accumulatedTranscription = "";

    document.getElementById('startButton').addEventListener('click', function () {
        startListening();
    });

    document.getElementById('stopButton').addEventListener('click', function () {
        stopListening();
    });

    function startListening() {
        speechConfig = SpeechSDK.SpeechConfig.fromSubscription(subscriptionKey, serviceRegion);
        audioConfig = SpeechSDK.AudioConfig.fromDefaultMicrophoneInput();

        recognizer = new SpeechSDK.SpeechRecognizer(speechConfig, audioConfig);

        recognizer.recognizing = function (s, e) {
            document.getElementById('transcription').innerText = e.result.text;
        };

        recognizer.recognized = function (s, e) {
            if (e.result.reason === SpeechSDK.ResultReason.RecognizedSpeech) {
                document.getElementById('transcription').innerText = e.result.text;
                accumulatedTranscription += e.result.text + "\n";
            } else if (e.result.reason === SpeechSDK.ResultReason.NoMatch) {
                document.getElementById('transcription').innerText = "No speech could be recognized.";
            }
        };

        recognizer.canceled = function (s, e) {
            console.error(`Canceled: ${e.reason}`);
            if (e.reason === SpeechSDK.CancellationReason.Error) {
                console.error(`Error details: ${e.errorDetails}`);
            }
            stopListening();
        };

        recognizer.sessionStopped = function (s, e) {
            console.log("Session stopped.");
            stopListening();
        };

        recognizer.startContinuousRecognitionAsync();

        document.getElementById('startButton').disabled = true;
        document.getElementById('stopButton').disabled = false;
    }

    function stopListening() {
        recognizer.stopContinuousRecognitionAsync(() => {
            recognizer.close();
            recognizer = undefined;
            
            // Enable download link
            const blob = new Blob([accumulatedTranscription], { type: 'text/plain' });
            const url = URL.createObjectURL(blob);
            const downloadLink = document.getElementById('downloadLink');
            downloadLink.href = url;
            downloadLink.style.display = 'block';
        });

        document.getElementById('startButton').disabled = false;
        document.getElementById('stopButton').disabled = true;
    }
</script>

Fill in the constants of your subscription key and the region for your azure account before running the code. The startListening function, initializes the recognizer and starts continuous recognition and updates button states.By clicking the on the start button the page requests audio access from a list of available audio inputs to the user. By clicking on the stop button the page stops listening to the audio and enables a download link for the transcript generated. The stopListening function, stops the recognition process. It finalizes the transcription, creates a downloadable text file, and updates the UI to enable the download link.

Webpage requests user for access to audio input

With this code, you can create a real-time transcription web app using Azure Cognitive Services. This tool can be invaluable for various applications, making speech-to-text functionality easily accessible through a web browser. By understanding each part of the code, you can further customize and extend this functionality to suit your needs.

Using Azure’s Speech SDK to transcribe real time audio

Published in Analytics Vidhya

Written by Pallav Raval

No responses yet