Working with Audio to Text

Let’s dive into a simple step-by-step guide on how to use Audio to Text.

Before we get started, to make the learning process more straightforward, we’ll break this process into three main parts:

  1. Transcribing your Audio/Video file
  2. Voice Changer Screen
  3. Fixing Generated Audio Errors

Here’s a concise overview of the Audio to Text feature. You can use this article in conjunction with the interactive tutorial for a more comprehensive learning experience.

 

 

Transcribing your Audio/Video file

Media Upload and Language Selection

The first step is to upload your media file from the Audio to Text tab in the sidebar and select the language of the speech in the uploaded media.

 

 

Selecting the appropriate language during media upload is critical to ensure the accuracy of the transcription.

 

 

Media Retention

Uploading audio or video directly to the Audio to Text menu will remove the original files. To keep the original video, upload it via Add Media, add it to the timeline, and then transcribe. Note that the original audio will still be removed in this process.

 

 

Voice Changer Screen

Once you’ve uploaded your media and selected the appropriate language for transcription, you will be shown a Voice Changer screen. 

 

If you’re interested in only transcribing your audio and want to export the script, simply click on Accept All & Save and go to Export > Script.

 

After the transcription of your script is visible on the Voice Changer screen, your voiceover will be generated using one of the default voices. 

There are two possible outcomes that can happen:

1. The Studio may encounter Block Errors.

2. The Studio may not encounter any Block Errors.

 

Understanding Block Errors

To understand Block Errors, it is essential to understand the concept of Uploaded Duration and Generated Audio Duration.

Understanding the pace of delivery for different voices is key. This pace determines how long it takes for a particular voice to narrate a sentence.

Uploaded Duration

Uploaded Duration is simply the original speech duration of the sentence/section of your video/audio. Based on the original durations/timestamps of the original speech, your speech is transcribed into text blocks, and the timestamps are frozen for syncing. 

Generated Audio Duration

It is essential to remember that each AI voice has its own distinct pace of delivery, much like human voices. However, we have more control over this, as you can customise the pace to your preference using the Speed feature in the Studio.

If a block error occurs in the Studio, it’s because the Uploaded Duration is shorter than the Audio Generated Duration. The erroneous block will be highlighted in the sidebar to alert you of the issue, as shown below.

 

 

As a result, the generated audio file may be longer than the original, which can cause synchronisation issues if the audio is transcribed for a video.

❗In cases where the Uploaded Duration is longer than the Generated Audio Duration, the Studio will not display an error message.

 

Fixing Block Errors

There are 4 ways of fixing a block error:

  1. Changing the Voice
  2. Altering the Speed
  3. Editing the Content
  4. Accepting the Errors

Changing the Voice

It’s worth noting that each AI voice has a unique delivery pace, typically measured in words per minute (wpm).

For reference, an average English speaker speaks at the rate of 120-180 WPM.

Consider selecting a different AI voice that matches the WPM of the original recording to achieve a more accurate match between the uploaded voice and the AI-generated voice.

Increasing Speed

One of the benefits of using AI voices is the option to adjust the narration speed to match the pace of the original uploaded voice.

By increasing the speed, you can shorten the generated audio duration to achieve a closer match with your original recording. 

It is advisable to avoid increasing or decreasing the speed by more than 25-30%, as it may result in the generated audio sounding too fast or too slow.

Editing the Content

If the previous options of changing the voice or altering the speed do not effectively match the generated audio with the uploaded audio, we recommend editing the text content. This is particularly advisable for text blocks that contain more extensive text content.

This may involve rephrasing and summarising the blocks with a significant error margin. While this may result in changes to the original script, it is one of the most effective ways to ensure that the script matches the uploaded audio duration and maintains synchronisation.

Accepting the Errors

If you’re simply looking to generate a transcript of your audio/video file and the timestamp syncing isn’t crucial to your workflow, you can select “Accept All & Save”. The “ACCEPT AUDIO LENGTH”  is helpful if you wish to make further changes to a specific block. 

This means acknowledging that the generated audio duration may be longer than the uploaded audio duration. Depending on the margin of error present, all blocks will move forward by a few seconds.