Trimming Audio Files and Adding Silence with AVFoundation in Swift

Originally posted in Akvelon Blog

While working on one of the apps, we faced the need to trim the recorded audio files. We were working with 32-bit float WAV files, and we had the following requirements:

the output file should have the exact same format as the input file;
no processing should be applied to the audio data, audio samples should be copied as-is;
there should be an ability to add silence to the output file.

My first guess was to use AVAssetExportSession, but it has limited options for exporting the audio, and it’s not possible to be sure what it does with the audio under the hood. A no-go.

Secondly, I took a look at the requirements again. “Audio samples should be copied as-is”. That was exactly what we needed - to open the input file for reading, the output one for writing, calculate the range of audio samples to copy, and perform the actual copying. Fortunately, it was completely possible with AVAudioFile - it can be read into AVAudioPCMBuffer and written from one’s contents.

A few words regarding the output format

While working on implementing this functionality, I was sure that AVFoundation converts the data to the set format of the output file on the fly. However, as later turned out, the header of the output file matched the set format, but the actual audio data had the format of the internal processing with AVFoundation (32-bit float WAV).

It’s possible to use AVAudioConverter to convert the audio to the desired format, but this is out of the scope of this article.

Also, keep in mind that multiple conversions from lossy to lossy format (like MP3 or AAC) will degrade the audio quality. Theoretically, it’s possible to copy the lossy audio data directly, like mp3DirectCut does, but this is also a thing to figure out yourself.

Preparation

First, we have to open the input file for reading, get some information about it, and open the output file for writing.

Opening the file for reading is simple:

let inputFile: AVAudioFile = try! AVAudioFile(forReading: inputFileURL)

Note: All the force unwraps, force tries, and fatalError()’s were added to keep the code examples concise. In the real app, all these things should be handled properly.

After doing so, we can take a look at 2 important properties of the opened file:

fileFormat: AVAudioFormat - the format of the audio file itself;
processingFormat: AVAudioFormat - the format that AVFoundation will use to process the audio.

As I mentioned earlier, the format of our files matched the internal processing format of AVFoundation. That’s why it was possible to use fileFormat for allocating audio buffers and exporting audio data. When I tried an input file with even a slightly different format, everything went down in flames. In the course of the article, I’ll be using processingFormat as the correct one for manipulating audio data.

For our task, we’re going to use the following properties of the input file:

processingFormat: AVAudioFormat - see above;
processingFormat.sampleRate: Double - the sample rate of the audio file;
length: Int64 - the length of the audio file in samples.

Note: In case digital audio is Greek to you, and you have no clue what the sample rate is, here’s a great article describing the basics.

Later, we’re going to need the processing format, its sampling rate, and duration in seconds. Let’s define them as the new properties for convenience:

let processingFormat = inputFile.processingFormat
let sampleRate = Int(processingFormat.sampleRate)
let duration = Double(inputFile.length) / Double(sampleRate)

Note: In this article, we’re going to cast a lot of values to the different types. At some point, it will become a total mishmash.

Then we can open the output file for writing, setting it up with the processing format:

let outputFile: AVAudioFile = try! AVAudioFile(
    forWriting: outputFileURL,
    settings: processingFormat.settings,
    commonFormat: processingFormat.commonFormat,
    interleaved: processingFormat.isInterleaved)

Also, looking a bit ahead, let’s define the default buffer size that we’re going to use. In my experience, buffer with the size of the sample rate (e.g. storing one second of the audio) works just fine:

let defaultBufferSize = sampleRate

Finally, we are prepared for the action. It’s time to copy some audio data!

Copying the arbitrary segment of the audio file

We’ll start with the case where we have to copy a segment of the audio file. Let’s define the start and the end time, and check that they are valid:

let startTime: Double = 1.0
let endTime: Double = 2.0

guard
    startTime < endTime,
    startTime >= 0,
    endTime <= duration
else {
    fatalError()
}

Then we have to figure out the offset of the segment and its duration in samples:

let offset = Int64(Double(sampleRate) * startTime)
var samplesToCopy = Int(Double(sampleRate) * (endTime - startTime))

Now the interesting part begins. AVAudioFile has framePosition property - “the position in the file at which the next read or write operation will occur” as the documentation says. It gets automatically advanced on reading or writing operation by the number of frames read or written respectively. Fortunately for us, it can also be set manually to perform a seek before a read or write. As you might have guessed, this is what we computed the offset for:

inputFile.framePosition = offset

Finally, it’s time to copy the audio data. Let’s take a look at the complete code snippet and break it apart right after:

// 1
while samplesToCopy > 0 {
    // 2
    let bufferCapacity = min(samplesToCopy, defaultBufferSize)
    // 3
    let buffer = AVAudioPCMBuffer(
        pcmFormat: processingFormat, 
        frameCapacity: AVAudioFrameCount(bufferCapacity))!

    // 4
    try! inputFile.read(into: buffer)
    try! outputFile.write(from: buffer)
    // 5
    samplesToCopy -= Int(buffer.frameLength)
}

Repeat the process until the whole number of samples is copied;
Define how many samples to copy during the iteration. At some point, the remaining number of samples to copy will be less than a default buffer size, so this should be used as a buffer size instead;
Instantiate the buffer with the proper audio format and the determined size;
Read the input file into the buffer and write it into the output file. Important note: the documentation says that the system will try to fill the buffer to its capacity, but it’s not guaranteed. That’s why we’re repeating the loop until the desired number of samples is copied rather than copying a precalculated number of the buffers;
Decrease the sample counter by the number of actually copied samples.

And the last step - to write the complete output file into the disk, the outputFile has to be deinitialized, which can be achieved by removing all its strong references (if you only referred to it from the property declared inside a method, then it will happen when the method returns). Not as scary as it could be, right?

Copying the last part of the audio file

Now let’s look at how we can cut some corners if we only need to copy the “tail” of the audio file with the desired length.

This time we’ll define a segment duration and check its validity:

let segmentDuration: Double = 4.0
guard segmentDuration <= duration else { fatalError() }

Offset calculation is also pretty straight-forward:

let offset = inputFile.length - Int64(Double(sampleRate) * segmentDuration)
inputFile.framePosition = offset

And here lies the main difference - instead of calculating how many samples should be copied, we just perform the copying until we reach the end of the input file:

while inputFile.framePosition < inputFile.length {
    let buffer = AVAudioPCMBuffer(
        pcmFormat: processingFormat, 
        frameCapacity: AVAudioFrameCount(defaultBufferSize))!

    try! inputFile.read(into: buffer)
    try! outputFile.write(from: buffer)
}

This is one of the practical examples of the situation when the buffer may not be filled completely if there’re not enough samples left in the input file.

The rest is identical - deinit the output file and you are free to go!

Adding silence to the output file

Now let’s add some silence to the output file, for example, to round up its duration. Obviously, we need a buffer which we will later write into the output file. Let’s begin with obtaining one:

let silenceDuration: Double = 0.7
let silenceLength = silenceDuration * Double(sampleRate)

let buffer = AVAudioPCMBuffer(
    pcmFormat: processingFormat, 
    frameCapacity: AVAudioFrameCount(silenceLength))!

Since in our case we had to add no more than a second of silence, we used a single buffer, but if you consider adding hours of silence, it’s better to work with small buffers in a loop similar to what we did earlier.

Now we have an empty buffer, and it has to be filled with silence somehow. The first thing that interests us is the buffer’s frameCount property, which represents the number of valid audio frames in the buffer. Its value can be changed manually, so to make the buffer look like it’s filled up with data, we can do the following:

buffer.frameLength = buffer.frameCapacity

If you try to write this buffer into the file now, it may work, especially if you’re testing it in a debug build. However, there’s no guarantee that the buffer’s memory will be zeroed out in the release build, so in the wild, you can get anything but silence.

Therefore, we need to fill the buffer with zeros manually. We’ll go a dangerous way, using the old infamous memset. Some smart people call it the most troublesome function in history, so it wouldn’t hurt to double-check that everything is implemented correctly. This approach was described by theanalogkid on the Apple Developer Forums with the sample code written in Objective-C. Here’s how the implementation in Swift code looks:

// 1
let bytesPerFrame = Int(processingFormat.streamDescription.pointee.mBytesPerFrame)
// 2
guard let channelDataPointer = buffer.floatChannelData else { fatalError() }

// 3
for channel in 0..<Int(processingFormat.channelCount) {
    // 4
    memset(channelDataPointer[channel], 0, silenceLenght * bytesPerFrame)
}

Here’s what we do:

Getting the size of a single audio frame in bytes;
Getting a pointer to the buffer with the audio data. Since we’re working with floating-point audio, the buffer we’re looking for is floatChannelData. All the 3 buffer referring properties (float, int16, and int32) are optional, so if you’re unsure about the audio data format, it’s possible to optional chain them like buffer.floatChannelData ?? buffer.int16ChannelData ?? buffer.int32ChannelData;
Iterating through the indexes of the audio channels;
For each audio channel, writing the exact number of zeroed bytes to fill all the audio frames.

Now the buffer can be safely written to the file, and that’s it. We just added some silence to the audio file. As always, don’t forget to close the output file when you’re done writing into it.

Conclusion

As you can see, it’s not that hard to do the basic audio file trimming without using any 3rd-party libraries. The described techniques can be applied further, for example, to compose pieces of the audio file in arbitrary order or to concatenate multiple audio files. The only drawback is the need to convert the audio to the desired output format manually.