Jitter Buffers!

Implementing a jitter buffer to improve real time audio quality

Author Avatar

Dakshin Devanand

Back in my days of playing Rocket League, having 90 ping was the greatest of my concerns.

Lag, and correspondingly, latency was the devil I thought. However, I’ve now personally experienced an even greater evil, which is…

Jitter.

What is it? #

Now latency is defined as the time taken for data to travel from source to destination. Jitter on the other hand can be thought of as the variation of latency across multiple packets.

Formally, the RFC 3550 standard defines it as “the mean deviation (smoothed absolute value) of the difference D in packet spacing at the receiver compared to the sender for a pair of packets.”

targets

Why it sucks #

Now imagine that I’m talking to you in realtime over a VoIP (Voice over IP) application like Discord or Skype. You’re generating audio packets of size 20 millisecnds and sending them over the network to me. While I play 20 ms of audio, you send over the next packet. And once I’m done with my current packet, another one is waiting for me, perfectly on time.

But what if a packet takes longer due to network congestion or a longer-than-normal I/O buffer? Well I’m left with no audio to play.

This is the problem with a tight couping between packetization length of audio and playback during realtime applications.

targets

And we can also face problems where we receive packets out-of-order.

targets

We have to handle the possibility of receiving OOO packets due to most real-time audio streaming applications using UDP as their protocol of choice.

Buffers to the rescue #

The jitter buffer is going to help us smooth out packet inconsistencies by caching a couple hundred ms of audio samples.

Yes, this involves a small amount of initial latency when playing audio, which is why we have to keep the buffer very small (100s of milliseconds) compared to the seconds that an application like live streaming could handle without users noticing.

targets

If packet 5 came 120 ms after its intended arrival time, the jitter buffer would allow us to insert it into the corrected order.

Not so trivial? #

Now I wanted to implement an initial static-sized jitter buffer for my project. Sounded pretty easy at first. All we need is a way to know the order in which my packets were created. This is solved by adopting the lightweight RTP (Real-time Transport Protocol) that adds a sequence number (SN) to the payload of the packet.

However, when we take into account that the SN, as stated by RFC 3550, is a uint_16, it makes our definition of order a bit more difficult. This is due to the SN ranging from 0 ... 2^16-1 (65,535). If we send more than 65,535 packets, the SN will wrap-around back to 0.

targets

In the above situation, lets say our < operator means “older” in the sense that it’s closer to being played back. Then if we simply check if 0 < 65,534, we might mistakenly insert packet 0 at the end as the oldest packet and play the audio out of order. So we need to override our meaning of < and >

Do we have to deal with this? #

What are the chances we have packets 0 and 65,535 in the buffer at once?

Well taking into account that:

  • RFC 6716 that defines the Opus codec I use says “20 ms frames are a good choice for most applications.”
  • RFC 3550 specifies that the receiver should start the SN at a random value.

To send 65,535 packets on average (using all sequence number values), it would take:

$$65,535 \times 20 \text{ ms} = 1,310,700 \text{ ms} = \frac{21.8 \text{ mins}}{2} = 10.9 \text{ mins}$$

This is pretty reasonable, and warrants taking a closer look at this edge case.

Signed Overflow Wrapping saves us? #

In order to make a comparator for packets, we need to be able to say if a is ahead of b and vice versa. The solution1 is simply:

1// Take advantage of signed overflow wrapping
2bool newer(uint16_t a, uint16_t b) {
3    return static_cast<int16_t>(a - b) > 0;
4}
5
6newer(5, 1) --> true
7newer(2, 65,535) --> true
8newer(0, 218) --> false

We use the fact that unsigned values adhere to modular arithmetic. In modular arithmetic, we can define 2 distances from a to b, the positive and negative distance. And interpreting the result of unsigned subtraction as signed tells us which direction is closer, and subsequently, which number is “ahead” of the other on the modular circle.

Detailed Explanation
Circle diagram showing modular arithmetic

Let a = 5 and b = 7 with 16 bits:

$$ 5 - 7 = -2 \bmod 65,536 = 65,534$$

We can see that unsigned subtraction gives us the negative distance - Travel 65,534 steps in the negative direction to reach 7 from 5.


Key insight: If the negative distance from a to b is shorter of the 2 distances (unsigned distance < 32,768), then a is “older”/“smaller” than b, and vice versa.

To make it simpler, we can observe that when (unsigned distance > 32,768), the MSB is on, making the signed representation negative, therefore indicating our order of “newer”/“larger”.

Popping packets… #

Now that we have inserting packets into our buffer done, we need to create the API to pop() packets off the buffer for our consumer. This part is rather straightforward, the only caveat being that we need to track the last SN popped, so that we can detect if we have a missed packet.

If we do detect that we missed a packet (3, 4, 6, 7), we should tell our audio player that we missed packet 5 and to play silence.

But we can do even better with something called Packet Loss Concealment (PLC). PLC can be implemented through:

  1. Zero insertion: replacing lost frames with silence.
  2. Waveform substitution: Reconstruct the frame from previous frames.
  3. Generative methods: Use an RNN or other model to extrapolate audio gaps.

The Opus codec itself provides PLC features by simply passing nil into their decoder in place of a regular buffer of data. It does this through 3 methods2:

  1. CELT - Opus maintains a history buffer of past frames, and then extracts windows that are commonly repeated to fill in for missing frames.
  2. SILK - Using LPC (Linear Predicive Coding) to model a digital signal as a linear combination of coefficients based on its past values.
  3. DeepPLC - Introduced in Opus 1.5 (2024), uses a RNN to predict acoustic features. 3

Conclusion #

Now that’s it for static jitter buffers. I ended up delving into the topic as I implement a VoIP application from scratch as a personal-project and it seemed quite interesting for such a simple topic. I’ve currently hit a road-block with static jitter bufers however whenever I receive network bursts. So I’m implementing an adaptive one currently, look forward to another post soon 👀.