Low latency microphone audio capture through ADB

35 min read

A few months ago I released Audio Source, an application that sends Android microphone audio to a computer connected through USB. For a couple of minutes, the latency is excellent, almost indistinguishable. Over time, especially as the computer gets busy, the latency builds up to a dozen seconds, which makes any audio conversation painful. Having embarrassed myself enough in remote meetings, I decided to get to the bottom of the issue. This is not a story of how I swapped everything with an RTSP server or bought a USB microphone. You will learn about audio encoding, and how to play with sockets, pipes, and buffers.

§
Introduction

This section recaps the basics of digital audio, presents an overview of the components that make Audio Source work, and states the problem we will try to solve next.

§
Digital audio

Audio sampling diagram

The process of sampling makes it possible to convert a continuous audio signal (mechanical sound waves captured as electromagnetic waves) into a discrete audio signal (a stream of bytes). The Pulse-Code Modulation (PCM) format encodes the amplitude of an audio signal as a sequence of bytes, sampled at a defined rate.

You may decide to encode the amplitude as a 16-bit integer, called the bit depth, so you can represent any input signal magnitude by an integer between -32,768 and 32,767. When you store these samples as a series of bytes, you have to choose the order of the low and high bytes. If the low bytes appear first, it is little-endian. The opposite is called big-endian. s16le designates signed 16-bit little-endian samples.

The sample rate is the number of times per second you take a sample from the input signal. For example, Audio CDs are sampled at 44100 Hz, which means 44100 samples per second (1 Hz=1 s11\ \text{Hz} = 1\ \text{s}^{-1}). The Nyquist–Shannon sampling theorem tells us that the sample rate needs to be at least twice the highest frequency you wish to record. If the input signal has higher frequencies, your recording will have audible aliasing.

The same phenomenon happens when you record the blades of a helicopter. Their frequency (rotation per second) appears to be much lower. In the context of audio, you get spurious sounds in the audible range. Because human hearing is limited to around 20,000 Hz, 44100 Hz is enough to sample in the hearing range, and it leaves enough room to remove higher frequencies with an analog low-pass filter.

The component that performs the sampling is called an Analog-to-Digital Converter (ADC). These chips add a little bit of latency, usually a few microseconds, because they have to average the amplitude of the audio signal over the sample duration. At 44,100 Hz, each sample lasts ~23 µs. Transmitting it takes a few more cycles, but this is negligible compared to the latency we will encounter at upper audio processing stages.

Note that audio recordings may have multiple channels. Mono recordings only have one channel. Stereo recordings have two, for the left and right sides. The samples from all the channels are commonly grouped into audio fragments, interleaving the samples from each channel in the output PCM stream.

Assuming a mono, s16le, 44100 Hz stream, the amount of information per second is:

16 bits/sample×44100 samples/second=705,600 bits/second. 16\ \text{bits/sample} \times 44100\ \text{samples/second} = 705{,}600\ \text{bits/second}.

From this result, you can easily compute how many second of audio a 1024-byte buffer contains:

1024 bytes705,600 bits/second=8192 bits705,600 bits/second11.6 ms. \begin{align*} {1024\ \text{bytes} \over 705{,}600\ \text{bits/second}} &= {8192\ \text{bits} \over 705{,}600\ \text{bits/second}}\\ &\approx 11.6\ \text{ms}. \end{align*}

For comparison, a comfortable real-time audio communication requires a one-way latency of less than 300 ms, with noticeable degradation above 150 ms, so that leaves us some room.

§
Architecture

Audio Source architecture diagram

Audio Source is the first piece of the puzzle that allows recording audio from an Android device and use it as an input on a computer. It acquires audio fragments from an instance of AudioRecord and passes them to a LocalSocket. These classes are part of the Android framework.

The audio is exposed on the phone through a UNIX Domain Socket (UDS). This socket family supports efficient local byte-oriented streams, without relying on the TCP stack: the data is simply passed between the connected socket pairs within the Linux kernel. The downside is that it is only accessible locally on the same machine.

The second piece of the puzzle is the Android Debug Bridge (ADB). It creates a link between a local socket on an Android phone and another socket on a computer attached through USB. This article treats this part as a reasonably efficient black box, without diving into the intricacies of USB and ADB.

The third piece of the puzzle is the sound server. The most common option for Linux is Pulse Audio. Its purpose is to receive or send audio to the hardware and make it available to multiple applications with a low latency. Pulse Audio supports modules that can register virtual sources. In this article, we will rely on the pipe source module to ingest audio fragments from a local Linux pipe.

Since ADB only forwards data between two sockets, you have to transfer the audio from the socket managed by ADB to the Pulse Audio source pipe. A simple way to do that is to use Socat, a kind of Swiss Army knife for network communications. You will see that it is reasonably simple to replicate this functionality with Python, which will serve as a basis to solve the latency issue.

§
The problem

Everything seems to work fine with an amazing latency. It feels almost real time, at least for a few minutes. Then, the latency slowly builds up and settles at about 12 seconds. This is especially noticeable as the system gets busy.

That may indicate that a consumer downstream doesn't process audio fragments fast enough. You can simulate this issue by suspending socat for a few seconds, and then resuming it:

$ pkill -STOP socat
$ sleep 5
$ pkill -CONT socat

After that, you will hear a huge delay. The instinctive explanation is that suspending socat delays the source, which it causes some buffering upstream. But that doesn't explain why the stream starts with a barely noticeable latency, why it cannot return to that state, and where exactly are these audio chunks accumulating.

This article goes through each later to investigate how they contribute to the overall latency. You will see that the capacity of the buffers along the transmission path is only tangentially related to the overall latency, and they only become an issue under the specifities of audio playback.

§
Audio Source

In this section, we will take a look at the architecture of Audio Source and how it interacts with Linux and the Android SDK, and how these layers exchange data. The following diagram provides an overview of the main buffers along the audio path that we will explore next.

Android audio buffers diagram

§
ALSA

The lowest layer in the audio stack is the Advanced Linux Sound Architecture (ALSA). Audio device drivers implement the hardware side to configure sound cards and setup the PCM data streams, and applications rely on the user API.

Exchanging data between the hardware and the kernel requires a hardware buffer. Through Direct Memory Access (DMA), sound cards can write to system memory, notify the kernel through a hardware interrupt that some data is available, and continue writing to the next chunk of the buffer.

The period is the amount of data sound cards write before triggering an interrupt. Only after an interrupt the system knows it can access the entire period, so it also corresponds to the absolute minimum latency you can expect from the sound card.

Each period contains a fixed number of fragments. The hardware configuration defines the format of the PCM data written to the buffer, which influences the sample size. Stereo audio fragments contain two samples for the left and right channels. As described in the introduction, the samples are the raw bytes that describe the magnitude of the original audio signal.

It is possible to query the kernel for the actual values of these hardware parameters. ALSA exposes the sound hardware as a directory tree of cards, devices, and subdevices. For example, the path card0/pcm0c/sub0 includes the capture device pcm0c, as the trailing c indicates (playback devices have a trailing p instead). When a device is active, you can get its configuration with the following command:

$ adb shell cat /proc/asound/card0/pcm0c/sub0/hw_params
access: RW_INTERLEAVED
format: S16_LE
subformat: STD
channels: 1
rate: 44100 (44100/1)
period_size: 896
buffer_size: 1792

The number of channels, the sample format, and the sample rate match the recording parameters requested by Audio Source. Note that period_size and buffer_size are given in fragment units. With such streams, a fragment is 16 bits, or 2 bytes, which means the size of a period is 1792 bytes.

As you've done in the introduction, you can compute the length of a period at around 20 ms, which is a lower bound for the overall recording latency. That also tells us that the sound card triggers interrupts every 20 ms, or 50 times per second.

The second interesting fact is that the buffer is twice as large as the period, for it is the minimum size required to allow the sound card to write the current period while the system reads the last one.

Sound cards connected through other interfaces may require larger buffers. Assuming the system is overloaded and cannot process the interrupts in time, the sound card will eventually override an unread period: this is a buffer overrun, and this period is lost. As you will see, the same issue happens at higher audio processing levels.

§
AudioRecord

Android provides the AudioRecord class to record audio from sources like a microphone. First, you have to choose the recording parameters, as described previously:

AudioSource.java
int SAMPLE_RATE = 44100;
int CHANNEL_CONFIG = CHANNEL_IN_MONO;
int AUDIO_FORMAT = ENCODING_PCM_16BIT; // s16le

AudioRecord sits on a number of layers, including the Hardware Abstraction Layer (HAL) that is vendor specific, and doesn't necessarily rely on ALSA. This HAL can buffer and process the recorded audio to enhance the poor mobile phone recording quality. That creates another level of buffering with additional latency.

The Android audio server (called Audio Flinger) sits above the HAL. Because it runs with a high priority, it is able to fetch the audio fragments from the hardware buffer in a timely manner and place them into its own buffer. This buffer can be used for resampling and post-processing, and it can be shared between multiple applications.

The following commands prints some statistics about an Audio Flinger recording stream:

Console
$ adb shell dumpsys media.audio_flinger
Input thread 0x7aa974c0c0, name AudioIn_56, tid 1507, type 3 (RECORD):
  I/O handle: 86
  Standby: no
  Sample rate: 44100 Hz
  HAL frame count: 896
  HAL format: 0x1 (AUDIO_FORMAT_PCM_16_BIT)
  HAL buffer size: 1792 bytes
  Channel count: 1
  Channel mask: 0x00000010 (front)
  Processing format: 0x1 (AUDIO_FORMAT_PCM_16_BIT)
  Processing frame size: 2 bytes
  Pending config events: none
  Output device: 0x1 (AUDIO_DEVICE_OUT_EARPIECE)
  Input device: 0x80000004 (AUDIO_DEVICE_IN_BUILTIN_MIC)
  Audio source: 1 (AUDIO_SOURCE_MIC)
  Timestamp stats: n=0 disc=0 cold=0 nRdy=0 err=8094 jitterMs(unavail) localSR(nan, nan) correctedJitterMs(unavail)
  Timestamp corrected: no
  Last read occurred (msecs): 36
  Process time ms stats: ave=0.0919268 std=0.00985403 min=0.056198 max=2.16047
  Hal read jitter ms stats: ave=-0.0213719 std=1.69454 min=-9.57267 max=10.9895
  AudioStreamIn: 0x7b372f26c0 flags 0 (AUDIO_INPUT_FLAG_NONE)
  Frames read: 7252224

Notice that the HAL parameters what we found by querying ALSA in the previous section.

AudioRecord has two parts linked by the Java Native Interface (JNI): the Java API, and the native C++ implementation. Communication with the audio server happens through the binder Inter-Process Communication (IPC) mechanism. The buffer from which AudioRecord reads its audio fragments is shared between the audio server and the application. This technique allows for the transmission of audio chunks between two processes without copying them to an additional buffer, something we won't be able to avoid next.

To successfully instantiate this class, you have to indicate a buffer size greater than the value returned by AudioRecord.getMinBufferSize.

AudioSource.java
int minBufSize = AudioRecord.getMinBufferSize(
	SAMPLE_RATE,
	CHANNEL_CONFIG,
	AUDIO_FORMAT
);

Log.i(App.TAG, "minBufferSize=" + minBufSize);

On my device, this value is 3584 bytes, which matches the size of the hardware buffer, and makes sense because it has the same double buffering constraints.

Console
$ adb logcat | grep AudioSource
I AudioSource: minBufferSize=3584

Calling AudioRecord.read copies a slice of audio data from the internal recording buffer to the buffer passed as an argument, and advances the read pointer of the internal buffer. If you do not call AudioRecord.read quickly enough, the write head may advance past the read head, discarding unread audio fragments. This is the same buffer overrun issue as with the hardware buffer, and it may occur if the scheduler doesn't give enough CPU time to the recording thread.

To prevent this situation, the documentation advises instantiating AudioRecord with a slightly larger buffer, twice the minimum in the following code snippet:

AudioSource.java
AudioRecord recorder = new AudioRecord(
	DEFAULT,        // Default source (mic).
	SAMPLE_RATE,
	CHANNEL_CONFIG,
	AUDIO_FORMAT,
	2 * minBufSize
);

To process the audio samples, you only need to call AudioRecord.read in a loop and make use of the data however you want. This operation blocks the calling thread until some audio samples are available and returns the number of bytes that were read.

AudioSource.java
byte[] buf = new byte[1024];

recorder.startRecording();

while (recording) {
	int r = recorder.read(buf, 0, buf.length);

	if (r < 0) {
		break;
	}

	// Do something with buf[0..r].
}

While the recorder is active, you can call AudioRecord.getBufferSizeInFrames, which tells you the actual size of the buffer.

AudioSource.java
int lastBufferSizeInFrames = 0;

while (recording) {
	int r = recorder.read(buf, 0, buf.length);

	if (r < 0) {
		break;
	}

	// Do something with buf[0..r].

	int bufferSizeInFrames = recorder.getBufferSizeInFrames();

	if (bufferSizeInFrames != lastBufferSizeInFrames) {
		Log.i(App.TAG, "bufferSizeInFrames=" + bufferSizeInFrames);
		lastBufferSizeInFrames = bufferSizeInFrames;
	}
}

Logging output:

Console
$ adb logcat | grep AudioSource
I AudioSource: minBufferSize=3584
I AudioSource: bufferSizeInFrames=3584

Remember that we passed 2 * minBufSize as the buffer size, and because each fragment (or frame) holds 2 bytes, 3584 frames are equivalent to 7168 bytes, twice the minimum buffer size.

Returning to the original problem, a possible source of latency is the AudioRecord buffer. But even after suspending and resuming socat, its size remains the same. From the calculus in the introduction, you know this buffer can hold at most 82.3 ms of audio, which isn't enough to cause multiple seconds of latency.

Another possibility is scheduling: Audio Source may not be able to keep up with the flow of audio fragments. This hypothesis is quickly discarded as there is no audible skips, which means the samples are being processed fast enough. Additionally, the internal buffer in AudioRecord wouldn't be able to provide packets that are consistently multiple seconds late, thereby invalidating the hypothesis of slow calls to AudioRecord.read.

Now that you can capture a chunk of audio from the microphone, let's make it available to other applications.

§
LocalSocket

LocalServerSocket gives you the ability to listen for local connections on a UNIX Domain Socket (UDS). These sockets are similar to TCP sockets, except they are local to the machine, only transfer data inside the Linux kernel, and do not run any of the TCP flow control algorithms.

Indeed, there is no need to handle reliable communication over IP networks, so their implementation doesn't require Nagle's algorithm with its negative effect on latency.

§
Handling connections

These sockets are usually bound to a path on the local file system, but there exist a flavor of "abstract" UDS that are identified only by name.

AudioSource.java
String SOCKET_NAME = "audiosource";

LocalServerSocket serverSocket = new LocalServerSocket(SOCKET_NAME);

To await for a client connection, you need to call the blocking method LocalServerSocket.accept. When a client connects, it returns a LocalSocket that contains two streams, input and output, for bidirectional communication.

AudioSource.java
LocalSocket socket = serverSocket.accept();

OutputStream ostream = socket.getOutputStream();
ostream.write("Hello, World!");

serverSocket.close();

Audio Source takes the audio samples from AudioRecord and passes them to the client connected through the LocalSocket:

AudioSource.java
while (recording) {
	int r = recorder.read(buf, 0, buf.length);

	if (r < 0) {
		break;
	}

	ostream.write(buf, 0, r);
}

In practice, it is a little bit more involved to allow for closing the connection from Android, and record in a separate thread. Nonetheless, the general idea is there.

§
Buffering

Each socket has two associated buffers: send and receive. You can get the size of the send buffer with LocalSocket.getSendBufferSize, and set it with LocalSocket.getSendBufferSize. (Under the hood, these methods call Linux's getsockopt and setsockopt with the option SO_RECVBUF or SO_SNDBUF.)

First, let's try to log the default size:

AudioSource.java
int lastSendBufferSize = 0;

while (recording) {
	int r = recorder.read(buf, 0, buf.length);

	if (r < 0) {
		break;
	}

	ostream.write(buf, 0, r);

	int sendBufferSize = socket.getSendBufferSize();

	if (sendBufferSize != lastSendBufferSize) {
		Log.i(App.TAG, "sendBufferSize=" + sendBufferSize);

		lastSendBufferSize = sendBufferSize;
	}
}

Output:

Console
$ adb logcat | grep AudioSource
I AudioSource: sendBufferSize=212992

That means the send buffer can fill up to 212992 bytes, which is about two seconds of audio. The fix is easy:

AudioSource.java
socket.setSendBufferSize(1024);

Output:

Console
$ adb logcat | grep AudioSource
I AudioSource: sendBufferSize=2048

The value you pass to setSendBufferSize gets doubled, and this double cannot be less than 2048. This is the smallest buffer size you can get, and it represents roughly 23.2 ms of audio with our sampling parameters.

§
ADB

When you connect your phone to a computer, you can enable the Android Debug Bridge (ADB), which gives you access to development features. To forward the abstract UDS created at the previous section to a socket of the same name on the attached computer, you can run:

Console
$ adb forward localabstract:audiosource localabstract:audiosource

ADB transparently links these two sockets through the USB connection, so you can access Audio Source through a local abstract UDS. The USB protocol used to connect them and its internal buffers are beyond our control. But we can try to check whether the socket on the computer side does the same buffering as the socket on Android side.

Although you do not have access to it directly, you can use the command ss to list all the sockets on the system. You can pass the option -p to get the list of connected processes, and filter the output for audiosource with grep:

Console
$ ss -p | grep audiosource
Netid  State  Recv-Q  Send-Q  Local Address:Port      Peer Address:Port      Process
u_str  ESTAB       0  212992   @audiosource 54719224             * 54713821  users:(("adb",pid=877449,fd=9))

The send queue contains about 213 kB of data. This value is not fortuitous, as it corresponds to /proc/sys/net/core/wmem_default, the default capacity of UDS sockets. This is about 2 seconds of audio. Unfortunately, there isn't much you can do about this buffer aside from changing the global limit.

§
Pulse Audio

Linux audio recording buffers diagram

§
Direct playback

Let's not forget about our goal: getting the audio to the sound server. Like everything else, Linux has multiple competing sound server implementations. Pulse Audio is the most common on the Linux desktop, so it will be the target of this article. This sound server relies on ALSA to exchange audio data with the hardware. Using its own buffer, it can resample and mix it with other sources.

Socat can be used to connect to the ADB abstract UDS. As an initial proof of concept, you can just send the PCM data to pacat to play the raw PCM data:

Console
$ socat ABSTRACT-CONNECT:audiosource STDOUT \
	| pacat --channels=1 --format=s16le --rate=44100

pacat has the option --volume that you may want to set higher than 100%, like --volume=150, if you want to hear anything. This setup allows you to listen to the audio PCM stream directly without going through a Pulse Audio recording source.

§
Pipe source

pacat sends the audio to a sink, but it doesn't allow us to use this stream as input for recording purposes, or as a source in Pulse Audio terminology. Fortunately, Pulse Audio provides a module to register a virtual source that can ingest raw audio from a pipe.

§
Configuration

Pipes are another kind of IPC within Linux, with an API similar to files, except they operate on a shared kernel buffer. Note that in the previous section, socat and pacat are connected through an anonymous pipe linking socat STDOUT and pacat STDIN. A pipe bound to the file system is called a FIFO, but there is no conceptual difference between them.

To create a virtual pipe source, the first step is to load the appropriate module:

Console
$ pactl load-module module-pipe-source \
	source_name=android                \
	channels=1                         \
	format=s16                         \
	rate=44100                         \
	file=/tmp/audiosource

Then you can run socat again, this time connecting the end to the pipe:

Console
$ socat -u ABSTRACT-CONNECT:audiosource PIPE:/tmp/audiosource

Note the option -u to force a unidirectional connection from left to right, otherwise socat will try to open the pipe for reading, consumming what it just wrote and forwarding it back to the abstract socket.

After starting socat, you can verify that the source is available:

Console
$ pactl list sources short
4160129	android	PipeWire	s16le 1ch 44100Hz	SUSPENDED

Note that it is marked as suspended. Indeed, Pulse Audio disables the source when no application reads from it. Unsurprisingly, this has the same effect as suspending socat.

Beyond the VU-meter in pavucontrol, the following section provides a way to listen to this source.

§
Playback

Linux audio playback buffers diagram

Similar to our initial setup, you can use parec to capture raw audio from the source and send it to pacat:

Console
$ parec -d android | pacat

These commands add a noticeable latency on top of the intrinsic source latency:

  • parec latency to capture the source from the sound server into its own buffer and copy it to STDOUT.
  • Pipe buffering between parec and pacat (which can be configured with stdbuf).
  • pacat latency to capture the source from STDIN into its own buffer and send it back to the sound server.
  • Sound server and driver latency to copy the audio fragments from the playback buffer to the hardware buffer.
  • Sound card latency to play the audio fragments from the hardware buffer.

There is an alternative way to achieve the same thing using the loopback module within the audio server, which cuts some of these sources of latency. See Pipe a source directly into a sink on the Arch Linux Wiki.

§
socat buffer

socat reads a chunk of data from the ADB socket, and writes it to the output pipe. The option -b controls the size of the buffer used for the copy:

Console
$ socat -u -b1024 ABSTRACT-CONNECT:audiosource PIPE:/tmp/audiosource

This internal buffer is only used to send the data from the input socket to the output pipe (passing the data from kernel space to user space and back). You can easily replace this command with the following Python script to make this mechanism explicit:

socat.py
import os
import socket

inp = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
inp.connect('\0audiosource')

out = os.open('/tmp/audiosource', 'wb')

while True:
	buf = inp.recv(1024)

	if not buf:
		break

	out.write(buf)

The code is very similar to what we've seen previously with Audio Source.

§
Internal buffering

Pulse Audio also performs internal buffering. If you pipe the audio into pacat -v, you can see view some statistics:

Console
$ socat -u ABSTRACT-CONNECT:audiosource STDOUT | pacat -v --channels=1
Opening a playback stream with sample specification 's16le 1ch 44100Hz' and channel map 'mono'.
Connection established.
Stream successfully created.
Buffer metrics: maxlength=4194304, tlength=176400, prebuf=174638, minreq=1764
Using sample spec 's16le 1ch 44100Hz', channel map 'mono'.
Connected to device output (index: 44, suspended: no).
Stream started.
Stream underrun.
Stream started.
Stream underrun.
Time: 15.238 sec; Latency: 2046313 usec.

Suspending socat causes buffer underruns because there is no incoming audio packets during that time. After resuming, the latency is pretty high, around 2 seconds, as Pulse Audio adjusted the buffer accordingly:

  • tlength: desired length of the audio buffer according to the target latency.
  • prebuf: amount of data necessary before starting the audio stream.
  • minreq: minimum audio chunk size requested to the client.

pacat has the flag --latency=<msec> to set the maximum latency for this source, for instance, 20 ms:

Console
$ socat -u ABSTRACT-CONNECT:audiosource STDOUT | pacat -v --channels=1 --latency-msec=20
Opening a playback stream with sample specification 's16le 1ch 44100Hz' and channel map 'mono'.
Connection established.
Stream successfully created.
Buffer metrics: maxlength=4194304, tlength=1416, prebuf=946, minreq=472
Using sample spec 's16le 1ch 44100Hz', channel map 'mono'.
Connected to device alsa_output.pci-0000_00_1f.3.analog-stereo (index: 44, suspended: no).
Stream started.
Stream underrun.
Stream started.
Time: 9.437 sec; Latency: 18891 usec.

Pulse Audio keeps a small buffer to maintain the latency under 20 ms. Note that by default it tries to pick a reasonably low latency, as long as the source keeps up (which is clearly not the case while socat is suspended).

Setting the latency on the command line limits the buffer size increase to fight the source jitter, but it doesn't say anything about buffering upstream, so that doesn't solve the original problem.

§
Pipe buffering

The last main source of delay is the pipe. You can inspect how many bytes are queued up and what the pipe capacity is through Linux's ioctl and fcntl interfaces:

socat.py
#!/usr/bin/env python3

import array
import fcntl
import termios
import time

def get_pipe_queue_bytes(fd):
	buf = array.array('i', [0])
	fcntl.ioctl(fd, termios.FIONREAD, buf)
	return buf[0]

def get_pipe_queue_size(fd):
	return fcntl.fcntl(fd, fcntl.F_GETPIPE_SZ)

with open('/tmp/audiosource', 'w') as f:
	while True:
		print('\rpipe usage/capacity: {}/{}'.format(
			get_pipe_queue_bytes(f),
			get_pipe_queue_size(f)
		), end='')

		time.sleep(1)

Initially:

Output
pipe usage/capacity: 2048/65536

After kill -STOP socat:

Output
pipe usage/capacity: 0/65536

After kill -CONT socat:

Output
pipe usage/capacity: 65536/65536

The default pipe capacity is defined by /proc/sys/fs/pipe-max-size at around 1 MB, which corresponds to approximately 12 seconds of audio! fcntl provides a command to set the pipe capacity:

socat.py
fcntl.fcntl(out, fcntl.F_SETPIPE_SZ, 4096)

§
Reducing the latency

Adjusting the buffer capacity is useful to prevent the accumulation of old fragments when the flow is interrupted downstream. But this tweak alone doesn't help when a consumer doesn't read fragments fast enough from a reliable communication channel.

§
Head-of-line blocking

When Pulse Audio doesn't read from the stream fast enough, which happens when the system is overloaded, the packets accumulate in the buffers upstream. This is what slowly increases the delay. Controlling some of these buffers might help maintain a reasonable latency, but the ADB buffer is large and beyond your control.

While socat is suspended, Pulse Audio replaces the missing fragments with silence during the buffer underruns. Because the transmission is reliable, once socat resumes, all the accumulated packets are sent in order, but they are now several seconds late.

Even worse, they delay any upcoming packets, that now need to pass through all the filled buffers. When all the intermediate buffers are full, the audio latency is at its maximum, and AudioRecord experiences a buffer overrun. This is a head-of-line blocking situation.

Because initially the latency is almost perfect, there is no real need to control the various buffers along the path. The packets are sent fast enough, but Pulse Audio may insert some silence which causes a delay if it cannot read some fragments in due time.

Instead of controlling all the intermediate buffers, you want to make the consumer fast enough. Because Pulse Audio can't play the audio faster to catch up to the source, you have to discard late fragments.

§
Discarding late fragments

Going back to our Python implementation of Socat, let's try to change make it discard late audio chunks just before they are forwarded to Pulse Audio through the pipe.

PCM data only contains the raw audio samples, so you do not have access to any timing information. Because the two calls to recv and write block, you don't know whether you can send more data or not without blocking. When this problem arises, it is usually time to turn to non-blocking sockets.

The idea is to make the input non-blocking, so you can take as many chunks as available, put them into a queue with a limited length, and write these selected fragments to the output. If the chunks arrive faster than they can be written, they will queue up and get discarded when the queue gets full.

In its collections module, Python has a deque that supports a maximum length. Calling the method append when it is full discards the oldest element:

socat.py
inp = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
inp.connect('\0audiosource')
inp.setblocking(False)

out = os.open('/tmp/audiosource', 'wb')

while True:
	q = deque(maxlen=4)

	while True:
		try:
			buf = sock.recv(1024)
		except BlockingIOError:
			break

		if not buf:
			return

		q.append(buf)

	while q:
		out.write(q.popleft())

There is a slight issue with this code. If the packets arrive too slowly, then the buffer stays empty and you have a busy wait. The simple way to solve this problem is to call sleep:

socat.py
inp = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
inp.connect('\0audiosource')
inp.setblocking(False)

out = os.open('/tmp/audiosource', 'wb')

q = deque(maxlen=4)

while True:
	while True:
		try:
			buf = sock.recv(1024)
		except BlockingIOError:
			break

		if not buf:
			return

		q.append(buf)

	while q:
		out.write(q.popleft())

	sleep(0.05)

A better way to solve this problem is to use poll. With register(inp, POLLIN), calls to poll block until you can read something from inp:

socat.py
inp = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
inp.connect('\0audiosource')
inp.setblocking(False)

out = os.open('/tmp/audiosource', 'wb')

p = select.poll()
p.register(inp, select.POLLIN)

q = deque(maxlen=4)

while True:
	p.poll()

	while True:
		try:
			buf = sock.recv(1024)
		except BlockingIOError:
			break

		if not buf:
			return

		q.append(buf)

	while q:
		out.write(q.popleft())

In principle, this program discards any chunk that cannot be written to the output in time. For this to work, you have to set a low pipe capacity, otherwise you can write to it as quickly as you want while it isn't full. The actual back pressure is applied through this pipe.

§
Pipe as a queue

There are still some issues with the previous solution. sock.recv returns at most 1024 bytes, but it may return less. Processing smaller chunks increases the number of iterations and makes the program less efficient. Additionally, the deque adds yet another level of buffering to finally write the chunks to the pipe, so it works like an extension of it. Finally, sock.recv creates a new buffer each time. It would be more efficient to reuse the same buffer. Let's try to solve these issues.

First, you can try to revert to a blocking read, but this time using the flag MSG_WAITALL. This option tells the syscall to return only when it wrote the specified number of bytes into the buffer, so the program processes chunks of exactly 1024 bytes per iteration. (That syscall may be interrupted by a signal and return fewer bytes, that doesn't affect the logic.)

Second, you can replace recv by recv_into. The first argument is the destination buffer, so you can use a bytearray, the mutable version of a byte buffer.

Finally, you can use the pipe as a discarding queue. If you rely on blocking writes, it won't return until they wrote all the input data, so you are back to the initial blocking implementation. Instead, you have to use a non-blocking write. What happens if the call to write returns with fewer bytes than the size of the chunk? You could discard the remaining bytes, but you would have to make sure that the last fragment wasn't half written. In this case, you would have to write the other half at the next iteration, and adjust recv_into to fill the remaining buffer space.

Fortunately, you do not have to do any of this. Any write of less than PIPE_BUF bytes, equal to 4096 on modern Linux systems, is atomic. That means the pipe really works like a queue, and the call to write will always return \1024. A 4096-byte pipe filled with atomically written 1024-byte chunks will contain at most 4 chunks.

There is a slight deviation from the behavior of the deque: instead of skipping the oldest fragment when the pipe is full, it discards the most recent ones that can't be written without blocking. As a proof, consider the same number of buffered fragments for each situation:

  1. The pipe contains 2 chunks, the queue contains 2 chunks, delimited by |.

    • [f1 f2|f3 f4] (initially).
    • [f1 f2|f4 f5] (after f5, discard f3).
    • [f2 f4|f5   ] (after 1 consumed chunk).
    • [f2 f4|f5 f6] (after f6).
    • [f4 f5|f6 f7] (after 1 consumed chunk + f7).
  2. The pipe contains 4 chunks.

    • [f1 f2 f3 f4] (initially).
    • [f1 f2 f3 f4] (after f5, discard f5).
    • [f2 f3 f4   ] (after 1 consumed chunk).
    • [f2 f3 f4 f6] (after f6).
    • [f3 f4 f6 f7] (after 1 consumed chunk + f7).

When chunks are discarded, the queue sends newer chunks sooner. In practice, it doesn't really matter, because the pipe is quite small. Given enough time, both situations will end up synchronized, hence using the pipe alone doesn't cause further delay.

socat.py
import fcntl
import os
import socket
import sys

BUF_SIZE = 1024
PIPE_SIZE = 4096

# Less than PIPE_BUF == 4096 for atomic writes.
assert BUF_SIZE <= 4096

def socat(inp, out):
	fcntl.fcntl(out, fcntl.F_SETPIPE_SZ, PIPE_SIZE)

	flags = fcntl.fcntl(out, fcntl.F_GETFL)
	fcntl.fcntl(out, fcntl.F_SETFL, flags | os.O_NONBLOCK)

	buf = bytearray(BUF_SIZE)

	while True:
		n = inp.recv_into(buf, BUF_SIZE, socket.MSG_WAITALL)

		if n == 0:
			break

		try:
			out.write(buf[:n])
		except BlockingIOError:
			pass

def main():
	sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
	sock.connect('\0audiosource')

	with open('/tmp/audiosource', 'wb') as fifo:
		socat(sock, fifo)

if __name__ == '__main__':
	main()

§
Conclusion

The audio delay is the consequence of excessive buffering. Changing the size of the intermediate buffers improves the latency, but doesn't fix the underlying issue, which is the inevitable desynchronization between the producer and the consumer.

The issue with audio playback is that each chunk corresponds to an incompressible timespan. If Pulse Audio doesn't receive the next audio chunk in time, it replaces it with silence. When it finally arrives, Pulse Audio plays this audio chunk late, which delays all the subsequent data in the stream.

For a short span of time, you could play back the audio faster to catch-up with the source, assuming you can adjust the pitch. For a longer duration, the only solution to restore a low-latency is to discard excess audio chunks from the reliable communication channel. Discarding excess data downstream helps maintain a low upstream buffering, which makes the capacity of the upstream buffers unimportant. The downside of this process is that it produces audible skips.

The last piece of the puzzle is to properly exploit the properties of the Linux API to limit the size of the pipe between the discarding and the consuming processes. It is important to control the size of this buffer, because it is responsible for applying the consumer back pressure, and so it directly contributes to the overall audio latency.