A few months ago I released Audio Source, an application that sends Android microphone audio to a computer connected through USB. For a couple of minutes, the latency is excellent, almost indistinguishable. Over time, especially as the computer gets busy, the latency builds up to a dozen seconds, which makes any audio conversation painful. Having embarrassed myself enough in remote meetings, I decided to get to the bottom of the issue. This is not a story of how I swapped everything with an RTSP server or bought a USB microphone. You will learn about audio encoding, and how to play with sockets, pipes, and buffers.
§Introduction
This section recaps the basics of digital audio, presents an overview of the components that make Audio Source work, and states the problem we will try to solve next.
§Digital audio
The process of sampling makes it possible to convert a continuous audio signal (mechanical sound waves captured as electromagnetic waves) into a discrete audio signal (a stream of bytes). The Pulse-Code Modulation (PCM) format encodes the amplitude of an audio signal as a sequence of bytes, sampled at a defined rate.
You may decide to encode the amplitude as a 16-bit integer, called the bit
depth, so you can represent any input signal magnitude by an integer between
-32,768 and 32,767. When you store these samples as a series of bytes, you have
to choose the order of the low and high bytes. If the low bytes appear first,
it is little-endian. The opposite is called big-endian. s16le
designates
signed 16-bit little-endian samples.
The sample rate is the number of times per second you take a sample from the input signal. For example, Audio CDs are sampled at 44100 Hz, which means 44100 samples per second (). The Nyquist–Shannon sampling theorem tells us that the sample rate needs to be at least twice the highest frequency you wish to record. If the input signal has higher frequencies, your recording will have audible aliasing.
The same phenomenon happens when you record the blades of a helicopter. Their frequency (rotation per second) appears to be much lower. In the context of audio, you get spurious sounds in the audible range. Because human hearing is limited to around 20,000 Hz, 44100 Hz is enough to sample in the hearing range, and it leaves enough room to remove higher frequencies with an analog low-pass filter.
The component that performs the sampling is called an Analog-to-Digital Converter (ADC). These chips add a little bit of latency, usually a few microseconds, because they have to average the amplitude of the audio signal over the sample duration. At 44,100 Hz, each sample lasts ~23 µs. Transmitting it takes a few more cycles, but this is negligible compared to the latency we will encounter at upper audio processing stages.
Note that audio recordings may have multiple channels. Mono recordings only have one channel. Stereo recordings have two, for the left and right sides. The samples from all the channels are commonly grouped into audio fragments, interleaving the samples from each channel in the output PCM stream.
Assuming a mono, s16le, 44100 Hz stream, the amount of information per second is:
From this result, you can easily compute how many second of audio a 1024-byte buffer contains:
For comparison, a comfortable real-time audio communication requires a one-way latency of less than 300 ms, with noticeable degradation above 150 ms, so that leaves us some room.
§Architecture
Audio Source is the first piece of the puzzle that allows recording audio from
an Android device and use it as an input on a computer. It acquires audio
fragments from an instance of AudioRecord
and passes them to a LocalSocket
.
These classes are part of the Android framework.
The audio is exposed on the phone through a UNIX Domain Socket (UDS). This socket family supports efficient local byte-oriented streams, without relying on the TCP stack: the data is simply passed between the connected socket pairs within the Linux kernel. The downside is that it is only accessible locally on the same machine.
The second piece of the puzzle is the Android Debug Bridge (ADB). It creates a link between a local socket on an Android phone and another socket on a computer attached through USB. This article treats this part as a reasonably efficient black box, without diving into the intricacies of USB and ADB.
The third piece of the puzzle is the sound server. The most common option for Linux is Pulse Audio. Its purpose is to receive or send audio to the hardware and make it available to multiple applications with a low latency. Pulse Audio supports modules that can register virtual sources. In this article, we will rely on the pipe source module to ingest audio fragments from a local Linux pipe.
Since ADB only forwards data between two sockets, you have to transfer the audio from the socket managed by ADB to the Pulse Audio source pipe. A simple way to do that is to use Socat, a kind of Swiss Army knife for network communications. You will see that it is reasonably simple to replicate this functionality with Python, which will serve as a basis to solve the latency issue.
§The problem
Everything seems to work fine with an amazing latency. It feels almost real time, at least for a few minutes. Then, the latency slowly builds up and settles at about 12 seconds. This is especially noticeable as the system gets busy.
That may indicate that a consumer downstream doesn't process audio fragments
fast enough. You can simulate this issue by suspending socat
for a few
seconds, and then resuming it:
After that, you will hear a huge delay. The instinctive explanation is that
suspending socat
delays the source, which it causes some buffering upstream.
But that doesn't explain why the stream starts with a barely noticeable
latency, why it cannot return to that state, and where exactly are these audio
chunks accumulating.
This article goes through each later to investigate how they contribute to the overall latency. You will see that the capacity of the buffers along the transmission path is only tangentially related to the overall latency, and they only become an issue under the specifities of audio playback.
§Audio Source
In this section, we will take a look at the architecture of Audio Source and how it interacts with Linux and the Android SDK, and how these layers exchange data. The following diagram provides an overview of the main buffers along the audio path that we will explore next.
§ALSA
The lowest layer in the audio stack is the Advanced Linux Sound Architecture (ALSA). Audio device drivers implement the hardware side to configure sound cards and setup the PCM data streams, and applications rely on the user API.
Exchanging data between the hardware and the kernel requires a hardware buffer. Through Direct Memory Access (DMA), sound cards can write to system memory, notify the kernel through a hardware interrupt that some data is available, and continue writing to the next chunk of the buffer.
The period is the amount of data sound cards write before triggering an interrupt. Only after an interrupt the system knows it can access the entire period, so it also corresponds to the absolute minimum latency you can expect from the sound card.
Each period contains a fixed number of fragments. The hardware configuration defines the format of the PCM data written to the buffer, which influences the sample size. Stereo audio fragments contain two samples for the left and right channels. As described in the introduction, the samples are the raw bytes that describe the magnitude of the original audio signal.
It is possible to query the kernel for the actual values of these hardware
parameters. ALSA exposes the sound hardware as a directory tree of cards,
devices, and subdevices. For example, the path card0/pcm0c/sub0
includes the
capture device pcm0c
, as the trailing c
indicates (playback devices have a
trailing p
instead). When a device is active, you can get its configuration
with the following command:
The number of channels, the sample format, and the sample rate match the
recording parameters requested by Audio Source. Note that period_size
and
buffer_size
are given in fragment units. With such streams, a fragment is 16
bits, or 2 bytes, which means the size of a period is 1792 bytes.
As you've done in the introduction, you can compute the length of a period at around 20 ms, which is a lower bound for the overall recording latency. That also tells us that the sound card triggers interrupts every 20 ms, or 50 times per second.
The second interesting fact is that the buffer is twice as large as the period, for it is the minimum size required to allow the sound card to write the current period while the system reads the last one.
Sound cards connected through other interfaces may require larger buffers.
Assuming the system is overloaded and cannot process the interrupts in time,
the sound card will eventually override an unread period: this is a buffer
overrun
, and this period is lost. As you will see, the same issue happens at
higher audio processing levels.
§AudioRecord
AudioRecord
Android provides the
AudioRecord
class to record audio from sources like a microphone. First, you have to choose
the recording parameters, as described previously:
AudioRecord
sits on a number of layers, including the Hardware Abstraction
Layer (HAL) that is vendor specific, and doesn't necessarily rely on ALSA. This
HAL can buffer and process the recorded audio to enhance the poor mobile phone
recording quality. That creates another level of buffering with additional
latency.
The Android audio server (called Audio Flinger) sits above the HAL. Because it runs with a high priority, it is able to fetch the audio fragments from the hardware buffer in a timely manner and place them into its own buffer. This buffer can be used for resampling and post-processing, and it can be shared between multiple applications.
The following commands prints some statistics about an Audio Flinger recording stream:
Notice that the HAL parameters what we found by querying ALSA in the previous section.
AudioRecord
has two parts linked by the Java Native Interface (JNI): the Java
API, and the native C++ implementation. Communication with the audio server
happens through the binder Inter-Process Communication (IPC) mechanism. The
buffer from which AudioRecord
reads its audio fragments is shared between the
audio server and the application. This technique allows for the transmission of
audio chunks between two processes without copying them to an additional
buffer, something we won't be able to avoid next.
To successfully instantiate this class, you have to indicate a buffer size
greater than the value returned by AudioRecord.getMinBufferSize
.
On my device, this value is 3584 bytes, which matches the size of the hardware buffer, and makes sense because it has the same double buffering constraints.
Calling AudioRecord.read
copies a slice of audio data from the internal
recording buffer to the buffer passed as an argument, and advances the read
pointer of the internal buffer. If you do not call AudioRecord.read
quickly
enough, the write head may advance past the read head, discarding unread audio
fragments. This is the same buffer overrun issue as with the hardware buffer,
and it may occur if the scheduler doesn't give enough CPU time to the recording
thread.
To prevent this situation, the documentation advises instantiating
AudioRecord
with a slightly larger buffer, twice the minimum in the following
code snippet:
To process the audio samples, you only need to call AudioRecord.read
in a
loop and make use of the data however you want. This operation blocks the
calling thread until some audio samples are available and returns the number of
bytes that were read.
While the recorder is active, you can call AudioRecord.getBufferSizeInFrames
,
which tells you the actual size of the buffer.
Logging output:
Remember that we passed 2 * minBufSize
as the buffer size, and because each
fragment (or frame) holds 2 bytes, 3584 frames are equivalent to 7168 bytes,
twice the minimum buffer size.
Returning to the original problem, a possible source of latency is the
AudioRecord
buffer. But even after suspending and resuming socat
, its size
remains the same. From the calculus in the introduction, you know this buffer
can hold at most 82.3 ms of audio, which isn't enough to cause multiple seconds
of latency.
Another possibility is scheduling: Audio Source may not be able to keep up with
the flow of audio fragments. This hypothesis is quickly discarded as there is
no audible skips, which means the samples are being processed fast enough.
Additionally, the internal buffer in AudioRecord
wouldn't be able to provide
packets that are consistently multiple seconds late, thereby invalidating the
hypothesis of slow calls to AudioRecord.read
.
Now that you can capture a chunk of audio from the microphone, let's make it available to other applications.
§LocalSocket
LocalSocket
LocalServerSocket
gives you the ability to listen for local connections on a
UNIX Domain Socket (UDS). These sockets are similar to TCP sockets, except they
are local to the machine, only transfer data inside the Linux kernel, and do
not run any of the TCP flow control algorithms.
Indeed, there is no need to handle reliable communication over IP networks, so their implementation doesn't require Nagle's algorithm with its negative effect on latency.
§Handling connections
These sockets are usually bound to a path on the local file system, but there exist a flavor of "abstract" UDS that are identified only by name.
To await for a client connection, you need to call the blocking method
LocalServerSocket.accept
. When a client connects, it returns a LocalSocket
that contains two streams, input and output, for bidirectional communication.
Audio Source takes the audio samples from AudioRecord
and passes them to the
client connected through the LocalSocket
:
In practice, it is a little bit more involved to allow for closing the connection from Android, and record in a separate thread. Nonetheless, the general idea is there.
§Buffering
Each socket has two associated buffers: send and receive. You can get the size
of the send buffer with LocalSocket.getSendBufferSize
, and set it with
LocalSocket.getSendBufferSize
. (Under the hood, these methods call Linux's
getsockopt
and setsockopt
with the option SO_RECVBUF
or SO_SNDBUF
.)
First, let's try to log the default size:
Output:
That means the send buffer can fill up to 212992 bytes, which is about two seconds of audio. The fix is easy:
Output:
The value you pass to setSendBufferSize
gets doubled, and this double cannot
be less than 2048. This is the smallest buffer size you can get, and it
represents roughly 23.2 ms of audio with our sampling parameters.
§ADB
When you connect your phone to a computer, you can enable the Android Debug Bridge (ADB), which gives you access to development features. To forward the abstract UDS created at the previous section to a socket of the same name on the attached computer, you can run:
ADB transparently links these two sockets through the USB connection, so you can access Audio Source through a local abstract UDS. The USB protocol used to connect them and its internal buffers are beyond our control. But we can try to check whether the socket on the computer side does the same buffering as the socket on Android side.
Although you do not have access to it directly, you can use the command ss
to
list all the sockets on the system. You can pass the option -p
to get the
list of connected processes, and filter the output for audiosource
with
grep
:
The send queue contains about 213 kB of data. This value is not fortuitous, as
it corresponds to /proc/sys/net/core/wmem_default
, the default capacity of
UDS sockets. This is about 2 seconds of audio. Unfortunately, there isn't much
you can do about this buffer aside from changing the global limit.
§Pulse Audio
§Direct playback
Let's not forget about our goal: getting the audio to the sound server. Like everything else, Linux has multiple competing sound server implementations. Pulse Audio is the most common on the Linux desktop, so it will be the target of this article. This sound server relies on ALSA to exchange audio data with the hardware. Using its own buffer, it can resample and mix it with other sources.
Socat can be used to connect to the ADB abstract UDS. As an initial proof of
concept, you can just send the PCM data to pacat
to play the raw PCM data:
pacat
has the option --volume
that you may want to set higher than 100%,
like --volume=150
, if you want to hear anything. This setup allows you to
listen to the audio PCM stream directly without going through a Pulse Audio
recording source.
§Pipe source
pacat
sends the audio to a sink, but it doesn't allow us to use this stream
as input for recording purposes, or as a source in Pulse Audio terminology.
Fortunately, Pulse Audio provides a module to register a virtual source that
can ingest raw audio from a pipe.
§Configuration
Pipes are another kind of IPC within Linux, with an API similar to files,
except they operate on a shared kernel buffer. Note that in the previous
section, socat
and pacat
are connected through an anonymous pipe linking
socat
STDOUT and pacat
STDIN. A pipe bound to the file system is called a
FIFO, but there is no conceptual difference between them.
To create a virtual pipe source, the first step is to load the appropriate module:
Then you can run socat
again, this time connecting the end to the pipe:
Note the option -u
to force a unidirectional connection from left to right,
otherwise socat
will try to open the pipe for reading, consumming what it
just wrote and forwarding it back to the abstract socket.
After starting socat
, you can verify that the source is available:
Note that it is marked as suspended. Indeed, Pulse Audio disables the source
when no application reads from it. Unsurprisingly, this has the same effect as
suspending socat
.
Beyond the VU-meter in pavucontrol
, the following section provides a way to
listen to this source.
§Playback
Similar to our initial setup, you can use parec
to capture raw audio from the
source and send it to pacat
:
These commands add a noticeable latency on top of the intrinsic source latency:
parec
latency to capture the source from the sound server into its own buffer and copy it to STDOUT.- Pipe buffering between
parec
andpacat
(which can be configured withstdbuf
). pacat
latency to capture the source from STDIN into its own buffer and send it back to the sound server.- Sound server and driver latency to copy the audio fragments from the playback buffer to the hardware buffer.
- Sound card latency to play the audio fragments from the hardware buffer.
There is an alternative way to achieve the same thing using the loopback module within the audio server, which cuts some of these sources of latency. See Pipe a source directly into a sink on the Arch Linux Wiki.
§socat
buffer
socat
buffersocat
reads a chunk of data from the ADB socket, and writes it to the output
pipe. The option -b
controls the size of the buffer used for the copy:
This internal buffer is only used to send the data from the input socket to the output pipe (passing the data from kernel space to user space and back). You can easily replace this command with the following Python script to make this mechanism explicit:
The code is very similar to what we've seen previously with Audio Source.
§Internal buffering
Pulse Audio also performs internal buffering. If you pipe the audio into pacat -v
, you can see view some statistics:
Suspending socat
causes buffer underruns because there is no incoming audio
packets during that time. After resuming, the latency is pretty high, around 2
seconds, as Pulse Audio adjusted the buffer accordingly:
tlength
: desired length of the audio buffer according to the target latency.prebuf
: amount of data necessary before starting the audio stream.minreq
: minimum audio chunk size requested to the client.
pacat
has the flag --latency=<msec>
to set the maximum latency for this
source, for instance, 20 ms:
Pulse Audio keeps a small buffer to maintain the latency under 20 ms. Note that
by default it tries to pick a reasonably low latency, as long as the source
keeps up (which is clearly not the case while socat
is suspended).
Setting the latency on the command line limits the buffer size increase to fight the source jitter, but it doesn't say anything about buffering upstream, so that doesn't solve the original problem.
§Pipe buffering
The last main source of delay is the pipe. You can inspect how many bytes are
queued up and what the pipe capacity is through Linux's ioctl
and fcntl
interfaces:
Initially:
After kill -STOP socat
:
After kill -CONT socat
:
The default pipe capacity is defined by /proc/sys/fs/pipe-max-size
at around
1 MB, which corresponds to approximately 12 seconds of audio! fcntl
provides
a command to set the pipe capacity:
§Reducing the latency
Adjusting the buffer capacity is useful to prevent the accumulation of old fragments when the flow is interrupted downstream. But this tweak alone doesn't help when a consumer doesn't read fragments fast enough from a reliable communication channel.
§Head-of-line blocking
When Pulse Audio doesn't read from the stream fast enough, which happens when the system is overloaded, the packets accumulate in the buffers upstream. This is what slowly increases the delay. Controlling some of these buffers might help maintain a reasonable latency, but the ADB buffer is large and beyond your control.
While socat
is suspended, Pulse Audio replaces the missing fragments with
silence during the buffer underruns. Because the transmission is reliable, once
socat
resumes, all the accumulated packets are sent in order, but they are
now several seconds late.
Even worse, they delay any upcoming packets, that now need to pass through all
the filled buffers. When all the intermediate buffers are full, the audio
latency is at its maximum, and AudioRecord
experiences a buffer overrun. This
is a head-of-line blocking situation.
Because initially the latency is almost perfect, there is no real need to control the various buffers along the path. The packets are sent fast enough, but Pulse Audio may insert some silence which causes a delay if it cannot read some fragments in due time.
Instead of controlling all the intermediate buffers, you want to make the consumer fast enough. Because Pulse Audio can't play the audio faster to catch up to the source, you have to discard late fragments.
§Discarding late fragments
Going back to our Python implementation of Socat, let's try to change make it discard late audio chunks just before they are forwarded to Pulse Audio through the pipe.
PCM data only contains the raw audio samples, so you do not have access to any
timing information. Because the two calls to recv
and write
block, you
don't know whether you can send more data or not without blocking. When this
problem arises, it is usually time to turn to non-blocking sockets.
The idea is to make the input non-blocking, so you can take as many chunks as available, put them into a queue with a limited length, and write these selected fragments to the output. If the chunks arrive faster than they can be written, they will queue up and get discarded when the queue gets full.
In its collections
module, Python has a deque
that supports a maximum
length. Calling the method append
when it is full discards the oldest
element:
There is a slight issue with this code. If the packets arrive too slowly, then
the buffer stays empty and you have a busy wait. The simple way to solve this
problem is to call sleep
:
A better way to solve this problem is to use poll
. With register(inp, POLLIN)
, calls to poll
block until you can read something from inp
:
In principle, this program discards any chunk that cannot be written to the output in time. For this to work, you have to set a low pipe capacity, otherwise you can write to it as quickly as you want while it isn't full. The actual back pressure is applied through this pipe.
§Pipe as a queue
There are still some issues with the previous solution. sock.recv
returns at
most 1024 bytes, but it may return less. Processing smaller chunks increases
the number of iterations and makes the program less efficient. Additionally,
the deque
adds yet another level of buffering to finally write the chunks to
the pipe, so it works like an extension of it. Finally, sock.recv
creates a
new buffer each time. It would be more efficient to reuse the same buffer.
Let's try to solve these issues.
First, you can try to revert to a blocking read
, but this time using the flag
MSG_WAITALL
. This option tells the syscall to return only when it wrote the
specified number of bytes into the buffer, so the program processes chunks of
exactly 1024 bytes per iteration. (That syscall may be interrupted by a signal
and return fewer bytes, that doesn't affect the logic.)
Second, you can replace recv
by recv_into
. The first argument is the
destination buffer, so you can use a bytearray
, the mutable version of a byte
buffer.
Finally, you can use the pipe as a discarding queue. If you rely on blocking
writes, it won't return until they wrote all the input data, so you are back to
the initial blocking implementation. Instead, you have to use a non-blocking
write. What happens if the call to write returns with fewer bytes than the size
of the chunk? You could discard the remaining bytes, but you would have to make
sure that the last fragment wasn't half written. In this case, you would have
to write the other half at the next iteration, and adjust recv_into
to fill
the remaining buffer space.
Fortunately, you do not have to do any of this. Any write of less than
PIPE_BUF
bytes, equal to 4096 on modern Linux systems, is atomic. That means
the pipe really works like a queue, and the call to write
will always return
\1024. A 4096-byte pipe filled with atomically written 1024-byte chunks will
contain at most 4 chunks.
There is a slight deviation from the behavior of the deque
: instead of
skipping the oldest fragment when the pipe is full, it discards the most recent
ones that can't be written without blocking. As a proof, consider the same
number of buffered fragments for each situation:
-
The pipe contains 2 chunks, the queue contains 2 chunks, delimited by
|
.[f1 f2|f3 f4]
(initially).[f1 f2|f4 f5]
(afterf5
, discardf3
).[f2 f4|f5 ]
(after 1 consumed chunk).[f2 f4|f5 f6]
(afterf6
).[f4 f5|f6 f7]
(after 1 consumed chunk +f7
).
-
The pipe contains 4 chunks.
[f1 f2 f3 f4]
(initially).[f1 f2 f3 f4]
(afterf5
, discardf5
).[f2 f3 f4 ]
(after 1 consumed chunk).[f2 f3 f4 f6]
(afterf6
).[f3 f4 f6 f7]
(after 1 consumed chunk +f7
).
When chunks are discarded, the queue sends newer chunks sooner. In practice, it doesn't really matter, because the pipe is quite small. Given enough time, both situations will end up synchronized, hence using the pipe alone doesn't cause further delay.
§Conclusion
The audio delay is the consequence of excessive buffering. Changing the size of the intermediate buffers improves the latency, but doesn't fix the underlying issue, which is the inevitable desynchronization between the producer and the consumer.
The issue with audio playback is that each chunk corresponds to an incompressible timespan. If Pulse Audio doesn't receive the next audio chunk in time, it replaces it with silence. When it finally arrives, Pulse Audio plays this audio chunk late, which delays all the subsequent data in the stream.
For a short span of time, you could play back the audio faster to catch-up with the source, assuming you can adjust the pitch. For a longer duration, the only solution to restore a low-latency is to discard excess audio chunks from the reliable communication channel. Discarding excess data downstream helps maintain a low upstream buffering, which makes the capacity of the upstream buffers unimportant. The downside of this process is that it produces audible skips.
The last piece of the puzzle is to properly exploit the properties of the Linux API to limit the size of the pipe between the discarding and the consuming processes. It is important to control the size of this buffer, because it is responsible for applying the consumer back pressure, and so it directly contributes to the overall audio latency.