Voice over IP in GNOME Calls Part 2: The Implementation

About
Latest Posts

Evangelos Ribeiro Tzaras

Latest posts by Evangelos Ribeiro Tzaras (see all)

FOSDEM 2023 Event Report - February 27, 2023
FOSDEM 2023 coming up in Brussels - February 3, 2023
Voice over IP in GNOME Calls Part 2: The Implementation - July 13, 2022

In the previous blog post we looked at the relevant protocols used for VoIP (SIP,SDP,RTP) and some libraries you might use to develop a SIP softphone.
Equipped with these basics let us dive into some details of our implementation after the following disclaimer:

Before we continue

To the uninitiated in the dark arts of Voice over IP we want to apologize in advance for the acronym soup you are (unless you stop reading here) inevitably about to enter. You can find explanations of all the used jargon in the first part of the blog post.

Getting started

The necessary work for an initial SIP plugin consists of signalling and media pipeline code which is largely independent of each other and some glue code responsible (e.g. for plugging negotiated codecs into the media pipeline).

Some notable limitations at the time (including, but not limited to):

Account configuration (including the password!) in a key file: ~/.config/calls/sip-account.cfg
Could only place a call from the command line: gnome-calls sip:some_user@some_host.tld
Audio codec hardcoded to PCMA
Could only use ModemManager or SIP plugin, but not both

One of the first improvements was supporting multiple audio codecs.

Implementing support for multiple provider plugins was another important milestone, that would allow to have both VoIP calls over SIP as well as traditional cellular calls using the modem at the same time.

The feature of managing accounts in the UI made sure that users would not have to manually tinker with credentials in a key file any more.

A closer look

Registering to a server

The first thing a SIP client typically does is registering to a server called the registrar.

The registrar is responsible for accounts in its domain (e.g. domain A in the SIP trapezoid), challenges you to authenticate when you want to login and maintains a list of currently connected clients per account (after all you can be connected to the same account from multiple devices). The registrar can often be reached on the same address as your SIP server, e.g. sip:example.org.

A simple invocation of

nua_register (registration_handle,
              NUTAG_M_USERNAME ("alice"),
              NUTAG_REGISTRAR ("sip:example.org"),
              TAG_END ());

will cause a message to be sent that looks something like:

REGISTER sip:example.org SIP/2.0
Via: SIP/2.0/UDP 192.168.1.23:47888;rport;branch=z9hG4bKBB6516vD98Z7F
From: <sip:alice@example.org>;tag=BByp2m992N4gc
To: <sip:alice@example.org>
Call-ID: 0df10634-7364-123b-bdb1-18c04d8376ad
CSeq: 966403932 REGISTER
Contact: <sip:alice@192.168.1.23:47888>
Expires: 180
User-Agent: calls sofia-sip/1.12.11devel
Content-Length: 0

We now receive a message challenging us for authentication:

SIP/2.0 401 Unauthorized
Via: SIP/2.0/UDP 192.168.1.23:47888;rport=47888;received=192.168.1.23;branch=z9hG4bKBB6516vD98Z7F
Call-ID: 0df10634-7364-123b-bdb1-18c04d8376ad
From: <sip:alice@example.org>;tag=BByp2m992N4gc
To: <sip:alice@example.org>;tag=z9hG4bKBB6516vD98Z7F
CSeq: 966403932 REGISTER
WWW-Authenticate: Digest realm="asterisk",nonce="1656626926/25ce7bee572bdfa3eba02f41552405a3",opaque="71c8fa0e05df2db4",algorithm=md5,qop="auth"
Server: Asterisk PBX 16.16.1~dfsg-1+deb11u1
Content-Length:  0

invoking our callback (see Sofia SIP section in Part 1) for the nua_r_register (response to register) event. This ends up being processed in the following code path

if (status == 401 || status == 407) {
  char *auth = g_strdup_printf ("%s:%s:%s:%s",
                                scheme, realm, username, password);
  nua_authenticate (nh, NUTAG_AUTH (auth), TAG_END ());
}

causing us to answer the authentication challenge with our credentials:

REGISTER sip:example.org SIP/2.0
Via: SIP/2.0/UDP 192.168.1.23:47888;rport;branch=z9hG4bKcmZy31DH6HptB
From: <sip:alice@example.org>;tag=BByp2m992N4gc
To: <sip:alice@example.org>
Call-ID: 0df10634-7364-123b-bdb1-18c04d8376ad
CSeq: 966403933 REGISTER
Contact: <sip:alice@192.168.1.23:47888>
Authorization: Digest username="alice", realm="asterisk", nonce="1656626926/25ce7bee572bdfa3eba02f41552405a3", cnonce="DfLWRHNkEjuxvRjATYN2rQ", opaque="71c8fa0e05df2db4", algorithm=MD5, uri="sip:example.org", response="3bf175376f577552babf42c4e6382a07", qop=auth, nc=00000001   Content-Length: 0

And this time we get:

SIP/2.0 200 OK
Via: SIP/2.0/UDP 192.168.1.23:47888;rport=47888;received=192.168.1.23;branch=z9hG4bKcmZy31DH6HptB
Call-ID: 0df10634-7364-123b-bdb1-18c04d8376ad
From: <sip:alice@example.org>;tag=BByp2m992N4gc
To: <sip:alice@example.org>;tag=z9hG4bKcmZy31DH6HptB
CSeq: 966403933 REGISTER
Date: Thu, 30 Jun 2022 22:08:47 GMT
Contact: <sip:alice@192.168.1.23:47888;user=phone>;expires=179
Expires: 180
Server: Asterisk PBX 16.16.1~dfsg-1+deb11u1
Content-Length:  0

The server accepted our registration and we are now online. Hooray!

Placing a call

Now that we are online, sending an INVITE (to actually call bob on the other end) is as simple as

char *local_sdp = construct_sdp ();
nua_invite (call_handle,
            SOATAG_USER_SDP_STR (local_sdp),
            SIPTAG_TO_STR ("sip:bob@example.org"),
            TAG_END ());

resulting in the following message:

INVITE sip:bob@example.org SIP/2.0
Via: SIP/2.0/UDP 192.168.1.23:56164;rport;branch=z9hG4bKX02p8U7yy6QHr
From: <sip:alice@example.org>;tag=t39mvBvF17gSa
To: <sip:bob@example.org>
Call-ID: 4ad1af21-7363-123b-4484-18c04d8376ad
CSeq: 966403849 INVITE
Contact: <sip:6001@192.168.1.23:56164>
Content-Type: application/sdp
Content-Length: 237

v=0
o=- 5090247275169452773 8778832019085077134 IN IP4 192.168.1.23
s=-
c=IN IP4 192.168.1.23
t=0 0
m=audio 49270 RTP/AVP 9 8 0 3
a=rtpmap:9 G722/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:3 GSM/8000
a=rtcp:46118

SDP

Let us now take another look at the SDP containing the codecs we want to negotiate. For now we only support a subset of static payload definitions, but extending them is on our radar.

The codecs are encoded in the following struct

typedef struct {
  guint payload_id;
  char *codec_name;  /* e.g. PCMA */
  gint  clock_rate;  /* sampling rate */
  gint  channels;    /* mono, stereo */
  char *gstreamer_payloader_element_name;
  char *gstreamer_depayloader_element_name;
  char *gstreamer_encoder_element_name;
  char *gstreamer_decoder_element_name;
  char *gstreamer_filename_plugin;
} MediaCodecInfo;

which includes the the information necessary for constructing the SDP as well as the names of the GStreamer elements we plug into our pipeline.

The currently supported codecs are hardcoded to:

static MediaCodecInfo gst_codecs[] = {
  {8, "PCMA", 8000, 1, "rtppcmapay", "rtppcmadepay", "alawenc", "alawdec", "libgstalaw.so"},
  {0, "PCMU", 8000, 1, "rtppcmupay", "rtppcmudepay", "mulawenc", "mulawdec", "libgstmulaw.so"},
  {3, "GSM", 8000, 1, "rtpgsmpay", "rtpgsmdepay", "gsmenc", "gsmdec", "libgstgsm.so"},
  {9, "G722", 8000, 1, "rtpg722pay", "rtpg722depay", "avenc_g722", "avdec_g722", "libgstlibav.so"},
};

The following code iterates over the supported codecs and build the SDP string from it:

for (GList *node = supported_codecs; node != NULL; node = node->next) {
  MediaCodecInfo *codec = node->data;

  g_string_append_printf (media_line, " %u", codec->payload_id);
  g_string_append_printf (attribute_lines,
                          "a=rtpmap:%u %s/%u%s",
                          codec->payload_id,
                          codec->codec_name,
                          codec->clock_rate,
                          "\r\n");
}

Audio

When we started experimenting with RTP in GStreamer we were using two separate gst-launch-1.0 invocations, one per direction:

For sending:

gst-launch-1.0 rtpbin name=rtpbin \
    pulsesrc ! avenc_g722 ! rtpg722pay ! rtpbin.send_rtp_sink_0 \
            rtpbin.send_rtp_src_0 ! udpsink host=${REMOTE} port=5002\
            rtpbin.send_rtcp_src_0 ! udpsink host=${REMOTE} port=5003 sync=false async=false \
    udpsrc port=5007 ! rtpbin.recv_rtcp_sink_0

And for receiving:

  gst-launch-1.0 -v rtpbin name=rtpbin \
     udpsrc caps="application/x-rtp,media=(string)audio,clock-rate=(int)8000,encoding-name=(string)G722" \
             port=5002 ! rtpbin.recv_rtp_sink_1 \
         rtpbin. ! rtpg722depay ! avdec_g722 ! pulsesink \
      udpsrc port=5003 ! rtpbin.recv_rtcp_sink_1 \
      rtpbin.send_rtcp_src_1 ! udpsink host=${REMOTE} port=5007 sync=false async=false

rtpbin is the main actor here: It can manage multiple RTP and their corresponding RTCP streams and create source and sink pads as needed (e.g. recv_rtp_sink_0 and send_rtp_src_0). These are appropriately linked to udpsink and udpsrc to send and receive data over the network via UDP, as well as to pulsesink for playing the audio back to you and pulsesrc for recording from your microphone.

The media pipeline code was closely modelled after these sort of commands.

Note that while these gst-launch-1.0 commands will work within the same network, they likely won’t work (as presented) over the internet, but more on that in the next section.

Network Address Translation

In the IPv4 world users typically sit behind a router which performs Network address translation (NAT) between the private network and the internet.

The router will keep track of connections and knows to which client in the private network responses from the internet should get forwarded to.

This does of course mean that our media pipeline needs to reuse the socket from outgoing RTP traffic for the incoming connections as well.

It should be noted that this is a rather simple approach, but it works for most cases. In the future however we plan to support NAT traversal using the Interactive Connectivity Establishment technique.

You can find the relevant issue here.

Doing the encrypted thing

With Secure RTP (SRTP) we can secure our media streams.

SRTP can be used to provide confidentiallity, message authentication and replay protection to both RTP and RTCP traffic.

Encrypting the RTP data using block ciphers e.g. AES Counter Mode) or Galois/Counter Mode provides confidentiality, while e.g. SHA is used for the message authentication.

For the security minded we must draw attention to the fact, that we currently only support SDP Security Descriptions for Media Streams (SDES) which embeds the keys used for SRTP directly in the signalling (as part of the exchanged SDP) like this:

  m=audio 49170 RTP/SAVP 0
  a=crypto:1 AES_CM_128_HMAC_SHA1_32
   inline:NzB4d1BINUAvLEw6UzF3WSJ+PSdFcGdUJShpX1Zj

The string after inline: is the base64 encoded representation of the key: This is why you can only enable SRTP when using TLS for the signalling. Even with TLS enabled you should be aware that your server (and potentially any proxies your messages pass through) can still see the keys with this particular method for key exchange.

Supporting DTLS-SRTP and ZRTP is on our radar.

With this disclaimer out of the way, let us return to some technical details:

The bulk of changes took place in these two merge requests.

We decomposed the work needed into the following tasks:

Using GstSrtpEnc and GstSrtpDec in the media pipeline. key change
Parsing and generating a=crypto SDP attributes key change
Using a=crypto attributes in the SDP offer/answer model key changes
Provide and use plumbing

The fruits of our labour can be seen in this last pipeline graph we want to show you: Note the presence (and base64 encoded “key” property) of GstSrtpEnc and GstSrtpDec inside the GstRtpBin!

Summary

I think it is fair to say that we covered quite a lot of ground: From the main protocols (SIP, SDP, RTP) and some code examples to a simple implementation, iterative improvements and finally supporting encrypted calls.

Additionally we want to point out that the knowledge (and code) we’ve developed will surely come in handy for implementing support for audio calls in chatty using matrix or video calls on the using the cameras of the Librem 5 (libcamera will provide easy to use GStreamer elements).

Finally, we hope you enjoyed reading this write-up as much as we enjoyed writing it (nevermind the code).

Final words

The human behind this blog post is very happy to be able to make a living developing free software and being part of the community.

My thanks go to Purism, for the work that I am able to do here, the team that I have the absolute pleasure of working with and the vision that it pursues:

I can honestly say that the Librem 5 is the smartphone I always wanted to have (but more on that in the future)!

Lastly, special thanks also to everyone who contributes to the ecosystem – by writing code, providing translations, filing issues or being an user:

You are awesome!

Purism Products and Availability Chart

Model	Status	Lead Time
Librem Key (Made in USA)	In Stock ($59+)	10 business days	Learn More Buy Now
Liberty Phone (Made in USA Electronics)	Available on backorder ($1,999+) 4GB/128GB	n/a	Learn More Buy Now
Librem 5	In Stock ($799+) 3GB/32GB	10 business days	Learn More Buy Now
Librem 11	Out of stock	New Version in Development	Coming Soon
Librem 14	Out of stock	New Version in Development	Coming Soon
Librem Mini	Out of stock	New Version in Development	Coming Soon
Librem Server	In Stock ($2,999+)	45 business days	Learn More Buy Now
Librem PQC Encryptor	Available Now, contact sales@puri.sm	90 business days	Learn More Contact Us
Librem PQC Comms Server	Available Now, contact sales@puri.sm	90 business days	Learn More Contact Us

The current product and shipping chart of Purism products, updated on April 22nd, 2026