Adaptive loss concealment for Internet Telephony applications

H. Sanneck
GMD Fokus
Kaiserin-Augusta-Allee 31
D-10589 Berlin, Germany
sanneck@fokus.gmd.de

Abstract:

Today's Internet is increasingly used not only for e-mail, ftp and WWW, but for interactive audio and video services (MBone [1]). However, the Internet as a datagram network offers only a ``best effort'' service, which can lead to excessive packet losses under congestion. Internet measurements have shown that the overall probability to loose one packet is high, however drops significantly for the loss of several consecutive packets ([2], [3]).

In this paper we consider this Internet loss characteristic and the property of long-term correlation within a speech signal together, to mitigate the impact of packet losses. This is accomplished by an adaptive choice of the packetization interval of the voice stream at the sender. When a packet is lost, the receiver can use adjacent signal segments to conceal the loss to the user, because a high similarity can be assumed due to the adaptive packetization at the sender. The subjective quality of the proposed scheme as well as its applicability within the current Internet environment (high loss rates, common audio tools, standard speech codecs) are discussed.

Table of Contents

This work was funded in part by the BMBF (German Ministry of Education and Research) and the DFN (German Research Network)

Background

To tackle the problem of lost packets (which for speech transmission means annoying signal drop-outs and a possibly disrupted conversation), different techniques have been proposed, which can be divided as follows:

Considering a scenario with the presence of numerous, low-bandwidth speech flows in the Internet, most of the approaches have scalability limitations, because they either introduce high per-flow state overhead in Internet routers (reservation), data overhead (redundancy mechanisms) or delay overhead (interleaving, receiver-based concealment).

Further compression of the speech signal leads to a reduction in the overall amount of data to be sent over the network, i.e. the payload per packet is reduced. Yet the number of packets stays the same (when maintaining the packetization time interval), thus inducing the same per-packet processing cost as before in the network.

Due to the nature of current speech codecs, also adaptation of the coder output bitrate to the network conditions, as well as a layered distribution of the coder output signal is not feasible. We therefore propose a scheme called Adaptive Packetization and Concealment (AP/C, [15]) and discuss its applicability in the Internet/Mbone environment.
The paper is structured as follows:...

Adaptive Packetization

  The part of the sender algorithm interfacing to the audio device copies PCM samples from the audio input device to its input buffer and returns the position of the maximum of the auto-correlation function p(c) of the input segment of a size of at least 2 pmax (pmax being the correlation window size; c being the ``chunk'' number; evaluation of the auto-correlation function starts at pmin to constitute a lower bound on possible chunk/packet sizes). Then, the input buffer pointer is moved by p(c) samples (thus constituting a ``chunk''), c is incremented and if necessary new audio samples are fetched from the audio device.

If no periodicity was found in the signal (i.e. the content of the "chunk" is unvoiced speech or noise), p(c) is close to pmax (Fig. 1). Thus, by applying a fixed bound pu (minimal length of a chunk classified as ``unvoiced'') to p(c) and p(c-1), as well as applying another bound $\Delta p$ to the first derivative of p(c), it is possible to detect speech transitions. The detection routine may run in parallel and can be combined with silence detection.

To alleviate the incurred header overhead, which would be prohibitive for IP if every chunk is sent in one packet, two consecutive chunks are associated to one packet (see figures 1 and 2, s(n): time domain signal n: sample number).


 
Figure 1:  AP/C sender operation: transition voiced / unvoiced

AP/C sender operation: transition ``voiced'' / ``unvoiced''



 
Figure 2:  AP/C sender operation: transition unvoiced / voiced
AP/C sender operation: transition unvoiced / voiced


Adaptive packetization of speech transitions

If a vu transition has been detected, the ``transition chunk'' is partitioned into two parts ca and cb (8a/b in Fig. 1) with p(ca) set to p(c-1) and p(cb)=p(c)-p(ca) (p(c) being the original chunk size). Note that if cmod2 = 0, the chunk c-1 (no. 7 in Fig. 1) is sent as a packet containing just one chunk.

When a uv transition has taken place, backward correlation of the current chunk with the previous one (no. 3 in Fig. 2) is tested as it may already contain voiced data (due to the forward auto-correlation calculation). If true, again the previous chunk is partitioned with p(cb-1)=pbackward(c-1) and p(ca-1)=p(c-1)-p(cb-1) (pbackward is the result of the backward correlation). Note that the above procedure can only be performed if cmod2 = 0, otherwise the previous chunk has already been sent in a packet. A solution to this problem would be to retain always two unvoiced chunks and check if the third contains a transition, however the gain in speech quality when concealing would not justify the incurred additional delay.

Packet size distribution

With the above algorithm ``more important'' (voiced) speech is sent in smaller packets and thus the resulting loss impact/distortion is less significant than using fixed size packets of the same average length, even without concealment (assuming that the network's loss probability is independent of the packet size and the mean number of packets sent remains the same). To enable concealment at the receiver, it is necessary to transmit the intra-packet boundary between two chunks (i.e. p(c) of the first chunk in the packet) as additional information in the packet itself and the following packet.

 
Figure 3:  Normalized packet size distributions for four different speakers; n(l): number of packets of size l; N: overall number of packets
Normalized packet size distributions for four different speakers;
n(l): number of packets of size l; N: overall number of packets


With our scheme, the packet size is now adaptive to the measured pitch period. Distributions of the packet size for four different speakers in Fig. 3 show that the parameter settings can accomodate a range of pitches, as their overall shapes are similar to each other (parameters were: pmin=30 samples (start offset point of the auto-correlation); pu=120; pmax=160; note that pmin <= l <= 2 pmax). The most common packets contain two voiced chunks (vv packets), as distributions are centered around a value that is twice the mean pitch period (i.e. the mean of voiced chunks).

Support for frame-based codecs

 
A major deployment problem (for Internet speech transmission in general) is that modern speech coders are designed to operate on (small) fixed size audio frames (e.g. F=10ms for G.729 [16], F=30ms for GSM [17] and G.723.1 [18]).

Fig. 4 shows the packetization, when speech boundaries found by the AP algorithm are used to associate frames of length F to the actual packets sent over the network. As AP packets overlap the frame boundaries, a significant amount of redundant data, as well as additional alignment information (si) needs to be transmitted (yet redundant data can be used in a possible concealment operation e.g. by overlap-adding it to the replacement signal). To allow analysis, we assume a constant AP packet size of l=kF+n, k being a positive integer.


 
Figure 4:  Packetization of a framed signal
Packetization of a framed signal


The fragmentation data ``overhead'' associated with packet i can then be written as follows:

Per packet fragmentation data

For a sequence of N packets, this results in:

Overall fragmentation data

With Fmodn = 0, we have Of = N(F-n). Fig. 5 shows the relative fragmentation data overhead Of' = Of / (l N) as a function of the AP packet size l and the frame size F for N=400. The surface of the graph gives an indication of the relative fragmentation overhead which can be expected for different speaker/ranges of packet sizes (Fig. 3)


 
Figure 5:  Relative fragmentation overhead Of' for AP/C and frame-based codecs
Relative fragmentation overhead

Concealment

When detecting a lost packet (by keeping track of RTP [19] sequence numbers), the receiver can assume that the chunks of a lost packet resemble the adjacent chunks, because of the pre-processing at the sender. To avoid discontinuities in the concealed signal, the adjacent chunks are copied and resampled (using an up/down sampler with linear interpolator) to exactly fit the lost chunk sizes, which are given by the packet length and the transmitted intra packet boundaries. No time-scale adjustment ([14]) is necessary as the chunk sizes are very small. As the sizes of the lost and the adjacent chunk most probably only differ slightly (and thus the respective spectra), no significant audible impact of the operation can be observed. Fig. 6 shows the concealment operation in the time domain.

 
Figure 6:  Concealment of a distorted signal (50% loss)
Concealment of a distorted signal (50% loss)

Concealment of speech transitions


Transitions in the signal might lead to extreme expansion/compression operations, because the length of an unvoiced chunk of a transition packet (denoted v|u or u|v) will be usually significantly smaller than in u|u packets (two unvoiced chunks). This is due to the chunk partitioning described in section "Adaptive packetization of speech transitions".


 
Table 1:  Concealment of/with packets containing speech transitions
Left packet lost packet right packet exp./compress.
v | ua uL u   ua << uL: expansion
  u uL ua | v ua << uL: expansion
u ua uL | v   ua >> uL: compression
  v | uL ua u ua >> uL: compression
  u (u|v)L va v va << (u|v)L: expansion
u (u|v)a vL v   (u|v)a >> vL: compression

Table 1 lists the possible cases (va, ua are (the relevant) voiced/unvoiced available chunks, vL, uL are (the relevant) voiced/unvoiced lost chunks. A u (u|v) packet is a packet where the second chunk contains an unvoiced/voiced transition that was not recognized by the sender algorithm. To avoid high compression, adjacent samples of the relevant length are taken and inserted in the gap. An audible discontinuity which might occur can be avoided by overlap-adding the concealment chunk with the adjacent ones. High expansions can be avoided by repeating a chunk until the necessary length is achieved and then again overlap-adding it.

Results

Subjective test

To evaluate the properties and performance of AP/C a subjective test was carried out. Test signals were the four signals (with different speakers) of approximately 10 seconds each, also used for the objective analysis (PCM 16 bit linear, sampled at 8 kHz). The new technique was compared against silence substitution (i.e. an adaptive packetization without concealment) and the simple receiver-based concealment algorithm ``Pitch Waveform Replication'' (PWR), which is the only one able to operate under very high loss rates (isolated losses). For PWR we used the same algorithm and fixed packet size (160 samples) as in [14].

Seven non-expert listeners [*] evaluated the overall quality of 40 test conditions (4 speakers x [3 algorithms x 3 loss rates + original]) on a five-category scale (Mean Opinion Score). Tests took place in a quiet room with the subjects using headphones.

The same packet loss pattern was applied to all input signals for one speaker (note that the sample loss pattern is different due to PWR working on fixed packet sizes only). To allow complete concealment and thus a relative evaluation of the algorithms, only isolated losses were introduced. Therefore we used a probability density function which satisfies the condition Pi(i|i-1) = 0 (Pi(i|i-1) is the conditional probability of packet i being lost when packet i-1 has been lost) and approximates at the same time an equally distributed loss behaviour with a given sample loss rate ([15]).

Before testing started, an ``Anchoring'' procedure took place, where the quality range (Original = 5, ``Worst Case'' signal = 1) was introduced. For this test we used the unconcealed 50% loss signal (with AP) as the ``Worst Case'' signal.

Figures 7-9 show the mean MOS values for the three algorithms (Silence Substitution, Pitch Waveform Replication and AP/C). Figure 10 gives the respective standard deviations of the MOS. As loss values we give the actual sample loss rate instead of the packet loss rate, as we deal with variable size packets. The pitch frequency axis refers to the measured mean of voiced chunks.

It can be seen that for all speakers AP/C leads to a significant enhancement in speech quality compared to the ``silence substitution'' case, which tends to relatively increase with higher loss rate. However for speakers with a very high pitch frequency, the relative performance decreases. A reason for this is the chosen start offset point pmin (= 30 samples) of the auto correlation computation, which constitutes a lower bound on the chunk/packet size to avoid excessive packet header overhead, but also limits the accurateness of the periodicity measurement (note the small distance between the peak of the packet size distribution and the lower bound in Fig. 3 for ``female high''). Additionally ``female high'' has the highest MOS for the worst case signal (2.0). This is due to the adaptive packetization (a high number of small-length ``gaps'' are introduced into the signal which are less intelligible).

The PWR algorithm performs well for loss rates of about 20% (cf. [14]), however speech quality drops significantly for higher loss rates, as the specific distortions introduced by that algorithm become increasingly audible.


 
Figure 7:  MOS for ``Silence Substitution''
MOS for ``Silence Substitution''


 
Figure 8:  MOS for ``Pitch Waveform Replication''
MOS for ``Pitch Waveform Replication''


 
Figure 9:  MOS for ``Adaptive Packetization/Concealment''
MOS for ``Adaptive Packetization/Concealment''


 
Figure 10:  Standard deviations of MOS values
Standard Deviation of MOS for Standard Deviation of MOS for Standard Deviation of MOS for

Subjective tests have been performed with PCM samples, this carries the implicit assumption that the speech immediately after the gap is decoded properly. However modern speech coders rely on synchronization of coder and decoder, which is lost during a packet loss gap ([20]) thus the decoding is of course worse after the gap due to previous coder state loss.

Objective measurements are clearly inappropriate for PWR (no aim at mathematical approximation of the missing signal segments). AP/C is not a reconstruction scheme as well, however the adaptive packetization and subsequent resampling should perform better concerning mathematical correctness. Calculated overall SNR values for PWR (for the examples which are presented in this paper) are always below those for the distorted signal. SNR values for AP/C are always above those for the distorted signal and at least 4dB higher than for PWR. This confirms our conjecture, yet conclusions about speech quality should be only based on the subjective test results.

Delay overhead

The maximum additional delay introduced in the current implementation consists of

The computational complexity is low at sender and very low at the receiver as only simple operations (auto-correlation, sample rate conversion) have to be performed (thus dC << dS + dR). This makes the scheme well suited for multicast environments with low-end receivers.

Data overhead

Table 2 gives the packet header overhead for different speakers, based on the sum of actually measured packet sizes. For a low average pitch period, we see that the overhead is comparable to a typical parameter setting in IP networks (160 bytes (=20ms) G.711 PCM audio in an IP/UDP/RTP packet [20+8+12 bytes header], resulting in 20% packet header overhead). However it increases with an increasing mean pitch period. But even for higher pitch voices the additional packet header overhead stays below 10%, which is comparable to adding a very low bitrate additional source coding to reconstruct isolated losses ([9]). Table 2 also shows that the mean value pv of the chunks classified as voiced, can be used as an estimate for an adaptive packetization "equivalent" packet size.
 


Table 2:  Relative cumulated header overhead O for AP assuming o=40 bytes per-packet overhead for four different speakers (mean pitch period: pv)
Speaker pv [samples] o/(o+2pv) [%] O [%]
male low 79.20 20.16 20.14
male high 67.05 22.97 22.83
female low 57.74 25.72 24.84
female high 49.88 28.62 27.98


To support a possible concealment operation it is necessary to transmit the intra-packet boundary between two chunks as additional information in the packet itself and the following packet. That amounts to two octets of ``redundancy'' for every packet, that could e.g. be transmitted by the proposed redundant encoding scheme ([21]).

When the frame length F is significantly smaller than the mean packet size (section "Support for frame-based codecs", Fig. 5), support for frame-based codecs can be assured with a reasonable amount of additional data.

Conclusions

A technique for the concealment of lost speech packets has been presented. The core idea of preprocessing a speech signal at the sender to support possible concealment operations at the receiver has proven to be successful. It results in an inherent adaptation of the network to the speech signal, as predefined portions of the signal (``chunks'' assembled to packets) are dropped under congestion.

Backwards compatibility to existing audio tools is ensured, as most tools can receive properly variable length PCM packets (and then mix them into their output buffer), however delay adaptation algorithms might need to be modified.

The subjective quality, when using AP/C in conjunction with existing frame-based codecs needs to be evaluated in further subjective tests. However, a more efficient scheme integrating the coder and appropriate packetization should be devised. We also plan to test more sophisticated speech classification / processing algorithms, yet always taking into account the compromise of quality and computational complexity.

From the perspective of the network, the presented application level scheme could be complemented by influencing loss patterns at congested routers (queue management), thus also supporting more fairness between flows by avoiding bursty losses within one flow.

References

1
V. Kumar,
``The MBONE FAQ'',
http://www.mbone.com/mbone/mbone.faq.html, January 1997.

2
J.-C. Bolot, H. Crépin, and A.V. Garcia,
``Analysis of audio packet loss in the Internet'',
in Proceedings of the 5th International Workshop on Network and Operating System Support for Digital Audio and Video, Durham, NH, April 1995, pp. 163-174.

3
J. Kurose M. Yajnik and D. Towsley,
``Packet loss correlation in the MBone multicast network'',
in Proceedings IEEE Global Internet 1996 (Jon Crowcroft and Henning Schulzrinne, eds.), London, England, November 1996, pp. 94-99.

4
J.-C. Bolot and A.V. Garcia,
``Control mechanisms for packet audio in the Internet'',
in Proceedings IEEE Infocom '96, San Francisco, CA, April 1996, pp. 232-239.

5
T. Turletti, S. Fosse Parisis, and J.-C. Bolot,
``Experiments with a layered transmission scheme over the Internet'',
Research report 3296, INRIA, November 1997.

6
S. McCanne, V. Jacobson, and M. Vetterli,
``Receiver-driven layered multicast'',
in Proceedings ACM SIGCOMM '96, Stanford, CA, September 1996, pp. 117-130.

7
D. Clark R. Braden and S. Shenker,
``Integrated services in the Internet architecture: an overview'',
RFC, IETF, 1994,
ftp://ds.internic.net/rfc/rfc1633.txt.

8
N. Shacham and P. McKenney,
``Packet recovery in high-speed networks using coding and buffer management'',
in Proceedings ACM SIGCOMM '90, San Francisco, CA, June 1990, pp. 124-131.

9
M. Handley V. Hardman, M. Sasse and A. Watson,
``Reliable audio for use over the Internet'',
in Proceedings Inet 95, http://info.isoc.org/HMP/PAPER/070/abst.html, 1995.

10
M. Podolsky, C. Romer, and S. McCanne,
``Simulation of FEC-based error control for packet audio on the Internet'',
in Proceedings IEEE Infocom, San Francisco, CA, March 1998, pp. 48-52.

11
J. Rosenberg and H. Schulzrinne,
`` An RTP Payload Format for Generic Forward Error Correction'',
Internet Draft, IETF Audio-Video Transport Group, November 1997,
ftp://ds.internic.net/internet-drafts/draft-ietf-avt-fec-01.txt.

12
C. Perkins,
``Options for repair of streaming media'',
Internet Draft, IETF Audio-Video Transport Group, January 1998,
ftp://ds.internic.net/internet-drafts/draft-ietf-avt-info-repair-02.txt.

13
N.S. Jayant and S.W. Christensen,
``Effects of packet losses in waveform coded speech and improvements due to an odd-even sample-interpolation procedure'',
IEEE Transactions on Communications, vol. COM-29, no. 2, pp. 101-109, February 1981.

14
H. Sanneck, A. Stenger, K. Ben Younes, and B. Girod,
``A new technique for audio packet loss concealment'',
in Proceedings IEEE Global Internet 1996 (Jon Crowcroft and Henning Schulzrinne, eds.), London, England, November 1996, pp. 48-52.

15
H. Sanneck,
``Concealment of lost speech packets using adaptive packetization'',
To appear in Proceedings IEEE Multimedia Systems 1998, Austin, TX, June 1998.

16
International Telecommunications Union,
``Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear-prediction (cs-acelp)'',
ITU-T Recommendation G.729, March 1996.

17
J. Degener,
``GSM 06.10 lossy speech compression'',
Documentation, TU Berlin, KBS, October 1996,
http://kbs.cs.tu-berlin.de/~jutta/toast.html.

18
International Telecommunications Union,
``Dual rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbit/s'',
ITU-T Recommendation G.723.1, March 1996.

19
H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson,
``RTP: A transport protocol for real-time applications'',
RFC 1889, IETF Audio-Video Transport Group, January 1996.
ftp://ds.internic.net/rfc/rfc1889.txt.

20
J. Rosenberg,
``G. 729 error recovery for Internet Telephony'',
Project report, Columbia University, 1997.

21
C. Perkins et al,
``RTP payload for redundant audio data'',
Internet Draft, IETF Audio-Video Transport Group, September 1997,
ftp://ds.internic.net/rfc/rfc2198.txt.
This document was generated using the LaTeX2HTML translator.
...listeners
Note to reviewers: additional tests with other 7 listeners are currently carried out