Tuesday, February 9, 2010

VoIP - Speech Transmission

3. Speech Transmission 

 3.1 Analogue Speech Transmission 

Analogue telephony uses the handset's microphone to convert speech, i.e., sound waves into fluctuations of current that are transported through the wires of the PSTN. The distant party's speaker in the handset reconverts this modulated current into audible sound. In a normal phone call, most of the time one party talks while the other party is listening. Nevertheless, signals are transmitted constantly in both directions as each party is able to interrupt the other at any time. This transmission type is also called a full duplex transmission. Analogue transmission perfectly meets the requirements of human speech transmission but it is neither noise resistant nor transmission resource efficient. In the early days of legacy analogue PSTNs, line amplifiers were used to refresh the signal on its way from the source to the drain. Unfortunately, system noise was amplified too, which very often led to very poor voice quality. 

3.2 Digital Speech Transmission 

In digital networks, such as ISDNs, system noise no longer plays a significant role. This is because digital repeaters are used to perfectly restore the transmitted signal. This is one of the key benefits of digital transmission which uses only logical "ones" and "zeroes", i.e., tension or no tension, to code the original signal. These benefits soon encouraged network infrastructure providers and operators to switch over to Pulse Code Modulation systems, or PCMs. 

Analogue speech signals are sampled at a frequency of 8 kHz, converted into a sequence of "ones" and "zeroes" and transmitted to the distant party. It is not necessary to process the entire human voice spectrum between  120 Hz and  10 kHz as this would require a lot of bandwidth. A limited spectrum between  340 Hz and  3.4 kHz is enough to recognize the speaker and to present his voice comfortably and understandably. 

Speech samples are coded with 8 bits, resulting in a data rate of 64 kbps. The Pulse Code Modulation that is used in ISDN does not apply a linear sort of quantisation. Small signals, i.e. quiet speech are converted with more accuracy than large signals representing loud speech. 

The G.711 speech codec family defines two major codec types: a µ-Law Codec used in the Americas and an A-Law type used in Europe. Both codecs apply a logarithmic sampling method to convert an analogue signal into digital words of 8 bits, but use different scales for amplitude quantisation. In both cases 28 = 256 different signal values are retrieved per 125

            µs time frame. This is sufficient for good speech quality.

3.3 Basic Standards for Audio Coding (1/3) 

IP telephony also starts with the analogue human voice signal. This signal is coded and compressed with an analogue-digital converter. Together with a compression algorithm for bandwidth-efficient data transport, this forms the encoder part of the digital signal transmitter. In the next step, dedicated portions of  the bit stream are formed into Internet Protocol packets. This is called "packetizing". These packets are then sent through the IP network to their final destination. The packets can be routed to the receiver in different ways. This might lead to a varying delay in packet reception. At the receiving end a Jitter Buffer handles the packets initially and sends on a constant stream of packets in the original sequence for further processing. 

Delayed or missing packets can severely decrease speech quality, although the Jitter Buffer is able to compensate to a certain extent. Finally the speech packets are re-converted into an audible analogue signal. 

3.3 Basic Standards for Audio Coding (2/3) 

Real-time  speech  transmission  supported  by  IP  resources  requires  a  dedicated  coding system. The International Telecommunication Union  (ITU) has developed a set of coding schemes suitable for the different network bandwidths. These coding schemes  - named codecs (coder-decoder) - are part of the H.323 protocol family and use different converting and reconverting algorithms to transform signals from one form into the other one. Different degrees  of  compression  are  available  to  provide  different  levels  of  speech  quality. Redundancy  information  can  be  added  to  the  original  signal  to  facilitate  decoding  and provide a certain amount of protection against transmission errors.

Coding techniques reduce the original signal data rate for efficient transmission without an audible decrease in speech quality. This is achieved by the use of  highly-sophisticated algorithms - including predictive techniques - and adaptating those techniques to the specific characteristic of the human voice. 

3.3 Basic Standards for Audio Coding (3/3) 

The need for bandwidth-saving speech transmission becomes obvious when we realize that many branch offices, for example, are connected to their main company by a WAN conection with a data rate of only 128 kbps. Uncompressed speech transmission would consume large parts of the available bandwidth leaving nothing for parallel-application data-transfer or other speech connections. A variety of codecs can be applied to ensure a good compromise between speech quality and efficient bandwidth use. 

Packet-oriented data transmission suffers from many of problems: delays, echoes, wrong packet  sequencing  and  finally,  packet  loss.  State-of-the-art  codecs  cope  with  all  these problems and achieve speech quality in packet data networks which is comparable to legacy circuit-switched networks. The best-known codecs process signal according to ITU G.711, G.723.1, G.729 and G.729A standards. 

3.3.1 Mean Option Score 

In  order  to  evaluate  the  performance  of  speech  transmission  under  real  conditions,  a measurement process for speech quality had to be defined. This would make it possible to allocate a rating and thus compare the different codecs. In a first approach, only a coarse assessment of the speech codecs' quality and efficiency was defined. The human ear and brain can compensate for poor speech quality to a certain extent  - two facts of humen physiological that can not be reproduced easily using technical means. Thus, only subjective measurements for speech quality can be used. 

ITU procedures provide a test scenario where a mixed audience was exposed to various speech samples under defined conditions. Speech quality was rated taking different aspects into account such as the speaker's understandability, a dialogue test, subjective loudness perception and the effort needed to understand a minimum level of information. Together, the results give an index that represents the test candidates' opinion of the codec type been tested. This index is called the Mean Opinion Score or MOS. The MOS can range from 1.0 (very bad) to 5.0 (excellent). Legacy circuit-switched ISDN telephony has an MOS of 4.4. 

3.3.2 G.711 

The G.711 codec is the type used most often for telephone networks. Analogue signals are sampled  8,000  times  per  second  and  each  sample  is  coded  with  8  bits.  One  sample represents a block of 125 µs of speech information. 

G.711  supports  a  highly-sophisticated,  uncompressed  speech  transmission  with  a  MOS value of  4.4. However, its high  64 kbps bit rate makes it unsuitable for IP-based voice transmission over WANs unless it is further compressed. 

Nevertheless, G.711 support is mandatory for all IP phones. Each IP packet contains 10ms of speech info, hence 100 packets are transmitted per second. Each 10 ms packet contains 640 bits or 80 bytes of speech information payload. 100 sent packets correspond to a bit rate of 64 kbps. 

3.3.3 G723.1 

G.723.1 overcomes the G.711 efficiency problem as it defines a coding technique with a much lower bit rate. 30 ms frames of speech payload result in the data rate of either 5.3 or 6.3 kbps. The MP-MLQ type (Multi-pulse Maximum-Likelihood Quantization) delivers better quality at  6.3 kbps and a MOS of  3,9. The CELP type  (Code Excited-Linear Prediction Compression) has an MOS of 3,65 - thats still an acceptable speech quality. G.723.1 is dedicated especially to transmission systems that only provide low bandwidth such as Wide Area Networks, like the Internet. It´s more sophisticated compression algorithm means that an initial delay of  37.5 ms must be accepted. Additionally, G.723.1 provides silence compression, a technique which inserts an artificial tone at a reduced bit rate when no speech information is being transferred. This reduces the bandwidth used. Speech is transported in 30 ms IP data packets - hence 33 packets must be transmitted each second. G.723.1 MP-MLQ coding at 6.3 kbps carries a speech payload of 24 bytes. 

3.3.4 G729 

G.729 achieves a bit rate of  8 kbps without any significant quality loss. A  10 ms packet carries 80 bits or 10 Bytes of speech payload. G.729A uses a less complex algorithm that leads to a slightly lower MOS of 3.9. Both versions generate frames of 10 ms length and a delay of only 15 ms due to the compression type used. 

G.729A is the recommended speech codec for VoIP systems. It has two variants - G.729B and G.729AB - which also provide additional silence compression for bandwidth reduction during speech pauses. 

3.4 Codec and Bandwidth (1/2) 

Bandwidth dimensioning in IP networks should support a calculated standard bit rate for the expected  speech  and  application  data  flow.  Overload  situations  can  be  handled  with additional quality assurance procedures. 

Let's look at an example: a 10 ms G.729 codec frame is encapsulated in an IP packet and sent through a standard Ethernet network. A 66 byte overhead is used for 10 byte of speech payload that corresponds to a ratio of 87 % to 13 % for overhead and payload per ethernet frame. 66 Byte plus 10 byte result in 76 byte or 608 bits per 10ms - that´s a gross bit rate of 60.8 kbps. For a full duplex link twice this value, i.e., around 120 kbps of bandwidth must be assigned. 

3.4 Codec and Bandwidth (2/2) 

This seems quite bandwidth-inefficient as each subsequent 10 byte speech payload comes with a dedicated overhead. To solve the problem, more than one speech frame needs be transmitted per IP packet. 

How can we increase the payload per Ethernet frame? 

Let's try to pack 2 frames of speech information at 10 ms coming out of a G.729 codec and calculate the resulting bandwidth.  66 bytes of overhead are responsible for  2 x 10 ms of payload hence 86 bytes per frame or 688 bits. As two 10 ms speech frames are carried the resulting amount of packets per second has to be divided by 2. This gives us 50 packets. 50 packets per second at 688 bits correspond to a bit rate of 34 kbps or 68 kbps for the full duplex line capacity. Thus, a significant bandwidth reduction of 50% has been achieved. 

3.5 Codec Bandwidth and Delay 

When we are looking for the most efficient use of bandwidth, the aim should be to pack the maximum number of speech samples into each single IP packet. However, this approach is limited by the available network connection speed, which might require smaller data blocks. This is shown clearly in the following example. On a 28.8 kbps modem-type trunk, an 86 byte data packet needs 23.8 ms for transmission whereas on a 100 Mbps 100Base-T Ethernet link only 6.8 µs are needed. 

In both scenarios, additional time is needed for packetizing as the IP packet has to wait until the last speech sample has been encapsulated before network transmission can start. For example, including 3 additional speech frames in the same IP packet will produce a waiting time of 30 ms with 10 ms more waiting time for every further speech sample. Real-time IP applications, such as VoIP, are very delay sensitive. All delays - that is, speech coding and compression, packetizing and transmission - have to be countered to keep the overall delay time at a level where speech quality is acceptable. 

3.6 Voice Activity Detection 

In a phone call, normally one party speaks while the other one listens to the speaker. As a standard PSTN link provides 64 kbps full duplex channels, half of the capacity is usually wasted during this kind of communication. 

Voice over IP makes these resources available for other purposes if Voice Activity Detection is activated. 

How does this work? VAD detects the loudness of the speaker's voice and decides when to stop speech frame packetizing. Before doing so, VAD waits for a fixed period of time, mostly 200  ms.  In  very noisy  environments,  VAD might  have  problems  distinguishing  between speech and background noise. At the start of the call a defined signal-to-noise ratio, also called the signal threshold, is used to decide whether to automatically activate or de-activate the VAD operations. 

If  the  VAD  procedure  detects  the  absence  of  voice  and  simply  switches  off  speech information  transmission  to  the  distant  party,  they  might  think  that  the  call  has  been interrupted. In order to avoid this, the systems inserts an artificial noise, called Comfort Noise, to make the listening party believe that the link is still present. 

3.7 Echo Cancellation 

An echo might be amusing enough when we are hiking in the mountains, but it is certainly a nuisance during a phone call and might even make communication impossible. Hearing one's own voice within the handset is a normal effect that calms the speaker. But a delay between the direct sound and the echo of more than 25 ms might cause interruptions and disturb severely the rhythm of communication. 

Legacy PSTNs have to cope with echoes produced by non-aligned impedances when the signal leaves the 4-wire system of the switch to enter the local 2-wire trunk to the end user. Very careful impedance supervision at the reflection points and the use of echo cancellers helps to solve the problem. 

In VoIP systems, echo cancellation is included with the codecs in the DSPs or digital signal processors. Hence, a software-based solution is used for IP soft-phones and the gateways of the network. For a certain period of time, it stores an inverse pattern of the sent speech sample. Then the DSP listens to the sound that is reflected by the distant party and subtracts the  stored  signal.  This  achieves  a  near-zero  amplitude  for  the  sound  disturbance.  The majority of echo cancellers cope with 32 ms echoes. That is the maximum storage time for the speech sample. 

5 Comment :

Reseller VoIP said...

your blog is super and has great voip information.

Pphone said...

Thanks for sharing ! This is very good information for voip viewers . Can you tell me about Pphone and Avaya Voip and what is the difference between these product.

centrex pbx said...

Enterprises with substantial PBX hardware and application investments are savvy to the groundswell of IP PBX and Hosted IP telephony options in the market.

Toll Free Number said...

I'd like to thank you for the efforts you've put in penning this site.

Maariah said...

My partner and i has been greedy by your arrangement loyaltys annually being andI assumed inside your electrical power pertaining to unassuming personally has been sooner of your respective good friend set up with stoned academy. Existing i 'm with Scholar Gardening shop. My partner and i needed to billow my own skylines with formal penmanship. Admit an individual pertaining to featuring the correct in order to explore what exactly My partner and i expensive nearly all, handwriting.. Restaurant Pager System