|Spend time with those involved in designing and operating videoteleconference systems and you will quickly discover that most people in that arena consider video quality to be MUCH more important than audio quality. They can often recite from memory all the specs about how many pixels of resolution the video has, what the refresh rate is, etc. But, ask them about the audio and you frequently get a response like, "Uh, it's pretty good." As musicians, we make many more critical decisions with our ears than we do with our eyes. Watch a talented musician teach sometime, and you will notice they often make comments about the student's technique without even seeing the technique. They can HEAR the difference between techniques. For musical purposes, then, the audio quality is paramount and the video quality often secondary in importance. In order to understand the type of VTC system needed to produce the high-quality audio needed for musical purposes, it is first helpful to understand a bit about “how” we hear what we hear.
It is generally accepted that the frequency range of human hearing is approximately 20 Hertz (20Hz) at the bottom end up to around 20,000Hz (20kHz) at the top end. However, the type of information our brain perceives tends to change depending on where the sounds occur in that frequency spectrum. We can generally divide our hearing frequency spectrum into thirds. There is substantial overlap between these three frequency bands, but we tend to use the sonic information we receive in different ways depending on the frequency range within which the sound falls (figure 1).
Approximately the bottom third of our hearing range is where we detect what we commonly refer to as pitch. In this frequency range, we can easily ascertain whether one sound is higher or lower in frequency (pitch) than another sound. The middle third of our hearing spectrum is where we typically detect timbre or tone color. If you play two different frequency sounds in this range, we often have a hard time telling that one is higher or lower in frequency than the other, but one sound will typically seem brighter or darker in tone than the other. We may not even realize that we are using the upper third of our hearing range, but it is where we sense the ambience or presence of a sound in a space.
What We Hear
At the collegiate and professional level of music performance and training, the fact that a student is playing the right notes is usually a given. What we are more interested in at this level is the quality of the sound a performer is making. The frequencies that give us that information are found in the middle and upper ranges of our hearing frequency spectrum. Thus, a VTC system that does not convey that frequency information will not allow musicians to make the type of critical, artistic, and aesthetic decisions that are needed for high-level teaching and performing.
Today's commercial VTC systems are designed primarily to convey speech from one location to another. Due to a few peculiarities in the way electronics, microphones, and loudspeakers tend to exagerate some of the sibilant and consonant sounds in speech, a VTC system that restricts these higher frequencies actually helps increase the intelligibility of our speech. Unfortunately, that restriction also reduces the type of information needed to accurately judge musical quality. In order to conduct high-quality musical interactions, a system that conveys the entire frequency spectrum is required.
When examining potential VTC systems, users are sometimes confused by the term "sampling rate" when used to describe the audio quality of a system. Many of the commercial VTC systems on the market advertise a sampling rate up to 22.05kHz, leading those unfamiliar with audio digitization to believe that these systems are capturing the entire audible frequency range. In audio digitization, however, there is a concept known as the Nyquist Theorem that basically states your sampling rate must be at least twice as high as the highest frequency you want to digitize. Thus, a sampling rate 22.05kHz is only capable of capturing frequencies up to about 11kHz. In order to capture the entire 20Hz-20kHz frequency range, you need a sampling rate in excess of 40kHz. The high-bandwidth systems that offer this range typically begin with 44.1kHz (CD quality) sampling rate and go up from there.
Another issue that frequently arises with VTC systems is that they put the audio into a monaural, instead of stereophonic, format. This is convenient for distributing voices in a conference room, but it eliminates much of the spatial quality of the sound that we associate with the making of music. For best audio quality, you should look for a system that offers at least stereo audio with a 44.1kHz (or higher) sampling rate, and a 16bit (or greater) sampling width. Currently, the only VTC systems that do that require the massive bandwidth available only on high-performance networks like Internet2.
Every piece of equipment creates a bit of latency, or delay, in the data stream. As sound and light energy is captured by the microphones and the camera, that analog information is first converted first to digital information and then converted into a format that can be transmitted on the Internet. Each of those conversion steps creates a bit of delay called encoding latency. The data stream then travels from one codec to another over the Internet, adding network latency. Once the data reach the destination codec, they are converted from Internet packets to digital audio and video, and then back to analog audio and video to be presented through the audio and video monitors. That final step is referred to as decoding latency (figure 2). In the interactive VTC, that latency is essentially doubled as the person at the receiving end gets the audio and video about a half second or so after it was created, then responds to it, which gets back to the original person another half second or so after that. There are a number of developments and projects going on attempting to reduce the latency, but at the current state, VTC latency makes any attempt to tightly coordinate or play music together difficult if not outright impossible.
The amount of latency from one end to the other is determined by the speed and efficiency of the codecs, as well as the travel distance for the data. What surprises most people is that the greatest amount of latency occurs in the encode/decode stages and not in the travel time. Since most Internet traffic moves via fiber-optic cables as light pulses, the speed of light is a limiting factor in the travel time. Although there is room for a small amount of improvement due to network equipment efficiency, it is not expected that travel times will greatly improve in the near future. Codec efficiency, on the other hand, has much greater room for improvement as faster computer processors and more efficient encoding/decoding algorithms are created.
One rather nasty byproduct of latency in the VTC environment is audio echo. Since both ends of the conference have microphones AND loudspeakers, it is quite easy for the audio that was created at one end to come out of the loudspeakers at the other end and get picked up by the microphones and sent right back to the originating source, only at a delay equal to the round-trip latency (figure 3). Most of today’s commercial VTC systems include an “echo-cancellation” function. However, those echo-cancellers only work with speech. Since speech tends to consist of short, choppy, non-repeated bursts of sound, it is relatively easy for a device to compare the outgoing audio with the incoming audio and look for matches. Then with a bit of electronic wizardry and frequency filtering, the echo-canceller attempts to remove the returning audio, or echo. Music, on the other hand, tends to consist of much longer, sustained sounds that need the benefit of the full frequency spectrum. If you attempt to use one of the commercial echo-cancellers on musical information, you will find the audio levels tend to jump up and down and exhibit a variety of frequency anomalies. In order to control echo in the musical VTC, a completely different approach is used that is similar in concept to the way a live-sound engineer prevents feedback in an amplified stage concert. Those techniques form the core of ECHODamp.