The Levelator® was created to solve a number of problems associated with audio post-production.
Among its other activities, The Conversations Network published interviews and recordings of conferences and lectures on the Internet. We often received original audio files containing wide loudness variations both overall and within a recording such as between the participants in panel discussions, Q&A sessions and telephone interviews. These variations can be corrected by a skilled post-production audio engineer, but it's a time-consuming and rather laborious process. Our engineers preferred to spend their time on the more creative aspects of their work.
We also used an automated show-assembly system that rebuilt the audio of all of our programs from components every night. A typical program might include as many as 15-20 components, some of which were as short as parts of sentences. Even our credits were automated. For example, in the sentence "The series producer is Ralph Barbera," the name and the earlier portion were likely recorded months apart. Seamless assembly only works if the components are all of the same loudness.
Finally, we've all experienced the wide variations in loudness from one podcast to another. There's nothing quite like cranking up your iPod so you can hear that low-level show only to have the next one be so loud you have to yank out your earbuds to escape the pain. Before The Levelator® there were no loudness standards for podcasts.
Take a look at the two clips below. Which do you think is louder?

The one on the right has higher peaks, so you might think it's louder. In fact, the one on the left is much louder. The sample on the right is human voice. The one on the left is heavy-metal music.
The outline of a waveform indicates its voltage. It's what you'd see on an oscilloscope, for example. But the perceived loudness more closely corresponds with the power of the signal, and that is represented by the area under the curve -- the area between the waveform outline and the horizontal axis. At the zoom level of this example, loudness roughly correlates with the density of the blue color. The greater density of the signal on the left suggests that it will be louder.
To determine the power of a segment of audio, we calculate its root mean square or RMS: the square root of the average (mean) of the squares of all the individual amplitude values in the segment.
A square wave with peaks at +/-0dB is the loudest possible signal and has an RMS of 0dB. A sine wave with peaks at +/-0dB has an RMS of 0.707*peak or -3.0dB.
If you're only dealing with simple continuous waveforms, calculating and measuring RMS values is easy and standardized. But once you enter the world of spoken-word recordings, it gets a lot more complex. For example, if you use different audio-editing programs to measure the RMS value of the same spoken-word audio file, you'll likely get very different answers. (Try it!)
Instead of a continuous tone or music track, for which every calculation should yield the same RMS value, consider a spoken-word track in which a narrator counts from "one" to "ten " over a period of ten seconds. Between each of those utterances are periods of silence which are actually longer than the non-silent sections. For the sake of our example, let's assume that there's speech during 20% of the recording and that 80% is in fact silence. If we calculate the RMS value of the entire track (including the silent sections), we'll get a value that is 20% or 14.0dB lower than if we only measure the sections during which the narrator is actually speaking.
For someone counting to ten we might all agree on what portions of the audio to include and exclude from the calculation. But when dealing with real-world speech, it's not so obvious. This is why different programs, utlities and meters will generally display different RMS values for spoken-word audio tracks. Each of them has a different way of excluding silence. (In fact, many of the programs -- even some of the most expensive -- don't consider silence at all when making their RMS calculations, therefore yielding entirely useless results for our purposes.)
When we began work on The Levelator® we assumed we would be able to find an existing standard for loudness. In fact we did identify a number of published standards and measurement techniques, but none of them addressed our specific requirements. Many of the standards related to the analog world, most notably in reference to the trusty old mechanical VU meter. But while all of the standards were valuable for tones and music, none were appropriate for the RMS levels for spoken-work material. Specifically, none addressed the calculation and exclusion of silence, which as explained above, is absolutley critical for automated processing of spoken-word content.
We had no choice, therefore, but to develop our own standard. Our approach was (a) to develop our own method of calculating RMS levels for speech accounting for silence, then (b) to determine a target RMS level by applying our measurment technique to a large number of podcasts which we felt followed best practices. Most notably, we solicited samples from NPR, CBC, BBC and other talk-oriented sources, but found that while many of these organizations followed standards within their signal chains from studios through to their transmitters, they didn't have such standards for the loudness of digital files. Finally, we refined the calculations and algorithms by testing with a wide variety of source files, some with rather patholigcal problems.
So how do we calculate levels and process audio for The Levelator®?
We first isolate segments that are silent and remove them from the calculations. We define silence as audio segments which have no subsegments of 50 ms or more where the RMS is greater than -44.0dB. We then compute the RMS value of the remaining segments and normalize them to our target RMS level of -18.0dB.
The above is actually a drastic simplification of The Levelator's processing, which takes into account a number of subtleties when dealing with certain real-world sources. For example, the silence threshold of -44.0dB is not reasonable if the audio before normalization is already very quiet. The -44.0dB value is therefore used only after the overall RMS is first normalized to near that target. This requires an iterative calculation. The Levelator® processes an entire audio file, not a continuous stream, so we have the advantage of infinite lookahead and the ability to make multiple passes over the data in large and small chunks.
Another issue is that increasing the loudness (normalizing) while not introducing peak clipping introduces nonlinear effects which also must be done iteratively. For these reasons, we actually allow +/-1 dB variance in the final RMS. It's a tradeoff in order to decrease execution time and is not noticeable to listeners.
Our peak output level is -1.0dB to allow some headroom for downstream processing and problems we've observed with some playback devices, particularly after the audio has been MP3 encoded and decoded.
The Conversations Network standard for loudness is based on the RMS measurements described above with no frequency weighting. We (and others) have experimented with several weighting schemes, but we found them to be no more effective. The relationship between perceived loudness and RMS is actually quite subtle and its science and engineering are not entirely understood. People in labs listening to pure sine waves just doesn't seem to tell us anything useful about the real world. For a given RMS, music often sounds louder than spoken word. Similarly a recording of someone on a phone can seem quieter than someone on a studio mic. Attempts to quantify this in terms of frequency content have not been very successful because perceived loudness has a lot to do with intelligibility of the material. Phone recordings are a little less intelligible because of their reduced frequency content and become more intelligible with a 2-3 dB boost.
Here are some online resources regarding loudness and RMS calculations: