For most people, virtual reality is mainly a visual idea. It’s a way to beam realistic pictures to your eyes so that you feel like you’re really in the virtual environment where you have been placed. However, as I’ve outlined in various other articles on this site, VR is actually an intricate dance of various factors that work together to craft the illusion of virtual presence.
One of the key factors in creating a truly convincing and immersive VR experience is having audio that syncs up with what your brain expects to hear. It may not seem like a big deal, but you’ve surely noticed that the real world does not sound like a movie soundtrack played over a pair of stereo headphones. Your brain uses the slight time delay between a sound reaching one ear and then the other to accurately and almost instantly figure out where its position is relative to you. That’s why you can tell if a sound is coming from above, behind, or below you.
Of course, your brain has access to a bunch of other sources of information too, such as how the sound changes as you move your head, and visual cues as well. Most of this auditory processing happens on a subconscious level so we aren’t aware of it, but just about anyone can tell that the sound we typically get from a recording played back on a pair of headphones sounds “fake” compared to real-world audio.
Simulating Position and Direction in Audio Playback
One of the ways that have been put into use when it comes to truly simulating where a sound is coming from using a speaker system is to use a multi-speaker surround-sound setup. If you are surrounded by speakers then it’s relatively easy to make it sound like, for example, there’s a bird chirping above and behind your right-hand side. Unfortunately, multi-speaker audio systems are quite expensive and complex. Most home users typically have a system with five, or perhaps seven, speakers. It’s also not terribly practical for VR use since you’ll also hear all the other sounds that are not part of the VR simulation.
Still, when you think about it, you only have two “microphones” – in the form of your ears. So surely there must be a way to simulate the positional audio signal that your brain can interpret and perceive as a real sound coming from a real location in the space around you. Audio engineers have been working on this problem for a long time and it has given rise to a special recording technique known as “binaural” recording.
Binaural recording is a recording meant for two ears. That may seem like a sort of dumb way to put it, since we use our two ears to listen to everything, after all. What I mean, however, is that a binaural recording is meant to simulate what two ears would have heard had they been present at the time of the recording.
In other words, this is not simply two channels of audio, which is all stereo is. When a stereo track is mixed, all the recording engineer really has control over is how loud and prominent each element of the multitrack recording is in each channel. If a sound is perfectly balanced then you should hear it sort of in the “middle” of your head. The more the sound is pushed to the left or right speaker, in terms of volume balance, the more you’ll hear it on your left or right side. Traditional stereo mixing and recording can really create the right sound to convince you of a sound’s origin in a spatial sense.
It’s the Head, Dummy
That’s mainly because what you hear coming into your ears is actually a complex acoustic product. For example, you don’t hear sound just as it enters your ear holes. Instead you also hear sound translated from the inside of your skull, where it all mixes together. This is one of the reasons why we often don’t recognize ourselves on a recording. We hear our own voices as a mix of internal and external sound that’s not reproduced on a recording at all.
The shape of our ears also plays an important role. The ear shell is a complex acoustic funnel. If you’ve ever been in an auditorium or an opera house you may have noticed the complex acoustic panels they use to reflect and redirect sound. All the curves and angles of your outer ear do a similar job. This means that your brain expects to hear sounds that have been put through this part of the process when figuring out where a sound is coming from. The problem is that when you are wearing headphones these natural acoustics of your hearing system are partly bypassed. The sound is pumped directly your ear canal, where you hear it as an undiluted stream of sound.
It’s a Dummy Head
Traditional binaural recording aims to recreate the acoustics of the head and ears that your inner ear and brain expect, by literally using a dummy head with microphones where your ears would be. This way the distance between the “ears” is correct and the model ear shells do similar things to the sound as your own ears would.
So if we wanted to make it sound as if you were really standing in the middle of a live band, or a nature scene, you just have to set up the real-world situation and then plonk your recording dummy head in the spot where you want the listener to feel they are when they play back the recording.
There’s just one catch – the positions of the audio sources are all fixed relative to your dummy head recording system. So that’s not much use for us in VR, because we want to move our heads around and have the sound stay where it’s supposed to be, relative to our heads.
The VR Positional Audio Conundrum
The virtual reality world often portrays an environment that is dynamic in nature. When it comes to the audio that you hear, this means you’re listening to not a single recording but many small recordings, and sometimes even completely generated (i.e. “made up”) sounds.
Let’s say there’s a virtual bee flying around your head. Unless the bee takes a pre-recorded path around your head, you’ll need some way of making the positional audio work, without having the luxury of perfectly controlling the position of your head as you listen. In other words, you need to generate or modify the sound in such a way that things in the VR world not only sound as if they are where they seem to be, but that their position in audio space changes correctly as the position of your virtual ears changes.
The Secret Sauce
Different VR experience providers each have their own “secret sauce” algorithm and software developer kit aimed at convincing us the origins of VR sounds really are where they appear to be. Regardless of how they achieve the end result, the basics are bound to be the same. These systems have to take three things into account when generating true spatial sound.
First, they have to simulate the difference in the time the sound takes to reach one ear versus the other. So if a virtual pin drops in the room, the computer has to work out exactly when the sound waves reach which parts of the room. If your ear is occupying that space, then it should hear the correct sound at the right time.
Secondly, they have to take into account that each ear should hear the sound at a slightly different volume. After all, the ear facing the sound gets more sound energy than the other.
The last part of the simulation has to reproduce the “spectral filtering” performed by the outer ear. All this refers to is how the shape of the ear highlights or eliminates parts of the sound. That way the audio sounds the way the brain expects it to after passing into the ear canal.
Bringing it All Together
So in the end a spatial audio system is a simulator that takes recorded sound and then simulates how that audio would travel through a room, interact with various surfaces, and then enter your virtual ears. It’s not hard to imagine what such a system has to do, but it’s a technical act of genius. We pay a lot of attention to the (deservedly) hard work that has gone on in the graphics department, but I think the revolution in true spatial audio deserves just as much praise. All these research and interlocking technological solutions come together in an intricate dance just so you can enjoy an aural experience. Hats off to them, I say.