Facebook 360 audio encoding and rendering can achieve immersive experience

From 360-degree video to Oculus, Facebook 360 audio encoding and rendering delivers immersive experience with fewer channels and less than 0.5 milliseconds of rendering latency.

â€¢ New 360-degree spatial audio encoding and rendering technology that enables spatialized audio to maintain high quality throughout the entire process from editor to user. This technology is expected to achieve large-scale commercial use for the first time.

â€¢ We support a technology called â€œhybrid higher-order ambisonicsâ€ that allows spatialized sound to remain of high quality throughout the process. This is an 8-channel audio processing system with rendering and optimization that delivers higher stereo quality with fewer channels and ultimately saves bandwidth.

â€¢ Our audio system supports both spatialized audio and head-oriented audio. In spatial audio, when hearing a sound from a certain scene, the system will respond differently according to the user's experience in the 360-degree panoramic video. By head-oriented audio, audio such as character dialogue and background music will remain stationary in orientation. This is the first time in the industry that we have achieved simultaneous rendering of high-end stereo and head-oriented audio.

â€¢ Spatial audio rendering systems give developers a real-time rendering experience with latency less than half a millisecond.

â€¢ The FB360 encoder tool can transfer processed audio to multiple platforms. The SDK for audio rendering can be integrated into Facebook and Oculus Video, ensuring a unified experience from production to release. This saves work time and ensures that what the developer hears in production is consistent with what is ultimately published.

The 360-degree panoramic video experience on Facebook is amazing, giving you an immersive experience. But if you want a more perfect audiovisual experience, you may also need 360-degree spatial audio. When users use 360-degree spatial audio, each sound sounds like it's coming from the corresponding orientation in space, just as we perceive sound in real-life environments. The helicopter roaring above the camera sounds like it is above the user, and the actors in front of the camera sound like they are in front of the user. When the user looks around the entire video picture, the system needs to react to changes in the direction of the user's head and reposition each sound to the corresponding position on the picture. Whether through a mobile phone, a browser or a VR headset display, each time a user views a 360-degree panoramic video, the audio needs to be recalculated and updated to perfectly restore the user's true spatial experience.

In short, in order to achieve this effect, we must develop an audio processing system that gives the user a sense of immersion in the theater even without relying on a large room in which the speakers are placed. Among the series of questions waiting for us to solve, the first thing is to structure the audio environment that reflects the real world environment, and present it to the user through the headphones with higher sound resolution, while constantly tracking the user's visual orientation. Regular stereo sound heard through a headset may help the user understand whether the sound is played in his or her left or right ear, but it does not help the user perceive the depth or height of the sound in the environment, nor does it accurately sense that the sound is due to Your front is still behind.

Creating this spatialized listening experience and commercializing it on a large scale requires many new technologies. Although spatial audio research is in full swing in the academic arena, so far, there has not been a reliable end-to-end transmission that can push this technology to the consumer market on a large scale. Recently we have introduced new user tools and rendering methods, which gives us the first opportunity to provide high quality spatial audio technology for large-scale consumer markets. These rendering techniques are applied to a new powerful work platform called "Space Audio Workstation" that enables creators to add spatialized audio to 360-degree panoramic video. The rendering system is also available for the Facebook app so that users can hear the same vivid panoramic audio uploaded by the creator via Facebook.

Both of these improvements can help video producers reshape reality across multiple devices and platforms. In this article, we will explore some of the technical details we have explored. But let us first understand the history and background of the development of spatial audio.

First-time spatial audio

Due to the presence of head related transfer functions (HRTFs), it is possible to make the user hear a sound with a vacuum between the headphones. HRTFs help developers build audio filters that can be applied to audio streams so that the sounds sound as if they are in their particular correct positionâ€”before, after, or next to the listener. HRTFs are generally suitable for use in anechoic chambers with human body models or human head and torso models, but this can also be achieved by other methods.

If you want users to hear panoramic sound while watching a panoramic video, developers must put the sound in the right place. In other words, they must design and transmit spatial audio. There are many ways to do this. One of the methods is based on the spatial audio of the object, and the sound emitted by each object (eg, helicopter or actor) in the scene is saved as a discrete stream with positional metadata. The sound environment architecture of most games uses this object-based spatial audio rendering system because the position of each audio stream can change at any time depending on where the player moves.

Ambisonics technology is another spatial audio method that can represent the entire sound field. We can think of it as a panoramic photo of the audio. The multi-channel audio stream can be easily used to represent the entire sound field, making it easier to transcode and stream than object-based spatial audio processing. An Ambisonic stream can be presented in a variety of scenarios. The biggest difference between these programs is the order of the ambisonic sound field. The first-order sound field produces four channels of data, while the third-order sound field produces 16 channels of data. Generally, a higher level of sound means better sound quality and more accurate spatial positioning. We can understand the low-order ambisonic sound field as a blurred panorama.

Workflows and tools

The Spatial Workstation Space Audio Workstation is an audio processing tool we developed to help professional sound designers design spatial audio for 360-degree panoramic video and linear VR experiences. The workstation has more powerful audio processing capabilities than existing audio workstations, allowing developers to locate sound in 3D space based on 360-degree panoramic video while pre-listening spatial audio through the sound output on the VR headset. This opens up a high-quality â€œend-to-endâ€ workflow that runs through the entire process of content creation to release.

Traditional stereo audio consists of only two audio channels. The system we developed using Spatial Workstation enables the output of eight audio channels. Because the sound field has been optimized to accommodate VR and 360 panoramic video, we call this a hybrid high-order ambisonic system. The system is tuned into our spatial audio renderer to maximize sound quality and positional accuracy while minimizing performance requirements and latency. In addition, the Spatial Workstation can output two head-oriented audio channels. What is output at this point is a stereo stream that does not track the head and responds with rotation, at which point the sound will remain "locked" to a position around the head. Most 360-degree panoramic experiences use mixed spatial and head-oriented audio, spatialized audio can be used for motion within a 360-degree panorama, and head-oriented audio can be used for narration or background music.

SPATIALIZED / HEAD-LOCKED

Extraordinary rendering

Our Spatial Audio Renderer condenses the range of technologies we've developed over the years to easily extend spatial audio into a wide range of different types and configurations while maintaining the best quality. The renderer uses parameterization and representation of HRTFs to measure the various components of the HRTFs to determine whether to focus on speed or quality when rendering, or to find an optimal equilibrium point between the two. In addition, our renderer audio latency is less than half a millisecond, an order of magnitude lower than most renderers, making it ideal for real-time experience in optimizing panoramic audio and video, such as panoramic video with head tracking.

This flexibility facilitates the spread of panoramic audio and video over a range of desktop computers, mobile devices and browsers. Our renderers are tuned to provide a consistent work experience for users on different platforms and work well on these different platforms. This consistency is important for creating high quality spatial audio. Due to the different requirements for operational performance, audio may exhibit different effects on different platforms or devices, which is clearly disadvantageous for the entire ecosystem. We hope to ensure that panoramic audio and video will consistently maintain excellent quality in all common devices and ecosystems under the large-scale use of platforms such as Facebook.

Work effectively across platforms

The renderer is part of the Audio360 audio engine, which spatializes mixed high-order ambisonic and head-oriented audio streams. The audio engine is written in C++, which provides optimized vector instructions for each platform. It is very lightweight and can be queued and spatialized by multithreading and Lock-Free system for mixing and mixing. It also works directly with the audio systems on each platform (openSL on Android, CoreAudio on iOS/macOS, WASAPI on Windows) to minimize output latency and maximize processing efficiency. This lightweight design not only allows developers to stay productive at all times, but also reduces application bloat by keeping the binary size small. Binary files in the audio engine are compiled to approximately 100 kilobytes.

For the Web, the audio engine is compiled into asm.js using Emscripten. This helps us maintain, optimize, and use the same code base on all platforms. The code works fine in the browser with very little modification. The flexibility and speed of the renderer allow us to use the same technology in a variety of browsers to guarantee audio quality. The audio engine in this case is used as a custom processor node in WebAudio, where the audio stream is routed from the Facebook video player to the audio engine, while the spatialized audio from the audio engine is passed to WebAudio and through the browser. Playback. Compared to the native C++ implementation, the JS version can only run at 2x slow or 4x slow, which is sufficient for real-time processing.

As the processing speed of electronic devices and browsers gets faster every year, the browsing speed of various devices and browsers will greatly evolve, and the flexibility and cross-platform nature of the renderer and audio engine will enable us to continue to improve. The quality of the sound.

From encoding to client

The world of spatial audio coding and its file formats is rapidly evolving and changing. We want to make it as easy as possible to encode and upload content created with Spatial Workstation to Facebook so that people can watch and listen on all the devices they use. The Spatial Workstation encoder prepares 8-channel spatial audio and stereo head-oriented audio, packaged with a 360-degree panoramic video into a file and uploaded to Facebook.

Encoder selection optimization

We have encountered some challenges in finding a viable file format. There are several constraints, mainly some of which can be placed first, but the urgent need to address is to provide a suitable encoder as early as possible. The main limiting factor is transcoding video into Facebook's native video format: H.264 encoded MP4 format video minimizes the loss of sound quality throughout the process, which means the following practical limitations:

â€¢ AAC in MP4 supports 8 channels, but does not support 10 channels.

â€¢ The AAC encoder treats 8-channel audio as a 7.1 surround format, which uses aggressive low-pass filters and other techniques to compress the LFE channel. This is not in line with the spatial audio we are committed to presenting.

â€¢ MP4 metadata is extensible, but will tediously work with tools such as ffmpeg or MP4Box.

We chose to configure the channel with three tracks in the MP4 file. The first two are four-channel tracks without LFE, and there are eight non-LFE channels. The third track is stereo head oriented audio. We encode at a high bit rate to minimize the loss of quality when converting from WAV to AAC, as these tracks will be transcoded again on the server in preparation for transmission over the client.

At Facebook, we have a core engineering valve to achieve rapid technological innovation. We believe that with the constant updating of the tools and corresponding features used to edit the audio, we have not been able to consider all the information that may need to be communicated in time. For this reason, we need a forward-extensible and easy-to-use metadata solution. Defining our own MP4 box type feels very fragile, so we decided to put the metadata in an xml box in the meta box. XML can follow a pattern that can evolve rapidly according to the needs of developers. The MP4Box tool can be used to read and write this metadata from MP4 files. We store metadata for each track (under the trak box) and define the channel layout for that track. Then, we also write global metadata at the file level (under the moov box).

The Space Workstation Encoder also takes the video as part of the input. The video is integrated into the generated file, no transcoding is required, and the appropriate video space metadata is written so that the server processes it as a 360-degree panoramic video when uploaded to Facebook.

YouTube currently requires four channels to support first-order stereo effects. We also support video in this format.

Efficient and convenient transcoding operation

Once developers upload videos with 360-degree panoramic views and 360-degree sound, these videos are ready to be pushed to clients of various devices. Audio performs similar processing in multiple formats. We extract audio metadata (whether in YouTube ambiX or Facebook 360 format) to determine the track and channel one-to-one mapping and then convert it to the various formats we need. As with all other videos, sometimes for comparison purposes, we use multiple encoder settings for transcoding for the best overall experience; we also have stereo binaural rendering technology compatible with all traditional clients, and Use it as an alternative when any problems arise.

Audio and video can be processed separately and delivered to the client by using an adaptive streaming protocol.

Delivering works to customers

Different clients have different capabilities and support different video container/codec formats. We can't force all devices to support one format at the same time, so we've prepared different formats for iOS, Android and web browsers. This allows users to control the video player on these devices across platforms, implementing special behaviors based on different needs, but we prefer existing code that is fully tested and does not require extra time to execute. For this reason, we prefer to use MP4 format files as video carriers on iOS, and we prefer to use WebM on Android and web browsers. On iOS and Android, unlike mono or stereo soundtracks, decoding 10-channel AAC audio is not directly supported by the unit or relies solely on hardware acceleration. These problems with AAC and 8-channel or 10-channel audio have led us to discover a special codec - Opus is being used by others for spatial audio, and the Opus codec for better compression. It is an advanced open source codec that decodes software faster than AAC. This makes Opus a natural choice for us, especially for WebM. Most encoders or decoders currently do not support Opus under MP4. However, there is a draft for implementing Opus under MP4, and we are working on a project to support ffmpeg.

There were some challenges when we converted the three-track format in the uploaded file ("4 + 4 + 2") into a single 10-channel Opus soundtrack. Like AAC, the allowed channel mapping and LFE channel are also a problem. However, Opus allows for undefined channel mapping families (255 families), meaning that these channels are not known layouts. This work is perfect enough because we control the encoding and decoding, and we can ensure that the layouts at both ends have the same understanding. We transmit channel layout information in a flowing manifest file. In the future, with the maturity of spatial audio technology in Opus, there may be some specific channel mapping and enhanced coding techniques that can significantly improve audio quality while significantly reducing file size. In this way, our customers will be able to decode spatial audio in a minimal or no change.

Future development direction

We are in an advanced field of spatial audio that is constantly evolving and evolving â€“ improving the format used to optimize non-spatial video and audio. Everything we do creates a new audiovisual experience and turns it into reality. But there are still a lot of things to do: improvements to the top and bottom stacking components, from workstations to video file formats. Currently, we are developing a file format that can store all audio in one track for uploading, and it is possible to implement lossless encoding. At the same time, we are also very interested in the project to improve spatial audio compression in Opus. We look forward to exploring adaptive bitrate and adaptive channel layout techniques to improve the user experience of users who are limited by bandwidth, or who have enough bandwidth and expect higher quality content. This is an exciting area and we look forward to making even more outstanding contributions to the entire audiovisual ecosystem.

LiveVideoStack Meet | Shanghai

New trends in multimedia development

At the beginning of 2018, the audio-visual technology ecology was not calm. Codec's competition entered the era of competition. WebRTC 1.0 was finalized in the browser, mobile communication and even multimedia communication on IoT; AI, blockchain technology development, and multimedia Development takes place in chemical reactions and will be a new force driving ecological development.

In 2018, LiveVideoStack will showcase the exploration and practice of new technologies in the audio and video field through the "LiveVideoStack Meet: New Trends in Multimedia Development" series, as well as the latest best practices in emerging application scenarios and traditional industries.

Universal Back Sticker

Universal Back Sticker, Back Film, TPU Back Sticker, Back Skin Sticker, PVC Back Sticker, Back Skin,Custom Phone Sticke,Custom Phone Skin,Phone Back Sticker

Shenzhen Jianjiantong Technology Co., Ltd. , https://www.mct-sz.com