In this article we’d like to write about H.264 video processing on Intel GPUs on Linux and the experience our company, Inventos, got in the process of enhancing StreamBuilder— our streaming media server.
Introduction
When Intel Server SDK Beta had been released for Linux we were very keen on implementing Intel Quick Sync Video technology into StreamBuilder — our versatile media server software that works as a backend for Webcaster.pro. At that moment StreamBuilder was able to:
capture input streams from SDI, IP multicasts, RTMP
transcode and resample virtually any audio/video streams to H264 HLS or RTMP
support distributed and fault-tolerant deployment scheme where ingesting, encoding and streaming are performed on independent and redundant nodes
apply filters (for audio normalization \ amplification, video deinterlace, crop, resize, etc.)
have flexible configuration (with own DSL) which allows to build pipelines (and even trees) for consequent media processing using mentioned filters etc.
StreamBuilder is based on libavcodec and despite that it’s already optimized well enough, it was designed to work on x86 CPUs. Increasing number of CPU cores speeds up encoding almost linearly, but it’s expensive and there are always tasks to do on CPU besides video encoding. Using GPU for encoding could make processing faster, cheaper and with a higher channel/rack unit density value.
Intel solution
So it was set: rewrite a major part of StreamBuilder core, implement Intel SDK for Servers to get a significant performance boost. Our goal was to encode at least 4 Full HD streams on a “budget price” hardware. Slightly anticipating, we’d say that the goal was outperformed.
Colleagues from Intel were interested in a “reallife” use case of Media SDK for Linux Servers too. They did a great job, helping us during the development and implementation process, answering our questions, providing code samples and making valuable pieces of advice.
Media SDK for Server comes with documentation and examples, which cover almost all possible use cases. It helped us a lot and simplified implementation greatly. As a matter of fact, implementation in our case came down to replacing decoding/ encoding/resampling modules to Intel Quick Sync-enabled modules that use Intel HD Graphics abilities.
Staging hardware and software
We used 1RU (Rack Unit) server with following specs:
Motherboard | Supermicro X10SLH-F |
CPUs | #1 Intel® Xeon® CPU E3-1225 v3, Intel® HD Graphics 3000 #2 Intel® Xeon® CPU I7-3770, Intel® HD Graphics 4000 #3 Intel® Xeon® CPU E3-1285 v3, Intel® HD Graphics P4700 |
RAM | 16 GB |
OS | Ubuntu 12.04.4 LTS 3.8.0-23-generic |
Motherboard chipset must be C226 PCH, because only those server chips are able to work with hardware encoding (for the moment of writing that article). Also it’s highly recommended to have motherboard without built-in GPU otherwise there could be issues with GPU identification and working.
Motherboard that we used had built-in GPU and that caused us a lot of headache to make things work. Intel Media SDK didn’t recognize device ID at first, we couldn't enable Quick Sync Video. After BIOS update the required setting appeared in BIOS, but we still had to manually turn off motherboard’s GPU with a on-board jumper. That configuration blocks IPMI and video output, but we are accessing server via SSH, so that wasn’t a big issue.
Note that here are some limitations on Linux kernel version: 3.2.0-41 or 3.8.0-23 for Ubuntu 12.04 and SP3 3.0.76-11 for SUSE Linux Enterprise Server.
Results
CPU: E3-1225 V3, 16 GB RAM, Intel® HD Graphics P4600
| ffmpeg | sample_full_transcode | streambuilder (no optimization) | streambuilder (optimization) |
time | 8 min 42 s | 1 min 19 s | 2 min 19 s | 1 min 40 s |
cpu (max) | 750% | 55% | 125% | 50% |
mem (max) | 3,3% | 4,6% | 0.5% | 0.4% |
PSNR | 48,107 | 46,68 |
|
|
Average PSNR | 51,204 | 49,52 |
|
|
SSIM | 0,99934 | 0,9956 |
|
|
MSE | 1,623 | 2,969 |
|
|
CPU: I7-3770, 3 GB RAM, Intel® HD Graphics 4000
| ffmpeg | sample_full_transcode | streambuilder (no optimization) | streambuilder (optimization) |
time | 8 min 48 s | 1 min 24 s | 2 min 31 s | 1 min 23 s |
cpu (max) | 750% | 19% | 150% | 45% |
mem (max) | 18% | 20% | 2.8% | 2.3% |
PSNR | 48,107 | 46,495 |
|
|
Average PSNR | 51,204 | 49,27 |
|
|
SSIM | 0,99934 | 0,991 |
|
|
MSE | 1,623 | 3,036 |
|
|
CPU: E3-1285 v3, 16 GB RAM, Intel® HD Graphics P4700
| ffmpeg | sample_full_transcode | streambuilder (no optimization) | streambuilder (optimization) |
time | 8 min 1 s | 1 min 11 s | 2 min 11 s | 1 min 34 s |
cpu (max) | 750% | 55% | 130% | 55% |
mem (max) | 3,3% | 4,6% | 0.5% | 0,4% |
PSNR | 48,107 | 46,68 |
|
|
Average PSNR | 51,204 | 49,52 |
|
|
SSIM | 0,99934 | 0,9956 |
|
|
MSE | 1,623 | 2,969 |
|
|
StreamBuilder’s signal quality metrics values (PSNR, SSIM, MSE) are equal to sample_full_transcode values so we didn’t show them in the table.
As you could see from tables above, server CPUs with Intel HD Graphics P4700/P4600 perform in our test better and give better output video quality than i7-3770, Intel HD Graphics 4000. But that statement is not always correct. Intel keeps improving video encoding with each microchip and SDK versions. Encoding speed could be slightly slower on the latest microchips, but CPU load would be lower too. We have no ideas, why it is that way.
Intel HD Graphics P4700 encoding quality was comparable to P4600, but it was 14% faster on E3-1285 v3 with the same resource consumption. Other notable thing is that E3-1285 v3 is faster than E3-1225 v3 by 10% on encoding with ffmpeg.
Server with installed StreamBuilder and enabled Quick Sync Video makes possible to encode one input stream to 12 Full HD (1080p) HLS streams or 24 HD HLS streams (720p) or 46 SD HLS streams (480p).
Also, optimized memory operations reduced RAM consumption by half.
Our initial goal was outperformed for three times! Now we could encode several times more streams on a hardware much cheaper that we used before.
You could try out StreamBuilder too, just email us at ask@streambuilder.pro, and we’ll send you a demo distributive.
Conclusion
Intel Media SDK for Servers allows building cost-effective, high-performance encoding/transcoding servers with high stream/rack unit density. Implementation wasn’t a walk in a park, we bumped into some difficulties linked with motherboard’s GPU, but they were solved eventually. As a reminder: main hardware requirements are C226 microchip and motherboard without built-in GPU.
Benefits of that solution: besides of a significant performance boost you get much lower CPU usage, low memory consumption — result in additional free resources that you could utilize for other tasks (even extra CPU encoding).