an honest look at gpu accelerated encoding

22nd Aug 2010 18:55 #1

Banned

as the title says, in this post i will be examining the facts behind gpu acceleration as they pertain to video encoding/transcoding. if you ask most "knowledgable" people about gpu accelerated encoding, almost all will say "the quality sucks" (or some variation on the theme) and will point people to some review on some site where a few tests were done and while the speed was good the quality was below what software based solutions offer.

but why should this be? is that the end of the story?

over the past couple of weeks i have read every white paper i could get my hands on, every faq and help file that comes with any gpu accelerated app currently available, as well as running numerous tests using my xfx 9600 gso 768mb and 2 other cards i borrowed:

an HIS 5550 1gb and a Palit 460 gtx 768mb.

here are the straight facts:

most of the criticism regarding encoding quality comes from developers being lazy and using the reference h264 encoder nvidia supplied with their cuda sdk. if you download the sdk and read through all the documentation as well as look at all the sample code included you will see that nvidia supplied a basic gpu accelerated h264 decoder and a basic h264 encoder, as a way of showing programmers who had never written anything meant to run on a gpu exactly how they should structure their code and the techniques required to get an app to run on a gpu. the reality is that nvidia never meant for said decoder and encoder to be used as a final product, only as a starting point. it's just like years ago, during the windows 2000 days, when microsoft released a bare bones ram file driver, that had a max size limit of 32mb, as a way of showing developers how to code a ram driver for windows. there were a number of commercial apps that just took that bare bones driver, wrapped a pretty package around it and tried to sell it as their own software.

using the latest version of cyberlink's espresso (a piece of software that uses main concept's sdk, the latest version of which features full gpu acceleration), i test encoded a file with the following characteristics:

VC-1 in WMV container, video is 8 mb/s, audio is wma 9.2, 192 kb/s, 48 khz, 18:44 min long, 1440x1080

with the following target outputs:

h264 at 10 mb/s, 128 kb/s 44 khz aac, 1440x1080, 16:9

mpeg-2, 10 mb/s, 128 kb/s 44 khz ac3, 1440x1080, 16:9

my 9600 GSO did the two encodes in 12:48 minutes and 13:03 minutes respectively.

the 460 gtx did the same encodes in 15:28 minutes and 15:00 minutes respectively.

here's where it gets even more interesting, espresso doesn't make full use of gpu acceleration when coupled with an ati card, it only supports gpu acceleration of filters and decoding, encoding is done by the cpu. be that as it may, doing the same mpeg-2 encode as above, the encode finished in 11:44 minutes. the h264 encode took much longer as one would expect when you're doing 1080p software based h264 encoding.

the 9600 gso has 96 cuda cores, but only 48 are available to current gpu accelerated apps because this is a dual gpu card and only the latest cuda sdk supports multi-gpu acceleration, unfortuantely it seems that main concept's sdk is based on the older cuda sdk. the 460 gtx has 332 cuda cores, but the new architecture doesn't allow for an exact cuda core comparison, primarily because of the way the cores are arranged and grouped and communicate with one another. the ati card has 320 stream processors but you can't compare gpu "core" count across competing gpu architecture design.

so what exactly is going on? to understand we need to keep in mind that when we transcode from one format to another the process isn't compressed_source ---> compressed_target, the process is compressed_source ---> decompressed_source ---> compressed_target.

after numerous tests with the above cards, using espresso, ulead's latest studio (which is based on the same main concept sdk as espresso and has similar options), tmpg express (which supports gpu acceleration for filters and mpeg-2 decode only), the latest version of ImToo's encoder (which uses the reference nvidia h264 encoder), and rerunning the above tests using full gpu acceleration, gpu accelerated decode only and gpu accelerated encode only, it became obvious that gpu acceleration was heavily dependant on the amount of on board ram the video card has, more than the speed of the ram it has. in fact the more compressed the source, the bigger a frame buffer was needed, likewise the more compressed format the target, the more gpu acceleration was beneficial and again the bigger frame buffer the better.

depending on the speed of the cpu mated to the gpu, it may be better from a performance standpoint if you chose to perform the decode on the cpu and allow the gpu to do the heavy lifting of encoding versus allowing the gpu to do everything, likewise a video card with a huge frame buffer (i have seen cards with 2gb of on board ram) would probably benefit most from allowing both the decode and encode to be handled by the video card. ram speed, from a video encoding standpoint, doesn't seem to have any bearing on performance.

lastly, many of the complaints on quality of encodes, in addition to the reason i outlined above, seem to have a lot to do with the way gpu accelerated decoders uncompress video, the tmpg express team tacitly admits this as they warn that enabling gpu accelerated decoding may result in unexpected output and experimentation with the various cards coupled with gpu accelerated media players showed that some files weren't being decoded properly. as anyone that has ever done any programming knows, GIGO (garbage in, garbage out) applies. if the gpu accelerated decoder isn't uncompressing the video stream properly then you're going to end up with a crappy encode.

i guess it comes down to this: if you have a high quality source and a card with lots of on board ram and a properly coded app, then you will have results you will be happy with, one the other hand if you have a source with just average quality, a card with average amounts of ram and an app that was coded by lazy and/or incompetent programmers, then you will be very disappointed with gpu acceleration.

of course this may all prove to be a moot point as both amd and intel are reading cpu's with fully integrated gpu's, depending on how well this fusion of processors is pulled off, it may come to pass that gpu accelerated video encoding will be still born technology, destined to be nothing more than an oddity, a foot note, in the annals of computing history.

Quote

22nd Aug 2010 19:25 #2

poisondeathray

Member

Thanks for sharing your observations.

Why not test higher quality sources ? WMV 1440x1080 at 8Mbit/s is quite low for "high quality"

of course this may all prove to be a moot point as both amd and intel are reading cpu's with fully integrated gpu's, depending on how well this fusion of processors is pulled off, it may come to pass that gpu accelerated video encoding will be still born technology, destined to be nothing more than an oddity, a foot note, in the annals of computing history.

Yes, this is interesting, but we still need programming to take advantage of the hardware. Apparently Sandy Bridge has dedicated transcoding silicon SEPARATE from the CPU and integrated GPU .

For the first time on any Intel chip, Sandy Bridge will include silicon dedicated to handling the transcoding, or converting, of data from one format to another. The transcoding circuits will be separate from the main processor and the on-chip graphics function, according to sources at system makers.

http://news.cnet.com/8301-13924_3-20013897-64.html

Quote

22nd Aug 2010 20:39 #3

deadrats

Banned

Originally Posted by poisondeathray

Why not test higher quality sources ? WMV 1440x1080 at 8Mbit/s is quite low for "high quality"

that's not necessarily true, the sources i used were from adult web sites where the footage is either shot directly to the format it's uploaded in or it's shot using high quality DSLR's and the final product on the site has only been transcoded once from the format the camera outputs to vc-1.

also, i have found that vc-1 sources are the most difficult to transcode with good results, it seems to be a highly compressed format that doesn't lend itself to proper decoding by the vast majority of decoders, as a general rule of thumb i found that given sources of equal quality, transcoding from vc-1 was the most difficult in terms of maintaining quality, h264 was right behind it, divx (asp variant) and xvid were almost as easy as mpeg-2 as far as transcoding from, in maintaining quality and achieving maximum speed.

in fact if i were given the choice to shoot the same scene using 4 different cameras where one shot directly to vc-1, h264, divx or mpeg-2, i would choose to shoot with the mpeg-2 camera.

be that as it may, if you have some really high quality footage feel free to upload it and i'll be more than happy to run a few more tests.

Originally Posted by poisondeathray

Yes, this is interesting, but we still need programming to take advantage of the hardware.

based on the white papers and some press releases from amd i was able to find, it seems that all floating point math will be handled by the integrated gpu, basically it will require no special programming at all. this is further supported by amd's recent announcement that future amd cpu's will not support 3dnow! extensions, just sse.

Originally Posted by poisondeathray

Apparently Sandy Bridge has dedicated transcoding silicon SEPARATE from the CPU and integrated GPU .

For the first time on any Intel chip, Sandy Bridge will include silicon dedicated to handling the transcoding, or converting, of data from one format to another. The transcoding circuits will be separate from the main processor and the on-chip graphics function, according to sources at system makers.

http://news.cnet.com/8301-13924_3-20013897-64.html

if that is in fact the case, then sandy bridge may effectively put the kibosh in gpu acceleration, the article makes it sound like transcoding will be handled by dedicated hardware within the cpu and the natural implication is that it will be transparent to the software being run. if that is the case, it certainly can make the massive investment nvidia made in gpu acceleration seem like a waste of money and resources.

of course this is intel, there is no doubt in my mind that they will artificially limit performance in order to preserve their forced upgrade/planned obsolescence business model.

on the other hand, gpu acceleration may thrive, if more apps start supporting open cl, direct compute or cuda and enthusiasts conclude that it's cheaper to buy a mid range video card with lots of on board ram instead of spending a butt load of cash on a new motherboard, new cpu and new ram.

either way, this can only be good for the consumer.

Quote

an honest look at gpu accelerated encoding

Thread Tools

Search Thread

Similar Threads

x264 soon to be gpu accelerated?

gpu accelerated firefox beta is out...

main concept gains gpu accelerated encoding

gpu accelerated firefox demo available!!!

firefox to soon be gpu accelerated