Intel SAndy Bridge ... manna from Heaven or monster from Hell?

4th Feb 2011 21:50 #31
jgreer

View Profile

View Forum Posts

Private Message
Member

Join Date
Feb 2011
If you got my point, I am having trouble understanding why you chose the example of matrix multiplication to illustrate a task that is faster on the GPU than the CPU. Either you missed my point or you did not realize how easily parallelizable matrix multiplication is (...?) Regardless, it doesn't look good.

I don't have to try to compile the code you posted to tell you it wouldn't work. You have not defined a,b,c, or N. The performance discrepancy between the CPU and the GPU would be a function of N. My suspicion would be that for small N, the CPU would be faster than the GPU due to the overhead of the kernel launch and the memcpy's to and from the gpu. Which brings me to my next point: Just because there are parts of the encoding task that are suited for GPUs doesn't mean that you will find a net benefit from utilizing the GPU for encoding since any performance gains you realize from the portions that execute on the GPU will be eaten by the overhead of kernel launches, data transfer, etc.

Quote
5th Feb 2011 06:45 #32
mh2360

View Profile

View Forum Posts

Private Message
Member

Join Date
Aug 2001

Location
UK
One only has to look at TMPGEnc Video Mastering Works to see the problems associated with GPU based encoding. Compared to x264, both CUDA and Intel SDK allow the user to change only the very basic settings (Bframes, Reference frames, GOP and motion search range), everything else seems to be "set in stone".

You just don't get the ability to tweak the codec to the requirements of your playback hardware as you do with x264, many of the tweaks and flags which make H.264 worth using just aren't there. This could also mean that we may see compatibility problems with certain players.

In the end it will depend on whether the user is happy with the results. Personally having tried both GPU encoders, i'll be sticking with x264, speed just isn't an issue for me.

Last edited by mh2360; 5th Feb 2011 at 06:52.

Quote
5th Feb 2011 07:52 #33
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by deadrats

i just finished numerous encoding tests for a review i was putting together

Would like to see it. Link?

Quote
6th Feb 2011 21:38 #34
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
https://forum.videohelp.com/threads/331545-an-honest-look-at-TMPGEnc-Video-Mastering-Works-5

sample encodes will be posted on monday.

Quote
7th Feb 2011 03:57 #35
oldfart13

View Profile

View Forum Posts

Private Message
Member

Join Date
Nov 2000

Location
Canada
Might want to hold off buying anymore SBs until they get the bugs out of the chips and mobos:

http://it.slashdot.org/story/11/01/31/1629232/Sandy-Bridge-Chipset-Shipments-Halted-Du...o-Bug?from=rss

Quote
7th Feb 2011 08:19 #36
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by jagabo

Originally Posted by deadrats

split the encoding task up by gop sequences, assign a thread to process each

You can't do that. The working set will become too large. You'll start cache thrashing and all gains from parallelism will go down the drain.

i never said how many segments i would process in parallel, i didn't mean process all of them simultaneously, i meant something on the order of 10-20 gop segments, write the results to the output file, then move on to the next batch.

Quote
7th Feb 2011 08:49 #37
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by jgreer

If you got my point, I am having trouble understanding why you chose the example of matrix multiplication to illustrate a task that is faster on the GPU than the CPU. Either you missed my point or you did not realize how easily parallelizable matrix multiplication is (...?) Regardless, it doesn't look good.

i chose this example because that is what i am working with at the moment, so it was the first thing that came to mind.

however consider a task such as a SAD calculation, which is performed thousands of times during an encode; i recently ran across the source code for an H263 codec that was available on the apple and the author had included the full un-optimized code as well as the AltiVec optimized code, so i thought it would be a great way to learn more about those now defunct SIMD instructions.

in the code it shows how a SAD was implemented by the author and to my surprise the exact same procedure i had always said should be used by a gpu powered encoder was the same one being used by the author, namely performing all the SAD's for a gop and saving them to an array, a simpler version of what i had been advocating for incorporating a gpu into the encoding process.

Just because there are parts of the encoding task that are suited for GPUs doesn't mean that you will find a net benefit from utilizing the GPU for encoding since any performance gains you realize from the portions that execute on the GPU will be eaten by the overhead of kernel launches, data transfer, etc.

variations of this are one of the most commonly used objections against using gpu's for encoding yet with a bit of analytical thought it is easily debunked: a 3d game, in it's most basic form, is nothing more than the gpu rendering polygons and layering textures on top of them. the textures are always pre-rendered on work station class cards and reside on the hard disk, they have to be uploaded to the gpu during execution of the game.

if a gpu, with a large frame buffer, using the PCI-E lanes, can render millions of polygons per second (remember those 3d images are being rendered at 100+ fps), and consequently deal with having massive amounts of data uploaded to it's frame buffer in the form of textures, why do you, or anyone else believe that the multiple kernel launches and data transfers associated with video encoding on a gpu would somehow bring a system to it's knees?

video cards, since dx9, have been designed to be programmed by shaders, do you know what a shader is and how it's used? a shader is typically coded in HLSL and resides in a separate ,shd file file that is then called repeatedly by the main game engine as needed, sound familiar?

since you mentioned kernel launches i'm assuming you are at least somewhat familiar with the "mechanics" of a cuda program, so if a gpu, designed to deal with thousands shader launches per second can handle that what makes you think that it can't handle multiple kernel launches?

don't you think nvidia, with it's hundreds of millions invested in cuda and it's well payed engineers thought of possible road blocks? or do you really believe that some open source developer that has to give his software away for free knows more about engineering than they do?

but perhaps the best proof i can offer is the fact that intel invested 5 years of research and development in quick sync and in the process effectively eliminated one of the two "killer apps" that drives cpu upgrades. intel has already stated that all intel cpu's from now on will feature QS and for those that don't know motion estimation is carried out by the integrated gpu (that's why in order to use QS the integrated gpu must be active).

if gpu encoding was pie in the sky as it's detractor would have you believe why did a multi-billion dollar company invest 5 years and billions developing it's own version to combat nvidia's offering?

clearly anyone that says gpu's are not well suited for video encoding is wrong and needs to lay of the x264 kool aid.

Quote
7th Feb 2011 08:52 #38
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by mh2360

Compared to x264, both CUDA and Intel SDK allow the user to change only the very basic settings (Bframes, Reference frames, GOP and motion search range), everything else seems to be "set in stone".

You just don't get the ability to tweak the codec to the requirements of your playback hardware as you do with x264, many of the tweaks and flags which make H.264 worth using just aren't there. This could also mean that we may see compatibility problems with certain players.

In the end it will depend on whether the user is happy with the results. Personally having tried both GPU encoders, i'll be sticking with x264, speed just isn't an issue for me.

one of my criticisms of tmpg's implementation of the cuda encoder and sdk was the lack of adjustable settings, that however does not mean that they suck just that the way the were implemented was substandard.

if you download the intel developer's media sdk guide you will see extensive features supported by QS, programmers just need to exploit them.

Quote
7th Feb 2011 10:24 #39
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Tom's put up an article on quicksync , cuda etc....
http://www.tomshardware.com/reviews/video-transcoding-amd-app-nvidia-cuda-intel-quicksync,2839.html

Quote
7th Feb 2011 10:52 #40
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by deadrats

https://forum.videohelp.com/threads/331545-an-honest-look-at-TMPGEnc-Video-Mastering-Works-5

sample encodes will be posted on monday.

Thanks for the link.

Quote
7th Feb 2011 21:37 #41
RabidDog

View Profile

View Forum Posts

Private Message
Member

Join Date
Oct 2002

Location
UK
Its strange tho how many encoders (programs) still cant make use of any more than two cores? so if it takes them this long to become truly multicore how much longer must it take for them to become GPU trained?
At least if programmers code for SB CPUGPU they know they will have potential audience of 80%+ of new computers, while with CUDA and others, the market is far more limited.
I mean the limited supply of GPU programmers are all working for Brokerage firms or Render farms aren't they?

I understood the two bits of programming without being a programmer...

Corned beef is now made to a higher standard than at any time in history.
The electronic components of the power part adopted a lot of Rubycons.

Quote
7th Feb 2011 22:44 #42
rumplestiltskin

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2003

Location
United States
You'll know if the GPU is doing its job if Apple's new MacBook Pros (which will be using SB CPUs) rely upon motherboard video or include a separate video chip (which they did in previous models and permitted the user to switch manually or automatically). If the SB GPU is all it's cracked up to be, Apple will rely upon it totally. The new machines are due in a month or so. Be patient.

Quote
8th Feb 2011 07:11 #43
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by rumplestiltskin

new MacBook Pros... due in a month or so

Surely they'll be delayed because of the recently discovered chipset problem.

Quote
8th Feb 2011 14:28 #44
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by RabidDog

Its strange tho how many encoders (programs) still cant make use of any more than two cores? so if it takes them this long to become truly multicore how much longer must it take for them to become GPU trained?
At least if programmers code for SB CPUGPU they know they will have potential audience of 80%+ of new computers, while with CUDA and others, the market is far more limited.
I mean the limited supply of GPU programmers are all working for Brokerage firms or Render farms aren't they?

I understood the two bits of programming without being a programmer...

i can tell you this, i went to college later on in life, in 1998 i enrolled in a community college where i initially majored in physics, after taking only physics, chemistry and calculus classes i realized it was going to take me a long time to earn a physics degree and quite frankly i wasn't optimistic about my employment chances so i switched to comp sci and started working towards a comp sci degree. i took every single comp sci class they offered, if i had finished my electives i would have earned my A.S. in comp sci with a minor in physics.

in all those programming classes, i can't recall ever, ever having done any multithreaded coding or SIMD coding, everything i know now (and i'm not an expert by any means) i had to learn on my own.

the biggest road block to multithreaded programming is the tools that are normally used, namely C/C++. C dates back about 40 years, developed by bell, if i remember correctly, and was an evolution of A and B programming languages. C does not natively support multithreading or SIMD, but the language is extensible so you can add support via libraries, such as pthreads.

be that as it may, it's still a pain in the ass to write multithreaded code, if the language supported constructs such as:

thread 1 {

}

thread 2 {

}

and so on you would see all apps being multithreaded, maybe even too threaded. likewise if someone released a C compiler that allowed a programmer to do something like:

sse int c = a + b

sse float g = e/pi

or something similar, you would see much more simd enabled apps.

the intel compiler does have the capability to take straight code and multithread it and/or simd optimize it, but it can only be using from within visual c and both those apps are expensive.

i do think nvidia screwed the pooch with cuda, they should have made a dummy proof compiler that you could chimp your way through and simply done something like:

gpu execute {

}

and had the compiler take care of all the details. had they invested the time and money to do that (and no, it's not easy) today the conversation about gpu programming would be very different.

Last edited by deadrats; 8th Feb 2011 at 14:36.

Quote
8th Feb 2011 14:48 #45
oldfart13

View Profile

View Forum Posts

Private Message
Member

Join Date
Nov 2000

Location
Canada
Also from that Tom's site:

"You have chosen a CPU that uses an 1155 socket, otherwise known as "Sandybridge". There is a problem with the SATA ports on motherboards that support that socket, and they have been removed from the market until INTEL can fix the problem. If you want to build an INTEL based system now, you need either an 1156 or a 1366 CPU and compatable motherboard."

Can't people just WAIT?

Quote
8th Feb 2011 16:36 #46
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by oldfart13

Also from that Tom's site:

"You have chosen a CPU that uses an 1155 socket, otherwise known as "Sandybridge". There is a problem with the SATA ports on motherboards that support that socket, and they have been removed from the market until INTEL can fix the problem. If you want to build an INTEL based system now, you need either an 1156 or a 1366 CPU and compatable motherboard."

Can't people just WAIT?

i would and am. i see no point in spending good money buying a previous generation cpu when in a couple of months things will be iron out.

there is one caveat to that however and that's if you find a ridiculously good deal on a 1156 or 1366 cpu and an obscenely good deal on a supporting motherboard, then i might consider it.

but it would have to be a really good deal.

Quote
9th Feb 2011 11:50 #47
hank315

View Profile

View Forum Posts

Private Message
HCenc author

Join Date
Dec 2006

Location
Netherlands
and so on you would see all apps being multithreaded, maybe even too threaded. likewise if someone released a C compiler that allowed a programmer to do something like:

sse int c = a + b

sse float g = e/pi

or something similar, you would see much more simd enabled apps.

the intel compiler does have the capability to take straight code and multithread it and/or simd optimize it, but it can only be using from within visual c and both those apps are expensive.

Just for the record, there's a big difference between multithreading and SIMD instructions.
The Intel compilers will insert SIMD instructions automatically using a dispatcher which will recognize the available MMX/SSEx CPU capabilities.
AFAIK a compiler will never insert multithreading code, you have to program that yourself using a pthread lib or the Windows API.

And IMO Visual Studio combined with a decent (Intel) compiler isn't that expensive if you use it professional.

HCenc at: http://hank315.nl

Quote
9th Feb 2011 16:35 #48
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by hank315

AFAIK a compiler will never insert multithreading code, you have to program that yourself using a pthread lib or the Windows API.

i'm pretty sure intel released a compiler version that could take single threaded code and multi-thread it back in the P4 days, as a way of showing off the capabilities of hyper-threading and i'm almost positive that there is an option within the intel compiler to this day to auto-parallelize you code and i'm also fairly certain that they advertise this capability for their latest fortran compiler (along with support for avx).

i'm going to download the demos for both intel compilers and microsoft's visual studio and confirm my memory.

Quote
9th Feb 2011 17:13 #49
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Automatic Parallelization with Intel® Compilers

http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers/

Three requirements must be met for the compiler to parallelize a loop. First, the number of iterations must be known before entry into a loop so that the work can be divided in advance. A while-loop, for example, usually cannot be made parallel. Second, there can be no jumps into or out of the loop. Third, and most important, the loop iterations must be independent.

Quote
9th Feb 2011 18:15 #50
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
good, so i'm not losing my mind, intel's compiler is able to automatically multi-thread certain types of code, obviously an experienced programmer will get better results but if one wants to save on development time and doesn't give a rip about wringing every last drop of speed out of his code, intel's compilers may be the way to go.

Quote
9th Feb 2011 20:11 #51
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
I suspect that multithreading optimization works in such a small number of cases it's essentially useless.

Quote
9th Feb 2011 20:24 #52
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by jagabo

I suspect that multithreading optimization works in such a small number of cases it's essentially useless.

i wouldn't be too sure about that, when the very first hyperthreaded P4's came out there was site that did a comparison test between those P4's and whichever was the fastest amd's at the time (i think they still used the "xp" nomenclature, anyway back then mp3 encoding was a part of every review and so these guys did a test with lame and the amd was faster. they also got their hands on a copy of the intel compiler and they compiled lame with the /MT switch (i think that's the switch) and reran the tests.

this time, though both processors saw an increase in performance, the P4 saw a bigger gain and was able to beat the amd easily (i'm suspecting that the compiler also probably used SSE2 optimizations for the P4).

now you've got me thinking, i think i'm going to write a quick math benchmarks with loops that should be easily parallelized and see what kind of boost the intel compiler is capable of providing.

Quote
9th Feb 2011 20:38 #53
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Intel's runtime library disables all MMX and SSE optimizations if "Genuine Intel" doesn't appear in the CPUID.

Quote

Intel SAndy Bridge ... manna from Heaven or monster from Hell?

Thread Tools

Search Thread

Similar Threads

Ivy Bridge to be up to 199% faster than Sandy Bridge

new intel sandy bridge chips to run up to 180watts - 23amps! toast anyone?

new Intel Sandy Bridge cpus

New Intel Sandy Bridge cpu

intel demos sandy bridge video transcoding engine...