If you got my point, I am having trouble understanding why you chose the example of matrix multiplication to illustrate a task that is faster on the GPU than the CPU. Either you missed my point or you did not realize how easily parallelizable matrix multiplication is (...?) Regardless, it doesn't look good.
I don't have to try to compile the code you posted to tell you it wouldn't work. You have not defined a,b,c, or N. The performance discrepancy between the CPU and the GPU would be a function of N. My suspicion would be that for small N, the CPU would be faster than the GPU due to the overhead of the kernel launch and the memcpy's to and from the gpu. Which brings me to my next point: Just because there are parts of the encoding task that are suited for GPUs doesn't mean that you will find a net benefit from utilizing the GPU for encoding since any performance gains you realize from the portions that execute on the GPU will be eaten by the overhead of kernel launches, data transfer, etc.
+ Reply to Thread
Results 31 to 53 of 53
-
-
One only has to look at TMPGEnc Video Mastering Works to see the problems associated with GPU based encoding. Compared to x264, both CUDA and Intel SDK allow the user to change only the very basic settings (Bframes, Reference frames, GOP and motion search range), everything else seems to be "set in stone".
You just don't get the ability to tweak the codec to the requirements of your playback hardware as you do with x264, many of the tweaks and flags which make H.264 worth using just aren't there. This could also mean that we may see compatibility problems with certain players.
In the end it will depend on whether the user is happy with the results. Personally having tried both GPU encoders, i'll be sticking with x264, speed just isn't an issue for me.Last edited by mh2360; 5th Feb 2011 at 06:52.
-
https://forum.videohelp.com/threads/331545-an-honest-look-at-TMPGEnc-Video-Mastering-Works-5
sample encodes will be posted on monday. -
Might want to hold off buying anymore SBs until they get the bugs out of the chips and mobos:
http://it.slashdot.org/story/11/01/31/1629232/Sandy-Bridge-Chipset-Shipments-Halted-Du...o-Bug?from=rss -
-
i chose this example because that is what i am working with at the moment, so it was the first thing that came to mind.
however consider a task such as a SAD calculation, which is performed thousands of times during an encode; i recently ran across the source code for an H263 codec that was available on the apple and the author had included the full un-optimized code as well as the AltiVec optimized code, so i thought it would be a great way to learn more about those now defunct SIMD instructions.
in the code it shows how a SAD was implemented by the author and to my surprise the exact same procedure i had always said should be used by a gpu powered encoder was the same one being used by the author, namely performing all the SAD's for a gop and saving them to an array, a simpler version of what i had been advocating for incorporating a gpu into the encoding process.
Just because there are parts of the encoding task that are suited for GPUs doesn't mean that you will find a net benefit from utilizing the GPU for encoding since any performance gains you realize from the portions that execute on the GPU will be eaten by the overhead of kernel launches, data transfer, etc.
if a gpu, with a large frame buffer, using the PCI-E lanes, can render millions of polygons per second (remember those 3d images are being rendered at 100+ fps), and consequently deal with having massive amounts of data uploaded to it's frame buffer in the form of textures, why do you, or anyone else believe that the multiple kernel launches and data transfers associated with video encoding on a gpu would somehow bring a system to it's knees?
video cards, since dx9, have been designed to be programmed by shaders, do you know what a shader is and how it's used? a shader is typically coded in HLSL and resides in a separate ,shd file file that is then called repeatedly by the main game engine as needed, sound familiar?
since you mentioned kernel launches i'm assuming you are at least somewhat familiar with the "mechanics" of a cuda program, so if a gpu, designed to deal with thousands shader launches per second can handle that what makes you think that it can't handle multiple kernel launches?
don't you think nvidia, with it's hundreds of millions invested in cuda and it's well payed engineers thought of possible road blocks? or do you really believe that some open source developer that has to give his software away for free knows more about engineering than they do?
but perhaps the best proof i can offer is the fact that intel invested 5 years of research and development in quick sync and in the process effectively eliminated one of the two "killer apps" that drives cpu upgrades. intel has already stated that all intel cpu's from now on will feature QS and for those that don't know motion estimation is carried out by the integrated gpu (that's why in order to use QS the integrated gpu must be active).
if gpu encoding was pie in the sky as it's detractor would have you believe why did a multi-billion dollar company invest 5 years and billions developing it's own version to combat nvidia's offering?
clearly anyone that says gpu's are not well suited for video encoding is wrong and needs to lay of the x264 kool aid. -
one of my criticisms of tmpg's implementation of the cuda encoder and sdk was the lack of adjustable settings, that however does not mean that they suck just that the way the were implemented was substandard.
if you download the intel developer's media sdk guide you will see extensive features supported by QS, programmers just need to exploit them. -
Tom's put up an article on quicksync , cuda etc....
http://www.tomshardware.com/reviews/video-transcoding-amd-app-nvidia-cuda-intel-quicksync,2839.html -
Its strange tho how many encoders (programs) still cant make use of any more than two cores? so if it takes them this long to become truly multicore how much longer must it take for them to become GPU trained?
At least if programmers code for SB CPUGPU they know they will have potential audience of 80%+ of new computers, while with CUDA and others, the market is far more limited.
I mean the limited supply of GPU programmers are all working for Brokerage firms or Render farms aren't they?
I understood the two bits of programming without being a programmer...Corned beef is now made to a higher standard than at any time in history.
The electronic components of the power part adopted a lot of Rubycons. -
You'll know if the GPU is doing its job if Apple's new MacBook Pros (which will be using SB CPUs) rely upon motherboard video or include a separate video chip (which they did in previous models and permitted the user to switch manually or automatically). If the SB GPU is all it's cracked up to be, Apple will rely upon it totally. The new machines are due in a month or so. Be patient.
-
-
i can tell you this, i went to college later on in life, in 1998 i enrolled in a community college where i initially majored in physics, after taking only physics, chemistry and calculus classes i realized it was going to take me a long time to earn a physics degree and quite frankly i wasn't optimistic about my employment chances so i switched to comp sci and started working towards a comp sci degree. i took every single comp sci class they offered, if i had finished my electives i would have earned my A.S. in comp sci with a minor in physics.
in all those programming classes, i can't recall ever, ever having done any multithreaded coding or SIMD coding, everything i know now (and i'm not an expert by any means) i had to learn on my own.
the biggest road block to multithreaded programming is the tools that are normally used, namely C/C++. C dates back about 40 years, developed by bell, if i remember correctly, and was an evolution of A and B programming languages. C does not natively support multithreading or SIMD, but the language is extensible so you can add support via libraries, such as pthreads.
be that as it may, it's still a pain in the ass to write multithreaded code, if the language supported constructs such as:
thread 1 {
}
thread 2 {
}
and so on you would see all apps being multithreaded, maybe even too threaded. likewise if someone released a C compiler that allowed a programmer to do something like:
sse int c = a + b
sse float g = e/pi
or something similar, you would see much more simd enabled apps.
the intel compiler does have the capability to take straight code and multithread it and/or simd optimize it, but it can only be using from within visual c and both those apps are expensive.
i do think nvidia screwed the pooch with cuda, they should have made a dummy proof compiler that you could chimp your way through and simply done something like:
gpu execute {
}
and had the compiler take care of all the details. had they invested the time and money to do that (and no, it's not easy) today the conversation about gpu programming would be very different.Last edited by deadrats; 8th Feb 2011 at 14:36.
-
Also from that Tom's site:
"You have chosen a CPU that uses an 1155 socket, otherwise known as "Sandybridge". There is a problem with the SATA ports on motherboards that support that socket, and they have been removed from the market until INTEL can fix the problem. If you want to build an INTEL based system now, you need either an 1156 or a 1366 CPU and compatable motherboard."
Can't people just WAIT? -
i would and am. i see no point in spending good money buying a previous generation cpu when in a couple of months things will be iron out.
there is one caveat to that however and that's if you find a ridiculously good deal on a 1156 or 1366 cpu and an obscenely good deal on a supporting motherboard, then i might consider it.
but it would have to be a really good deal. -
and so on you would see all apps being multithreaded, maybe even too threaded. likewise if someone released a C compiler that allowed a programmer to do something like:
sse int c = a + b
sse float g = e/pi
or something similar, you would see much more simd enabled apps.
the intel compiler does have the capability to take straight code and multithread it and/or simd optimize it, but it can only be using from within visual c and both those apps are expensive.
The Intel compilers will insert SIMD instructions automatically using a dispatcher which will recognize the available MMX/SSEx CPU capabilities.
AFAIK a compiler will never insert multithreading code, you have to program that yourself using a pthread lib or the Windows API.
And IMO Visual Studio combined with a decent (Intel) compiler isn't that expensive if you use it professional.HCenc at: http://hank315.nl -
i'm pretty sure intel released a compiler version that could take single threaded code and multi-thread it back in the P4 days, as a way of showing off the capabilities of hyper-threading and i'm almost positive that there is an option within the intel compiler to this day to auto-parallelize you code and i'm also fairly certain that they advertise this capability for their latest fortran compiler (along with support for avx).
i'm going to download the demos for both intel compilers and microsoft's visual studio and confirm my memory. -
Automatic Parallelization with Intel® Compilers
Three requirements must be met for the compiler to parallelize a loop. First, the number of iterations must be known before entry into a loop so that the work can be divided in advance. A while-loop, for example, usually cannot be made parallel. Second, there can be no jumps into or out of the loop. Third, and most important, the loop iterations must be independent. -
good, so i'm not losing my mind, intel's compiler is able to automatically multi-thread certain types of code, obviously an experienced programmer will get better results but if one wants to save on development time and doesn't give a rip about wringing every last drop of speed out of his code, intel's compilers may be the way to go.
-
I suspect that multithreading optimization works in such a small number of cases it's essentially useless.
-
i wouldn't be too sure about that, when the very first hyperthreaded P4's came out there was site that did a comparison test between those P4's and whichever was the fastest amd's at the time (i think they still used the "xp" nomenclature, anyway back then mp3 encoding was a part of every review and so these guys did a test with lame and the amd was faster. they also got their hands on a copy of the intel compiler and they compiled lame with the /MT switch (i think that's the switch) and reran the tests.
this time, though both processors saw an increase in performance, the P4 saw a bigger gain and was able to beat the amd easily (i'm suspecting that the compiler also probably used SSE2 optimizations for the P4).
now you've got me thinking, i think i'm going to write a quick math benchmarks with loops that should be easily parallelized and see what kind of boost the intel compiler is capable of providing. -
Intel's runtime library disables all MMX and SSE optimizations if "Genuine Intel" doesn't appear in the CPUID.
Similar Threads
-
Ivy Bridge to be up to 199% faster than Sandy Bridge
By deadrats in forum ComputerReplies: 33Last Post: 7th Dec 2011, 12:06 -
new intel sandy bridge chips to run up to 180watts - 23amps! toast anyone?
By aedipuss in forum ComputerReplies: 4Last Post: 15th Aug 2011, 18:16 -
new Intel Sandy Bridge cpus
By kenmo in forum ComputerReplies: 19Last Post: 23rd Jan 2011, 11:48 -
New Intel Sandy Bridge cpu
By snafubaby in forum Newbie / General discussionsReplies: 2Last Post: 10th Jan 2011, 07:19 -
intel demos sandy bridge video transcoding engine...
By deadrats in forum ComputerReplies: 8Last Post: 14th Sep 2010, 17:51