VideoHelp Forum




+ Reply to Thread
Page 2 of 2
FirstFirst 1 2
Results 31 to 53 of 53
  1. If you got my point, I am having trouble understanding why you chose the example of matrix multiplication to illustrate a task that is faster on the GPU than the CPU. Either you missed my point or you did not realize how easily parallelizable matrix multiplication is (...?) Regardless, it doesn't look good.

    I don't have to try to compile the code you posted to tell you it wouldn't work. You have not defined a,b,c, or N. The performance discrepancy between the CPU and the GPU would be a function of N. My suspicion would be that for small N, the CPU would be faster than the GPU due to the overhead of the kernel launch and the memcpy's to and from the gpu. Which brings me to my next point: Just because there are parts of the encoding task that are suited for GPUs doesn't mean that you will find a net benefit from utilizing the GPU for encoding since any performance gains you realize from the portions that execute on the GPU will be eaten by the overhead of kernel launches, data transfer, etc.
    Quote Quote  
  2. One only has to look at TMPGEnc Video Mastering Works to see the problems associated with GPU based encoding. Compared to x264, both CUDA and Intel SDK allow the user to change only the very basic settings (Bframes, Reference frames, GOP and motion search range), everything else seems to be "set in stone".

    You just don't get the ability to tweak the codec to the requirements of your playback hardware as you do with x264, many of the tweaks and flags which make H.264 worth using just aren't there. This could also mean that we may see compatibility problems with certain players.

    In the end it will depend on whether the user is happy with the results. Personally having tried both GPU encoders, i'll be sticking with x264, speed just isn't an issue for me.
    Last edited by mh2360; 5th Feb 2011 at 06:52.
    Quote Quote  
  3. Originally Posted by deadrats View Post
    i just finished numerous encoding tests for a review i was putting together
    Would like to see it. Link?
    Quote Quote  
  4. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Quote Quote  
  5. Member
    Join Date
    Nov 2000
    Location
    Canada
    Search Comp PM
    Might want to hold off buying anymore SBs until they get the bugs out of the chips and mobos:

    http://it.slashdot.org/story/11/01/31/1629232/Sandy-Bridge-Chipset-Shipments-Halted-Du...o-Bug?from=rss
    Quote Quote  
  6. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by jagabo View Post
    Originally Posted by deadrats View Post
    split the encoding task up by gop sequences, assign a thread to process each
    You can't do that. The working set will become too large. You'll start cache thrashing and all gains from parallelism will go down the drain.
    i never said how many segments i would process in parallel, i didn't mean process all of them simultaneously, i meant something on the order of 10-20 gop segments, write the results to the output file, then move on to the next batch.
    Quote Quote  
  7. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by jgreer View Post
    If you got my point, I am having trouble understanding why you chose the example of matrix multiplication to illustrate a task that is faster on the GPU than the CPU. Either you missed my point or you did not realize how easily parallelizable matrix multiplication is (...?) Regardless, it doesn't look good.
    i chose this example because that is what i am working with at the moment, so it was the first thing that came to mind.

    however consider a task such as a SAD calculation, which is performed thousands of times during an encode; i recently ran across the source code for an H263 codec that was available on the apple and the author had included the full un-optimized code as well as the AltiVec optimized code, so i thought it would be a great way to learn more about those now defunct SIMD instructions.

    in the code it shows how a SAD was implemented by the author and to my surprise the exact same procedure i had always said should be used by a gpu powered encoder was the same one being used by the author, namely performing all the SAD's for a gop and saving them to an array, a simpler version of what i had been advocating for incorporating a gpu into the encoding process.

    Just because there are parts of the encoding task that are suited for GPUs doesn't mean that you will find a net benefit from utilizing the GPU for encoding since any performance gains you realize from the portions that execute on the GPU will be eaten by the overhead of kernel launches, data transfer, etc.
    variations of this are one of the most commonly used objections against using gpu's for encoding yet with a bit of analytical thought it is easily debunked: a 3d game, in it's most basic form, is nothing more than the gpu rendering polygons and layering textures on top of them. the textures are always pre-rendered on work station class cards and reside on the hard disk, they have to be uploaded to the gpu during execution of the game.

    if a gpu, with a large frame buffer, using the PCI-E lanes, can render millions of polygons per second (remember those 3d images are being rendered at 100+ fps), and consequently deal with having massive amounts of data uploaded to it's frame buffer in the form of textures, why do you, or anyone else believe that the multiple kernel launches and data transfers associated with video encoding on a gpu would somehow bring a system to it's knees?

    video cards, since dx9, have been designed to be programmed by shaders, do you know what a shader is and how it's used? a shader is typically coded in HLSL and resides in a separate ,shd file file that is then called repeatedly by the main game engine as needed, sound familiar?

    since you mentioned kernel launches i'm assuming you are at least somewhat familiar with the "mechanics" of a cuda program, so if a gpu, designed to deal with thousands shader launches per second can handle that what makes you think that it can't handle multiple kernel launches?

    don't you think nvidia, with it's hundreds of millions invested in cuda and it's well payed engineers thought of possible road blocks? or do you really believe that some open source developer that has to give his software away for free knows more about engineering than they do?

    but perhaps the best proof i can offer is the fact that intel invested 5 years of research and development in quick sync and in the process effectively eliminated one of the two "killer apps" that drives cpu upgrades. intel has already stated that all intel cpu's from now on will feature QS and for those that don't know motion estimation is carried out by the integrated gpu (that's why in order to use QS the integrated gpu must be active).

    if gpu encoding was pie in the sky as it's detractor would have you believe why did a multi-billion dollar company invest 5 years and billions developing it's own version to combat nvidia's offering?

    clearly anyone that says gpu's are not well suited for video encoding is wrong and needs to lay of the x264 kool aid.
    Quote Quote  
  8. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by mh2360 View Post
    Compared to x264, both CUDA and Intel SDK allow the user to change only the very basic settings (Bframes, Reference frames, GOP and motion search range), everything else seems to be "set in stone".

    You just don't get the ability to tweak the codec to the requirements of your playback hardware as you do with x264, many of the tweaks and flags which make H.264 worth using just aren't there. This could also mean that we may see compatibility problems with certain players.

    In the end it will depend on whether the user is happy with the results. Personally having tried both GPU encoders, i'll be sticking with x264, speed just isn't an issue for me.
    one of my criticisms of tmpg's implementation of the cuda encoder and sdk was the lack of adjustable settings, that however does not mean that they suck just that the way the were implemented was substandard.

    if you download the intel developer's media sdk guide you will see extensive features supported by QS, programmers just need to exploit them.
    Quote Quote  
  9. Originally Posted by deadrats View Post
    Thanks for the link.
    Quote Quote  
  10. Its strange tho how many encoders (programs) still cant make use of any more than two cores? so if it takes them this long to become truly multicore how much longer must it take for them to become GPU trained?
    At least if programmers code for SB CPUGPU they know they will have potential audience of 80%+ of new computers, while with CUDA and others, the market is far more limited.
    I mean the limited supply of GPU programmers are all working for Brokerage firms or Render farms aren't they?

    I understood the two bits of programming without being a programmer...
    Corned beef is now made to a higher standard than at any time in history.
    The electronic components of the power part adopted a lot of Rubycons.
    Quote Quote  
  11. Member
    Join Date
    Jun 2003
    Location
    United States
    Search Comp PM
    You'll know if the GPU is doing its job if Apple's new MacBook Pros (which will be using SB CPUs) rely upon motherboard video or include a separate video chip (which they did in previous models and permitted the user to switch manually or automatically). If the SB GPU is all it's cracked up to be, Apple will rely upon it totally. The new machines are due in a month or so. Be patient.
    Quote Quote  
  12. Originally Posted by rumplestiltskin View Post
    new MacBook Pros... due in a month or so
    Surely they'll be delayed because of the recently discovered chipset problem.
    Quote Quote  
  13. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by RabidDog View Post
    Its strange tho how many encoders (programs) still cant make use of any more than two cores? so if it takes them this long to become truly multicore how much longer must it take for them to become GPU trained?
    At least if programmers code for SB CPUGPU they know they will have potential audience of 80%+ of new computers, while with CUDA and others, the market is far more limited.
    I mean the limited supply of GPU programmers are all working for Brokerage firms or Render farms aren't they?

    I understood the two bits of programming without being a programmer...
    i can tell you this, i went to college later on in life, in 1998 i enrolled in a community college where i initially majored in physics, after taking only physics, chemistry and calculus classes i realized it was going to take me a long time to earn a physics degree and quite frankly i wasn't optimistic about my employment chances so i switched to comp sci and started working towards a comp sci degree. i took every single comp sci class they offered, if i had finished my electives i would have earned my A.S. in comp sci with a minor in physics.

    in all those programming classes, i can't recall ever, ever having done any multithreaded coding or SIMD coding, everything i know now (and i'm not an expert by any means) i had to learn on my own.

    the biggest road block to multithreaded programming is the tools that are normally used, namely C/C++. C dates back about 40 years, developed by bell, if i remember correctly, and was an evolution of A and B programming languages. C does not natively support multithreading or SIMD, but the language is extensible so you can add support via libraries, such as pthreads.

    be that as it may, it's still a pain in the ass to write multithreaded code, if the language supported constructs such as:

    thread 1 {

    }

    thread 2 {

    }

    and so on you would see all apps being multithreaded, maybe even too threaded. likewise if someone released a C compiler that allowed a programmer to do something like:

    sse int c = a + b

    sse float g = e/pi

    or something similar, you would see much more simd enabled apps.

    the intel compiler does have the capability to take straight code and multithread it and/or simd optimize it, but it can only be using from within visual c and both those apps are expensive.

    i do think nvidia screwed the pooch with cuda, they should have made a dummy proof compiler that you could chimp your way through and simply done something like:

    gpu execute {

    }

    and had the compiler take care of all the details. had they invested the time and money to do that (and no, it's not easy) today the conversation about gpu programming would be very different.
    Last edited by deadrats; 8th Feb 2011 at 14:36.
    Quote Quote  
  14. Member
    Join Date
    Nov 2000
    Location
    Canada
    Search Comp PM
    Also from that Tom's site:

    "You have chosen a CPU that uses an 1155 socket, otherwise known as "Sandybridge". There is a problem with the SATA ports on motherboards that support that socket, and they have been removed from the market until INTEL can fix the problem. If you want to build an INTEL based system now, you need either an 1156 or a 1366 CPU and compatable motherboard."

    Can't people just WAIT?
    Quote Quote  
  15. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by oldfart13 View Post
    Also from that Tom's site:

    "You have chosen a CPU that uses an 1155 socket, otherwise known as "Sandybridge". There is a problem with the SATA ports on motherboards that support that socket, and they have been removed from the market until INTEL can fix the problem. If you want to build an INTEL based system now, you need either an 1156 or a 1366 CPU and compatable motherboard."

    Can't people just WAIT?
    i would and am. i see no point in spending good money buying a previous generation cpu when in a couple of months things will be iron out.

    there is one caveat to that however and that's if you find a ridiculously good deal on a 1156 or 1366 cpu and an obscenely good deal on a supporting motherboard, then i might consider it.

    but it would have to be a really good deal.
    Quote Quote  
  16. HCenc author
    Join Date
    Dec 2006
    Location
    Netherlands
    Search Comp PM
    and so on you would see all apps being multithreaded, maybe even too threaded. likewise if someone released a C compiler that allowed a programmer to do something like:

    sse int c = a + b

    sse float g = e/pi

    or something similar, you would see much more simd enabled apps.

    the intel compiler does have the capability to take straight code and multithread it and/or simd optimize it, but it can only be using from within visual c and both those apps are expensive.
    Just for the record, there's a big difference between multithreading and SIMD instructions.
    The Intel compilers will insert SIMD instructions automatically using a dispatcher which will recognize the available MMX/SSEx CPU capabilities.
    AFAIK a compiler will never insert multithreading code, you have to program that yourself using a pthread lib or the Windows API.

    And IMO Visual Studio combined with a decent (Intel) compiler isn't that expensive if you use it professional.
    Quote Quote  
  17. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by hank315 View Post
    AFAIK a compiler will never insert multithreading code, you have to program that yourself using a pthread lib or the Windows API.
    i'm pretty sure intel released a compiler version that could take single threaded code and multi-thread it back in the P4 days, as a way of showing off the capabilities of hyper-threading and i'm almost positive that there is an option within the intel compiler to this day to auto-parallelize you code and i'm also fairly certain that they advertise this capability for their latest fortran compiler (along with support for avx).

    i'm going to download the demos for both intel compilers and microsoft's visual studio and confirm my memory.
    Quote Quote  
  18. Automatic Parallelization with Intel® Compilers
    http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers/

    Three requirements must be met for the compiler to parallelize a loop. First, the number of iterations must be known before entry into a loop so that the work can be divided in advance. A while-loop, for example, usually cannot be made parallel. Second, there can be no jumps into or out of the loop. Third, and most important, the loop iterations must be independent.
    Quote Quote  
  19. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    good, so i'm not losing my mind, intel's compiler is able to automatically multi-thread certain types of code, obviously an experienced programmer will get better results but if one wants to save on development time and doesn't give a rip about wringing every last drop of speed out of his code, intel's compilers may be the way to go.
    Quote Quote  
  20. I suspect that multithreading optimization works in such a small number of cases it's essentially useless.
    Quote Quote  
  21. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by jagabo View Post
    I suspect that multithreading optimization works in such a small number of cases it's essentially useless.
    i wouldn't be too sure about that, when the very first hyperthreaded P4's came out there was site that did a comparison test between those P4's and whichever was the fastest amd's at the time (i think they still used the "xp" nomenclature, anyway back then mp3 encoding was a part of every review and so these guys did a test with lame and the amd was faster. they also got their hands on a copy of the intel compiler and they compiled lame with the /MT switch (i think that's the switch) and reran the tests.

    this time, though both processors saw an increase in performance, the P4 saw a bigger gain and was able to beat the amd easily (i'm suspecting that the compiler also probably used SSE2 optimizations for the P4).

    now you've got me thinking, i think i'm going to write a quick math benchmarks with loops that should be easily parallelized and see what kind of boost the intel compiler is capable of providing.
    Quote Quote  
  22. Intel's runtime library disables all MMX and SSE optimizations if "Genuine Intel" doesn't appear in the CPUID.
    Quote Quote  



Similar Threads

Visit our sponsor! Try DVDFab and backup Blu-rays!