VideoHelp Forum
+ Reply to Thread
Page 3 of 3
FirstFirst 1 2 3
Results 61 to 80 of 80
Thread
  1. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    mr poisondeathray, that was a lucid, intelligent, well thought-out objection. OVERRULED!!!

    i had originally written out a long rebuttal of everything you said, but instead decided to supply some code to prove my point, note i stole all this code from nvidia themselves, it's out there if anyone wants to go looking for it:

    code to decode a video stream on a gpu:

    *******************************

    #include "VideoDecoder.h"

    #include "FrameQueue.h"
    #include <cstring>
    #include <cassert>
    #include <string>

    VideoDecoder::VideoDecoder(const CUVIDEOFORMAT & rVideoFormat,
    CUcontext &rContext,
    cudaVideoCreateFlags eCreateFlags,
    CUvideoctxlock &ctx)
    : m_CtxLock(ctx)
    {
    // get a copy of the CUDA context
    m_Context = rContext;
    m_VideoCreateFlags = eCreateFlags;

    printf("> VideoDecoder::cudaVideoCreateFlags = <%d>", (int)eCreateFlags);
    switch (eCreateFlags) {
    case cudaVideoCreate_Default: printf("Default (VP)\n"); break;
    case cudaVideoCreate_PreferCUDA: printf("Use CUDA decoder\n"); break;
    case cudaVideoCreate_PreferDXVA: printf("Use DXVA decoder\n"); break;
    default: printf("Unknown value\n"); break;
    }

    // Validate video format. Currently only a subset is
    // supported via the cuvid API.
    cudaVideoCodec eCodec = rVideoFormat.codec;
    assert(cudaVideoCodec_MPEG1 == eCodec || cudaVideoCodec_MPEG2 == eCodec || cudaVideoCodec_VC1 == eCodec || cudaVideoCodec_H264 == eCodec);
    assert(cudaVideoChromaFormat_420 == rVideoFormat.chroma_format);
    // Fill the decoder-create-info struct from the given video-format struct.
    memset(&oVideoDecodeCreateInfo_, 0, sizeof(CUVIDDECODECREATEINFO));
    // Create video decoder
    oVideoDecodeCreateInfo_.CodecType = rVideoFormat.codec;
    oVideoDecodeCreateInfo_.ulWidth = rVideoFormat.coded_width;
    oVideoDecodeCreateInfo_.ulHeight = rVideoFormat.coded_height;
    oVideoDecodeCreateInfo_.ulNumDecodeSurfaces = FrameQueue::cnMaximumSize;
    // Limit decode memory to 24MB (16M pixels at 4:2:0 = 24M bytes)
    while (oVideoDecodeCreateInfo_.ulNumDecodeSurfaces * rVideoFormat.coded_width * rVideoFormat.coded_height > 16*1024*1024)
    {
    oVideoDecodeCreateInfo_.ulNumDecodeSurfaces--;
    }
    oVideoDecodeCreateInfo_.ChromaFormat = rVideoFormat.chroma_format;
    oVideoDecodeCreateInfo_.OutputFormat = cudaVideoSurfaceFormat_NV12;
    oVideoDecodeCreateInfo_.DeinterlaceMode = cudaVideoDeinterlaceMode_Adaptive;

    // No scaling
    oVideoDecodeCreateInfo_.ulTargetWidth = oVideoDecodeCreateInfo_.ulWidth;
    oVideoDecodeCreateInfo_.ulTargetHeight = oVideoDecodeCreateInfo_.ulHeight;
    oVideoDecodeCreateInfo_.ulNumOutputSurfaces = 2; // We won't simultaneously map more than 2 surfaces
    oVideoDecodeCreateInfo_.ulCreationFlags = m_VideoCreateFlags;
    oVideoDecodeCreateInfo_.vidLock = ctx;
    // create the decoder
    CUresult oResult = cuvidCreateDecoder(&oDecoder_, &oVideoDecodeCreateInfo_);
    assert(CUDA_SUCCESS == oResult);
    }

    VideoDecoder::~VideoDecoder()
    {
    cuvidDestroyDecoder(oDecoder_);
    }

    cudaVideoCodec
    VideoDecoder::codec()
    const
    {
    return oVideoDecodeCreateInfo_.CodecType;
    }

    cudaVideoChromaFormat
    VideoDecoder::chromaFormat()
    const
    {
    return oVideoDecodeCreateInfo_.ChromaFormat;
    }

    unsigned long
    VideoDecoder::maxDecodeSurfaces()
    const
    {
    return oVideoDecodeCreateInfo_.ulNumDecodeSurfaces;
    }

    unsigned long
    VideoDecoder::frameWidth()
    const
    {
    return oVideoDecodeCreateInfo_.ulWidth;
    }

    unsigned long
    VideoDecoder::frameHeight()
    const
    {
    return oVideoDecodeCreateInfo_.ulHeight;
    }

    unsigned long
    VideoDecoder::targetWidth()
    const
    {
    return oVideoDecodeCreateInfo_.ulTargetWidth;
    }

    unsigned long
    VideoDecoder::targetHeight()
    const
    {
    return oVideoDecodeCreateInfo_.ulTargetHeight;
    }

    void
    VideoDecoder::decodePicture(CUVIDPICPARAMS * pPictureParameters)
    {
    CUresult oResult = cuvidDecodePicture(oDecoder_, pPictureParameters);
    assert(CUDA_SUCCESS == oResult);
    }

    void
    VideoDecoder::mapFrame(int iPictureIndex, CUdeviceptr * ppDevice, unsigned int * pPitch, CUVIDPROCPARAMS * pVideoProcessingParameters)
    {
    CUresult oResult = cuvidMapVideoFrame(oDecoder_,
    iPictureIndex,
    ppDevice,
    pPitch, pVideoProcessingParameters);
    assert(CUDA_SUCCESS == oResult);
    assert(0 != *ppDevice);
    assert(0 != *pPitch);
    }

    void
    VideoDecoder::unmapFrame(CUdeviceptr pDevice)
    {
    CUresult oResult = cuvidUnmapVideoFrame(oDecoder_, pDevice);
    //assert(CUDA_SUCCESS == oResult);
    }

    ***********************************

    here's how you integrated a cuda function into an existing c++ app:

    **********************************************

    /* Example of integrating CUDA functions into an existing
    * application / framework.
    * CPP code representing the existing application / framework.
    * Compiled with default CPP compiler.
    */

    // includes, system
    #include <iostream>
    #include <stdlib.h>

    // Required to include CUDA vector types
    #include <vector_types.h>
    #include "cutil_inline.h"

    ////////////////////////////////////////////////////////////////////////////////
    // declaration, forward
    extern "C" void runTest(const int argc, const char** argv,
    char* data, int2* data_int2, unsigned int len);

    ////////////////////////////////////////////////////////////////////////////////
    // Program main
    ////////////////////////////////////////////////////////////////////////////////
    int
    main(int argc, char** argv)
    {

    // input data
    int len = 16;
    // the data has some zero padding at the end so that the size is a multiple of
    // four, this simplifies the processing as each thread can process four
    // elements (which is necessary to avoid bank conflicts) but no branching is
    // necessary to avoid out of bounds reads
    char str[] = { 82, 111, 118, 118, 121, 42, 97, 121, 124, 118, 110, 56,
    10, 10, 10, 10};

    // Use int2 showing that CUDA vector types can be used in cpp code
    int2 i2[16];
    for( int i = 0; i < len; i++ )
    {
    i2[i].x = str[i];
    i2[i].y = 10;
    }

    // run the device part of the program
    runTest(argc, (const char**)argv, str, i2, len);

    std::cout << str << std::endl;
    for( int i = 0; i < len; i++ )
    {
    std::cout << (char)(i2[i].x);
    }
    std::cout << std::endl;

    cutilExit(argc, argv);
    }

    ****************************

    here's another sample:

    ******************************

    /* Example of integrating CUDA functions into an existing
    * application / framework.
    * Reference solution computation.
    */

    // Required header to support CUDA vector types
    #include <vector_types.h>

    ////////////////////////////////////////////////////////////////////////////////
    // export C interface
    extern "C"
    void computeGold(char* reference, char* idata, const unsigned int len);
    extern "C"
    void computeGold2(int2* reference, int2* idata, const unsigned int len);

    ////////////////////////////////////////////////////////////////////////////////
    //! Compute reference data set
    //! Each element is multiplied with the number of threads / array length
    //! @param reference reference data, computed but preallocated
    //! @param idata input data as provided to device
    //! @param len number of elements in reference / idata
    ////////////////////////////////////////////////////////////////////////////////
    void
    computeGold(char* reference, char* idata, const unsigned int len)
    {
    for(unsigned int i = 0; i < len; ++i)
    reference[i] = idata[i] - 10;
    }

    ////////////////////////////////////////////////////////////////////////////////
    //! Compute reference data set for int2 version
    //! Each element is multiplied with the number of threads / array length
    //! @param reference reference data, computed but preallocated
    //! @param idata input data as provided to device
    //! @param len number of elements in reference / idata
    ////////////////////////////////////////////////////////////////////////////////
    void
    computeGold2(int2* reference, int2* idata, const unsigned int len)
    {
    for(unsigned int i = 0; i < len; ++i)
    {
    reference[i].x = idata[i].x - idata[i].y;
    reference[i].y = idata[i].y;
    }
    }

    ************************************************** *

    note that these code samples will allow the developers to decode vc-1, h264, mpeg-1 and mpeg-2 on an nvidia gpu, integrate this gpu accelerated decoder into x264 (they would have to remove their own decoder obviously) and how to integrate cuda functions into their own app.

    note, x264 is a codec, by definition a codec is a COmpressor and a DECompressor, i have just gpu accelerated the DECompressor portion of x264 and shown how to integrate it with the COmpressor portion of x264 with all of 5 minutes of searching through nvidia's website, using standard c++ code.

    tell me again how what great programmers the x264 developers are.

    stay tuned, i'm about to show how to gpu accelerate significant portions of their COmpressor as well, give me a few days to get my hands dirty.

    edit:

    here's how to parse video on a gpu:

    *********************************

    #ifndef NV_VIDEO_PARSER
    #define NV_VIDEO_PARSER

    #include <cuvid/nvcuvid.h>

    #include <iostream>

    class FrameQueue;
    class VideoDecoder;

    // Wrapper class around the CUDA video-parser API.
    // The video parser consumes a video-data stream and parses it into
    // a) Sequences: Whenever a new sequence or initial sequence header is found
    // in the video stream, the parser calls its sequence-handling callback
    // function.
    // b) Decode segments: Whenever a a completed frame or half-frame is found
    // the parser calls its picture decode callback.
    // c) Display: Whenever a complete frame was decoded, the parser calls the
    // display picture callback.
    //
    class VideoParser
    {
    public:
    // Constructor.
    //
    // Parameters:
    // pVideoDecoder - pointer to valid VideoDecoder object. This VideoDecoder
    // is used in the parser-callbacks to decode video-frames.
    // pFrameQueue - pointer to a valid FrameQueue object. The FrameQueue is used
    // by the parser-callbacks to store decoded frames in it.
    VideoParser(VideoDecoder * pVideoDecoder, FrameQueue * pFrameQueue);

    private:
    // Struct containing user-data to be passed by parser-callbacks.
    struct VideoParserData
    {
    VideoDecoder * pVideoDecoder;
    FrameQueue * pFrameQueue;
    };

    // Default constructor. Don't implement.
    explicit
    VideoParser();

    // Copy constructor. Don't implement.
    VideoParser(const VideoParser & );

    // Assignement operator. Don't implement.
    void
    operator= (const VideoParser & );

    // Called when the decoder encounters a video format change (or initial sequence header)
    // This particular implementation of the callback returns 0 in case the video format changes
    // to something different than the original format. Returning 0 causes a stop of the app.
    static
    int
    CUDAAPI
    HandleVideoSequence(void * pUserData, CUVIDEOFORMAT * pFormat);

    // Called by the video parser to decode a single picture
    // Since the parser will deliver data as fast as it can, we need to make sure that the picture
    // index we're attempting to use for decode is no longer used for display
    static
    int
    CUDAAPI
    HandlePictureDecode(void * pUserData, CUVIDPICPARAMS * pPicParams);

    // Called by the video parser to display a video frame (in the case of field pictures, there may be
    // 2 decode calls per 1 display call, since two fields make up one frame)
    static
    int
    CUDAAPI
    HandlePictureDisplay(void * pUserData, CUVIDPARSERDISPINFO * pPicParams);


    VideoParserData oParserData_; // instance of the user-data we have passed into the parser-callbacks.
    CUvideoparser hParser_; // handle to the CUDA video-parser

    friend class VideoSource;
    };

    std:stream &
    operator << (std:stream & rOutputStream, const CUVIDPARSERDISPINFO & rParserDisplayInfo);

    #endif // NV_VIDEO_PARSER

    **********************************************

    here's some code for multithreading:

    **************************************

    #include <multithreading.h>

    #if _WIN32
    //Create thread
    CUTThread cutStartThread(CUT_THREADROUTINE func, void *data){
    return CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)func, data, 0, NULL);
    }

    //Wait for thread to finish
    void cutEndThread(CUTThread thread){
    WaitForSingleObject(thread, INFINITE);
    CloseHandle(thread);
    }

    //Destroy thread
    void cutDestroyThread(CUTThread thread){
    TerminateThread(thread, 0);
    CloseHandle(thread);
    }

    //Wait for multiple threads
    void cutWaitForThreads(const CUTThread * threads, int num){
    WaitForMultipleObjects(num, threads, true, INFINITE);

    for(int i = 0; i < num; i++)
    CloseHandle(threads[i]);
    }

    #else
    //Create thread
    CUTThread cutStartThread(CUT_THREADROUTINE func, void * data){
    pthread_t thread;
    pthread_create(&thread, NULL, func, data);
    return thread;
    }

    //Wait for thread to finish
    void cutEndThread(CUTThread thread){
    pthread_join(thread, NULL);
    }

    //Destroy thread
    void cutDestroyThread(CUTThread thread){
    pthread_cancel(thread);
    }

    //Wait for multiple threads
    void cutWaitForThreads(const CUTThread * threads, int num){
    for(int i = 0; i < num; i++)
    cutEndThread(threads[i]);
    }

    #endif

    ********************************************

    who's you daddy?
    Quote Quote  
  2. Awesome. I think? It looks like alien to me.

    Any early testing yet, or still very beta?

    Decoding can already be done with DGNVIndex , but nothing on the encoding side for x264 and cuda, that I am aware of

    You can find the developers at IRC: irc://irc.freenode.net/x264 , Doom9, or Doom10 if you want them to help debug

    They welcome patches and development from everyone (of course they only commit after close scrutiny and testing)

    If this pans out, you will be "that guy" who succeeded where many have failed
    Quote Quote  
  3. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by poisondeathray
    Awesome. I think? It looks like alien to me.

    Any early testing yet, or still very beta?

    Decoding can already be done with DGNVIndex , but nothing on the encoding side for x264 and cuda, that I am aware of

    You can find the developers in IRC, Doom9, or Doom10 if you want them to help debug

    They welcome patches and development from everyone (of course they only commit after close scrutiny and testing)

    If this pans out, you will be "that guy" who succeeded where many have failed
    here's the thing i didn't realize until just a few moments ago, many of the objections the x264 developers raise are silly, they don't hold up to scrutiny when looked at from a programing angle, some are down right laughable, especially in light of the fact that nvidia has thoroughly documented cuda, supplied ample code samples and is taught as graduate level course work at quite a few prestigious universities, and then it hit me: x264 is open source software, specifically gpl'd software, gpl developers as a rule are very idealistic, they despise proprietary software and in fact many despise any software not released under the gpl, open source or not.

    cuda represents everything they hate: a proprietary framework, proprietary hardware, proprietary software, closed source drivers, they would never port their codec to run on a limited range of hardware, by porting x264 to cuda they would effectively be limiting the number of people that could use it to just owners of nvidia based geforce 8 and above video cards.

    i should have seen it before, it's not technical obstacles that are preventing them from porting it to cuda it's philisophical obstacles, my guess is that they will wait until open cl is fully mature and then modify their code to run within that open source frame work.
    Quote Quote  
  4. Originally Posted by deadrats

    here's the thing i didn't realize until just a few moments ago, many of the objections the x264 developers raise are silly, they don't hold up to scrutiny when looked at from a programing angle, some are down right laughable, especially in light of the fact that nvidia has thoroughly documented cuda, supplied ample code samples and is taught as graduate level course work at quite a few prestigious universities, and then it hit me: x264 is open source software, specifically gpl'd software, gpl developers as a rule are very idealistic, they despise proprietary software and in fact many despise any software not released under the gpl, open source or not.

    cuda represents everything they hate: a proprietary framework, proprietary hardware, proprietary software, closed source drivers, they would never port their codec to run on a limited range of hardware, by porting x264 to cuda they would effectively be limiting the number of people that could use it to just owners of nvidia based geforce 8 and above video cards.

    i should have seen it before, it's not technical obstacles that are preventing them from porting it to cuda it's philisophical obstacles, my guess is that they will wait until open cl is fully mature and then modify their code to run within that open source frame work.
    Translation = you can't do it and you give up?

    From a layman's functional viewpoint, how do you get around the issues in my post above?

    Shall I ask one of the developers to review your code? or even Donald Graft, who wrote DGNVIndex and worked with Nvidia's team?

    Some early work has been put into OpenCl, there were a couple grad students thinking of it, but they pretty much ran in to the same issues.
    Quote Quote  
  5. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by poisondeathray
    Translation = you can't do it and you give up?
    read until the end, i discovered something as i was writing this response that i think you will find quite interesting.

    in layman's terms:

    so there's no way you can explain how to integrate x264 with cuda

    i have already done so, in fact even provided code samples on the "how to", you just have a bizarre infatuation with x264 that borders on the unhealthy, you wouldn't happen to be a closet developer, would you? if you are, i urge you to come out of the closet, it will make you feel better, i promise not to mock you...too much.

    btw, it's not "integrate x264 with cuda, it's port x264 to cuda, big difference.

    Name something that gives you better results - free or paid.

    CC-HDe, blu-code, with interlaced content main concept and at blu-ray level bit rates main concept, procoder, and apple's h264, as well as vc-1 and mpeg-2. (i'm sure that should light a fire under you).

    Video consists of I-frames, P-frames, and B-frames , and long GOP formats use both Intra-frame and Inter-frame predictive compression techniques. They code differences between frames. Video uses "frames", but you don't process individual frames, modern encoders look at macroblocks. And these macroblocks, when using h.264 can be 16x16, 8x8, 4x4. You use all these words, but I don't think you understand what they really mean. If you understood these basic concepts, you would understand why this x264 or any decent modern encoder won't work with cuda.

    when you say things like this you just leave me in stitches, you really do. yes, encoders look at macroblocks, and said macroblocks can be chained together to form slices but said macroblocks and slices reside within, wait for it, frames. furthermore I-frames are coded without reference to any other frame and are always intra-coded, thus it's very valid to talk about processing chunks of gop on separate threads.

    your "objections" are meaningless within the context of porting code to run a gpu, a gpu was designed to work with individual pixels, and render thousands of pixels at the same time, why would you believe that it would be incapable of handling a group of pixels that we could refer to as a macroblocks and bigger groups of pixels that we can refer to as a slice and a bigger collection of pixels that we can refer to as a frame?

    when i "frame" the problem in the context of the ability to manipulate pixels, don't the supposed obstacles to porting x264 to cuda you have put forth seem downright foolish and misinformed? mind you, i in no way blame you, you clearly have drank the x264 kool aid, you took what those idiot developers said as being accurate and based in reality and now you simply parrot their party line.

    but once you look at it from the frame of reference of a gpu being able to work on individual pixels the thought that a gpu can't work on macroblocks, slices or individual frames because "the encoder is too advanced and complicated" gets exposed as the malformed collection of chemically produced electrical signals it is.

    You would have to "dumb down" x264 in order for it to work with cuda. With efficient encoders such as x264, there is dynamic frametype placement, and variable GOP sizes. e.g. a "whip pan" might place 10 I-frames in a row, but a slow pan or static shot might be 300 frames long between keyframes. So you cannot know ahead of time how to divide up your video and spawn "x" number of threads according to how many "units" your GPU has available - you don't know where or what the GOP's look like ahead of time. This becomes an allocation issue as you have idle units, and extra resources have to be wasted on allocation and optimization.

    no, you would have to smarten up the developers coding it. have you ever heard of 2 pass encoding? how about variable bit rate encoding? do you know what an analysis pass is? did you get the above from the developers as one of their reasons why they can't port x264 to cuda or did you make that up yourself?

    here's how you get around this objection, either code your gpu accelerated encoder to perform a quick analysis pass so that it knows at what points it can segment the file or launch an analysis thread that runs 10-20 seconds ahead of the fastest worker thread, as the analysis thread reaches the end of each gop it launches a worker thread and assigns that last analyzed segment to that thread for encoding, and you keep doing that until all gop's are being processed on a separate thread, you will also need another thread for house keeping, to kill each worker thread as it finishes it's job and concatenates the results to the output file. any programmer that knows how to manipulate strings knows how to read ahead to analyze a thread.

    every single one of your objections with regard to this process or that process being serial in nature is easily countered by the above technique, it works, it exists and it's easy to implement, any programmer with any formal training knows how to do it, the objections are silly on their face and absurd from a programming standpoint, nothing more than hollow excuses as to why "it can't be done".

    i poured over the code for the x264 encoder and the most striking thing is how similar it is in structure to the xvid encoder code, they both have an over abundance of pointers, in fact that is the only obstacle i can see from a programming standpoint to porting x264, or xvid for that matter, to cuda, the extensive use of function pointers, which is an obstacle that can't be overcome, nvidia gpu's do not support function pointers, they support pointers, but not function pointers. i have no idea why those 2 mpeg-4 codecs both make such extensive use of function pointers but the use is so wide spread throughout the code that i believe it is insurmountable, you would need to recode all of x264 and xvid from the ground up sans function pointers.

    interestingly enough, i took a look through the ffmpeg source and it doesn't appear to use them to any great extend, in fact as far as structure is concerned it's reasonably similar to the nvidia h264 encoder, which leads me to believe that that would be the best candidate for porting to cuda, which jives with the fact that of the three, ffmpeg is the only one to be made to work with open cl.

    as a side note, based on what i see in the xvid and x264 code, i don't think it's possible to modify them to work with open cl either, the same hardware limitation applies, there's just no way around it.

    perhaps fermi will bring hardware support for function pointers, but if it doesn't we're right back to square one.

    it would seem that you are 100% right about it being impossible to port encoder portion of x264 to cuda but not for any of the reasons you outlined, rather because of the programming techniques the developers used to code the encoder.

    bummer...
    Quote Quote  
  6. HAHA ok..... I thought so.

    Nope, everything I wrote was NOT from the developers mouths at all, they are only my views. I explained it from a functional viewpoint as an end user of an encoder. I have about zero code writing experience, but I know how x264 works at a basic functional level, and I know a little bit about video. There's probably a lot of other technical or coding issues that one of the developers could add. While I'm sure some of the suggestions you made could help improve some of the issues - at least theoretically - but not all of them.

    The linear lookahead method you mentioned is what x264 does right now (rc threaded lookahead). But how slow would it be on a GPU? Obviously there is no "thread equivalency" between CPU and GPU. I thought we were talking in the order of 1000's to 10000's of threads for Cuda? Those were the #'s thrown around by the developer's. Isn't that why GPU's were "faster" for massively parallelizable tasks in the first place? You chop up the task in to tiny little bits? Wouldn't your GPU be 99% idle ?

    Yes, you could do 2-passes, with a CPU 1st pass, that would get around the GOP frametype placement issues, but you still would have many other issues. Quality is paramount for the developers, and they won't sacrifice it, and as an end user, I wouldn't want them to. I don't care if you have a geeky programming workaround, if the quality sucks I'm not interested.

    1) While doing 2 passes (with a 1st pass CPU), you could calculate where the optimal frametype and GOP boundaries lie. BUT, you still wouldn't want to use the GPU because of lower quality and efficiency losses from INTRframe slice "boundaries." If 16 threads and 4 slices /frame results in a minor PSNR 0.0001dB to 0.0003dB quality loss for a certain source (just making these #'s up for example) , imagine what 10000, or 20000 would do to quality. You would never get as good quality.

    Many sources would be slower to encode, and much worse quality. Consider a short 1 minute clip (maybe a video trailer , or something for youtube). It might only have 2 or 3 GOPs. How do you allocate the parallelizable work units? A few thousand per GOP ? How much of a quality hit are you willing to take? How many idle units would there be?

    2) One of the key benefits that 99% of x264 users rave about is the true 1 pass quality (variable quantizer) mode ie. CRF , as opposed to CQ or constant quantizer , like xvid and such use. You can't do CRF encodes properly with GPU for the reasons mentioned earlier.

    3) non parallelizable algorithms e.g CABAC , brute force motion analysis, SATD - these would still be bottlenecks , and you would never get close to 100% efficient use of GPU. These are "facts of life" that you can't get around; and there are several published paper on the subject if you're interested.



    So you're basically agreeing with me (even if for different reasons), that massive parts have to be re-written in order to force it to work. = lot of work = not going to happen.

    Thats too bad. Even if you could accelerate *parts* of the 2nd pass or some 1pass or crf calculations without a significant quality hit I'd be happy.

    OK, so maybe x264 interlaced encoding can be improved, some parts can be multithreaded a little better.... I'll drink the "kool aid" of any coders that make an encoder that works better than x264. If you suggest something does better (on progressive content), I will need proof. I'm a "proof" type of guy. If it sounds like I worship the developers, it's because they've produced a product that earns my respect. You've produced nothing to earn my respect. Hey, if you code one that does better, I'll "worship" the "deadrats encoder" too and become your biggest fan I'm a fan of quality and nothing is even close in quality/speed. But I'll jump ship in a heartbeat if something out there is better.

    The limitations mentioned earlier affect how GPU encoders work now. They take shortcuts. That's why the quality sucks. Maybe you can work from improving it from the existing cuda angle? Since it was written from the ground up to work on a GPU.

    Can you explain the bit on "function pointers" a bit more in plain English?
    Quote Quote  
  7. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by poisondeathray
    Nope, everything I wrote was NOT from the developers mouths at all, they are only my views. I explained it from a functional viewpoint as an end user of an encoder. I have about zero code writing experience, but I know how x264 works at a basic functional level, and I know a little bit about video. There's probably a lot of other technical or coding issues that one of the developers could add. While I'm sure some of the suggestions you made could help improve some of the issues - at least theoretically - but not all of them.
    here's the thing, none of the "issues" you raised are hurdles to gpu acceleration in any way, shape or form. if you can decode a video stream on the gpu then you can encode the video stream on the gpu, the two are inverse operations of another, similarly if the gpu decode doesn't choke on variable length gop's, macroblocks or slices, hell individual pixels don't faze it, then there is no reason why it should choke while trying to encode them.

    Originally Posted by poisondeathray
    The linear lookahead method you mentioned is what x264 does right now (rc threaded lookahead). But how slow would it be on a GPU? Obviously there is no "thread equivalency" between CPU and GPU. I thought we were talking in the order of 1000's to 10000's of threads for Cuda? Those were the #'s thrown around by the developer's. Isn't that why GPU's were "faster" for massively parallelizable tasks in the first place? You chop up the task in to tiny little bits? Wouldn't your GPU be 99% idle ?
    no, no, no!!!

    first things first, read this on threads:

    http://en.wikipedia.org/wiki/Thread_(computer_science)

    when we talk about cpu threads and gpu threads we're fundamentally talking about the same construct, so there is cpu/gpu "thread equivalency". second, cuda has nothing to do with how many threads can be used, cuda is the framework for using C to code applications that run on nvidia gpu's, the limiting factor is the hardware not the development environment. in so far as how many threads can be kept in flight at any one time, the top of the line gpu's can keep slightly over 30 thousand threads in flight at any one time, fermi will actually be able to keep fewer threads in flight, about 24 thousand is the number that has been thrown around, and yes that's part of the reason why they are faster for tasks that are massively parallel in nature, but that the fact remains that gpu's are significantly faster for linear tasks as well, they are computation monsters, in terms of floating point performance they just can't be touched, i know you hate the flop metric but it is a very valid performance comparison benchmark, the more floating point operation per second a processor can perform the faster it will be.

    with gpu's, not only can they keep many orders of threads more than a cpu in flight but they can only complete each individual task significantly faster. think of it as having a phenom 9500, sure it can keep twice the number of threads in flight at a time than an e8600 (4 vs 2) but even under the most multi-threaded app the e8600 is still way faster because it can complete work on it's 2 threads, be issued 2 more and complete them and start on 2 more before the 9500 finishes it's 4.

    Yes, you could do 2-passes, with a CPU 1st pass, that would get around the GOP frametype placement issues, but you still would have many other issues. Quality is paramount for the developers, and they won't sacrifice it, and as an end user, I wouldn't want them to. I don't care if you have a geeky programming workaround, if the quality sucks I'm not interested.
    no one says that quality has to suck, were is it written that variable gop's are required for maximum quality? x264 places way too much emphasis on low bit rate quality, seriously just who is the target user? certainly not the pros who have no problem letting the bit rate go past 25 mb/s. incidently, if quality was of the utmost importance to the developer's, and to you, they would not use b or p frames, everything would be an i frame.

    1) While doing 2 passes (with a 1st pass CPU), you could calculate where the optimal frametype and GOP boundaries lie. BUT, you still wouldn't want to use the GPU because of lower quality and efficiency losses from INTRframe slice "boundaries." If 16 threads and 4 slices /frame results in a minor PSNR 0.0001dB to 0.0003dB quality loss for a certain source (just making these #'s up for example) , imagine what 10000, or 20000 would do to quality. You would never get as good quality.
    first things first, i never said do the analysis pass on the cpu, there is no reason why you can't perform the first pass on the gpu.

    your problem is that you suffer from the same thought patterns as those "reporters" at Fox, you have a preconceived notion based on zero facts, in some cases made up facts, and you use these "facts" as a way of supporting your perceived reality.

    by your own admission you just made those numbers up, you have your mind dead set on the notion that for some reason the exact same calculations performed on a different processor would somehow result in different results. where did you ever get the idea that performing the exact same operation would result in lower quality?

    let's assume that for the sack of argument the above numbers are accurate, we have 4 slices per frame, 16 threads, obviously 4 frames at a time, and we end up with a PSNR quality loss of .0002dB, why would you extrapolate that to 10-20 thousand threads? no one is saying cut up each frame into 20 thousand slices, that would be insane, what you would do is work on more frames at the same time, still use 4 slices per frame but instead of working on 4 frames at any one time, you would work on 2500-5000 frames at the same time, you would keep the quality loss, which you deemed acceptable the same, you would just work on larger chunks of video at a time.

    Many sources would be slower to encode, and much worse quality. Consider a short 1 minute clip (maybe a video trailer , or something for youtube). It might only have 2 or 3 GOPs. How do you allocate the parallelizable work units? A few thousand per GOP ? How much of a quality hit are you willing to take? How many idle units would there be?
    here you go making assumptions again, 2 problems with the above reasoning, no one is going to use gpu accelerated encoding for a 1 minute clip, and the performance advantages gpu's enjoy, for the 43 millionth time, are not tied solely to their ability to handle multiple threads as the same time, they can also finish working on said threads much faster.

    consider an inherently serial task like folding@home, check out their faq (they are currently the foremost experts on gpgpu programming):

    http://folding.stanford.edu/English/FAQ-SMP

    None of our engines are written to be thread-safe or multi-threaded. The only parallelizable codes (Gromacs and AMBER) both use MPI. Making Gromacs use only threads for parallelization isn't possible right now (we talk with the Gromacs developers frequently on this issue), so MPI is the only solution.

    http://folding.stanford.edu/English/FAQ-NVIDIA

    One of the really exciting aspects about GPU's is that not only can they accelerate existing algorithms significantly, they get really interesting in that they can open doors to new algorithms that we would never think to do on CPUs at all (due to their very slow speed on CPUs, not but GPU's).

    Much like the Gromacs core greatly enhanced Folding@home by a 20x to 30x speed increase via a new utilization of hardware (SSE) in PCs, in 2006, Folding@home has developed a new streaming processor core to utilize another new generation of hardware: GPUs with programmable floating-point capability. By writing highly optimized, hand tuned code to run on ATI X1900 class GPUs, the science of Folding@home will see another 20x to 30x speed increase over its previous software (Gromacs) for certain applications. This great speed increase is achieved by running essentially the complete molecular dynamics calculation on the GPU; while this is a challenging software development task, it appears to be the way to achieve the highest speed improvement on GPU's


    as you can see, much of the speed increase comes from the ability of gpu's to perform floating point operations 20-30 times faster than a cpu, yes this faq was written circa 2006 and yes cpu's have gotten faster since then but so have the gpu's, try it for yourself, download the cpu and gpu folding@home clients and see for yourself how much faster the gpu is and as you can read for yourself it has nothing to do with the number of threads a gpu can handle as most of the code isn't multi-threaded.

    2) One of the key benefits that 99% of x264 users rave about is the true 1 pass quality (variable quantizer) mode ie. CRF , as opposed to CQ or constant quantizer , like xvid and such use. You can't do CRF encodes properly with GPU for the reasons mentioned earlier.
    complete and utter bull, nothing you have posted would prevent the same operations from being run on a gpu, only much faster.

    non parallelizable algorithms e.g CABAC , brute force motion analysis, SATD - these would still be bottlenecks , and you would never get close to 100% efficient use of GPU. These are "facts of life" that you can't get around; and there are several published paper on the subject if you're interested.
    no, the "facts of life" was a tv show that's now off the air, but here's a fact for you to chew on, you know nothing about writing code, zip about cpu and gpu architectures and you're trying to apply your knowledge of video to explain topics you know little about.

    gpu's are awesome for brute force work:

    http://securityandthe.net/2008/10/12/russian-researchers-achieve-100-fold-increase-in-...racking-speed/

    The 100-fold increase in speed is achieved with two GeForce GTX280’s per workstation; for €599 you can build a network of 20 workstations dedicated to “recovering” your “lost” WPA keys. This means that a WPA or WPA2 key could be cracked in days or weeks instead of years.

    So you're basically agreeing with me (even if for different reasons), that massive parts have to be re-written in order to force it to work. = lot of work = not going to happen.
    not re-written, completely restructured, if we were talking about rewriting the code, then that could be done, even if it was a tedious task, we're talking re-thinking the implementation from the ground up.

    The limitations mentioned earlier affect how GPU encoders work now. They take shortcuts. That's why the quality sucks. Maybe you can work from improving it from the existing cuda angle? Since it was written from the ground up to work on a GPU.
    ROTFLMAO!!! gpu's don't "take shortcuts", that's why they're used to accelerate many mission critical tasks, such as protein folding, seti, weather prediction, nuclear reaction analysis, stock market analysis, the list of scientific and professional applications are endless. if gpu's took shortcuts they wouldn't be used for squat, the x264 developer's are just fond of making up bull sh*t excuses for the coding decisions they made.

    Can you explain the bit on "function pointers" a bit more in plain English?
    this one is going to be long. in C and C++ the code is structured within a construct called a function, you may be familiar with the general form:

    main ( )

    in C/C++ all functions return a value, which is done with the "return" statement, like this:

    main ( )
    {

    // this is where the function definition, or the "body", is placed

    return ( );

    } //these brackets encapsulate the all the elements of the function

    in C/C++ the main portion of the program resides within the "main" function and all functions need to be declared with a data type that corresponds with the type of data that is returned when all the operations are completed, in the case of main ( ) it is usually a 1 or 0 that is returned and thus it is written like this:

    int main ( )
    {

    return (value);
    }

    when the parenthesis following the function name (in this case "main) are empty, no value is passed to the function, it is merely called from another function, if the parenthesis have one variable a value is passed to the function but it doesn't return a function, if there are 2 variables, a value is passed to the function, operated on and an answer is returned, thus:

    int main ( )
    {
    add ( );
    return (value);
    }

    add ( )
    {
    a = 1;
    b = 2;
    c = a + b
    cout << c;
    return (void);
    }

    calls a function that adds 1 + 2, assigns the value to c and then prints out the value of c to the screen, where as:

    int main ( )
    {
    a = 1;
    b = 2;
    int add (int a, int b, int c);
    cout << c;
    return (value);
    }

    int add (int a, int b, int c )
    {
    a = 1;
    b = 2;
    c = a + b
    return (c);
    }

    sends the values of 1 and 2 to the add ( ) function which adds two numbers and returns the resulting sum to main which in turn prints the value to the screen, there are some things i'm leaving out, such as the preprocessor directives, for the sake of simplicity. now as you may have noticed all variables must be assigned a data type, and since functions are also variables, they too must have a data type, the data types can be int, float, double, char, long, as well as some other more complex data types such as dword.

    now because passing a large number of values, as well as strings, causes increased program overhead, in order to keep resource usage to a minimum, pointers were included with C/C++. a pointer is a variable that "points" to the location of another variable, in the example above instead of passing the actual values of a, b or c, we could pass the location in memory. while this doesn't have any performance advantages in the above examples it certainly does have an advantage when we're dealing with large data streams. when you want to read the value of a pointer variable you dereference the pointer and it gives you whatever is in memory of the location it's referencing or pointing to.

    now a function pointer is a pointer that when you dereference it you actually invoke a function which you can pass values to just like an ordinary function, it is this behavior that nvidia gpu's do not currently support in hardware.

    based on what you said the encoding of the video streams is highly dynamic, which in and of itself jives nicely with the use of function pointers since they are generally used to simplify code when you need to invoke functions based on a run-time value, the dynamic length of gop's would certainly lead to variable run-time values and the simplest way to implement that in code is either via function pointers or functors in objective C or c++.

    looking over the x264 code i see no easy way to recode this so as to eliminate the need for function pointers, they made other design decisions that effectively painted them into a corner and the only way out was via function pointers, you would basically need to rethink x264 from the ground up.

    as i mentioned, ffmpeg's avc codec is nothing like x264, i looked through libavc and while it would need a bit of restructuring, i think it would be reasonably straight forward to port it to cuda.
    Quote Quote  
  8. Originally Posted by deadrats
    here's the thing, none of the "issues" you raised are hurdles to gpu acceleration in any way, shape or form. if you can decode a video stream on the gpu then you can encode the video stream on the gpu, the two are inverse operations of another, similarly if the gpu decode doesn't choke on variable length gop's, macroblocks or slices, hell individual pixels don't faze it, then there is no reason why it should choke while trying to encode them.
    Wrong. They are not equal inverse operations. Encoding is significantly more intensive and uses more calculations

    when we talk about cpu threads and gpu threads we're fundamentally talking about the same construct, so there is cpu/gpu "thread equivalency". second, cuda has nothing to do with how many threads can be used, cuda is the framework for using C to code applications that run on nvidia gpu's, the limiting factor is the hardware not the development environment. in so far as how many threads can be kept in flight at any one time, the top of the line gpu's can keep slightly over 30 thousand threads in flight at any one time, fermi will actually be able to keep fewer threads in flight, about 24 thousand is the number that has been thrown around, and yes that's part of the reason why they are faster for tasks that are massively parallel in nature, but that the fact remains that gpu's are significantly faster for linear tasks as well, they are computation monsters, in terms of floating point performance they just can't be touched, i know you hate the flop metric but it is a very valid performance comparison benchmark, the more floating point operation per second a processor can perform the faster it will be.

    When I talk about "thread equivalency" in layman's terms, I mean the same task doing the same thing. So your 20,000 threads should be 1666x faster than an i7 using 12 threads. Is this what you are suggesting? Are you suggesting that using 1 thread on the the various algorithms motion analysis with a GPU on a single GOP section is equivalent in speed when using 1 thread on a GPU?

    All this sound great in theory, and some parallelizable tasks do work great with GPU's like small workunit F@H. But things like time travel and teleportation sound great too . Why are the current GPU based encoders slower than CPU encoders at the same quality level, and don't even come close in top quality?

    I want you to prove it. When you examine the current GPU encoder output streams, they suck for the very reasons I suggested earlier. Is it because none of the programmers have the GPU know-how? Are they all "chimps"? Are they all lazy? Even if they were, you should be able to write some code for a better GPU encoder right?


    Yes, you could do 2-passes, with a CPU 1st pass, that would get around the GOP frametype placement issues, but you still would have many other issues. Quality is paramount for the developers, and they won't sacrifice it, and as an end user, I wouldn't want them to. I don't care if you have a geeky programming workaround, if the quality sucks I'm not interested.
    no one says that quality has to suck, were is it written that variable gop's are required for maximum quality? x264 places way too much emphasis on low bit rate quality, seriously just who is the target user? certainly not the pros who have no problem letting the bit rate go past 25 mb/s. incidently, if quality was of the utmost importance to the developer's, and to you, they would not use b or p frames, everything would be an i frame.
    Come on you know this answer. x264 is very configurable, and has an 8-bit 4:2:0 lossless mode. You can use all I-frames in lossy or lossless configuration - remember you did this for one of the test encodes in the other thread- do you remember how crappy it looked? When you have bitrate limitations and fixed capacity, all Intra is very low performing. This is why long GOP formats were created in the first place!

    There is a distinction and lots of valid uses for lossy compression. Blu-ray is already highly compressed, around 40-60x from the 10-bit 4:4:4 master. Your average 100min movie wouldn't even fit on a 2TB HDD. Most users are looking for better compression (i.e better quality at the same bitrate), faster encoding at a certain quality level. Are you happy with your TMPGEnc and Badaboom? If you can get the same quality, faster and at a lower bitrate isn't that appealing? A 25Mb/s encode using x264 might require 35-40Mb/s for the same quality using other encoders.


    1) While doing 2 passes (with a 1st pass CPU), you could calculate where the optimal frametype and GOP boundaries lie. BUT, you still wouldn't want to use the GPU because of lower quality and efficiency losses from INTRframe slice "boundaries." If 16 threads and 4 slices /frame results in a minor PSNR 0.0001dB to 0.0003dB quality loss for a certain source (just making these #'s up for example) , imagine what 10000, or 20000 would do to quality. You would never get as good quality.
    first things first, i never said do the analysis pass on the cpu, there is no reason why you can't perform the first pass on the gpu.

    your problem is that you suffer from the same thought patterns as those "reporters" at Fox, you have a preconceived notion based on zero facts, in some cases made up facts, and you use these "facts" as a way of supporting your perceived reality.

    by your own admission you just made those numbers up, you have your mind dead set on the notion that for some reason the exact same calculations performed on a different processor would somehow result in different results. where did you ever get the idea that performing the exact same operation would result in lower quality?
    The reason why you can't do the 1st pass on the GPU, is that it would be too slow. In order to the frametype placement, GOP size correct etc.., it has to be done sequentially. Unless there is thread equivalency in terms of speed, I don't see how this can be done on a GPU

    Where are your FACTS? All you have spewed are unproven theories. Who says you can even do the same operation on the GPU at all? How do you get around memory limitations?

    Where are your GPU encoded stream examples that PROVE you can do the same operations? The FACTS I have are the current stream samples that PROVE what I say is true for current encoders. I can show you features at the stream level why the quality sucks. I can emulate the low quality from GPU encodes by using similar settings with x264.

    Does this prove 5 years from now someone might have finally programmed a decent GPU encoder? Of course not, but what I'm saying is a lot closer to current reality than your unproven theories.

    Come on, make that encoder. Prove me wrong. I dare you to. In the scientific world, the onus is on those making the bold claims and theories to prove it, not on those who have established facts.

    let's assume that for the sack of argument the above numbers are accurate, we have 4 slices per frame, 16 threads, obviously 4 frames at a time, and we end up with a PSNR quality loss of .0002dB, why would you extrapolate that to 10-20 thousand threads? no one is saying cut up each frame into 20 thousand slices, that would be insane, what you would do is work on more frames at the same time, still use 4 slices per frame but instead of working on 4 frames at any one time, you would work on 2500-5000 frames at the same time, you would keep the quality loss, which you deemed acceptable the same, you would just work on larger chunks of video at a time.
    Sounds great in theory. But not panning out in reality. I've given the reasons why I think this is the case, but you have yet to prove me wrong.

    Many sources would be slower to encode, and much worse quality. Consider a short 1 minute clip (maybe a video trailer , or something for youtube). It might only have 2 or 3 GOPs. How do you allocate the parallelizable work units? A few thousand per GOP ? How much of a quality hit are you willing to take? How many idle units would there be?
    here you go making assumptions again, 2 problems with the above reasoning, no one is going to use gpu accelerated encoding for a 1 minute clip, and the performance advantages gpu's enjoy, for the 43 millionth time, are not tied solely to their ability to handle multiple threads as the same time, they can also finish working on said threads much faster.

    consider an inherently serial task like folding@home, check out their faq (they are currently the foremost experts on gpgpu programming):

    http://folding.stanford.edu/English/FAQ-SMP

    None of our engines are written to be thread-safe or multi-threaded. The only parallelizable codes (Gromacs and AMBER) both use MPI. Making Gromacs use only threads for parallelization isn't possible right now (we talk with the Gromacs developers frequently on this issue), so MPI is the only solution.

    http://folding.stanford.edu/English/FAQ-NVIDIA

    One of the really exciting aspects about GPU's is that not only can they accelerate existing algorithms significantly, they get really interesting in that they can open doors to new algorithms that we would never think to do on CPUs at all (due to their very slow speed on CPUs, not but GPU's).

    Much like the Gromacs core greatly enhanced Folding@home by a 20x to 30x speed increase via a new utilization of hardware (SSE) in PCs, in 2006, Folding@home has developed a new streaming processor core to utilize another new generation of hardware: GPUs with programmable floating-point capability. By writing highly optimized, hand tuned code to run on ATI X1900 class GPUs, the science of Folding@home will see another 20x to 30x speed increase over its previous software (Gromacs) for certain applications. This great speed increase is achieved by running essentially the complete molecular dynamics calculation on the GPU; while this is a challenging software development task, it appears to be the way to achieve the highest speed improvement on GPU's


    as you can see, much of the speed increase comes from the ability of gpu's to perform floating point operations 20-30 times faster than a cpu, yes this faq was written circa 2006 and yes cpu's have gotten faster since then but so have the gpu's, try it for yourself, download the cpu and gpu folding@home clients and see for yourself how much faster the gpu is and as you can read for yourself it has nothing to do with the number of threads a gpu can handle as most of the code isn't multi-threaded.
    I'm quite familiar with F@H and DC projects. I trust Dr. Pande and Scott LeGrande from Nvidia more than you. The workunits processed are very different between CPU and GPU. GPU workunits are limited to small calculations and smaller number of atoms. They can't do the calculations with the complexity that a CPU can, and parts of protein simulation model cannot be done by GPU. Part of the issue is physical - ie. memory related , and part of the issue is programming related. They have said while the sheer power of GPU encoding is great (I know you don't need convincing but if you look at the stats, GPU's do most of the work for the project now in terms of %) , he still prefers people use CPU because it can do calculations required for science that the GPU cannot.

    2) One of the key benefits that 99% of x264 users rave about is the true 1 pass quality (variable quantizer) mode ie. CRF , as opposed to CQ or constant quantizer , like xvid and such use. You can't do CRF encodes properly with GPU for the reasons mentioned earlier.
    complete and utter bull, nothing you have posted would prevent the same operations from being run on a gpu, only much faster.
    Prove it. The only reason a GPU would be faster , is if the tasks were modified to allow for massive parallelization. You would still have idle units, load balance issues, and probably run out of memory. In many cases, it would be slower to run on a GPU for the reasons I mentioned earlier


    non parallelizable algorithms e.g CABAC , brute force motion analysis, SATD - these would still be bottlenecks , and you would never get close to 100% efficient use of GPU. These are "facts of life" that you can't get around; and there are several published paper on the subject if you're interested.
    no, the "facts of life" was a tv show that's now off the air, but here's a fact for you to chew on, you know nothing about writing code, zip about cpu and gpu architectures and you're trying to apply your knowledge of video to explain topics you know little about.
    Ditto for you and video! You're trying to apply your knowledge of cpu and gpu to video which you know little about.


    gpu's are awesome for brute force work:

    http://securityandthe.net/2008/10/12/russian-researchers-achieve-100-fold-increase-in-...racking-speed/

    The 100-fold increase in speed is achieved with two GeForce GTX280’s per workstation; for €599 you can build a network of 20 workstations dedicated to “recovering” your “lost” WPA keys. This means that a WPA or WPA2 key could be cracked in days or weeks instead of years.
    I'm not disputing that. They are way faster for small workunit, parallelizable tasks. I'm suggesting that the workloads are different, and there are other issues when applying that to video encoding. The higher quality motion prediction and search algorithms haven't been written in a way that can take advantage of the GPU. In theory you should be able to get it to work, but it hasn't quite panned out has it?

    The limitations mentioned earlier affect how GPU encoders work now. They take shortcuts. That's why the quality sucks. Maybe you can work from improving it from the existing cuda angle? Since it was written from the ground up to work on a GPU.
    ROTFLMAO!!! gpu's don't "take shortcuts", that's why they're used to accelerate many mission critical tasks, such as protein folding, seti, weather prediction, nuclear reaction analysis, stock market analysis, the list of scientific and professional applications are endless. if gpu's took shortcuts they wouldn't be used for squat, the x264 developer's are just fond of making up bull sh*t excuses for the coding decisions they made.
    I think we're saying the same thing. Current GPU encoders take shortcuts because the GPU programmers have made shortcuts. GPU can only do what it's told right?

    Current GPU encoders have lower quality prediction and analysis. They skip out on using some features like CABAC and b-frames, residuals are a lot worse. We just disagree on the "why". All the things you said , should in theory make it possible. But where is that great GPU encoder?

    as i mentioned, ffmpeg's avc codec is nothing like x264, i looked through libavc and while it would need a bit of restructuring, i think it would be reasonably straight forward to port it to cuda.
    Modern ffmpeg builds use x264. Unless you're referring to some old build that gives worse than xvid
    Quote Quote  
  9. Member
    Join Date
    Feb 2009
    Location
    United States
    Search Comp PM
    ooowowwwwww my head hurts

    ocgw

    peace
    i7 2700K @ 4.4Ghz 16GB DDR3 1600 Samsung Pro 840 128GB Seagate 2TB HDD EVGA GTX 650
    https://forum.videohelp.com/topic368691.html
    Quote Quote  
  10. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by ocgw
    ooowowwwwww my head hurts
    i was under the impression that poison and i had scared everyone away, guess i was wrong, stay tuned, we're still not done.
    Quote Quote  
  11. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by poisondeathray
    When I talk about "thread equivalency" in layman's terms, I mean the same task doing the same thing. So your 20,000 threads should be 1666x faster than an i7 using 12 threads. Is this what you are suggesting? Are you suggesting that using 1 thread on the the various algorithms motion analysis with a GPU on a single GOP section is equivalent in speed when using 1 thread on a GPU?
    a thread is a thread is a thread, if you had any formal study on comp sci you would know this. 1 thread processed on a gpu is not the equivalent speed wise to it being processed on a cpu, it is much much faster on the gpu, in so far as whether the 20 thousand threads done on a gpu is 1666x faster than an i7 using 12 threads, i'll let you prove that to yourself, download the cinebench benchmark and run the software render and the hardware render benchmarks where the same scene is rendered in the cpu and the gpu, and compare the results for yourself.

    Why are the current GPU based encoders slower than CPU encoders at the same quality level, and don't even come close in top quality?
    what are you smoking and why don't you bring enough for all of us? the current gpu encoders are slower than cpu encoder?!? really? where did you buy your reality distortion field and did you get a good deal on it?

    i just ran this test using the badaboom encoder, i took a 1080p wmv at 5mps and encoded it to 1080p h264, 25mb/s, main profile, 4.1 level, cabac on, vbr with 128kb/s ac3, all processing decode and encode was handled by the gpu, i averaged 13-14 frames per second, i defy you to encode a file at 1080p at 25mb/s, using any settings within x264 and achieve anywhere near that frame rate.

    I want you to prove it. When you examine the current GPU encoder output streams, they suck for the very reasons I suggested earlier. Is it because none of the programmers have the GPU know-how? Are they all "chimps"? Are they all lazy? Even if they were, you should be able to write some code for a better GPU encoder right?
    first thing first, why is it i have to prove it, you're the one making all kinds of nonsensical claims in regards to gpu capabilities. second of all, yes, no current main stream programmer at the moment has the experience writing general purpose code on the gpu, gpgpu is still in it's infancy, as i have already pointed out most universities don't even offer courses in gpu programming and those that do only offer it as graduate level course work, it's not like you can have a guy go to devry and learn enough to code for a gpu.

    If you can get the same quality, faster and at a lower bitrate isn't that appealing? A 25Mb/s encode using x264 might require 35-40Mb/s for the same quality using other encoders.
    not even close to true, x264 rapidly starts losing any advantage it has as the bit rate increases and at 25 mb/s even mpeg-2 offers similar quality. as for low bit rate quality, quite frankly flix is much, much better, the best encodes i have ever seen at low bit rate were done with flix.

    The reason why you can't do the 1st pass on the GPU, is that it would be too slow. In order to the frametype placement, GOP size correct etc.., it has to be done sequentially. Unless there is thread equivalency in terms of speed, I don't see how this can be done on a GPU
    well, if you can't see how it can be done, i guess we should all pack it in and call it a day.

    you keep saying that it would be too slow on a gpu, i have offered you tons of third party proof to the contrary, you're starting to make yourself look foolish, i strongly suggest you stop.

    Where are your FACTS? All you have spewed are unproven theories. Who says you can even do the same operation on the GPU at all? How do you get around memory limitations?
    have you been paying attention at all or are the x264 blinders on too tight?

    as i have already pointed out, cuda is basically C for nvidia gpu's, if the compiler supports a feature then the hardware supports the feature. cuda is well documented, all the proof you want is in the tutorials and cuda developer documentation. looking through the documentation, geforce 8 and later gpu's support ALL features of ANSI C, with the exception of function pointers and object oriented programming features.

    every other procedural programming feature is supported, such as structs, unions, pointers, all data types, preprocessor directives, shared libraries, integer math, floating point math, pushing the stack, popping the stack, it's all supported.

    you can get around the limitations of no support for classes by using function definitions within a struct (which is basically what a class is), basically you just need to use a slightly different programming technique than you're use to.

    Where are your GPU encoded stream examples that PROVE you can do the same operations? The FACTS I have are the current stream samples that PROVE what I say is true for current encoders. I can show you features at the stream level why the quality sucks. I can emulate the low quality from GPU encodes by using similar settings with x264.
    you are confusing 2 very different issues, i am not disputing that gpu based h264 encoders produce lower quality, the thing that you can't seem to understand is that the limiting factor is not the hardware it's that x86 programmers don't have the experience writing code for the gpu, big difference. if you take a risc programmer or a sparc programmer, someone that's been coding for those platforms exclusively for years and ask him to write code for the x86 platform and the code is likewise going to be poor, same thing having an x86 programmer code for the ia64 architecture, it's programmer inexperience, not inferior hardware that's at fault.

    Come on, make that encoder. Prove me wrong. I dare you to. In the scientific world, the onus is on those making the bold claims and theories to prove it, not on those who have established facts.
    all you have proven is that the less you know about a subject the more you like to argue a fallacious point.

    Ditto for you and video! You're trying to apply your knowledge of cpu and gpu to video which you know little about.
    my knowledge is in general programming, cpu and gpu architectures and data stream manipulation, i know that the arguments you have put forth as to why "it can't be done" don't hold water from a programming and architecture standpoint.

    Current GPU encoders have lower quality prediction and analysis. They skip out on using some features like CABAC and b-frames, residuals are a lot worse. We just disagree on the "why". All the things you said , should in theory make it possible. But where is that great GPU encoder?
    on the contrary, badaboom and media coder both use cabac and media coder also uses b-frames, as to disagreeing on the why, you seem hell bent on believing that it's because of an inherent fault within current gpu architectures, i know that it's because programmers don't quite have a handle on gpu computing just yet.

    as for when the great gpu encoder will finally be here, most likely never. as i said way earlier in this thread, IF intel actually ends up releasing that video transcoding driver and it is in fact a driver in ever sense of the word, then gpu acceleration via open cl or cuda will go the way of the dodo. and even if that driver is never released and/or it's not a driver in the traditional sense but more like a plug in or a stand alone encoder, it's still a moot point, sandy bridge is on track to hit retail by this time next year, once that happens i give nvidia less than 2 years to close up shop and i think open cl will end up going nowhere.

    now, if the education environment was to change in this country and gpgpu programming classes started being offered within the associates degree curriculum, i.e. in addition to needing to take c++ I&II, the various data structures and algorithm classes, the comp organization and assembler classes, etc, they also made the student take gpu programming I&II, then we would see a massive shift toward high quality gpu accelerated apps, but as i said...

    Modern ffmpeg builds use x264. Unless you're referring to some old build that gives worse than xvid
    they most certainly do not, here's the copyright notice for the latest build of x264:

    /************************************************** ***************************
    * x264: h264 encoder
    ************************************************** ***************************
    * Copyright (C) 2003 Laurent Aimar
    * $Id: encoder.c,v 1.1 2004/06/03 19:27:08 fenrir Exp $
    *
    * Authors: Laurent Aimar <fenrir@via.ecp.fr>
    *
    * This program is free software; you can redistribute it and/or modify
    * it under the terms of the GNU General Public License as published by
    * the Free Software Foundation; either version 2 of the License, or
    * (at your option) any later version.
    *
    * This program is distributed in the hope that it will be useful,
    * but WITHOUT ANY WARRANTY; without even the implied warranty of
    * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    * GNU General Public License for more details.
    *
    * You should have received a copy of the GNU General Public License
    * along with this program; if not, write to the Free Software
    * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111, USA.
    ************************************************** ***************************/

    and here's the copyright notice for libavc:

    /*
    * H.26L/H.264/AVC/JVT/14496-10/... encoder/decoder
    * Copyright (c) 2003 Michael Niedermayer <michaelni@gmx.at>
    *
    * This file is part of FFmpeg.
    *
    * FFmpeg is free software; you can redistribute it and/or
    * modify it under the terms of the GNU Lesser General Public
    * License as published by the Free Software Foundation; either
    * version 2.1 of the License, or (at your option) any later version.
    *
    * FFmpeg is distributed in the hope that it will be useful,
    * but WITHOUT ANY WARRANTY; without even the implied warranty of
    * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
    * Lesser General Public License for more details.
    *
    * You should have received a copy of the GNU Lesser General Public
    * License along with FFmpeg; if not, write to the Free Software
    * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
    */

    here's the preprocessor directives for x264:

    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    #include <math.h>

    #ifdef __WIN32__
    #include <windows.h>
    #define pthread_t HANDLE
    #define pthread_create(t,u,f,d) *(t)=CreateThread(NULL,0,f,d,0,NULL)
    #define pthread_join(t,s) { WaitForSingleObject(t,INFINITE); \
    CloseHandle(t); }
    #define HAVE_PTHREAD 1

    #elif defined(SYS_BEOS)
    #include <kernel/OS.h>
    #define pthread_t thread_id
    #define pthread_create(t,u,f,d) { *(t)=spawn_thread(f,"",10,d); \
    resume_thread(*(t)); }
    #define pthread_join(t,s) wait_for_thread(t,(long*)s)
    #define HAVE_PTHREAD 1

    #elif HAVE_PTHREAD
    #include <pthread.h>
    #endif

    #include "common/common.h"
    #include "common/cpu.h"

    #include "set.h"
    #include "analyse.h"
    #include "ratecontrol.h"
    #include "macroblock.h"

    #if VISUALIZE
    #include "common/visualize.h"
    #endif

    //#define DEBUG_MB_TYPE
    //#define DEBUG_DUMP_FRAME
    //#define DEBUG_BENCHMARK

    #ifdef DEBUG_BENCHMARK
    static int64_t i_mtime_encode_frame = 0;
    static int64_t i_mtime_analyse = 0;
    static int64_t i_mtime_encode = 0;
    static int64_t i_mtime_write = 0;
    static int64_t i_mtime_filter = 0;
    #define TIMER_START( d ) \
    { \
    int64_t d##start = x264_mdate();

    #define TIMER_STOP( d ) \
    d += x264_mdate() - d##start;\
    }
    #else
    #define TIMER_START( d )
    #define TIMER_STOP( d )
    #endif

    #define NALU_OVERHEAD 5 // startcode + NAL type costs 5 bytes per frame

    and here's the preprocessor directives for the h264 portion of libavc:

    /**
    * @file libavcodec/h264.c
    * H.264 / AVC / MPEG4 part10 codec.
    * @author Michael Niedermayer <michaelni@gmx.at>
    */

    #include "internal.h"
    #include "dsputil.h"
    #include "avcodec.h"
    #include "mpegvideo.h"
    #include "h264.h"
    #include "h264data.h"
    #include "h264_parser.h"
    #include "golomb.h"
    #include "mathops.h"
    #include "rectangle.h"
    #include "vdpau_internal.h"

    #include "cabac.h"
    #if ARCH_X86
    #include "x86/h264_i386.h"
    #endif

    //#undef NDEBUG
    #include <assert.h>

    /**
    * Value of Picture.reference when Picture is not a reference picture, but
    * is held for delayed output.
    */
    #define DELAYED_PIC_REF 4

    you don't need to be a programmer to see that the libavc code is much simpler, much cleaner and way more streamlined and completely different.

    note, this code comparison is from the latest snapshot of each encoder, 2 very different animals.
    Quote Quote  
  12. a thread is a thread is a thread, if you had any formal study on comp sci you would know this. 1 thread

    processed on a gpu is not the equivalent speed wise to it being processed on a cpu, it is much much

    faster on the gpu, in so far as whether the 20 thousand threads done on a gpu is 1666x faster than an

    i7 using 12 threads, i'll let you prove that to yourself, download the cinebench benchmark and run the

    software render and the hardware render benchmarks where the same scene is rendered in the cpu

    and the gpu, and compare the results for yourself.
    My point is that you aren't using the *same* threads for the gpu encoder. ie. you're not using the exact

    same code for the gpu as you are for the cpu.

    So why isn't badaboom 1666x faster than x264 encoding (not accounting for quality)? Is it possible there

    are bottlenecks and memory limitations (at least with the current software implementation)?

    Why are the current GPU based encoders slower than CPU encoders at the same quality level, and don't

    even come close in top quality?

    what are you smoking and why don't you bring enough for all of us? the current gpu encoders are

    slower than cpu encoder?!? really? where did you buy your reality distortion field and did you get a good

    deal on it?


    i just ran this test using the badaboom encoder, i took a 1080p wmv at 5mps and encoded it to 1080p

    h264, 25mb/s, main profile, 4.1 level, cabac on, vbr with 128kb/s ac3, all processing decode and encode

    was handled by the gpu, i averaged 13-14 frames per second, i defy you to encode a file at 1080p at

    25mb/s, using any settings within x264 and achieve anywhere near that frame rate.
    Read again what I said carefully.
    Why are the current GPU based encoders slower than CPU encoders at the same quality level, and

    don't even come close in top quality?
    You're results are pretty slow, but you confused the issue by including audio. What settings did you

    use, and what is your configuration?

    Use --preset "fast" or "veryfast". I get about 2x realtime on 1080p24 bluray source on an i7. You could

    even use "ultrafast", but that would drop below badaboom quality on most sources. These settings

    adjust the settings (like lower quality search algoithms, fewer reference frames, b-adapt 1, etc.. ) to

    match badaboom's quality.


    Quote:
    I want you to prove it. When you examine the current GPU encoder output streams, they suck for the

    very reasons I suggested earlier. Is it because none of the programmers have the GPU know-how? Are

    they all "chimps"? Are they all lazy? Even if they were, you should be able to write some code for a

    better GPU encoder right?


    first thing first, why is it i have to prove it, you're the one making all kinds of nonsensical claims in

    regards to gpu capabilities. second of all, yes, no current main stream programmer at the moment has

    the experience writing general purpose code on the gpu, gpgpu is still in it's infancy, as i have already

    pointed out most universities don't even offer courses in gpu programming and those that do only offer it

    as graduate level course work, it's not like you can have a guy go to devry and learn enough to code for

    a gpu.
    I completely agree with you that GPU's have great potential. I've been saying that from day 1 even before

    they were released. My concern is top quality and it's not there from the current crop of GPU encoders.

    If you or someone can leverage that potential, then that's what I'm looking for.


    If you can get the same quality, faster and at a lower bitrate isn't that appealing? A 25Mb/s encode using

    x264 might require 35-40Mb/s for the same quality using other encoders.
    not even close to true, x264 rapidly starts losing any advantage it has as the bit rate increases and at 25

    mb/s even mpeg-2 offers similar quality. as for low bit rate quality, quite frankly flix is much, much better,

    the best encodes i have ever seen at low bit rate were done with flix.
    x264 doesn't lose it's advantage. To the average human eye, everything looks similar at high bitrates for

    most types of content, but the advantage is still maintained. If you were to do generational encoding for

    example, the MPEG2 encode would look worse each time. Plot the PSNR/Bitrate graphs even up to

    180Mb/s and the advantage is still there. MPEG2 never crosses the line, and never comes close. If you

    want the sources to test for yourself, let me know.



    OK now for Flix. Explain your observations. Do you mean flix pro as in vp6 or something else? What kind

    of testing have you done, and can you post sources /encodes etc... i.e. provide some evidence. If you

    don't want to do it, I'll do the testing process. I just need the sources and information. I simply don't

    believe these claims.


    Quote:
    Where are your FACTS? All you have spewed are unproven theories. Who says you can even do the

    same operation on the GPU at all? How do you get around memory limitations?


    have you been paying attention at all or are the x264 blinders on too tight?

    as i have already pointed out, cuda is basically C for nvidia gpu's, if the compiler supports a feature then

    the hardware supports the feature. cuda is well documented, all the proof you want is in the tutorials

    and cuda developer documentation. looking through the documentation, geforce 8 and later gpu's

    support ALL features of ANSI C, with the exception of function pointers and object oriented

    programming features.

    every other procedural programming feature is supported, such as structs, unions, pointers, all data

    types, preprocessor directives, shared libraries, integer math, floating point math, pushing the stack,

    popping the stack, it's all supported.

    you can get around the limitations of no support for classes by using function definitions within a struct

    (which is basically what a class is), basically you just need to use a slightly different programming

    technique than you're use to.
    Well most programmers I've heard talk about it (not necessarily from x264) all moan how bad it is to

    program for cuda. The bottom line is nobody has put together a good GPU encoder yet. That's all I'm

    interested in at the end of the day. If it's due to lack of programming knowledge, that's entirely plausible.

    But if you're saying there are no limitations to what a GPU can do in terms of video encoding, I find that

    hard to believe

    Where are your GPU encoded stream examples that PROVE you can do the same operations? The

    FACTS I have are the current stream samples that PROVE what I say is true for current encoders. I can

    show you features at the stream level why the quality sucks. I can emulate the low quality from GPU

    encodes by using similar settings with x264.
    you are confusing 2 very different issues, i am not disputing that gpu based h264 encoders produce

    lower quality, the thing that you can't seem to understand is that the limiting factor is not the hardware

    it's that x86 programmers don't have the experience writing code for the gpu, big difference. if you take

    a risc programmer or a sparc programmer, someone that's been coding for those platforms exclusively

    for years and ask him to write code for the x86 platform and the code is likewise going to be poor, same

    thing having an x86 programmer code for the ia64 architecture, it's programmer inexperience, not inferior

    hardware that's at fault.
    Ok. All I'm saying is the current generation of GPU encoders have shown limitations and quality issues. If

    it was because of the writers' lack of experience and (not even partly) a hardware/cuda API limitation

    then I can accept that. I posted some of the reasons why I thought would be problematic for a GPU, but

    you said they were related to programming 100%. Is it sucha big stretch of the imagination that there are

    hardware/architectural limitations? Since other fields in scientific computing have similar limitations e.g.

    F@H. I accept what you have to say, but if there was some great GPU encoder coming along that had all

    the features, configurablilty, and quality of x264 and yet encodes faster, I would find your comments

    even more convincing.


    Quote:
    Current GPU encoders have lower quality prediction and analysis. They skip out on using some features

    like CABAC and b-frames, residuals are a lot worse. We just disagree on the "why". All the things you

    said , should in theory make it possible. But where is that great GPU encoder?
    on the contrary, badaboom and media coder both use cabac and media coder also uses b-frames, as to

    disagreeing on the why, you seem hell bent on believing that it's because of an inherent fault within

    current gpu architectures, i know that it's because programmers don't quite have a handle on gpu

    computing just yet.
    I stand corrected. Badaboom did improve on their initial implementation and do offer "Main" and CABAC

    now, but still limited to 1 pass, and don't offer "High". The PSNR graphs and encoding times and charts I

    posted in the other thread were done with the most recent version that added those features.



    I do believe I said earlier above that a GPU can only be told what the programming tells it to do. I posted

    reasons why I thought there were issues with the architecture. Assuming there are no architechtural

    limitations, why did they program a POS then? If it's only because programmers don't have a handle on it

    , when will they? Badaboom has been released for over a year, and been in development for 3 or 4. How long does it take to

    "learn?"



    as for when the great gpu encoder will finally be here, most likely never. as i said way earlier in this

    thread, IF intel actually ends up releasing that video transcoding driver and it is in fact a driver in ever

    sense of the word, then gpu acceleration via open cl or cuda will go the way of the dodo. and even if

    that driver is never released and/or it's not a driver in the traditional sense but more like a plug in or a

    stand alone encoder, it's still a moot point, sandy bridge is on track to hit retail by this time next year,

    once that happens i give nvidia less than 2 years to close up shop and i think open cl will end up going

    nowhere.

    now, if the education environment was to change in this country and gpgpu programming classes

    started being offered within the associates degree curriculum, i.e. in addition to needing to take c++ I&II,

    the various data structures and algorithm classes, the comp organization and assembler classes, etc,

    they also made the student take gpu programming I&II, then we would see a massive shift toward high

    quality gpu accelerated apps, but as i said...
    Too bad. I hope it picks up. The potential is there.


    Modern ffmpeg builds use x264. Unless you're referring to some old build that gives worse than xvid
    they most certainly do not, here's the copyright notice for the latest build of x264:

    you don't need to be a programmer to see that the libavc code is much simpler, much cleaner and way

    more streamlined and completely different.

    note, this code comparison is from the latest snapshot of each encoder, 2 very different animals.
    It depends what you compiled with your ffmpeg build with. You can compile x264 with it, or download

    precompiled ones; most precompiled binaries have x264. If yours has different avc encoder code it's probably the crappy one.

    Do some quick tests, because it might not be worth your time screwing around with it.
    Quote Quote  
  13. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by poisondeathray
    My point is that you aren't using the *same* threads for the gpu encoder. ie. you're not using the exact same code for the gpu as you are for the cpu.

    So why isn't badaboom 1666x faster than x264 encoding (not accounting for quality)? Is it possible there
    are bottlenecks and memory limitations (at least with the current software implementation)?
    because the encoders are coded not to rely on the superior thread capabilities of the gpu but rather the "brute force" capabilities of the gpu, as i said earlier, while it would be relatively easy to code an encoder that breaks up a video stream into numerous little pieces, assigns each piece in to a thread and processes all pieces simultaneously (you could do this with a simple do/while loop, where each pass of the loop reads a segment of the stream and assigns a worker thread to it, you would still need to do thread maintenance to prevent locks and stalls and allocate and deallocate memory (on this last point i may be wrong, the cuda documentation seems to suggest that the hardware is capable of doing this on it's own), it is much easier to simply write your code as you would on an x86 cpu and simply allow the vastly faster floating point unit found on the gpu to do the work.

    when you compare the floating point capabilities of x86 cpu's to the floating point capabilities of gpu's you find that the speed discrepancy between badaboom and software based h264 encoders are a match for the differences in fp performance.

    basically hardware encoders are using a brute force tactic to gain their speed advantage instead of a more finesse approach which is the category highly threaded code would fall into.

    You're results are pretty slow, but you confused the issue by including audio. What settings did you
    use, and what is your configuration?
    i already told you the settings and i use an X4 620 coupled to a 9600 gso, it should be noted that this video card had 2 gpu cores but the latest version of badaboom only makes use of one gpu core, my understanding is that the latest version of cuda allows for spreading the work across multiple gpu's but i'm almost 100% that badaboom hasn't been updated to support this yet.

    OK now for Flix. Explain your observations. Do you mean flix pro as in vp6 or something else? What kind
    of testing have you done, and can you post sources /encodes etc... i.e. provide some evidence. If you
    don't want to do it, I'll do the testing process. I just need the sources and information. I simply don't
    believe these claims.
    i can't post a source or sample as the files i am referring to are adult in nature, in so far as what i mean by the quality, the encodes are clean, 100% free of noise, compression artifacts, extremely detailed and clear, when i check to see the writing application it says linux flixengine, the encodes are damn impressive, you can download a demo for windows, you just have to register first.

    Well most programmers I've heard talk about it (not necessarily from x264) all moan how bad it is to
    program for cuda. The bottom line is nobody has put together a good GPU encoder yet. That's all I'm
    interested in at the end of the day. If it's due to lack of programming knowledge, that's entirely plausible.
    But if you're saying there are no limitations to what a GPU can do in terms of video encoding, I find that
    hard to believe
    welcome to dealing with programmers, i have met programmers that absolutely hate: C, C++, FORTRAN, Pascal, Java, Visual BASIC, any BASIC, Perl, Python, ADA, COBOL...

    and that same programmer will absolutely love one of the above languages, i personally like Pascal more than C and i know programmers that have said they would rather throw their computers out the window than use Pascal, programmers are people and more importantly they are the biggest bitches you will ever meet, they will complain about everything.

    Assuming there are no architectural limitations, why did they program a POS then? If it's only because programmers don't have a handle on it, when will they? Badaboom has been released for over a year, and been in development for 3 or 4. How long does it take to "learn?"
    it's not that they programmed a piece of sh*t, as i already said nvcuvenc, the nvidia h264 gpu encoder, was created by nvidia as a template and distributed with the cuda sdk, it wasn't meant to be a full featured h264 encoder, programmers were supposed to look at it and use it as a guide, or even a starting point, for their own h264 encoder, not take it, barely modify it and include it in final product. it's kind of like the ram disk driver used to offer for win 2k, it was meant only as a template, it was limited to just 32 mb ram disks, but there were "professional" products that used it as a back end for their gui.

    the elemental developer's, the people behind badaboom, also have an ulterior motive for keeping badaboom from being all it can be, they also make the rapid hd plug-in for adobe premiere, and that plug-in is expensive (only comes bundled with a $2000 quadro fx card), they're never, ever going to sell a $30 app that runs on gaming gpu's that is anywhere near the quality of their premium product, they would be crazy to. if you look at their rapidhd plug-in you will note that it offers more features:

    http://elementaltechnologies.com/products/accelerator/specs

    most reviews i have read on this plug-in seem to indicate that it's a pretty good product.

    so badaboom will always be sub-par, as for how long it takes to learn, it really depends on the programmer and how badly he/she wants to learn, there is a lot if inertia within the programming community to anything that's new, hell COBOL is now what, 40 years old and it's still the most widely used business programming language, operating systems are still coded in C and that's at least 35 years old, it's just the way people are.

    It depends what you compiled with your ffmpeg build with. You can compile x264 with it, or download
    precompiled ones; most precompiled binaries have x264. If yours has different avc encoder code it's probably the crappy one.

    Do some quick tests, because it might not be worth your time screwing around with it.
    i'm not exactly sure what you are talking about, ffmpeg uses libavc:

    http://ffmpeg.org/

    FFmpeg is a complete, cross-platform solution to record, convert and stream audio and video. It includes libavcodec - the leading audio/video codec library.
    if you download the source:

    http://ffmpeg.org/download.html

    you get the source to the various parts of libavc, including, and i didn't know this, the source for an open source vc-1 codec. that's beside the point, looking through the libavc folder we see a C file and a header file named H264, but no mention of x264, as such there is no way you are compiling ffmpeg with x264 support just from what is offered for download in the ffmpeg website.

    my guess people are jury rigging ffmpeg to work with x264, but that's definitely not an official build.
    Quote Quote  
  14. Originally Posted by deadrats
    when you compare the floating point capabilities of x86 cpu's to the floating point capabilities of gpu's you find that the speed discrepancy between badaboom and software based h264 encoders are a match for the differences in fp performance.

    basically hardware encoders are using a brute force tactic to gain their speed advantage instead of a more finesse approach which is the category highly threaded code would fall into.
    Aren't video calculations primarily integer as opposed to FP? Could that contribute to why the GPU encoders aren't as fast as they could be?

    i can't post a source or sample as the files i am referring to are adult in nature,
    Do you have anything that is not adult

    in so far as what i mean by the quality, the encodes are clean, 100% free of noise, compression artifacts, extremely detailed and clear, when i check to see the writing application it says linux flixengine, the encodes are damn impressive, you can download a demo for windows, you just have to register first.
    I still need more info. Linux flixengine is just the writing application. e.g what does mediainfo say about it? I have serious doubts, but I'll willing to check anything that has potential



    i'm not exactly sure what you are talking about, ffmpeg uses libavc:

    http://ffmpeg.org/
    I'm not sure what I'm talking about either. But lots of people use it on linux to access x264, and lots of people use configure it for batch encoding for their own programs. I guess you need a svn build

    http://rob.opendot.cl/index.php/useful-stuff/ffmpeg-x264-encoding-guide/
    http://sites.google.com/site/linuxencoding/x264-ffmpeg-mapping
    http://ubuntuforums.org/showthread.php?t=786095
    Quote Quote  
  15. Member
    Join Date
    Nov 2002
    Location
    United States
    Search Comp PM
    Originally Posted by deadrats
    Originally Posted by ocgw
    ooowowwwwww my head hurts
    i was under the impression that poison and i had scared everyone away, guess i was wrong, stay tuned, we're still not done.
    You should start a new thread or just IM each other. It's not very nice to hijack someone elses thread and talk about something that has nothing to do with the thread topic. I'm surprised that the mods have let this go on for this long.
    Quote Quote  
  16. contrarian rallynavvie's Avatar
    Join Date
    Sep 2002
    Location
    Minnesotan in Texas
    Search Comp PM
    Originally Posted by DarrellS
    You should start a new thread or just IM each other. It's not very nice to hijack someone elses thread and talk about something that has nothing to do with the thread topic. I'm surprised that the mods have let this go on for this long.
    Actually it would have been nice to have this as a separate topic just so that it'd be easier to find for future reference to this subject matter. In between the ad hominem there is some great information.
    FB-DIMM are the real cause of global warming
    Quote Quote  
  17. Originally Posted by DarrellS
    You should start a new thread or just IM each other. It's not very nice to hijack someone elses thread and talk about something that has nothing to do with the thread topic. I'm surprised that the mods have let this go on for this long.
    Well I apologized to the op, Engineering, early on in the thread. He actually started a new one, and got all his questions answered either in that thread or through PM
    https://forum.videohelp.com/topic376999.html

    Come on, which thread ever stays on topic? This excursion began very related to the original topic as discussed what types of video cards were suitable for NLE editing like the upcoming CS5, and acceleration of video encoding, future outlook for purchasing decisions, but got derailed. Maybe a mod can split it somewhere or append the 1st bit to the other hardware thread that Engineering had.
    Quote Quote  
  18. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by poisondeathray
    Aren't video calculations primarily integer as opposed to FP? Could that contribute to why the GPU encoders aren't as fast as they could be?
    it's funny you should mention that as i remember reading a review on the penryns when they first came out and the reviewer said something along the lines of the reason the quad core penryn was so much faster than the amd cpu's it was compared to was because of the penryn's superior integer performance, which struck me as extremely odd.

    many calculations, like dct and idct are pure floating point calculations, but the proof is in the code, looking through the x264 code to see what kind of data types are declared if the computations were primarily integer based you would expect to see "int" declared, if they are floating point calculation you would see "float" or "double", looking through the x264 source there seems to be more int declarations in general, but that is a bit misleading as analyzing the code shows that many of those int declarations are for functions that return a 1 or 0, there are about a dozen or so float and double declarations, overall the code seems to indicate that it's a mix of integer and floating point calculations.

    this however is not necessarily a fatal blow to gpu acceleration, if you were structuring the code to run on a gpu, knowing that a gpu is a brute force floating point machine, you could simply declare "int" as "float" which would cause the compiler to treat all integer calculations as floating point calculations and thus have them run on the gpu's floating point unit, this is a very minor modification to the code that could be implemented in less than an hour (even accounting for "breaking" something as you re-declare the data types).

    Do you have anything that is not adult
    let's just say that if porn was ever declared illegal and someone was to search my hard drivers, they would find enough content to put me away for the next 500+ years. i think i may have a problem, but it's a problem i like

    there is one more thing i wanted to address with regards to the memory issue objections you had raised, i know you based that on what the x264 developer's said because i was reading through that "dairy of a x264 developer" and they said almost exactly the same thing but the more i thought about it the less sense it made, and here's why:

    when you run a software based encoder you only have access to the system ram, while i may have 4 gigs of ram, i would think that about 2 gigs is common for most users. a gpu has it's own frame buffer and modern gpu's, even low end ones have as much as 512 mb of ram, my 9600 gso has 768 mb and the really high end ones have as much as 1.5 to 2 gigs of ram, and since the advent of agp cards, it is possible to flush just parts of the frame buffer to main memory (in the pci days it was all or nothing, first you filled up the cards memory and then you could make use of system ram, now it can be any combination), a gpu encoder has access to system ram + video card ram.

    i also ran a quick test, using avi demux as a front end, and encoding using x264 with 4 threads, 8 threads and 12 threads just to see how much ram is used. with 4 threads, encoding to 720x480 at 3mb/s avidemux used about 170 mb of ram, with 8 threads it used 185 mb of ram. with 12 threads we hit 200 mb of ram usage. so extrapolating some expected ram usage, for every additional 4 threads it requires roughly 15 extra mb of ram, thus starting from 12 threads and 200 mb of ram and adding another 40 threads we would need a total of 600 mb of ram to run 52 threads under x264, which is well within the capabilities of any mid range card. we also need to remember that graphics cards, with the exception of the low end cheap variants, use much faster ram than system ram, top of the line cards use gddr5 while top of the line desktops use ddr3.

    i also must congratulate you, you proved to be an excellent master debater ( :P ) i hope this thread was as enjoyable for those reading it as it was for me.
    Quote Quote  
  19. Banned
    Join Date
    Nov 2005
    Location
    United States
    Search Comp PM
    Originally Posted by rallynavvie
    Actually it would have been nice to have this as a separate topic just so that it'd be easier to find for future reference to this subject matter. In between the ad hominem there is some great information.
    ad hominem, really? i thought we were quite civil to one another, it was a debate, an exchange of information and ideas, that while it may have gotten a bit warm never got heated.

    i have no problem with poison, he strikes me as a decent guy, he has certainly help me, and many others, in the past, and he certainly is knowledgeable as far as video is concerned, it's just that he seemed to have believed the excuses the x264 developer's have made with regards to gpu acceleration as God's Own Truth, and the sad truth of the matter is that from a coding standpoint they just don't hold water, the only valid obstacle is that they chose to use a programming technique that is currently unsupported by nvidia gpu archiectures (i can't seem to find any documentation to see if ati hardware also doesn't support function pointers), but that reality is a far cry from "gpu's suck for video encoding".
    Quote Quote  
  20. I'm a Super Moderator johns0's Avatar
    Join Date
    Jun 2002
    Location
    canada
    Search Comp PM
    My eyes have started to bleed.
    I think,therefore i am a hamster.
    Quote Quote  



Similar Threads

Visit our sponsor! Try DVDFab and backup Blu-rays!