mr poisondeathray, that was a lucid, intelligent, well thought-out objection. OVERRULED!!!
i had originally written out a long rebuttal of everything you said, but instead decided to supply some code to prove my point, note i stole all this code from nvidia themselves, it's out there if anyone wants to go looking for it:
code to decode a video stream on a gpu:
*******************************
#include "VideoDecoder.h"
#include "FrameQueue.h"
#include <cstring>
#include <cassert>
#include <string>
VideoDecoder::VideoDecoder(const CUVIDEOFORMAT & rVideoFormat,
CUcontext &rContext,
cudaVideoCreateFlags eCreateFlags,
CUvideoctxlock &ctx)
: m_CtxLock(ctx)
{
// get a copy of the CUDA context
m_Context = rContext;
m_VideoCreateFlags = eCreateFlags;
printf("> VideoDecoder::cudaVideoCreateFlags = <%d>", (int)eCreateFlags);
switch (eCreateFlags) {
case cudaVideoCreate_Default: printf("Default (VP)\n"); break;
case cudaVideoCreate_PreferCUDA: printf("Use CUDA decoder\n"); break;
case cudaVideoCreate_PreferDXVA: printf("Use DXVA decoder\n"); break;
default: printf("Unknown value\n"); break;
}
// Validate video format. Currently only a subset is
// supported via the cuvid API.
cudaVideoCodec eCodec = rVideoFormat.codec;
assert(cudaVideoCodec_MPEG1 == eCodec || cudaVideoCodec_MPEG2 == eCodec || cudaVideoCodec_VC1 == eCodec || cudaVideoCodec_H264 == eCodec);
assert(cudaVideoChromaFormat_420 == rVideoFormat.chroma_format);
// Fill the decoder-create-info struct from the given video-format struct.
memset(&oVideoDecodeCreateInfo_, 0, sizeof(CUVIDDECODECREATEINFO));
// Create video decoder
oVideoDecodeCreateInfo_.CodecType = rVideoFormat.codec;
oVideoDecodeCreateInfo_.ulWidth = rVideoFormat.coded_width;
oVideoDecodeCreateInfo_.ulHeight = rVideoFormat.coded_height;
oVideoDecodeCreateInfo_.ulNumDecodeSurfaces = FrameQueue::cnMaximumSize;
// Limit decode memory to 24MB (16M pixels at 4:2:0 = 24M bytes)
while (oVideoDecodeCreateInfo_.ulNumDecodeSurfaces * rVideoFormat.coded_width * rVideoFormat.coded_height > 16*1024*1024)
{
oVideoDecodeCreateInfo_.ulNumDecodeSurfaces--;
}
oVideoDecodeCreateInfo_.ChromaFormat = rVideoFormat.chroma_format;
oVideoDecodeCreateInfo_.OutputFormat = cudaVideoSurfaceFormat_NV12;
oVideoDecodeCreateInfo_.DeinterlaceMode = cudaVideoDeinterlaceMode_Adaptive;
// No scaling
oVideoDecodeCreateInfo_.ulTargetWidth = oVideoDecodeCreateInfo_.ulWidth;
oVideoDecodeCreateInfo_.ulTargetHeight = oVideoDecodeCreateInfo_.ulHeight;
oVideoDecodeCreateInfo_.ulNumOutputSurfaces = 2; // We won't simultaneously map more than 2 surfaces
oVideoDecodeCreateInfo_.ulCreationFlags = m_VideoCreateFlags;
oVideoDecodeCreateInfo_.vidLock = ctx;
// create the decoder
CUresult oResult = cuvidCreateDecoder(&oDecoder_, &oVideoDecodeCreateInfo_);
assert(CUDA_SUCCESS == oResult);
}
VideoDecoder::~VideoDecoder()
{
cuvidDestroyDecoder(oDecoder_);
}
cudaVideoCodec
VideoDecoder::codec()
const
{
return oVideoDecodeCreateInfo_.CodecType;
}
cudaVideoChromaFormat
VideoDecoder::chromaFormat()
const
{
return oVideoDecodeCreateInfo_.ChromaFormat;
}
unsigned long
VideoDecoder::maxDecodeSurfaces()
const
{
return oVideoDecodeCreateInfo_.ulNumDecodeSurfaces;
}
unsigned long
VideoDecoder::frameWidth()
const
{
return oVideoDecodeCreateInfo_.ulWidth;
}
unsigned long
VideoDecoder::frameHeight()
const
{
return oVideoDecodeCreateInfo_.ulHeight;
}
unsigned long
VideoDecoder::targetWidth()
const
{
return oVideoDecodeCreateInfo_.ulTargetWidth;
}
unsigned long
VideoDecoder::targetHeight()
const
{
return oVideoDecodeCreateInfo_.ulTargetHeight;
}
void
VideoDecoder::decodePicture(CUVIDPICPARAMS * pPictureParameters)
{
CUresult oResult = cuvidDecodePicture(oDecoder_, pPictureParameters);
assert(CUDA_SUCCESS == oResult);
}
void
VideoDecoder::mapFrame(int iPictureIndex, CUdeviceptr * ppDevice, unsigned int * pPitch, CUVIDPROCPARAMS * pVideoProcessingParameters)
{
CUresult oResult = cuvidMapVideoFrame(oDecoder_,
iPictureIndex,
ppDevice,
pPitch, pVideoProcessingParameters);
assert(CUDA_SUCCESS == oResult);
assert(0 != *ppDevice);
assert(0 != *pPitch);
}
void
VideoDecoder::unmapFrame(CUdeviceptr pDevice)
{
CUresult oResult = cuvidUnmapVideoFrame(oDecoder_, pDevice);
//assert(CUDA_SUCCESS == oResult);
}
***********************************
here's how you integrated a cuda function into an existing c++ app:
**********************************************
/* Example of integrating CUDA functions into an existing
* application / framework.
* CPP code representing the existing application / framework.
* Compiled with default CPP compiler.
*/
// includes, system
#include <iostream>
#include <stdlib.h>
// Required to include CUDA vector types
#include <vector_types.h>
#include "cutil_inline.h"
////////////////////////////////////////////////////////////////////////////////
// declaration, forward
extern "C" void runTest(const int argc, const char** argv,
char* data, int2* data_int2, unsigned int len);
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int
main(int argc, char** argv)
{
// input data
int len = 16;
// the data has some zero padding at the end so that the size is a multiple of
// four, this simplifies the processing as each thread can process four
// elements (which is necessary to avoid bank conflicts) but no branching is
// necessary to avoid out of bounds reads
char str[] = { 82, 111, 118, 118, 121, 42, 97, 121, 124, 118, 110, 56,
10, 10, 10, 10};
// Use int2 showing that CUDA vector types can be used in cpp code
int2 i2[16];
for( int i = 0; i < len; i++ )
{
i2[i].x = str[i];
i2[i].y = 10;
}
// run the device part of the program
runTest(argc, (const char**)argv, str, i2, len);
std::cout << str << std::endl;
for( int i = 0; i < len; i++ )
{
std::cout << (char)(i2[i].x);
}
std::cout << std::endl;
cutilExit(argc, argv);
}
****************************
here's another sample:
******************************
/* Example of integrating CUDA functions into an existing
* application / framework.
* Reference solution computation.
*/
// Required header to support CUDA vector types
#include <vector_types.h>
////////////////////////////////////////////////////////////////////////////////
// export C interface
extern "C"
void computeGold(char* reference, char* idata, const unsigned int len);
extern "C"
void computeGold2(int2* reference, int2* idata, const unsigned int len);
////////////////////////////////////////////////////////////////////////////////
//! Compute reference data set
//! Each element is multiplied with the number of threads / array length
//! @param reference reference data, computed but preallocated
//! @param idata input data as provided to device
//! @param len number of elements in reference / idata
////////////////////////////////////////////////////////////////////////////////
void
computeGold(char* reference, char* idata, const unsigned int len)
{
for(unsigned int i = 0; i < len; ++i)
reference[i] = idata[i] - 10;
}
////////////////////////////////////////////////////////////////////////////////
//! Compute reference data set for int2 version
//! Each element is multiplied with the number of threads / array length
//! @param reference reference data, computed but preallocated
//! @param idata input data as provided to device
//! @param len number of elements in reference / idata
////////////////////////////////////////////////////////////////////////////////
void
computeGold2(int2* reference, int2* idata, const unsigned int len)
{
for(unsigned int i = 0; i < len; ++i)
{
reference[i].x = idata[i].x - idata[i].y;
reference[i].y = idata[i].y;
}
}
************************************************** *
note that these code samples will allow the developers to decode vc-1, h264, mpeg-1 and mpeg-2 on an nvidia gpu, integrate this gpu accelerated decoder into x264 (they would have to remove their own decoder obviously) and how to integrate cuda functions into their own app.
note, x264 is a codec, by definition a codec is a COmpressor and a DECompressor, i have just gpu accelerated the DECompressor portion of x264 and shown how to integrate it with the COmpressor portion of x264 with all of 5 minutes of searching through nvidia's website, using standard c++ code.
tell me again how what great programmers the x264 developers are.
stay tuned, i'm about to show how to gpu accelerate significant portions of their COmpressor as well, give me a few days to get my hands dirty.
edit:
here's how to parse video on a gpu:
*********************************
#ifndef NV_VIDEO_PARSER
#define NV_VIDEO_PARSER
#include <cuvid/nvcuvid.h>
#include <iostream>
class FrameQueue;
class VideoDecoder;
// Wrapper class around the CUDA video-parser API.
// The video parser consumes a video-data stream and parses it into
// a) Sequences: Whenever a new sequence or initial sequence header is found
// in the video stream, the parser calls its sequence-handling callback
// function.
// b) Decode segments: Whenever a a completed frame or half-frame is found
// the parser calls its picture decode callback.
// c) Display: Whenever a complete frame was decoded, the parser calls the
// display picture callback.
//
class VideoParser
{
public:
// Constructor.
//
// Parameters:
// pVideoDecoder - pointer to valid VideoDecoder object. This VideoDecoder
// is used in the parser-callbacks to decode video-frames.
// pFrameQueue - pointer to a valid FrameQueue object. The FrameQueue is used
// by the parser-callbacks to store decoded frames in it.
VideoParser(VideoDecoder * pVideoDecoder, FrameQueue * pFrameQueue);
private:
// Struct containing user-data to be passed by parser-callbacks.
struct VideoParserData
{
VideoDecoder * pVideoDecoder;
FrameQueue * pFrameQueue;
};
// Default constructor. Don't implement.
explicit
VideoParser();
// Copy constructor. Don't implement.
VideoParser(const VideoParser & );
// Assignement operator. Don't implement.
void
operator= (const VideoParser & );
// Called when the decoder encounters a video format change (or initial sequence header)
// This particular implementation of the callback returns 0 in case the video format changes
// to something different than the original format. Returning 0 causes a stop of the app.
static
int
CUDAAPI
HandleVideoSequence(void * pUserData, CUVIDEOFORMAT * pFormat);
// Called by the video parser to decode a single picture
// Since the parser will deliver data as fast as it can, we need to make sure that the picture
// index we're attempting to use for decode is no longer used for display
static
int
CUDAAPI
HandlePictureDecode(void * pUserData, CUVIDPICPARAMS * pPicParams);
// Called by the video parser to display a video frame (in the case of field pictures, there may be
// 2 decode calls per 1 display call, since two fields make up one frame)
static
int
CUDAAPI
HandlePictureDisplay(void * pUserData, CUVIDPARSERDISPINFO * pPicParams);
VideoParserData oParserData_; // instance of the user-data we have passed into the parser-callbacks.
CUvideoparser hParser_; // handle to the CUDA video-parser
friend class VideoSource;
};
std:stream &
operator << (std:stream & rOutputStream, const CUVIDPARSERDISPINFO & rParserDisplayInfo);
#endif // NV_VIDEO_PARSER
**********************************************
here's some code for multithreading:
**************************************
#include <multithreading.h>
#if _WIN32
//Create thread
CUTThread cutStartThread(CUT_THREADROUTINE func, void *data){
return CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)func, data, 0, NULL);
}
//Wait for thread to finish
void cutEndThread(CUTThread thread){
WaitForSingleObject(thread, INFINITE);
CloseHandle(thread);
}
//Destroy thread
void cutDestroyThread(CUTThread thread){
TerminateThread(thread, 0);
CloseHandle(thread);
}
//Wait for multiple threads
void cutWaitForThreads(const CUTThread * threads, int num){
WaitForMultipleObjects(num, threads, true, INFINITE);
for(int i = 0; i < num; i++)
CloseHandle(threads[i]);
}
#else
//Create thread
CUTThread cutStartThread(CUT_THREADROUTINE func, void * data){
pthread_t thread;
pthread_create(&thread, NULL, func, data);
return thread;
}
//Wait for thread to finish
void cutEndThread(CUTThread thread){
pthread_join(thread, NULL);
}
//Destroy thread
void cutDestroyThread(CUTThread thread){
pthread_cancel(thread);
}
//Wait for multiple threads
void cutWaitForThreads(const CUTThread * threads, int num){
for(int i = 0; i < num; i++)
cutEndThread(threads[i]);
}
#endif
********************************************
who's you daddy?
Try StreamFab Downloader and download from Netflix, Amazon, Youtube! Or Try DVDFab and copy Blu-rays! or rip iTunes movies!
+ Reply to Thread
Results 61 to 80 of 80
Thread
-
-
Awesome. I think? It looks like alien to me.
Any early testing yet, or still very beta?
Decoding can already be done with DGNVIndex , but nothing on the encoding side for x264 and cuda, that I am aware of
You can find the developers at IRC: irc://irc.freenode.net/x264 , Doom9, or Doom10 if you want them to help debug
They welcome patches and development from everyone (of course they only commit after close scrutiny and testing)
If this pans out, you will be "that guy" who succeeded where many have failed -
Originally Posted by poisondeathray
cuda represents everything they hate: a proprietary framework, proprietary hardware, proprietary software, closed source drivers, they would never port their codec to run on a limited range of hardware, by porting x264 to cuda they would effectively be limiting the number of people that could use it to just owners of nvidia based geforce 8 and above video cards.
i should have seen it before, it's not technical obstacles that are preventing them from porting it to cuda it's philisophical obstacles, my guess is that they will wait until open cl is fully mature and then modify their code to run within that open source frame work. -
Originally Posted by deadrats
From a layman's functional viewpoint, how do you get around the issues in my post above?
Shall I ask one of the developers to review your code? or even Donald Graft, who wrote DGNVIndex and worked with Nvidia's team?
Some early work has been put into OpenCl, there were a couple grad students thinking of it, but they pretty much ran in to the same issues. -
Originally Posted by poisondeathray
in layman's terms:
so there's no way you can explain how to integrate x264 with cuda
i have already done so, in fact even provided code samples on the "how to", you just have a bizarre infatuation with x264 that borders on the unhealthy, you wouldn't happen to be a closet developer, would you? if you are, i urge you to come out of the closet, it will make you feel better, i promise not to mock you...too much.
btw, it's not "integrate x264 with cuda, it's port x264 to cuda, big difference.
Name something that gives you better results - free or paid.
CC-HDe, blu-code, with interlaced content main concept and at blu-ray level bit rates main concept, procoder, and apple's h264, as well as vc-1 and mpeg-2. (i'm sure that should light a fire under you).
Video consists of I-frames, P-frames, and B-frames , and long GOP formats use both Intra-frame and Inter-frame predictive compression techniques. They code differences between frames. Video uses "frames", but you don't process individual frames, modern encoders look at macroblocks. And these macroblocks, when using h.264 can be 16x16, 8x8, 4x4. You use all these words, but I don't think you understand what they really mean. If you understood these basic concepts, you would understand why this x264 or any decent modern encoder won't work with cuda.
when you say things like this you just leave me in stitches, you really do. yes, encoders look at macroblocks, and said macroblocks can be chained together to form slices but said macroblocks and slices reside within, wait for it, frames. furthermore I-frames are coded without reference to any other frame and are always intra-coded, thus it's very valid to talk about processing chunks of gop on separate threads.
your "objections" are meaningless within the context of porting code to run a gpu, a gpu was designed to work with individual pixels, and render thousands of pixels at the same time, why would you believe that it would be incapable of handling a group of pixels that we could refer to as a macroblocks and bigger groups of pixels that we can refer to as a slice and a bigger collection of pixels that we can refer to as a frame?
when i "frame" the problem in the context of the ability to manipulate pixels, don't the supposed obstacles to porting x264 to cuda you have put forth seem downright foolish and misinformed? mind you, i in no way blame you, you clearly have drank the x264 kool aid, you took what those idiot developers said as being accurate and based in reality and now you simply parrot their party line.
but once you look at it from the frame of reference of a gpu being able to work on individual pixels the thought that a gpu can't work on macroblocks, slices or individual frames because "the encoder is too advanced and complicated" gets exposed as the malformed collection of chemically produced electrical signals it is.
You would have to "dumb down" x264 in order for it to work with cuda. With efficient encoders such as x264, there is dynamic frametype placement, and variable GOP sizes. e.g. a "whip pan" might place 10 I-frames in a row, but a slow pan or static shot might be 300 frames long between keyframes. So you cannot know ahead of time how to divide up your video and spawn "x" number of threads according to how many "units" your GPU has available - you don't know where or what the GOP's look like ahead of time. This becomes an allocation issue as you have idle units, and extra resources have to be wasted on allocation and optimization.
no, you would have to smarten up the developers coding it. have you ever heard of 2 pass encoding? how about variable bit rate encoding? do you know what an analysis pass is? did you get the above from the developers as one of their reasons why they can't port x264 to cuda or did you make that up yourself?
here's how you get around this objection, either code your gpu accelerated encoder to perform a quick analysis pass so that it knows at what points it can segment the file or launch an analysis thread that runs 10-20 seconds ahead of the fastest worker thread, as the analysis thread reaches the end of each gop it launches a worker thread and assigns that last analyzed segment to that thread for encoding, and you keep doing that until all gop's are being processed on a separate thread, you will also need another thread for house keeping, to kill each worker thread as it finishes it's job and concatenates the results to the output file. any programmer that knows how to manipulate strings knows how to read ahead to analyze a thread.
every single one of your objections with regard to this process or that process being serial in nature is easily countered by the above technique, it works, it exists and it's easy to implement, any programmer with any formal training knows how to do it, the objections are silly on their face and absurd from a programming standpoint, nothing more than hollow excuses as to why "it can't be done".
i poured over the code for the x264 encoder and the most striking thing is how similar it is in structure to the xvid encoder code, they both have an over abundance of pointers, in fact that is the only obstacle i can see from a programming standpoint to porting x264, or xvid for that matter, to cuda, the extensive use of function pointers, which is an obstacle that can't be overcome, nvidia gpu's do not support function pointers, they support pointers, but not function pointers. i have no idea why those 2 mpeg-4 codecs both make such extensive use of function pointers but the use is so wide spread throughout the code that i believe it is insurmountable, you would need to recode all of x264 and xvid from the ground up sans function pointers.
interestingly enough, i took a look through the ffmpeg source and it doesn't appear to use them to any great extend, in fact as far as structure is concerned it's reasonably similar to the nvidia h264 encoder, which leads me to believe that that would be the best candidate for porting to cuda, which jives with the fact that of the three, ffmpeg is the only one to be made to work with open cl.
as a side note, based on what i see in the xvid and x264 code, i don't think it's possible to modify them to work with open cl either, the same hardware limitation applies, there's just no way around it.
perhaps fermi will bring hardware support for function pointers, but if it doesn't we're right back to square one.
it would seem that you are 100% right about it being impossible to port encoder portion of x264 to cuda but not for any of the reasons you outlined, rather because of the programming techniques the developers used to code the encoder.
bummer... -
HAHA ok..... I thought so.
Nope, everything I wrote was NOT from the developers mouths at all, they are only my views. I explained it from a functional viewpoint as an end user of an encoder. I have about zero code writing experience, but I know how x264 works at a basic functional level, and I know a little bit about video. There's probably a lot of other technical or coding issues that one of the developers could add. While I'm sure some of the suggestions you made could help improve some of the issues - at least theoretically - but not all of them.
The linear lookahead method you mentioned is what x264 does right now (rc threaded lookahead). But how slow would it be on a GPU? Obviously there is no "thread equivalency" between CPU and GPU. I thought we were talking in the order of 1000's to 10000's of threads for Cuda? Those were the #'s thrown around by the developer's. Isn't that why GPU's were "faster" for massively parallelizable tasks in the first place? You chop up the task in to tiny little bits? Wouldn't your GPU be 99% idle ?
Yes, you could do 2-passes, with a CPU 1st pass, that would get around the GOP frametype placement issues, but you still would have many other issues. Quality is paramount for the developers, and they won't sacrifice it, and as an end user, I wouldn't want them to. I don't care if you have a geeky programming workaround, if the quality sucks I'm not interested.
1) While doing 2 passes (with a 1st pass CPU), you could calculate where the optimal frametype and GOP boundaries lie. BUT, you still wouldn't want to use the GPU because of lower quality and efficiency losses from INTRframe slice "boundaries." If 16 threads and 4 slices /frame results in a minor PSNR 0.0001dB to 0.0003dB quality loss for a certain source (just making these #'s up for example) , imagine what 10000, or 20000 would do to quality. You would never get as good quality.
Many sources would be slower to encode, and much worse quality. Consider a short 1 minute clip (maybe a video trailer , or something for youtube). It might only have 2 or 3 GOPs. How do you allocate the parallelizable work units? A few thousand per GOP ? How much of a quality hit are you willing to take? How many idle units would there be?
2) One of the key benefits that 99% of x264 users rave about is the true 1 pass quality (variable quantizer) mode ie. CRF , as opposed to CQ or constant quantizer , like xvid and such use. You can't do CRF encodes properly with GPU for the reasons mentioned earlier.
3) non parallelizable algorithms e.g CABAC , brute force motion analysis, SATD - these would still be bottlenecks , and you would never get close to 100% efficient use of GPU. These are "facts of life" that you can't get around; and there are several published paper on the subject if you're interested.
So you're basically agreeing with me (even if for different reasons), that massive parts have to be re-written in order to force it to work. = lot of work = not going to happen.
Thats too bad. Even if you could accelerate *parts* of the 2nd pass or some 1pass or crf calculations without a significant quality hit I'd be happy.
OK, so maybe x264 interlaced encoding can be improved, some parts can be multithreaded a little better.... I'll drink the "kool aid" of any coders that make an encoder that works better than x264. If you suggest something does better (on progressive content), I will need proof. I'm a "proof" type of guy. If it sounds like I worship the developers, it's because they've produced a product that earns my respect. You've produced nothing to earn my respect. Hey, if you code one that does better, I'll "worship" the "deadrats encoder" too and become your biggest fan I'm a fan of quality and nothing is even close in quality/speed. But I'll jump ship in a heartbeat if something out there is better.
The limitations mentioned earlier affect how GPU encoders work now. They take shortcuts. That's why the quality sucks. Maybe you can work from improving it from the existing cuda angle? Since it was written from the ground up to work on a GPU.
Can you explain the bit on "function pointers" a bit more in plain English? -
Originally Posted by poisondeathray
Originally Posted by poisondeathray
first things first, read this on threads:
http://en.wikipedia.org/wiki/Thread_(computer_science)
when we talk about cpu threads and gpu threads we're fundamentally talking about the same construct, so there is cpu/gpu "thread equivalency". second, cuda has nothing to do with how many threads can be used, cuda is the framework for using C to code applications that run on nvidia gpu's, the limiting factor is the hardware not the development environment. in so far as how many threads can be kept in flight at any one time, the top of the line gpu's can keep slightly over 30 thousand threads in flight at any one time, fermi will actually be able to keep fewer threads in flight, about 24 thousand is the number that has been thrown around, and yes that's part of the reason why they are faster for tasks that are massively parallel in nature, but that the fact remains that gpu's are significantly faster for linear tasks as well, they are computation monsters, in terms of floating point performance they just can't be touched, i know you hate the flop metric but it is a very valid performance comparison benchmark, the more floating point operation per second a processor can perform the faster it will be.
with gpu's, not only can they keep many orders of threads more than a cpu in flight but they can only complete each individual task significantly faster. think of it as having a phenom 9500, sure it can keep twice the number of threads in flight at a time than an e8600 (4 vs 2) but even under the most multi-threaded app the e8600 is still way faster because it can complete work on it's 2 threads, be issued 2 more and complete them and start on 2 more before the 9500 finishes it's 4.
Yes, you could do 2-passes, with a CPU 1st pass, that would get around the GOP frametype placement issues, but you still would have many other issues. Quality is paramount for the developers, and they won't sacrifice it, and as an end user, I wouldn't want them to. I don't care if you have a geeky programming workaround, if the quality sucks I'm not interested.
1) While doing 2 passes (with a 1st pass CPU), you could calculate where the optimal frametype and GOP boundaries lie. BUT, you still wouldn't want to use the GPU because of lower quality and efficiency losses from INTRframe slice "boundaries." If 16 threads and 4 slices /frame results in a minor PSNR 0.0001dB to 0.0003dB quality loss for a certain source (just making these #'s up for example) , imagine what 10000, or 20000 would do to quality. You would never get as good quality.
your problem is that you suffer from the same thought patterns as those "reporters" at Fox, you have a preconceived notion based on zero facts, in some cases made up facts, and you use these "facts" as a way of supporting your perceived reality.
by your own admission you just made those numbers up, you have your mind dead set on the notion that for some reason the exact same calculations performed on a different processor would somehow result in different results. where did you ever get the idea that performing the exact same operation would result in lower quality?
let's assume that for the sack of argument the above numbers are accurate, we have 4 slices per frame, 16 threads, obviously 4 frames at a time, and we end up with a PSNR quality loss of .0002dB, why would you extrapolate that to 10-20 thousand threads? no one is saying cut up each frame into 20 thousand slices, that would be insane, what you would do is work on more frames at the same time, still use 4 slices per frame but instead of working on 4 frames at any one time, you would work on 2500-5000 frames at the same time, you would keep the quality loss, which you deemed acceptable the same, you would just work on larger chunks of video at a time.
Many sources would be slower to encode, and much worse quality. Consider a short 1 minute clip (maybe a video trailer , or something for youtube). It might only have 2 or 3 GOPs. How do you allocate the parallelizable work units? A few thousand per GOP ? How much of a quality hit are you willing to take? How many idle units would there be?
consider an inherently serial task like folding@home, check out their faq (they are currently the foremost experts on gpgpu programming):
http://folding.stanford.edu/English/FAQ-SMP
None of our engines are written to be thread-safe or multi-threaded. The only parallelizable codes (Gromacs and AMBER) both use MPI. Making Gromacs use only threads for parallelization isn't possible right now (we talk with the Gromacs developers frequently on this issue), so MPI is the only solution.
http://folding.stanford.edu/English/FAQ-NVIDIA
One of the really exciting aspects about GPU's is that not only can they accelerate existing algorithms significantly, they get really interesting in that they can open doors to new algorithms that we would never think to do on CPUs at all (due to their very slow speed on CPUs, not but GPU's).
Much like the Gromacs core greatly enhanced Folding@home by a 20x to 30x speed increase via a new utilization of hardware (SSE) in PCs, in 2006, Folding@home has developed a new streaming processor core to utilize another new generation of hardware: GPUs with programmable floating-point capability. By writing highly optimized, hand tuned code to run on ATI X1900 class GPUs, the science of Folding@home will see another 20x to 30x speed increase over its previous software (Gromacs) for certain applications. This great speed increase is achieved by running essentially the complete molecular dynamics calculation on the GPU; while this is a challenging software development task, it appears to be the way to achieve the highest speed improvement on GPU's
as you can see, much of the speed increase comes from the ability of gpu's to perform floating point operations 20-30 times faster than a cpu, yes this faq was written circa 2006 and yes cpu's have gotten faster since then but so have the gpu's, try it for yourself, download the cpu and gpu folding@home clients and see for yourself how much faster the gpu is and as you can read for yourself it has nothing to do with the number of threads a gpu can handle as most of the code isn't multi-threaded.
2) One of the key benefits that 99% of x264 users rave about is the true 1 pass quality (variable quantizer) mode ie. CRF , as opposed to CQ or constant quantizer , like xvid and such use. You can't do CRF encodes properly with GPU for the reasons mentioned earlier.
non parallelizable algorithms e.g CABAC , brute force motion analysis, SATD - these would still be bottlenecks , and you would never get close to 100% efficient use of GPU. These are "facts of life" that you can't get around; and there are several published paper on the subject if you're interested.
gpu's are awesome for brute force work:
http://securityandthe.net/2008/10/12/russian-researchers-achieve-100-fold-increase-in-...racking-speed/
The 100-fold increase in speed is achieved with two GeForce GTX280’s per workstation; for €599 you can build a network of 20 workstations dedicated to “recovering” your “lost” WPA keys. This means that a WPA or WPA2 key could be cracked in days or weeks instead of years.
So you're basically agreeing with me (even if for different reasons), that massive parts have to be re-written in order to force it to work. = lot of work = not going to happen.
The limitations mentioned earlier affect how GPU encoders work now. They take shortcuts. That's why the quality sucks. Maybe you can work from improving it from the existing cuda angle? Since it was written from the ground up to work on a GPU.
Can you explain the bit on "function pointers" a bit more in plain English?
main ( )
in C/C++ all functions return a value, which is done with the "return" statement, like this:
main ( )
{
// this is where the function definition, or the "body", is placed
return ( );
} //these brackets encapsulate the all the elements of the function
in C/C++ the main portion of the program resides within the "main" function and all functions need to be declared with a data type that corresponds with the type of data that is returned when all the operations are completed, in the case of main ( ) it is usually a 1 or 0 that is returned and thus it is written like this:
int main ( )
{
return (value);
}
when the parenthesis following the function name (in this case "main) are empty, no value is passed to the function, it is merely called from another function, if the parenthesis have one variable a value is passed to the function but it doesn't return a function, if there are 2 variables, a value is passed to the function, operated on and an answer is returned, thus:
int main ( )
{
add ( );
return (value);
}
add ( )
{
a = 1;
b = 2;
c = a + b
cout << c;
return (void);
}
calls a function that adds 1 + 2, assigns the value to c and then prints out the value of c to the screen, where as:
int main ( )
{
a = 1;
b = 2;
int add (int a, int b, int c);
cout << c;
return (value);
}
int add (int a, int b, int c )
{
a = 1;
b = 2;
c = a + b
return (c);
}
sends the values of 1 and 2 to the add ( ) function which adds two numbers and returns the resulting sum to main which in turn prints the value to the screen, there are some things i'm leaving out, such as the preprocessor directives, for the sake of simplicity. now as you may have noticed all variables must be assigned a data type, and since functions are also variables, they too must have a data type, the data types can be int, float, double, char, long, as well as some other more complex data types such as dword.
now because passing a large number of values, as well as strings, causes increased program overhead, in order to keep resource usage to a minimum, pointers were included with C/C++. a pointer is a variable that "points" to the location of another variable, in the example above instead of passing the actual values of a, b or c, we could pass the location in memory. while this doesn't have any performance advantages in the above examples it certainly does have an advantage when we're dealing with large data streams. when you want to read the value of a pointer variable you dereference the pointer and it gives you whatever is in memory of the location it's referencing or pointing to.
now a function pointer is a pointer that when you dereference it you actually invoke a function which you can pass values to just like an ordinary function, it is this behavior that nvidia gpu's do not currently support in hardware.
based on what you said the encoding of the video streams is highly dynamic, which in and of itself jives nicely with the use of function pointers since they are generally used to simplify code when you need to invoke functions based on a run-time value, the dynamic length of gop's would certainly lead to variable run-time values and the simplest way to implement that in code is either via function pointers or functors in objective C or c++.
looking over the x264 code i see no easy way to recode this so as to eliminate the need for function pointers, they made other design decisions that effectively painted them into a corner and the only way out was via function pointers, you would basically need to rethink x264 from the ground up.
as i mentioned, ffmpeg's avc codec is nothing like x264, i looked through libavc and while it would need a bit of restructuring, i think it would be reasonably straight forward to port it to cuda. -
Originally Posted by deadrats
when we talk about cpu threads and gpu threads we're fundamentally talking about the same construct, so there is cpu/gpu "thread equivalency". second, cuda has nothing to do with how many threads can be used, cuda is the framework for using C to code applications that run on nvidia gpu's, the limiting factor is the hardware not the development environment. in so far as how many threads can be kept in flight at any one time, the top of the line gpu's can keep slightly over 30 thousand threads in flight at any one time, fermi will actually be able to keep fewer threads in flight, about 24 thousand is the number that has been thrown around, and yes that's part of the reason why they are faster for tasks that are massively parallel in nature, but that the fact remains that gpu's are significantly faster for linear tasks as well, they are computation monsters, in terms of floating point performance they just can't be touched, i know you hate the flop metric but it is a very valid performance comparison benchmark, the more floating point operation per second a processor can perform the faster it will be.
When I talk about "thread equivalency" in layman's terms, I mean the same task doing the same thing. So your 20,000 threads should be 1666x faster than an i7 using 12 threads. Is this what you are suggesting? Are you suggesting that using 1 thread on the the various algorithms motion analysis with a GPU on a single GOP section is equivalent in speed when using 1 thread on a GPU?
All this sound great in theory, and some parallelizable tasks do work great with GPU's like small workunit F@H. But things like time travel and teleportation sound great too . Why are the current GPU based encoders slower than CPU encoders at the same quality level, and don't even come close in top quality?
I want you to prove it. When you examine the current GPU encoder output streams, they suck for the very reasons I suggested earlier. Is it because none of the programmers have the GPU know-how? Are they all "chimps"? Are they all lazy? Even if they were, you should be able to write some code for a better GPU encoder right?
Yes, you could do 2-passes, with a CPU 1st pass, that would get around the GOP frametype placement issues, but you still would have many other issues. Quality is paramount for the developers, and they won't sacrifice it, and as an end user, I wouldn't want them to. I don't care if you have a geeky programming workaround, if the quality sucks I'm not interested.
There is a distinction and lots of valid uses for lossy compression. Blu-ray is already highly compressed, around 40-60x from the 10-bit 4:4:4 master. Your average 100min movie wouldn't even fit on a 2TB HDD. Most users are looking for better compression (i.e better quality at the same bitrate), faster encoding at a certain quality level. Are you happy with your TMPGEnc and Badaboom? If you can get the same quality, faster and at a lower bitrate isn't that appealing? A 25Mb/s encode using x264 might require 35-40Mb/s for the same quality using other encoders.
1) While doing 2 passes (with a 1st pass CPU), you could calculate where the optimal frametype and GOP boundaries lie. BUT, you still wouldn't want to use the GPU because of lower quality and efficiency losses from INTRframe slice "boundaries." If 16 threads and 4 slices /frame results in a minor PSNR 0.0001dB to 0.0003dB quality loss for a certain source (just making these #'s up for example) , imagine what 10000, or 20000 would do to quality. You would never get as good quality.
your problem is that you suffer from the same thought patterns as those "reporters" at Fox, you have a preconceived notion based on zero facts, in some cases made up facts, and you use these "facts" as a way of supporting your perceived reality.
by your own admission you just made those numbers up, you have your mind dead set on the notion that for some reason the exact same calculations performed on a different processor would somehow result in different results. where did you ever get the idea that performing the exact same operation would result in lower quality?
Where are your FACTS? All you have spewed are unproven theories. Who says you can even do the same operation on the GPU at all? How do you get around memory limitations?
Where are your GPU encoded stream examples that PROVE you can do the same operations? The FACTS I have are the current stream samples that PROVE what I say is true for current encoders. I can show you features at the stream level why the quality sucks. I can emulate the low quality from GPU encodes by using similar settings with x264.
Does this prove 5 years from now someone might have finally programmed a decent GPU encoder? Of course not, but what I'm saying is a lot closer to current reality than your unproven theories.
Come on, make that encoder. Prove me wrong. I dare you to. In the scientific world, the onus is on those making the bold claims and theories to prove it, not on those who have established facts.
let's assume that for the sack of argument the above numbers are accurate, we have 4 slices per frame, 16 threads, obviously 4 frames at a time, and we end up with a PSNR quality loss of .0002dB, why would you extrapolate that to 10-20 thousand threads? no one is saying cut up each frame into 20 thousand slices, that would be insane, what you would do is work on more frames at the same time, still use 4 slices per frame but instead of working on 4 frames at any one time, you would work on 2500-5000 frames at the same time, you would keep the quality loss, which you deemed acceptable the same, you would just work on larger chunks of video at a time.
Many sources would be slower to encode, and much worse quality. Consider a short 1 minute clip (maybe a video trailer , or something for youtube). It might only have 2 or 3 GOPs. How do you allocate the parallelizable work units? A few thousand per GOP ? How much of a quality hit are you willing to take? How many idle units would there be?
consider an inherently serial task like folding@home, check out their faq (they are currently the foremost experts on gpgpu programming):
http://folding.stanford.edu/English/FAQ-SMP
None of our engines are written to be thread-safe or multi-threaded. The only parallelizable codes (Gromacs and AMBER) both use MPI. Making Gromacs use only threads for parallelization isn't possible right now (we talk with the Gromacs developers frequently on this issue), so MPI is the only solution.
http://folding.stanford.edu/English/FAQ-NVIDIA
One of the really exciting aspects about GPU's is that not only can they accelerate existing algorithms significantly, they get really interesting in that they can open doors to new algorithms that we would never think to do on CPUs at all (due to their very slow speed on CPUs, not but GPU's).
Much like the Gromacs core greatly enhanced Folding@home by a 20x to 30x speed increase via a new utilization of hardware (SSE) in PCs, in 2006, Folding@home has developed a new streaming processor core to utilize another new generation of hardware: GPUs with programmable floating-point capability. By writing highly optimized, hand tuned code to run on ATI X1900 class GPUs, the science of Folding@home will see another 20x to 30x speed increase over its previous software (Gromacs) for certain applications. This great speed increase is achieved by running essentially the complete molecular dynamics calculation on the GPU; while this is a challenging software development task, it appears to be the way to achieve the highest speed improvement on GPU's
as you can see, much of the speed increase comes from the ability of gpu's to perform floating point operations 20-30 times faster than a cpu, yes this faq was written circa 2006 and yes cpu's have gotten faster since then but so have the gpu's, try it for yourself, download the cpu and gpu folding@home clients and see for yourself how much faster the gpu is and as you can read for yourself it has nothing to do with the number of threads a gpu can handle as most of the code isn't multi-threaded.
2) One of the key benefits that 99% of x264 users rave about is the true 1 pass quality (variable quantizer) mode ie. CRF , as opposed to CQ or constant quantizer , like xvid and such use. You can't do CRF encodes properly with GPU for the reasons mentioned earlier.
non parallelizable algorithms e.g CABAC , brute force motion analysis, SATD - these would still be bottlenecks , and you would never get close to 100% efficient use of GPU. These are "facts of life" that you can't get around; and there are several published paper on the subject if you're interested.
gpu's are awesome for brute force work:
http://securityandthe.net/2008/10/12/russian-researchers-achieve-100-fold-increase-in-...racking-speed/
The 100-fold increase in speed is achieved with two GeForce GTX280’s per workstation; for €599 you can build a network of 20 workstations dedicated to “recovering” your “lost” WPA keys. This means that a WPA or WPA2 key could be cracked in days or weeks instead of years.
The limitations mentioned earlier affect how GPU encoders work now. They take shortcuts. That's why the quality sucks. Maybe you can work from improving it from the existing cuda angle? Since it was written from the ground up to work on a GPU.
Current GPU encoders have lower quality prediction and analysis. They skip out on using some features like CABAC and b-frames, residuals are a lot worse. We just disagree on the "why". All the things you said , should in theory make it possible. But where is that great GPU encoder?
as i mentioned, ffmpeg's avc codec is nothing like x264, i looked through libavc and while it would need a bit of restructuring, i think it would be reasonably straight forward to port it to cuda. -
ooowowwwwww my head hurts
ocgw
peacei7 2700K @ 4.4Ghz 16GB DDR3 1600 Samsung Pro 840 128GB Seagate 2TB HDD EVGA GTX 650
https://forum.videohelp.com/topic368691.html -
Originally Posted by ocgw
-
Originally Posted by poisondeathray
Why are the current GPU based encoders slower than CPU encoders at the same quality level, and don't even come close in top quality?
i just ran this test using the badaboom encoder, i took a 1080p wmv at 5mps and encoded it to 1080p h264, 25mb/s, main profile, 4.1 level, cabac on, vbr with 128kb/s ac3, all processing decode and encode was handled by the gpu, i averaged 13-14 frames per second, i defy you to encode a file at 1080p at 25mb/s, using any settings within x264 and achieve anywhere near that frame rate.
I want you to prove it. When you examine the current GPU encoder output streams, they suck for the very reasons I suggested earlier. Is it because none of the programmers have the GPU know-how? Are they all "chimps"? Are they all lazy? Even if they were, you should be able to write some code for a better GPU encoder right?
If you can get the same quality, faster and at a lower bitrate isn't that appealing? A 25Mb/s encode using x264 might require 35-40Mb/s for the same quality using other encoders.
The reason why you can't do the 1st pass on the GPU, is that it would be too slow. In order to the frametype placement, GOP size correct etc.., it has to be done sequentially. Unless there is thread equivalency in terms of speed, I don't see how this can be done on a GPU
you keep saying that it would be too slow on a gpu, i have offered you tons of third party proof to the contrary, you're starting to make yourself look foolish, i strongly suggest you stop.
Where are your FACTS? All you have spewed are unproven theories. Who says you can even do the same operation on the GPU at all? How do you get around memory limitations?
as i have already pointed out, cuda is basically C for nvidia gpu's, if the compiler supports a feature then the hardware supports the feature. cuda is well documented, all the proof you want is in the tutorials and cuda developer documentation. looking through the documentation, geforce 8 and later gpu's support ALL features of ANSI C, with the exception of function pointers and object oriented programming features.
every other procedural programming feature is supported, such as structs, unions, pointers, all data types, preprocessor directives, shared libraries, integer math, floating point math, pushing the stack, popping the stack, it's all supported.
you can get around the limitations of no support for classes by using function definitions within a struct (which is basically what a class is), basically you just need to use a slightly different programming technique than you're use to.
Where are your GPU encoded stream examples that PROVE you can do the same operations? The FACTS I have are the current stream samples that PROVE what I say is true for current encoders. I can show you features at the stream level why the quality sucks. I can emulate the low quality from GPU encodes by using similar settings with x264.
Come on, make that encoder. Prove me wrong. I dare you to. In the scientific world, the onus is on those making the bold claims and theories to prove it, not on those who have established facts.
Ditto for you and video! You're trying to apply your knowledge of cpu and gpu to video which you know little about.
Current GPU encoders have lower quality prediction and analysis. They skip out on using some features like CABAC and b-frames, residuals are a lot worse. We just disagree on the "why". All the things you said , should in theory make it possible. But where is that great GPU encoder?
as for when the great gpu encoder will finally be here, most likely never. as i said way earlier in this thread, IF intel actually ends up releasing that video transcoding driver and it is in fact a driver in ever sense of the word, then gpu acceleration via open cl or cuda will go the way of the dodo. and even if that driver is never released and/or it's not a driver in the traditional sense but more like a plug in or a stand alone encoder, it's still a moot point, sandy bridge is on track to hit retail by this time next year, once that happens i give nvidia less than 2 years to close up shop and i think open cl will end up going nowhere.
now, if the education environment was to change in this country and gpgpu programming classes started being offered within the associates degree curriculum, i.e. in addition to needing to take c++ I&II, the various data structures and algorithm classes, the comp organization and assembler classes, etc, they also made the student take gpu programming I&II, then we would see a massive shift toward high quality gpu accelerated apps, but as i said...
Modern ffmpeg builds use x264. Unless you're referring to some old build that gives worse than xvid
/************************************************** ***************************
* x264: h264 encoder
************************************************** ***************************
* Copyright (C) 2003 Laurent Aimar
* $Id: encoder.c,v 1.1 2004/06/03 19:27:08 fenrir Exp $
*
* Authors: Laurent Aimar <fenrir@via.ecp.fr>
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111, USA.
************************************************** ***************************/
and here's the copyright notice for libavc:
/*
* H.26L/H.264/AVC/JVT/14496-10/... encoder/decoder
* Copyright (c) 2003 Michael Niedermayer <michaelni@gmx.at>
*
* This file is part of FFmpeg.
*
* FFmpeg is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public
* License as published by the Free Software Foundation; either
* version 2.1 of the License, or (at your option) any later version.
*
* FFmpeg is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public
* License along with FFmpeg; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
*/
here's the preprocessor directives for x264:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#ifdef __WIN32__
#include <windows.h>
#define pthread_t HANDLE
#define pthread_create(t,u,f,d) *(t)=CreateThread(NULL,0,f,d,0,NULL)
#define pthread_join(t,s) { WaitForSingleObject(t,INFINITE); \
CloseHandle(t); }
#define HAVE_PTHREAD 1
#elif defined(SYS_BEOS)
#include <kernel/OS.h>
#define pthread_t thread_id
#define pthread_create(t,u,f,d) { *(t)=spawn_thread(f,"",10,d); \
resume_thread(*(t)); }
#define pthread_join(t,s) wait_for_thread(t,(long*)s)
#define HAVE_PTHREAD 1
#elif HAVE_PTHREAD
#include <pthread.h>
#endif
#include "common/common.h"
#include "common/cpu.h"
#include "set.h"
#include "analyse.h"
#include "ratecontrol.h"
#include "macroblock.h"
#if VISUALIZE
#include "common/visualize.h"
#endif
//#define DEBUG_MB_TYPE
//#define DEBUG_DUMP_FRAME
//#define DEBUG_BENCHMARK
#ifdef DEBUG_BENCHMARK
static int64_t i_mtime_encode_frame = 0;
static int64_t i_mtime_analyse = 0;
static int64_t i_mtime_encode = 0;
static int64_t i_mtime_write = 0;
static int64_t i_mtime_filter = 0;
#define TIMER_START( d ) \
{ \
int64_t d##start = x264_mdate();
#define TIMER_STOP( d ) \
d += x264_mdate() - d##start;\
}
#else
#define TIMER_START( d )
#define TIMER_STOP( d )
#endif
#define NALU_OVERHEAD 5 // startcode + NAL type costs 5 bytes per frame
and here's the preprocessor directives for the h264 portion of libavc:
/**
* @file libavcodec/h264.c
* H.264 / AVC / MPEG4 part10 codec.
* @author Michael Niedermayer <michaelni@gmx.at>
*/
#include "internal.h"
#include "dsputil.h"
#include "avcodec.h"
#include "mpegvideo.h"
#include "h264.h"
#include "h264data.h"
#include "h264_parser.h"
#include "golomb.h"
#include "mathops.h"
#include "rectangle.h"
#include "vdpau_internal.h"
#include "cabac.h"
#if ARCH_X86
#include "x86/h264_i386.h"
#endif
//#undef NDEBUG
#include <assert.h>
/**
* Value of Picture.reference when Picture is not a reference picture, but
* is held for delayed output.
*/
#define DELAYED_PIC_REF 4
you don't need to be a programmer to see that the libavc code is much simpler, much cleaner and way more streamlined and completely different.
note, this code comparison is from the latest snapshot of each encoder, 2 very different animals. -
a thread is a thread is a thread, if you had any formal study on comp sci you would know this. 1 thread
processed on a gpu is not the equivalent speed wise to it being processed on a cpu, it is much much
faster on the gpu, in so far as whether the 20 thousand threads done on a gpu is 1666x faster than an
i7 using 12 threads, i'll let you prove that to yourself, download the cinebench benchmark and run the
software render and the hardware render benchmarks where the same scene is rendered in the cpu
and the gpu, and compare the results for yourself.
same code for the gpu as you are for the cpu.
So why isn't badaboom 1666x faster than x264 encoding (not accounting for quality)? Is it possible there
are bottlenecks and memory limitations (at least with the current software implementation)?
Why are the current GPU based encoders slower than CPU encoders at the same quality level, and don't
even come close in top quality?
what are you smoking and why don't you bring enough for all of us? the current gpu encoders are
slower than cpu encoder?!? really? where did you buy your reality distortion field and did you get a good
deal on it?
i just ran this test using the badaboom encoder, i took a 1080p wmv at 5mps and encoded it to 1080p
h264, 25mb/s, main profile, 4.1 level, cabac on, vbr with 128kb/s ac3, all processing decode and encode
was handled by the gpu, i averaged 13-14 frames per second, i defy you to encode a file at 1080p at
25mb/s, using any settings within x264 and achieve anywhere near that frame rate.
Why are the current GPU based encoders slower than CPU encoders at the same quality level, and
don't even come close in top quality?
use, and what is your configuration?
Use --preset "fast" or "veryfast". I get about 2x realtime on 1080p24 bluray source on an i7. You could
even use "ultrafast", but that would drop below badaboom quality on most sources. These settings
adjust the settings (like lower quality search algoithms, fewer reference frames, b-adapt 1, etc.. ) to
match badaboom's quality.
Quote:
I want you to prove it. When you examine the current GPU encoder output streams, they suck for the
very reasons I suggested earlier. Is it because none of the programmers have the GPU know-how? Are
they all "chimps"? Are they all lazy? Even if they were, you should be able to write some code for a
better GPU encoder right?
first thing first, why is it i have to prove it, you're the one making all kinds of nonsensical claims in
regards to gpu capabilities. second of all, yes, no current main stream programmer at the moment has
the experience writing general purpose code on the gpu, gpgpu is still in it's infancy, as i have already
pointed out most universities don't even offer courses in gpu programming and those that do only offer it
as graduate level course work, it's not like you can have a guy go to devry and learn enough to code for
a gpu.
they were released. My concern is top quality and it's not there from the current crop of GPU encoders.
If you or someone can leverage that potential, then that's what I'm looking for.
If you can get the same quality, faster and at a lower bitrate isn't that appealing? A 25Mb/s encode using
x264 might require 35-40Mb/s for the same quality using other encoders.
mb/s even mpeg-2 offers similar quality. as for low bit rate quality, quite frankly flix is much, much better,
the best encodes i have ever seen at low bit rate were done with flix.
most types of content, but the advantage is still maintained. If you were to do generational encoding for
example, the MPEG2 encode would look worse each time. Plot the PSNR/Bitrate graphs even up to
180Mb/s and the advantage is still there. MPEG2 never crosses the line, and never comes close. If you
want the sources to test for yourself, let me know.
OK now for Flix. Explain your observations. Do you mean flix pro as in vp6 or something else? What kind
of testing have you done, and can you post sources /encodes etc... i.e. provide some evidence. If you
don't want to do it, I'll do the testing process. I just need the sources and information. I simply don't
believe these claims.
Quote:
Where are your FACTS? All you have spewed are unproven theories. Who says you can even do the
same operation on the GPU at all? How do you get around memory limitations?
have you been paying attention at all or are the x264 blinders on too tight?
as i have already pointed out, cuda is basically C for nvidia gpu's, if the compiler supports a feature then
the hardware supports the feature. cuda is well documented, all the proof you want is in the tutorials
and cuda developer documentation. looking through the documentation, geforce 8 and later gpu's
support ALL features of ANSI C, with the exception of function pointers and object oriented
programming features.
every other procedural programming feature is supported, such as structs, unions, pointers, all data
types, preprocessor directives, shared libraries, integer math, floating point math, pushing the stack,
popping the stack, it's all supported.
you can get around the limitations of no support for classes by using function definitions within a struct
(which is basically what a class is), basically you just need to use a slightly different programming
technique than you're use to.
program for cuda. The bottom line is nobody has put together a good GPU encoder yet. That's all I'm
interested in at the end of the day. If it's due to lack of programming knowledge, that's entirely plausible.
But if you're saying there are no limitations to what a GPU can do in terms of video encoding, I find that
hard to believe
Where are your GPU encoded stream examples that PROVE you can do the same operations? The
FACTS I have are the current stream samples that PROVE what I say is true for current encoders. I can
show you features at the stream level why the quality sucks. I can emulate the low quality from GPU
encodes by using similar settings with x264.
lower quality, the thing that you can't seem to understand is that the limiting factor is not the hardware
it's that x86 programmers don't have the experience writing code for the gpu, big difference. if you take
a risc programmer or a sparc programmer, someone that's been coding for those platforms exclusively
for years and ask him to write code for the x86 platform and the code is likewise going to be poor, same
thing having an x86 programmer code for the ia64 architecture, it's programmer inexperience, not inferior
hardware that's at fault.
it was because of the writers' lack of experience and (not even partly) a hardware/cuda API limitation
then I can accept that. I posted some of the reasons why I thought would be problematic for a GPU, but
you said they were related to programming 100%. Is it sucha big stretch of the imagination that there are
hardware/architectural limitations? Since other fields in scientific computing have similar limitations e.g.
F@H. I accept what you have to say, but if there was some great GPU encoder coming along that had all
the features, configurablilty, and quality of x264 and yet encodes faster, I would find your comments
even more convincing.
Quote:
Current GPU encoders have lower quality prediction and analysis. They skip out on using some features
like CABAC and b-frames, residuals are a lot worse. We just disagree on the "why". All the things you
said , should in theory make it possible. But where is that great GPU encoder?
disagreeing on the why, you seem hell bent on believing that it's because of an inherent fault within
current gpu architectures, i know that it's because programmers don't quite have a handle on gpu
computing just yet.
now, but still limited to 1 pass, and don't offer "High". The PSNR graphs and encoding times and charts I
posted in the other thread were done with the most recent version that added those features.
I do believe I said earlier above that a GPU can only be told what the programming tells it to do. I posted
reasons why I thought there were issues with the architecture. Assuming there are no architechtural
limitations, why did they program a POS then? If it's only because programmers don't have a handle on it
, when will they? Badaboom has been released for over a year, and been in development for 3 or 4. How long does it take to
"learn?"
as for when the great gpu encoder will finally be here, most likely never. as i said way earlier in this
thread, IF intel actually ends up releasing that video transcoding driver and it is in fact a driver in ever
sense of the word, then gpu acceleration via open cl or cuda will go the way of the dodo. and even if
that driver is never released and/or it's not a driver in the traditional sense but more like a plug in or a
stand alone encoder, it's still a moot point, sandy bridge is on track to hit retail by this time next year,
once that happens i give nvidia less than 2 years to close up shop and i think open cl will end up going
nowhere.
now, if the education environment was to change in this country and gpgpu programming classes
started being offered within the associates degree curriculum, i.e. in addition to needing to take c++ I&II,
the various data structures and algorithm classes, the comp organization and assembler classes, etc,
they also made the student take gpu programming I&II, then we would see a massive shift toward high
quality gpu accelerated apps, but as i said...
Modern ffmpeg builds use x264. Unless you're referring to some old build that gives worse than xvid
you don't need to be a programmer to see that the libavc code is much simpler, much cleaner and way
more streamlined and completely different.
note, this code comparison is from the latest snapshot of each encoder, 2 very different animals.
precompiled ones; most precompiled binaries have x264. If yours has different avc encoder code it's probably the crappy one.
Do some quick tests, because it might not be worth your time screwing around with it. -
Originally Posted by poisondeathray
when you compare the floating point capabilities of x86 cpu's to the floating point capabilities of gpu's you find that the speed discrepancy between badaboom and software based h264 encoders are a match for the differences in fp performance.
basically hardware encoders are using a brute force tactic to gain their speed advantage instead of a more finesse approach which is the category highly threaded code would fall into.
You're results are pretty slow, but you confused the issue by including audio. What settings did you
use, and what is your configuration?
OK now for Flix. Explain your observations. Do you mean flix pro as in vp6 or something else? What kind
of testing have you done, and can you post sources /encodes etc... i.e. provide some evidence. If you
don't want to do it, I'll do the testing process. I just need the sources and information. I simply don't
believe these claims.
Well most programmers I've heard talk about it (not necessarily from x264) all moan how bad it is to
program for cuda. The bottom line is nobody has put together a good GPU encoder yet. That's all I'm
interested in at the end of the day. If it's due to lack of programming knowledge, that's entirely plausible.
But if you're saying there are no limitations to what a GPU can do in terms of video encoding, I find that
hard to believe
and that same programmer will absolutely love one of the above languages, i personally like Pascal more than C and i know programmers that have said they would rather throw their computers out the window than use Pascal, programmers are people and more importantly they are the biggest bitches you will ever meet, they will complain about everything.
Assuming there are no architectural limitations, why did they program a POS then? If it's only because programmers don't have a handle on it, when will they? Badaboom has been released for over a year, and been in development for 3 or 4. How long does it take to "learn?"
the elemental developer's, the people behind badaboom, also have an ulterior motive for keeping badaboom from being all it can be, they also make the rapid hd plug-in for adobe premiere, and that plug-in is expensive (only comes bundled with a $2000 quadro fx card), they're never, ever going to sell a $30 app that runs on gaming gpu's that is anywhere near the quality of their premium product, they would be crazy to. if you look at their rapidhd plug-in you will note that it offers more features:
http://elementaltechnologies.com/products/accelerator/specs
most reviews i have read on this plug-in seem to indicate that it's a pretty good product.
so badaboom will always be sub-par, as for how long it takes to learn, it really depends on the programmer and how badly he/she wants to learn, there is a lot if inertia within the programming community to anything that's new, hell COBOL is now what, 40 years old and it's still the most widely used business programming language, operating systems are still coded in C and that's at least 35 years old, it's just the way people are.
It depends what you compiled with your ffmpeg build with. You can compile x264 with it, or download
precompiled ones; most precompiled binaries have x264. If yours has different avc encoder code it's probably the crappy one.
Do some quick tests, because it might not be worth your time screwing around with it.
http://ffmpeg.org/
FFmpeg is a complete, cross-platform solution to record, convert and stream audio and video. It includes libavcodec - the leading audio/video codec library.
http://ffmpeg.org/download.html
you get the source to the various parts of libavc, including, and i didn't know this, the source for an open source vc-1 codec. that's beside the point, looking through the libavc folder we see a C file and a header file named H264, but no mention of x264, as such there is no way you are compiling ffmpeg with x264 support just from what is offered for download in the ffmpeg website.
my guess people are jury rigging ffmpeg to work with x264, but that's definitely not an official build. -
Originally Posted by deadrats
i can't post a source or sample as the files i am referring to are adult in nature,
in so far as what i mean by the quality, the encodes are clean, 100% free of noise, compression artifacts, extremely detailed and clear, when i check to see the writing application it says linux flixengine, the encodes are damn impressive, you can download a demo for windows, you just have to register first.
http://rob.opendot.cl/index.php/useful-stuff/ffmpeg-x264-encoding-guide/
http://sites.google.com/site/linuxencoding/x264-ffmpeg-mapping
http://ubuntuforums.org/showthread.php?t=786095 -
Originally Posted by deadrats
-
Originally Posted by DarrellSFB-DIMM are the real cause of global warming
-
Originally Posted by DarrellS
https://forum.videohelp.com/topic376999.html
Come on, which thread ever stays on topic? This excursion began very related to the original topic as discussed what types of video cards were suitable for NLE editing like the upcoming CS5, and acceleration of video encoding, future outlook for purchasing decisions, but got derailed. Maybe a mod can split it somewhere or append the 1st bit to the other hardware thread that Engineering had. -
Originally Posted by poisondeathray
many calculations, like dct and idct are pure floating point calculations, but the proof is in the code, looking through the x264 code to see what kind of data types are declared if the computations were primarily integer based you would expect to see "int" declared, if they are floating point calculation you would see "float" or "double", looking through the x264 source there seems to be more int declarations in general, but that is a bit misleading as analyzing the code shows that many of those int declarations are for functions that return a 1 or 0, there are about a dozen or so float and double declarations, overall the code seems to indicate that it's a mix of integer and floating point calculations.
this however is not necessarily a fatal blow to gpu acceleration, if you were structuring the code to run on a gpu, knowing that a gpu is a brute force floating point machine, you could simply declare "int" as "float" which would cause the compiler to treat all integer calculations as floating point calculations and thus have them run on the gpu's floating point unit, this is a very minor modification to the code that could be implemented in less than an hour (even accounting for "breaking" something as you re-declare the data types).
Do you have anything that is not adult
there is one more thing i wanted to address with regards to the memory issue objections you had raised, i know you based that on what the x264 developer's said because i was reading through that "dairy of a x264 developer" and they said almost exactly the same thing but the more i thought about it the less sense it made, and here's why:
when you run a software based encoder you only have access to the system ram, while i may have 4 gigs of ram, i would think that about 2 gigs is common for most users. a gpu has it's own frame buffer and modern gpu's, even low end ones have as much as 512 mb of ram, my 9600 gso has 768 mb and the really high end ones have as much as 1.5 to 2 gigs of ram, and since the advent of agp cards, it is possible to flush just parts of the frame buffer to main memory (in the pci days it was all or nothing, first you filled up the cards memory and then you could make use of system ram, now it can be any combination), a gpu encoder has access to system ram + video card ram.
i also ran a quick test, using avi demux as a front end, and encoding using x264 with 4 threads, 8 threads and 12 threads just to see how much ram is used. with 4 threads, encoding to 720x480 at 3mb/s avidemux used about 170 mb of ram, with 8 threads it used 185 mb of ram. with 12 threads we hit 200 mb of ram usage. so extrapolating some expected ram usage, for every additional 4 threads it requires roughly 15 extra mb of ram, thus starting from 12 threads and 200 mb of ram and adding another 40 threads we would need a total of 600 mb of ram to run 52 threads under x264, which is well within the capabilities of any mid range card. we also need to remember that graphics cards, with the exception of the low end cheap variants, use much faster ram than system ram, top of the line cards use gddr5 while top of the line desktops use ddr3.
i also must congratulate you, you proved to be an excellent master debater ( :P ) i hope this thread was as enjoyable for those reading it as it was for me. -
Originally Posted by rallynavvie
i have no problem with poison, he strikes me as a decent guy, he has certainly help me, and many others, in the past, and he certainly is knowledgeable as far as video is concerned, it's just that he seemed to have believed the excuses the x264 developer's have made with regards to gpu acceleration as God's Own Truth, and the sad truth of the matter is that from a coding standpoint they just don't hold water, the only valid obstacle is that they chose to use a programming technique that is currently unsupported by nvidia gpu archiectures (i can't seem to find any documentation to see if ati hardware also doesn't support function pointers), but that reality is a far cry from "gpu's suck for video encoding". -
My eyes have started to bleed.
I think,therefore i am a hamster.
Similar Threads
-
Review this system.... (revision 2.0)
By Engineering in forum ComputerReplies: 14Last Post: 17th Dec 2009, 09:54