proof that x264 HD is a poor benchmark.

11th Jul 2013 22:38 #1
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
i know, the title sounds like trolling but it's not really, more like "i told you so". in the past when people have asked for cpu recommendations and someone has recommended either an intel or an amd cpu, someone else invariably comes along and after most of the recommendations have gone one way, linked to an x264 benchmark that supposedly shows that the other is also a viable choice.

i have always said that x264 based benchmarks are poor indicators of true performance, primarily because the benchmarks that employ it use poorly written scripts, poor sources and of course x264 is coded by knuckle dragging chimps and despite what anyone tells you does not really scale well with higher thread counts*.

with this in mind, check out this test of a 4 x 8 core 16 thread cpu beast of a system that anandtech has done:

http://anandtech.com/show/7121/trials-of-an-intel-quad-processor-system-4x-e54650l-from-supermicro

a quad E5-4650L actually gets beaten by a single 3930k. now mind you that the quad cpu setup has 32 physical cores and can handle 64 threads simultaneously and the 3930k only has 6 cores and can only handle 12 threads simultaneously.

also keep in mind that the x264 by default launches 1.5 times the number of threads a system can handle and supposedly x264 is coded to be able to launch up to about 120 threads.

*in all fairness, there is much more at play here than the numbers reveal.

first, the dagnabbit chicken pluckers that developed sux264 have repeatedly said that there are parts to their suck ass encoder that they can't parallelize primary because they don't quite have the balls, so if any clown pounder ever tells you that x264 scales well with thread count just remind them that the developers of the lamest show on earth known as x264 have said that there are parts that they aren't smart enough to multi-thread.

but more importantly, as big of a bombastic simpleton that any sux264 developer may be, the butt-fingering numb-nuts that put out that ridiculous x264 HD benchmark are even more pathetic. they choose faster settings for their assmark that doesn't really stress a system to it's max.

the biggest raspberry has to go to Ian Cutress himself for coming out of the idiot closet with such reckless abandon and to Anand for hiring his worthless ass to write for his site.

i want to know what kind of crack one needs to be smoking to get their hands on a 4 way hyperthreaded octo core setup and decide to run a benchmark that use 720p mpeg-2 as it's source and 4mb/s 720p x264 with the very fast setting (i think that's the one they use) as the target?!?

if you really wanted to stress all the logical cores, a custom benchmark should have been used with all of the x264 settings maxed out and a much higher resolution, maybe even a 4k source, so that we can see some separation between multi cpu and single cpu setups.

seriously, who in there right mind would build this kind of system and then encode 4mb/s 720p avc?

thank you, scum again.

Quote
12th Jul 2013 09:43 #2
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
x264 is the perfect benchmark -- for x264 encoding.

It's not clear if they manually specified the number of threads to use, since earlier in the article they mentioned that the OS was only reporting half the number of processing threads:

In both Windows Server 2008 R2 Standard and 2012 Standard, the system would detect all 64 threads in task manager, but only report 32 threads to software.

So x264 may have been using only half the number of threads it should have on the quad 4650L.

Of course, the reason for converting 720p to 720p was given:

a standardized result which can be compared across other reviews

Last edited by jagabo; 12th Jul 2013 at 10:17.

Quote
12th Jul 2013 11:02 #3
hello_hello

View Profile

View Forum Posts

Private Message
Member

Join Date
Mar 2012
Originally Posted by deadrats

i know, the title sounds like trolling but it's not really, more like "i told you so".

The title doesn't sound like trolling, but your derogatory references to the x264 encoder certainly do.

Based on your post my question would be, if you want h264 encoding which scales well with higher thread counts and can produce the same or better quality than the x264 encoder, what are the alternatives? Which encoder should we be using?

Quote
12th Jul 2013 21:13 #4
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by jagabo

So x264 may have been using only half the number of threads it should have on the quad 4650L.

even if the benchmark was using 32 instead of 64 threads, there's no reason a dual HT octo-core setup should lose out to a single hexa-core setup.

Of course, the reason for converting 720p to 720p was given:

i read this but it's such a cop out, using the author's argument all reviews should still be doing 720x480 divx tests so that they can compare the results from modern results with P3 results.

as you must be aware, 720p to 720p @4mb/s + very fast doesn't come anywhere near to really pushing even an i5 to the breaking point, much less a 64 thread monster.

Quote
12th Jul 2013 21:46 #5
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by hello_hello

Originally Posted by deadrats

i know, the title sounds like trolling but it's not really, more like "i told you so".

The title doesn't sound like trolling, but your derogatory references to the x264 encoder certainly do.

Based on your post my question would be, if you want h264 encoding which scales well with higher thread counts and can produce the same or better quality than the x264 encoder, what are the alternatives? Which encoder should we be using?

in all fairness, i made derogatory statements about the article's author as well as the creator of the benchmark and most of what i posted was obviously meant to elicit a few chuckles as well as let people know that x264 based benchmarks are not the be all and end all of performance assessment.

with regards to consumer grade avc encoders, the software main concept encoder maxes out at 16 threads and this is really the only major competitor.

personally i think both x264 and main concept and just about every damn mutli-threaded encoder developer went about threading their software the wrong way. what they have done is focus on parallelizing large portions of the various algorithms employed but basically the entire stream is processed linearly, frame by frame. this was the wrong approach from the get go because you need to worry about locks, race conditions, memory synchronization, etc which adds code complexity, increased development and debugging time and more importantly leads to limits of how threaded you can make your software. basically they took the approach of maximizing the usage of all the processing units within a core, but the problem is that some processes depend on the completion of other tasks and so it may look like you are getting the most out of the cpu but all you're doing is keeping it under load without actually doing any work, sort of like a car spinning it's tires, producing a big cloud of smoke but not really getting any movement for the wasted fuel. in a nutshell, the software developers made the classic mistake of assuming that every part of a cpu is meant to be fully utilized at all times, i.e. load up the alu's, simd units, fp and caches, all at the same time. when you only have one core, this is the approach you need to take, when you start having a large number of cores, this approach quickly reveals the shortcomings of this sort of thinking.

what they should have done is employed an SVE approach, sort of like the way distributed computing works, where the source file is simply cut up along some boundary, with video closed GOP seems like the most logical choice, each segment is processed in a single thread and then the results are concatenated into a single output file. if they had taken this approach from the get go, gpu accelerated encoders would be the norm by now, as such a threading approach is easy to impliment, easy to maintain and scales nearly linearly all the way up to the limit of available segments. it also makes simd optimization much easier. basically all encoder developers lacked vision, it's almost like they never expected consumers to have quad core hyper-threaded processors in their home computers.

Quote
12th Jul 2013 23:12 #6
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by deadrats

what they should have done is employed an SVE approach, sort of like the way distributed computing works, where the source file is simply cut up along some boundary, with video closed GOP seems like the most logical choice, each segment is processed in a single thread and then the results are concatenated into a single output file.

You can somewhat simulate that. Split a video file into 8 sections (or use 8 AviSynth scripts to access different sections of a single file), and feed them to 8 instances of x264 running singled threaded. Then watch your system come to a standstill.

Quote
12th Jul 2013 23:17 #7
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by jagabo

You can somewhat simulate that. Split a video file into 8 sections (or use 8 AviSynth scripts to access different sections of a single file), and feed them to 8 instances of x264 running singled threaded. Then watch your system come to a standstill.

you know, i would like to try that, how do i set x264 to run in single threaded mode?

Quote
12th Jul 2013 23:22 #8
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by deadrats

how do i set x264 to run in single threaded mode?

--threads=1

A short test on my i5 2500K showed 4 single threaded encodings (Task Manager showed 100 percent CPU usage) achieving about the same total fps as a single multithread encoding (also 100 percent CPU usage). With 8 sections you may still be ok. But when the number of threads gets large enough you will run into cache thrashing and throughput will drop 100 fold.

Last edited by jagabo; 12th Jul 2013 at 23:52.

Quote
13th Jul 2013 00:21 #9
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
i just realized that i could run the test using media coder, as the developer has added SVE to his app. i used a 12mb/s 1920x1080p 29.97fps vc-1 source file and i targeted x264 @ 12mb/s, same resolution, slow preset, with audio.

in the first test i set threads to 4 and set the affinity within media coder so that all the threads only ran on physical cores, i.e. not an HT core i saw about 11 fps, when i set the threads to 8 and the affinity to all logical cores i saw 13 fps.

i then went in and set threads to 1 and enabled sve with 4 and 8 concurrent segments, with 4 segments i averaged about 14 fps and with 8 segments i averaged about 15 fps and i decided to also try with 16 segments and saw about 17 fps.

the big difference is the cpu load, using x264's default threading model cpu load with 8 threads was nearly 100%, with 16 segments cpu load hovered right around 50%.

with regards to cache thrashing, this would hold true of any threading model, the difference is that with sve style threading the code is cleaner and easier to maintain, it's practically self scaling (all you need to do is keep adding segments) and you don't have to worry about locks and memory management as the OS takes care of that for you.

Quote
13th Jul 2013 00:31 #10
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Originally Posted by deadrats

what they should have done is employed an SVE approach, sort of like the way distributed computing works, where the source file is simply cut up along some boundary, with video closed GOP seems like the most logical choice, each segment is processed in a single thread and then the results are concatenated into a single output file. if they had taken this approach from the get go, gpu accelerated encoders would be the norm by now, as such a threading approach is easy to impliment, easy to maintain and scales nearly linearly all the way up to the limit of available segments. it also makes simd optimization much easier. basically all encoder developers lacked vision, it's almost like they never expected consumers to have quad core hyper-threaded processors in their home computers.

Beyond the other issues with this approach, the quality and compression efficiency is also lower.

The more "cuts" you make, the lower the quality. The more arbitrary the cuts (e.g maybe you cut up evenly and disribute evenly), the lower the quality. There are fewer similarities and you cannot use references out of that section . You end of with GPU encoder type quality.

It's a similar (but less severe) issue with threading now - the more threads, the lower the quality

Quote
13th Jul 2013 08:11 #11
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by poisondeathray

The more "cuts" you make, the lower the quality.

Not necessarily. Some threads can perform a look-ahead and determine optimal cut locations (eg, look for scene changes) then apportion sections to other threads for encoding.

Quote
13th Jul 2013 08:21 #12
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by deadrats

with regards to cache thrashing, this would hold true of any threading model

No. With many frames working on a few frames at once all of those frames will fit within the shared L3 cache. With many threads working on different parts of the video, each with their own few frames worth of video memory, the shared L3 cache will be overwhelmed. Any random access to DRAM will require hundreds of CPU clock cycles. Random access to the hard drive can take thousands of CPU clock cycles.

Quote
13th Jul 2013 09:02 #13
jman98

View Profile

View Forum Posts
Banned

Join Date
Oct 2004

Location
Freedonia
Originally Posted by deadrats

in all fairness, i made derogatory statements about the article's author as well as the creator of the benchmark and most of what i posted was obviously meant to elicit a few chuckles as well as let people know that x264 based benchmarks are not the be all and end all of performance assessment.

Hmm... my personal take away from your way too many posts on X.264 is that you think it blows chunks and may in fact be worse than MPEG-1 at below VCD bit rates.

You really need to get a life. Getting your panties in a wad and getting all worked up 100% of the time here about X.264 is just pointless.

The number of people in the world who give a crap about how fast their damn X.264 encodes go is pretty low. Personally, I just need it to work. I really could not possibly care less if it takes 2 hours or 4 hours or 6 hours. I do not do encodes every day. When I do them, they just need to work and not take all day. Then again, maybe you're like those ADHD 20 year olds we get who come here and cry like babies because OMG they did something on the computer and it took more than 1 minute which is just an insanely unacceptable amount of time to spend on anything.

You're as bad as those crazy people on Doom9 who adopt some minor part of the whole video field as their personal issue and spend their time ranting and raving about any posts that disagree with them. It's just sad that this is so important and all consuming to you.

Quote
13th Jul 2013 11:36 #14
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Originally Posted by jagabo

Originally Posted by poisondeathray

The more "cuts" you make, the lower the quality.

Not necessarily. Some threads can perform a look-ahead and determine optimal cut locations (eg, look for scene changes) then apportion sections to other threads for encoding.

Yes it is worse, and that's the way it works now with threaded look ahead.

But deadrats is suggesting a SVE approach. The cuts are made simultaneously, beforehand. Currently, the threaded look ahead looks linearly "x" frames in a single section, not multiple sections ahead simultaneously. The threaded look ahead looks ahead, not behind, in linear display sequence. The more arbitrary cuts, the more number of "cuts", the more sections will potentially miss MB's that could have been used as references . There is no reconcillation of overlap sequences (although theoretically some threads could be allocated to fix that)

He's suggesting that the developer's are "lacking vision", but those are reasons why HEVC and VP9 use a non SVE approach - the driving force is compression, not scalability (and both are better than x264 in early tests) ; and the memory ,cache, performance issues make it not feasible to do. And they most certainly had access to multicore computers even before they started development, not due to some "lack of vision"

Quote
13th Jul 2013 12:01 #15
deadmeow

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Feb 2010

Location
North America
How relevant is a multi-cpu benchmark for most home consumers? Not very, as most home consumers do not have multi-cpu systems. H264 benchmarks are very important for anyone who encodes in h264, and is looking to choose a cpu.

Still, very interesting article. I recently wondered how fast a system with a couple of server cpu's would do encoding, and I looked at some prices of the server motherboards and cpus, and then I wasn't very interested in how fast they could potentially encode! haha.

Quote
13th Jul 2013 13:16 #16
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by poisondeathray

But deadrats is suggesting a SVE approach. The cuts are made simultaneously, beforehand.

I didn't take him to mean the segments would necessarily be equally spaced. The decision on GOP boundaries can remain exactly as they are now. But each GOP would be worked on by a single core (or maybe a few) with several GOPs being processed in parallel. But even with fixed sized segments an extra I frame here and there isn't going to kill compression -- unless you're using really tiny segments.

Quote
13th Jul 2013 19:49 #17
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by poisondeathray

The more "cuts" you make, the lower the quality. The more arbitrary the cuts (e.g maybe you cut up evenly and disribute evenly), the lower the quality. There are fewer similarities and you cannot use references out of that section . You end of with GPU encoder type quality.

It's a similar (but less severe) issue with threading now - the more threads, the lower the quality

can you explain to me why the quality would be lower; with closed gop's frames do not reference any other frame outside the gop boundary, in a sense they are a small self contained microcosm, oblivious to any other gop's existence.

so why would cutting up a file along gop boundaries lower the quality and at what point are you claiming the quality is lowered, I frames?

Quote
13th Jul 2013 19:56 #18
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Originally Posted by deadrats

Originally Posted by poisondeathray

The more "cuts" you make, the lower the quality. The more arbitrary the cuts (e.g maybe you cut up evenly and disribute evenly), the lower the quality. There are fewer similarities and you cannot use references out of that section . You end of with GPU encoder type quality.

It's a similar (but less severe) issue with threading now - the more threads, the lower the quality

can you explain to me why the quality would be lower; with closed gop's frames do not reference any other frame outside the gop boundary, in a sense they are a small self contained microcosm, oblivious to any other gop's existence.

so why would cutting up a file along gop boundaries lower the quality and at what point are you claiming the quality is lowered, I frames?

1) closed is less efficient;

2) Different GOP and frametype allocation: I'm claiming the more arbitrary cuts, the lower the quality, and it's a fact. I was under the assumption that you were doing this on a big parallel scale , like DC projects, equally spaced.

If you look at any DC project, render farms, CGI, etc... that's how it's done. Work is evenly divided between nodes, simultaneously, beforehand. Some farms distribute it according to a benchmark e.g. if you have some faster computers, some slower, the faster ones get proportionally more work - so in that respect it's not "evenly spaced", the but point is the divisions don't occur how it would be optimally for video encoding

If you propose doing it the way jagabo interpreted it - then the GOP and frametype and Mb allocation will be the same as if you didn't do it your way. But then you run into the memory and cache problems when doing HD video. You might only be able to run a few GOP simultaneously

Quote
13th Jul 2013 20:03 #19
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
@jman: yeah, you tell him, he's been driving me crazy with that unhealthy obsession with x264 and the suck ass developers that coded it. it's really annoying, i have to live with this guy and all day running around in his head is how he can improve the quality of his porn encodes and that x264 kills way too many of the details he wants to keep. i just don't know what's wrong with this guy!!!

@pdr: your claim of lowered quality due to high number of segment cuts is easy to put to the test, download media coder and encode the same file twice, once using the traditional threading model and set a decent number of threads (say 24) and once setting threads=1 and setting media coder to use 24 segments (you will have to manually enter the number of segments in the SVE advanced SVE section) and compare the output to see if you notice any difference in quality.

the proof as they say, is in the pudding, if i am correct you will not be able to see any difference in encoded quality and if you're correct they difference should be obvious.

feel free to post some screenshots after your tests.

Quote
13th Jul 2013 20:08 #20
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Originally Posted by deadrats

@pdr: your claim of lowered quality due to high number of segment cuts is easy to put to the test, download media coder and encode the same file twice, once using the traditional threading model and set a decent number of threads (say 24) and once setting threads=1 and setting media coder to use 24 segments (you will have to manually enter the number of segments in the SVE advanced SVE section) and compare the output to see if you notice any difference in quality.

You're mixing up concepts. Threading and GOP are 2 differnt things

WRT to threading, it's proven fact, SSIM decreases with increased threads (and it will under either model). But whether or not you can "see" it depends on many factors, such as bitrate range relative to content complexity . I did mention earlier it's not "severe" as the GOP issue - and in most scenarios with not absurd thread counts, it's probably neligible unless you are in a very low bitrate range

WRT to GOP, it's a proven fact too, if you arbitrary set some low max GOP size to truncate GOPs, you will get lower quality at lower to mid bitrate ranges due to the loss in compression efficiency (and it will under either model). At very high bitrate ranges you might even increase the quality. Now this is visible. (Do you remember when you did some encodes claiming xvid was better, and we had that long ass thread ? Go revisit your I frame encode )

Quote
13th Jul 2013 20:10 #21
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by deadrats

the proof as they say, is in the pudding, if i am correct you will not be able to see any difference in encoded quality and if you're correct they difference should be obvious.

You won't be able to see the difference. But if you use CRF encoding you'll be able to measure it as bitrate or file size. I suspect we're talking about less than 1 percent differences with moderate numbers of segments. In the future when we all have 16,000 core CPUs it won't make sense to break a 2 hour movie into 5 frame GOPs. On the other hand, it won't make sense to break a 1920x1080 frame down into 16,000 12x12 pixel blocks either. Some middle ground will have to be found.

Last edited by jagabo; 13th Jul 2013 at 20:19.

Quote
13th Jul 2013 20:14 #22
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by poisondeathray

1) closed is less efficient;

2) Different GOP and frametype allocation: I'm claiming the more arbitrary cuts, the lower the quality, and it's a fact. I was under the assumption that you were doing this on a big parallel scale , like DC projects, equally spaced.

If you look at any DC project, render farms, CGI, etc... that's how it's done. Work is evenly divided between nodes, simultaneously, beforehand. Some farms distribute it according to a benchmark e.g. if you have some faster computers, some slower, the faster ones get proportionally more work - so in that respect it's not "evenly spaced", the but point is the divisions don't occur how it would be optimally for video encoding

If you propose doing it the way jagabo interpreted it - then the GOP and frametype and Mb allocation will be the same as if you didn't do it your way. But then you run into the memory and cache problems when doing HD video. You might only be able to run a few GOP simultaneously

look, you're a good guy so i'm not going to be an ass and point out that what you said above is logically inconsistent, instead i'm going to give you the opportunity to reread what you posted and see if catch where you seem to be playing both side of the ball (hint, paragraphs 2 and 3 contradict each other and don't jive with what i have said previously nor what you claimed in previous posts to have understood me to mean.

take some time, run a few tests, have a beer or two, then we can continue this discussion once we have some solid proof of your claims.

again, not to come off like a dick but somewhere along the line you seem to have gone slightly askew with your beliefs in regards to SVE.

with regards to closed gop's beings less efficient, i beg to differ. with an open gop you end up with scene changes that lack a key frame, you end up with ridiculously long sections of video comprised of P and B frames, both of which are encoded with higher quantizers than I frames. i hate to break this to you but the more B frames a movie is composed of the lower the overall quality, it's just the way it is, there's a reason, other than seeking, that the blu-ray and hddvd spec call for closed gop's equal to frame rate.

Quote
13th Jul 2013 20:25 #23
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Originally Posted by deadrats

look, you're a good guy so i'm not going to be an ass and point out that what you said above is logically inconsistent, instead i'm going to give you the opportunity to reread what you posted and see if catch where you seem to be playing both side of the ball (hint, paragraphs 2 and 3 contradict each other and don't jive with what i have said previously nor what you claimed in previous posts to have understood me to mean.

take some time, run a few tests, have a beer or two, then we can continue this discussion once we have some solid proof of your claims.

again, not to come off like a dick but somewhere along the line you seem to have gone slightly askew with your beliefs in regards to SVE.

with regards to closed gop's beings less efficient, i beg to differ. with an open gop you end up with scene changes that lack a key frame, you end up with ridiculously long sections of video comprised of P and B frames, both of which are encoded with higher quantizers than I frames. i hate to break this to you but the more B frames a movie is composed of the lower the overall quality, it's just the way it is, there's a reason, other than seeking, that the blu-ray and hddvd spec call for closed gop's equal to frame rate.

Maybe I've misinterpreted what you said, but then please clarify what you mean.

Did you mean something along the lines of what jagabo was saying ? (because in reality it won't work for more than a few GOPs simultaneously for HD video. Unless you use tiny GOPs. A threaded lookahead of about 175-200 frames 1920x1080 with 1 section encoding will take about 4.5GB memory . How many GOP's can fit in 200 frames ? 400 frames ? Of course it depends on many things such as the content, and the encoding goals - but you're not going to have very many GOP's in parallel with today's computers and memory)

You run the tests and try to disprove these are known facts, not claims. I've run them before and so have many others and posted many results. Search if you want.

1) The more truncated the GOP , the less efficient the encoding => fact (except in very high bitrate ranges)

2) more threads lead to lower SSIM => fact

3) open is more efficient than closed => fact

i hate to break this to you but the more B frames a movie is composed of the lower the overall quality, it's just the way it is, there's a reason, other than seeking, that the blu-ray and hddvd spec call for closed gop's equal to frame rate.

If you have a lot of bitrate, yes. In fact, if you have "infinite" bitrate, you might as use all I frames. In normal, to low bitrate ranges, no, you 're absolutely wrong (did you not learn from that long ass thread or is xvid still "better" for you? Do you remember what happened when you encoded with no b-frames ? )

Last edited by poisondeathray; 13th Jul 2013 at 20:31.

Quote
13th Jul 2013 22:20 #24
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
@pdr:

re: xvid, you're going to hate what i'm about to say but i firmly believe that xvid gives superior visual quality to x264 when xvid's settings are all maxed out and x264's are not.

re: SVE/distributed computing approach to threading; let me make this as perfectly clear as i possibly can, i'm saying instead of coding any codec, not just x264, so that threading comes from multiple slices per frame (though the blu-ray spec does demand it) or from frame based threading (where one thread carries out cabac, one carries out motion search, one carries out sub pixel refinement, one carries out dct and trellis, one carries out macroblock detection and deblocking, one carries out psycho-visual enhancements, one works on b frame placement, etc and all these threads are spread apart on all the available cores, just to be retired and then respawned for the next frame, a better approach would have been to just have 1 thread look at the entire source file, find the gop boundaries, cut the file into segments, and process each segment linearly on a separate core, so that with a 4 core 4 segments are being processed simultaneously.

my contention is that this approach would have led to much easier threading, less chance of locks or race conditions, easier scaling (as you simply add segments with available core count), simpler cleaner more maintainable code and less cpu usage, as a single thread does not load up a single core to 100%, if anything a quad core would be able to handle, according to my tests, 32 segments before you loaded up all 4 cores and i firmly believe that due to the simpler, cleaner code the quality would be better not worse.

this approach certainly doesn't hurt the visual quality of the 3d projects like cgi that employ it and i don't see it making any difference if the stream had a variable or fixed length gop, you can cut a data file into as many pieces as you want and concatenate the results back and the quality if the file is not impacted in any way, in fact it borders on absurd to expect it to, after all the stream is only composed of electrical states represented as 0's and 1's, if you cut 100011110000111 into individual 1's and 0's and stitch it back together it doesn't change it in any way.

i think x264 is a classic example of an over-engineered piece of software, where the developers wanted to show how clever they are and so added all sorts of shit that they themselves say shouldn't be used by most people. they coded it using handcrafted assembler, because somehow the assembler generated by optimizing compilers wasn't good enough, they've added code bloat and complexity by coding multiple code paths for various processors, because you know using properly structured C with compiler intrinsics and a good optimizing compiler would somehow slow their little baby by a few cycles and they threaded it like the 17 year old kid that buys his first car and chromes the shit out of it to the point that my balls are shiny.

in fact, at times it seemed to me that x264 was really coded the way it was coded not because they actually believe that this was the best way to write their code but rather as a functioning resume that they could show potential employers that they knew how to implement a given programming technique, kind of like cooking a meal and including every protein and starch known to man just to prove that you know how to cook everything.

Quote
13th Jul 2013 22:54 #25
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Originally Posted by deadrats

re: xvid, you're going to hate what i'm about to say but i firmly believe that xvid gives superior visual quality to x264 when xvid's settings are all maxed out and x264's are not.

Yeah,... there's some logic for you. Just think about that statement for a second...

If I break tiger wood's arms (and 1 leg), I'll get a better golf score than him.

so that threading comes from multiple slices per frame (though the blu-ray spec does demand it)

Only for certain blu-ray formats and profiles, namely L4.1 . L4.0 doesn't require it

a better approach would have been to just have 1 thread look at the entire source file, find the gop boundaries, cut the file into segments, and process each segment

linearly on a separate core, so that with a 4 core 4 segments are being processed simultaneously.

In terms of scaling mechanics, easier threading, I agree . But there are still issues with memory , cache , scheduling issues

But the problem in terms of quality is how to "find the gop boundaries" , frame types, optimally . Lets say you have 8 logical cores. Do you randomly pick x frames and divide them up ? If you have a 8000 frame video, do you divide them up evenly ?

If you do it with the lookahead thread - How far does that lookahead thread look ? I mentioned in the post earlier ~4.5GB for 1920x1080 200 frames lookahead with 1 section encoding... it gets larger the higher the lookahead distance . That's concurrent lookahead + encoding. Or did you mean a 2 pass encode ? (1 or more thread looks ahead the whole file, then divides up accordingly before encoding ) ?

Earlier I jumped to conclusion because you mentioned DC and GPU encoding being the norm - and I assumed you meant massive parallelization . If you only meant on the order of something like 8-16 cores and therefore segments, it's going to be negligible difference on quality, except at very low bitrates.

But are you saying for example a movie divided up into 8-16 parts, something in that order? Then I misinterpreted what you said. Then the GPU has no place except for some accessory calculations.

That's basically the way ripbot's distributed encoding mode works now , except for the single thread part (it' s still default cores*1.5, but you can set it to whatever), and that it's over a network with multiple computers, not a single computer , and there is no "smart analysis" before divying up the work. But the raw avc segments are "stitched" together at the end

this approach certainly doesn't hurt the visual quality of the 3d projects like cgi that employ it and i don't see it making any difference if the stream had a variable or fixed length gop, you can cut a data file into as many pieces as you want and concatenate the results back and the quality if the file is not impacted in any way, in fact it borders on absurd to expect it to, after all the stream is only composed of electrical states represented as 0's and 1's, if you cut 100011110000111 into individual 1's and 0's and stitch it back together it doesn't change it in any way.

Network rendered CG is always rendered as still frames, no temporal compression .

Video with temporal compression is different, so it does matter.

So was that the "contradiction ?" or "logical inconsistency "

And that' s a big reason why CG is ALWAYS done in that manner - often certain frames have to be re-rendered for whatever reason, maybe client wants a change, maybe one computer crashes... etc.. In CG renders, single frames can take a hours to encode (sometimes days, LOL it makes the reference HEVC encoder look fast! ) , so redoing the minimium required frames is a huge benefit

in fact, at times it seemed to me that x264 was really coded the way it was coded not because they actually believe that this was the best way to write their code but rather as a functioning resume that they could show potential employers that they knew how to implement a given programming technique, kind of like cooking a meal and including every protein and starch known to man just to prove that you know how to cook everything.

You probably right, and there were a lot of "cooks in the kitchen", and probably a lot of inefficiencies. But in it's time (the last few years), it was still clearly produced the best results under most situations .

Last edited by poisondeathray; 13th Jul 2013 at 23:34.

Quote
13th Jul 2013 23:13 #26
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by deadrats

have 1 thread look at the entire source file, find the gop boundaries, cut the file into segments, and process each segment linearly on a separate core, so that with a 4 core 4 segments are being processed simultaneously... according to my tests, 32 segments before you loaded up all 4 cores

On my quad core i5 2500K I was getting nearly 100 percent CPU usage with just 4 segments of 1 thread each.

This was from splitting a 19 GB Blu-ray rip into 4 equal (number of frames) sections via AviSynth scripts and feeding each script to x264 with the slow preset, CRF 18, --threads=1. The sum of the frame rates of the individual processes (26.65 fps) was approximately the same as when running a single instance of x264 with automatic threading (6 threads, 25.34 fps, also near 100 percent CPU usage). Don't take those numbers as precise as I was using the computer a bit here and there while the encodings were running. Bit there was not a huge change, either way, in overall encoding rate. The sum of the four encoded segements (MKV) was 0.15 percent SMALLER than the single 6 thread encode.

Obviously, this doesn't include the overhead of managing separate threads in one process and coordinating the output of different threads finishing at different times, etc. So it's not a perfect simulation of what we've been discussing.

Maybe tomorrow I'll see how many segments I can use before cache thrashing kills performance.

Last edited by jagabo; 13th Jul 2013 at 23:58.

Quote
13th Jul 2013 23:40 #27
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by jagabo

On my quad core i5 2500K I was getting nearly 100 percent CPU usage with just 4 segments of 1 thread each.

how exactly did you test it? i tested it with 16 segments using media coder's SVE capabilities and it barely was over 50% cpu usage with an i7 3770k.

did you calculate an effective fps encoding rate?

Quote
13th Jul 2013 23:58 #28
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
But the problem in terms of quality is how to "find the gop boundaries" , frame types, optimally . Lets say you have 8 logical cores. Do you randomly pick x frames and divide them up ? If you have a 8000 frame video, do you divide them up evenly ?

you can't think of any way to find gop boundaries? how about you search the file looking for an I frame and cut it at that point. correct me if i'm mistaken but doesn't an I frame signal the start of a new gop with closed gop streams?

you obviously can't randomly pick x frames and divide them up because some frames will be making reference to frames in the other segment and then you end up screwing everything up and yes, i'm sure the results would look like shit without a ton of code for garbage collection.

Earlier I jumped to conclusion because you mentioned DC and GPU encoding being the norm - and I assumed you meant massive parallelization . If you only meant on the order of something like 8-16 cores and therefore segments, it's going to be negligible difference on quality, except at very low bitrates.

But are you saying for example a movie divided up into 8-16 parts, something in that order? Then I misinterpreted what you said. Then the GPU has no place except for some accessory calculations.

i mentioned distributed computing as a way for people to visualize what i was talking about, in case some may not be familiar with segmented video encoding. i mentioned gpu encoding because with an SVE approach, if you were to cut a file into segments of say 250 frames a piece, you could assign a segment to each gpu "core" and yes, you would end up with massive parallelism. the 8-16 segments was just a test i did as a proof of concept using media coder, obviously for gpu encoding you would up the segments well into the hundreds.

of course, as i have already pointed out in other posts, you would need to use a non-primary gpu for such a task as loading up all the gpu cores and onboard ram would slow your desktop to a stand still.

the basic point is to end up with encoders that have clean code, easy to maintain and easily scalable as core count goes up. i have spent quite a bit of time looking at the x264 code and my God, i have to wonder what the **** these guys were thinking, it's like one giant kludge, they just kept adding code and adding code, almost like adding layer after layer of paint on a wall as new tenants move in without first scraping the old paint off.

we, as end users are lucky that hevc and/or vp9 is poised to take over as the go-to choice for encoding, because if these guys, the x264 developers ever said "**** this, i'm retiring" there wouldn't be anyone that could maintain that code base, outside of maybe some deep pocket company like main concept or microsoft that can afford to hire dedicate software engineers to screw around with the code.

just trying to build the damn encoder from source, without the automated tools and scripts that some users have created, is like pulling teeth, i decided to try to build it manually, the old fashioned way, yeah well fat chance, i ain't building shit anytime soon that's for sure.

Quote
14th Jul 2013 00:01 #29
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by deadrats

Originally Posted by jagabo

On my quad core i5 2500K I was getting nearly 100 percent CPU usage with just 4 segments of 1 thread each.

how exactly did you test it?... did you calculate an effective fps encoding rate?

I added more details to my last post. In short, 4 segments at 1 thread each encoded at the same overall frame rate and delivered the same file size as one segment at x264's default 6 threads (1.5 x cores).

Quote
14th Jul 2013 00:05 #30
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Originally Posted by deadrats

But the problem in terms of quality is how to "find the gop boundaries" , frame types, optimally . Lets say you have 8 logical cores. Do you randomly pick x frames and divide them up ? If you have a 8000 frame video, do you divide them up evenly ?

you can't think of any way to find gop boundaries? how about you search the file looking for an I frame and cut it at that point. correct me if i'm mistaken but doesn't an I frame signal the start of a new gop with closed gop streams?

It's an "IDR" frame for new GOP (you can have non delimiting "i" frames with open GOP)

I was asking about distributing the frame types and GOP composition for the following encode, not the source . (You're trying to plan the optimal way to encode the file , for most efficient compression)

When you encode, the source video is decoded to uncompressed frames before encoding (so the notion of frame types, GOP doesn't really apply)

we, as end users are lucky that hevc and/or vp9 is poised to take over as the go-to choice for encoding, because if these guys, the x264 developers ever said "**** this, i'm retiring" there wouldn't be anyone that could maintain that code base, outside of maybe some deep pocket company like main concept or microsoft that can afford to hire dedicate software engineers to screw around with the code.

I only look at end results, mostly quality , but also speed, customizabilty, and maybe ease of use . I don't care if the code sucks if encoder delivers. Unlike you I don't care if the authors are jackasses or saints

And it looks promising for HEVC and VP9. Even at this early stage HEVC > VP9 > x264 , and the delta only going to get bigger

Quote

proof that x264 HD is a poor benchmark.

Thread Tools

Search Thread

Similar Threads

show x264 command line output when using megui as x264 gui

benchmark

HD x264 CPU benchmark - compare different CPUs encoding the same file

x264 CPU benchmark - Compare different CPUs encoding the same file