Follow-up: x264 CPU usage--threads vs. instances

8th Feb 2010 21:38 #1
Calidore

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2007

Location
United States
Remember when "thread count" just referred to the quality of sheets?

In a forum thread a couple of days ago, there was some discussion of x264 multithreading, CPU usage, effect on quality, etc. As promised, I've done some testing on my new i7 860 (4 cores plus hyperthreading), 4 gig RAM system.

Please be aware that I'm aware that this isn't remotely scientific. Since no two systems are the same hardware, software, or configuration-wise, I'm not going to be anal about precision. (So no software stopwatches; I just used my eyes.) It's about the trends, not the nanoseconds. Also, I'll leave it to better eyes that mine to discern effects on visual quality. I'm including the resulting file sizes to illustrate effects on compression quality.

X264 says to set the threads to 1.5 times the number of cores, which in my case makes 12. So for the experiment I ripped the first 12 cartoons from my Tom & Jerry: Chuck Jones Collection DVD set (these got somewhat better cleanup treatment than the Hanna-Barbera cartoons on the Spotlight DVDs, so less noise for x264 to have to deal with). These were compressed with Virtualdub 1.9.8 using x264vfw ver. 1376. The only filter I used was vdub's built-in deinterlacing filter to eliminate the combing (which is hell on compression).

First I ran a single instance of Virtualdub and compressed one cartoon, changing the number of threads used by x264 as follows:

1 thread (control + curiosity): CPU usage 13-14%; time 22:30; final file size 106,800,090

4 threads (# of physical cores): CPU high 20s - low 30s; time 11:16; size 107,615,208

6 threads (physical cores * 1.5): CPU mid 30s - mid 40s; time 9:25; size 107,661,144

8 threads (# of virtual cores): CPU high 40s - high 50s; time 7:38; size 107,817,362

12 threads (virtual cores * 1.5): CPU mid 50s - low 60s; time 7:11; size 107,808,378

Interestingly, while 12 cores does work a bit faster than 8, we seem to hit the point of diminishing returns when we've directly maxed out the cores. I'ts also clear that hyperthreading is fully utilized, or the slope would have started between 4 and 6.

Meanwhile, the processor usage was never close to 100%, but that makes sense; x264 is simply dividing part of the work. When I briefly peeked at indidivual core usage, one was around 66% and the others were at about 24%.

File size rose with the thread count as well (except for a small drop between 8 and 12), which goes with one poster's theory that the multithreading could interfere with motion vectors.

Anyway, that's the thread test. Now the instance test, in which I put all 12 cartoons in a shared job queue, set x264 to a single thread, and ran multiple copies of Virtualdub.

I skipped doing 1 instance / 1 thread, because that projects to around 4 hours.

4 Virtualdubs: CPU 52-56%; time 1 hour 30 on the button

6 Virtualdubs: CPU 78-84%; time 1:10:30

12 Virtualdubs: CPU 100%; time 1:07:35 (This is an excellent way to bring your fire-breathing new system to its knees)

CPU usage here is a lot more straightforward, as you're running the complete application in parallel. Also, what a black hole does to nearby space, this will do to your available memory.

Tomorrow I'll try running the thread test on the entire queue instead of a single cartoon, and then I'll look for a sweet spot that most efficiently combines threads and instances, while hopefully still leaving a usable system.

Hope this offers something of interest.

Best,

Calidore

Quote
8th Feb 2010 21:49 #2
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
I suspect you are using low quality settings, or settings that were not multithreaded (like b-adapt 2), or that the deinterlacer , source filter, or vdub is the bottleneck

If you want some suggestions for improvements, I would recommend elminating confounding variables, so you actually test something like x264 scaling with threads, instead of something else like deinterlacing speed. You might do this by deinterlacing and or denoising to a lossless intermediate first, which would then serve as the "input". A good choice would be UT video codec , because it's decode speed is very fast, and less likely to be a bottleneck than something like lagarith

What version of x264vfw ? you mention r1376 but I haven't seen this build anywhere, can you provide a link?

If you wanted to implement "objective" metrics, you can use the built in psnr or ssim from x264 printed from each run , and the log file should give you fps, and total time etc.. thus you could very quickly make charts in excel or similar software. e.g. quality vs. speed at some setting etc... PSNR drop per thread at given encode settings etc..

On HD content, using higher quality settings, you should get very close to linear scaling and 100% cpu usage with the cli version

Cheers

Last edited by poisondeathray; 8th Feb 2010 at 22:06.

Quote
8th Feb 2010 22:48 #3
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
@Calidore

you testing methodology is slightly off, with a hyperthreaded cpu, cpu 0 is the first physical core, cpu 1 is that first cores 2nd logical core and so on, if you ran tests with 1, 4, 6, 8 and 12 threads, it very likely that the test with 4 threads was in fact only using cores 0, 1, 2, 3 (the first 2 cores plus 2 logical cores).

also i don't understand why running virtual dub + x264 with one thread should finish in 22:30 and then running a "1 instance/1 thread" test (which is the exact same test) should "project" to around 4 hours.

@poisondeathray

On HD content, using higher quality settings, you should get very close to linear scaling and 100% cpu usage with the cli version

i have never personally seen linear scaling with x264 other than going from 1 to 2 cores, despite what they claim, tests just don't back up linear scaling with an increase in simultaneous threads.

on my x4 620, using tmpg express + x264vfw, exact same settings for all tests with exception of thread count, i get the following encode times:

(source is 720x480, 4:3, interlaced, 32:37 mpeg-2/ac3 and output is 720x480 4:3, deinterlaced using "interpolation - animation II option, filters and decoding gpu accelerated, cpu only has to handle encoding and file I/O, 128kb/s audio, 3500 kb/s x264):

1 thread - a touch over 1 hour and 30 minutes

2 threads - just over 49 minutes

3 threads - dropped the encoding time to just over 40 minutes

4 threads - dropped it further still to a bit over 37 minutes.

now keep in mind this is with the above number of threads but the "thread queue" option set to 0, matching the number of threads to the number of queued threads, we see the following:

2 threads, 2 in queue - just over 50 minutes

3 threads, 3 in queue - a little over 39 minutes

4 threads, 4 in queue - we see just over 38 minutes

Quote
8th Feb 2010 23:00 #4
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Originally Posted by deadrats

i have never personally seen linear scaling with x264 other than going from 1 to 2 cores, despite what they claim, tests just don't back up linear scaling with an increase in simultaneous threads.

Not a claim , a fact.

I see it all the time, as do many others. Hell, just look at the old Greysky HD benchmark results spreadsheet (collected from 100's of users). Not quite linear, like ~99% scaling. The version that test came bundled with is actually about 10% slower - it doesn't have recent optimizations e.g. for i7 etc...

You have some bottlenecks , like deinterlacing, filters, wrong x264 version, wrong settings. etc..

The tests do what they claim to; ie. test encoding speed of x264 encoder , not something else like bake a cake, run some other software, filters, deinterlace etc.. You should eliminate those confounding variables....Science 101...

In my tests, threads=hyper threaded cores is the fastest for i7 (i.e. 8 threads under most scenarios. I used to use 1.5x cores for the Q6600 , or 6 threads, as many other tests have shown to be optimal for speed)

Also some updated info in that old thread Calidore was talking about; I realize that originally referred to a quad core, but x264 is actually faster than xvid now on an i7 at similar or better quality. It used to be xvid was always faster under any condition; not anymore. Just use the --preset faster or veryfast , and it's about the same speed, but slightly higher PSNR. At --preset fastest , it's faster, but lower PSNR/SSIM

Last edited by poisondeathray; 8th Feb 2010 at 23:22.

Quote
9th Feb 2010 08:09 #5
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
The problem is the single threaded deinterlacing and the YUV/RGB conversions that VirtualDub performs when filtering. Those become the bottlenecks as the compression gets faster. Say for example the filtering takes 1 minute and the the h.264 compression takes 4 minutes with a single thread -- a total of 5 minutes for the full conversion. If x264 scaled linearly with a slope of 1, the full process with 4 threads would take 2 minutes -- 1 minute for the filtering and 1 minute for the x264 compression. Doubling the number of threads to 8 would then take 1.5 minutes -- 1 minute for the filtering and half a minute for the compression. Increasing the number of threads to infinity would still take 1 minute -- one minute for the filtering and 0 minutes for the compression.

Quote
10th Feb 2010 13:56 #6
Calidore

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2007

Location
United States
Poisondeathray: I just referred to a previous thread when I probably should have spelled out what I was doing--testing x264 against itself to look for the best combination of threads/separate instances. I appreciate the improvement suggestions, but they don't really apply for my purposes.

X264vfw ver. 1376 can be found here:

http://www.free-codecs.com/download/DTS_x264_VfW.htm

Ignore the old "Changes" section. The config screen does show version 1376, built 12/28/2009

Deadrats: Thanks for the info re: hyperthreading. I'd assumed the physical cores would be used first, then logical if necessary. That would have saved some time had I known.

I'm curious: Aren't actual physical cores still faster than the virtual ones? Wouldn't it actually be more efficient to use the "real" ones first, then spill over into the "simulated" ones if necessary?

The 4-hour projection was for the entire set of 12 cartoons. I started with one because I didn't know how much difference the threading would actually make (hence the test ).

Interesting that in your test, using the thread queue actually slightly increased your times. Queuing sounds like a good thing, so why would that be?

For that matter, here's another one. In x264vfw 1376, the thread queue is gone; we just have a checkbox labeled "deterministic." This is checked by default and sounds like a good option, but when I tried the 8-thread test without it, compression finished two seconds faster (7:36) and the file was somewhat smaller (107,138,296 vs. 107,817,362). Should this stay unchecked?

Thanks for the responses, folks. I'm learning more than I expected.

Best,

Calidore

Quote
10th Feb 2010 16:33 #7
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by Calidore

Deadrats: Thanks for the info re: hyperthreading. I'd assumed the physical cores would be used first, then logical if necessary. That would have saved some time had I known.

I'm curious: Aren't actual physical cores still faster than the virtual ones? Wouldn't it actually be more efficient to use the "real" ones first, then spill over into the "simulated" ones if necessary?

The 4-hour projection was for the entire set of 12 cartoons. I started with one because I didn't know how much difference the threading would actually make (hence the test ).

Interesting that in your test, using the thread queue actually slightly increased your times. Queuing sounds like a good thing, so why would that be?

let's tackle your questions 1 at a time:

1) re: physical cores being faster than virtual ones - on the surface that would appear to be the logical expectation and if each thread was carrying out a completely different workload, for instance one thread handled audio encoding and the other thread handled video encoding, that is in fact what would happen, 2 physical cores would be faster than 1 physical and 1 logical core.

it starts to get quite a bit more complicated as the cpu architecture gets more complicated, the OSes thread schedulers get more intelligent and programmers threading techniques get more sophisticated.

under normal circumstances a thread doesn't get anywhere near to using all a cpu's resources, for instance if a thread is only performing 32 bit integer add, the ALU (which handles integer arithmetic among other things) is the only part of the cpu performing any work, the 64 bit registers are idle, the fp/sse hybrid is idle, some of the L2 will be free and so on, in a case like that, if you need to perform an ssefp operation, launching a second thread that runs on the same cpu will be faster (and more efficient) than having the thread run on a second core, primarily because the results are probably dependent on one another and it's easier to reconcile the results if the same cpu is handling all the operations (less overhead, less ping-ponging data between cores, less cache thrashing and so on).

2) re: thread queue - i kind of expected the results i got, if you set the encoder to use 4 threads and keep 4 threads in queue, you just increased the cpu's workload, because now the cpu needs to keep track of 8 threads rather than 4 and also manage how they are lined up for processing, it's usually better to launch threads on a as needed basis, reduces the instantaneous amount of ram used, reduces complexity of code, and thread creation is a faster process than keeping track of a thread in queue.

Quote
10th Feb 2010 17:05 #8
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by poisondeathray

Not a claim , a fact.

I see it all the time, as do many others. Hell, just look at the old Greysky HD benchmark results spreadsheet (collected from 100's of users). Not quite linear, like ~99% scaling. The version that test came bundled with is actually about 10% slower - it doesn't have recent optimizations e.g. for i7 etc...

You have some bottlenecks , like deinterlacing, filters, wrong x264 version, wrong settings. etc..

just for you i reran my tests, no de-interlacing, latest vfwx264 version, tmpg express with the hardware decode turned off, here are the results:

1 thread - just over 1 hour and 41 minutes

2 threads - a bit over 53 minutes

3 threads - just over 38 minutes

4 threads - a bit over 34 minutes

i don't know what to tell you, it clearly doesn't scale linearly, i did do another test where i checked the "disable all cpu optimizations" and reran the 4 thread test, it took just over 1 hour and 46 minutes.

so what does that tell us? basically that instead of wasting money on higher and higher core count and more and more threads (which becomes increasingly difficult to program anyway) we should be looking at optimizing our code and specialized instructions, perhaps like extending SIMD so that they are 256 bit operations and beefing up the SIMD units so that they are half or even quarter cycle.

i think you should run your own tests on your i7 (that is what you have, no?) see if in fact you do in fact see linear scaling with thread count, i'm willing to bet that you quickly start to see diminishing returns past the 2 thread point.

Quote
10th Feb 2010 17:20 #9
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Not sure why, it might be the x264vfw version, the settings you used , or that your source is SD, or something b0rked with your setup. It might be the software YV12=>RGB=>YV12 conversion that TMPG enc GUI does (you bypass this and don't lose quality if you stay in YV12)

I'm telling you I get close to 98-100% on the CLI version, and everyone else does as well.

Try downloading the benchmark. It's standardized in terms of settings, x264 version, etc... So the only possible difference is your system configuration, and everyone gets close to 100% scaling. AMD, Intel, dual core, quad core - everybody. If you don't , there must be somehing wrong with your CPU or setup

Parallel encodes will always be more efficient, especially with bottlenecks

Quote
10th Feb 2010 19:26 #10
disturbed1

View Profile

View Forum Posts

Private Message
Get Slack

Join Date
Apr 2001

Location
init 4
so what does that tell us? basically that instead of wasting money on higher and higher core count and more and more threads

Threads != cores. The i7 is only a 4 core CPU with hyperthreading. With an application like x264 that does scale quite well and a hyperthreading cpu you'll get horrible scaling performance because of the hyperthreading. It's been described as cache-thrashing. The software (x264) believes there are more threads available then there actually is. These fake threads end up in a stale wait process that slows down throughput.

Turn off hyperthreading (perhaps in the bios), and re-run the test with the actual number of cores present. Not number of cores + fake threads. IMO I can't believe Intel is using this gimmick again, and people are actually falling for it ... again. I thought everyone tested, confirmed, and agree on the wastefulness of hyperthreading 8 years ago.

Linux _is_ user-friendly. It is not ignorant-friendly and idiot-friendly.

Quote
10th Feb 2010 19:32 #11
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
HT is useful in some applications. It depends if the software can use it.

In the case of i7 and x264 CLI, HT on increases FPS by ~30-40% over HT off (assuming no other bottlenecks), and is definitely advantageous. These results have been reproduced by dozens of review sites (not just me). In some applications, such as games, you actually get lower performance by a few percent!

You usually see about 1.5x performance with applications like BOINC and HPC computing where an entire process is spawned per thread

Quote
10th Feb 2010 19:40 #12
disturbed1

View Profile

View Forum Posts

Private Message
Get Slack

Join Date
Apr 2001

Location
init 4
Not so, hyperthreading destroys x264 results
While benching x264 we noticed enabling Hyperthreading does hurt the results considerably
Google turns up 1,000's of other discussions on the same subject.

Linux _is_ user-friendly. It is not ignorant-friendly and idiot-friendly.

Quote
10th Feb 2010 19:43 #13
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Wrong.

No offense, but you trust that site fudzilla?

They probably don't even know how to use x264

I feel like this is a time machine. This HT scaling testing has been beat to death already.

Quote
10th Feb 2010 20:08 #14
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by poisondeathray

Not sure why, it might be the x264vfw version, the settings you used , or that your source is SD, or something b0rked with your setup. It might be the software YV12=>RGB=>YV12 conversion that TMPG enc GUI does (you bypass this and don't lose quality if you stay in YV12)

I'm telling you I get close to 98-100% on the CLI version, and everyone else does as well.

Try downloading the benchmark. It's standardized in terms of settings, x264 version, etc... So the only possible difference is your system configuration, and everyone gets close to 100% scaling. AMD, Intel, dual core, quad core - everybody. If you don't , there must be somehing wrong with your CPU or setup

Parallel encodes will always be more efficient, especially with bottlenecks

i downloaded both the SD and HD versions of the benchmark and i have to say there is no way for anyone to know what kind of scaling they get, i looked through the batch files and scripts that control each benchmark and there is nothing that allow the end user to specify how many threads to use, having said that i did see, across all the tests within each benchmark, i saw between 29 and 30 fps for the SD version and between 13.5 and 14.5 for the HD version and i noticed that the HD version "cheats" a bit by setting the priority to "HIGH" for x264 during the test runs.

now you know i'm a stickler for proof, how about showing me some proof of 98% to 100% scaling with thread count on your setup, a couple of screenshots will do, hell i'll even take your word for it, run from the cli tests with 1, 2, 3, 4 (on up to as many threads as your cpu can handle) and report the fps or encode time you see with any sample you like, i just want an honest report.

when you say close to 100% scaling that means that if your single thread performance is 5fps, with 2 threads it's 10fps, with 3 it's 15, with 4 it's 20 and so on, i would love to see any proof of x264 scaling in such a manner.

Quote
10th Feb 2010 20:19 #15
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Seriously , I can't believe you're even questioning this... This is one of those well known facts in the video world, x264 scales very well in the absence of bottlenecks (bottlenecks could be anything from source filter, avisynth, encoding settings etc....) . That HD test should scale well on any computer to at least 8 cores (I've tested it on my dual quad workstation to 8 )

Yes, 99-100% scaling means 5fps at 1thread, 10fps at 2threads etc...(EDIT: sorry, I mean scales linearly with physical *cores* , by default x264 uses threads*1.5)

Deadrats - You're not going to believe me until you do it yourself. In the bios or msconfig bootup you can disable your cores. Do the tests at 1, 2, 3, 4 etc. I've done this exact same thing a year or two ago, like you I needed proof for myself. And as you know, I 'm probably even more of a sticker for proof than you.

Have a look at the spreadsheets, they haven't even been updated to include 100's of new reports from dozens of different forums, but I think there is enough sample size to see the very very clear trend. There is a big ass thread here and dozens of other forums posting their results for encoding, including here.

Last edited by poisondeathray; 10th Feb 2010 at 22:34.

Quote
10th Feb 2010 20:20 #16
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by disturbed1

Threads != cores. The i7 is only a 4 core CPU with hyperthreading. With an application like x264 that does scale quite well and a hyperthreading cpu you'll get horrible scaling performance because of the hyperthreading. It's been described as cache-thrashing. The software (x264) believes there are more threads available then there actually is. These fake threads end up in a stale wait process that slows down throughput.

Turn off hyperthreading (perhaps in the bios), and re-run the test with the actual number of cores present. Not number of cores + fake threads. IMO I can't believe Intel is using this gimmick again, and people are actually falling for it ... again. I thought everyone tested, confirmed, and agree on the wastefulness of hyperthreading 8 years ago.

first things first, they are not "fake threads", the threads are very real, the cores are 1 physical core plus 1 logical core, which is how they get each core to handle 2 threads simultaneously, no "fakeness" there.

since the core 2 intel cpu's have had more than enough L2 cache so that "cache thrashing" isn't such a big deal on desktop applications. furthermore, since the core 2, intel cpu's have been able to do instruction fusing (core 2 under 32 bit, i7 extended that to 64 bit) so that under the right circumstances, it can combine 2 instructions and treat them as 1, thereby allowing a dual core to execute up to 5 instructions per cycle and each dual core cpu has a single cycle SSE engine, meaning that they can fetch, execute and retire a 128 bit SIMD instruction in one cycle.

these architectural improvements are a perfect compliment for hyperthreading, there's a ton of idle resources within the average dual core, hyperthreading allows an application (and the OS) to put them to good use.

if the software isn't scaling it's not because of hyperthreading, it's because the software is improperly coded, the OSes thread scheduler isn't properly tuned and most importantly, the compiler used is written by chimps.

intel's compiler produces great code for hyperthreaded cpu's and has since the P4 days, and even without the intel compiler i remember i used to own a prescott 630 (3ghz, hyperthreaded) and a pentium d 2.8 (dual core pentium 4, 2.8ghz) and the 830 was able to beat it under most encoding benchmarks i ran, with the exception of main concept's encoder.

Quote
10th Feb 2010 20:54 #17
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by poisondeathray

In the bios or msconfig bootup you can disable your cores.

you can't disable the cores in msconfig, all you can do is tell the OS how many cores to use duing the boot up process, it will still use all your cores while the OS is running regardless of how you configure "number of preocessors"

as for the BIOS route, i'm a bit apprehensive about doing that because i have seen windows have a fit if the number of cores suddenly changes, not always, but i don't feel like having to do a clean install.

as far as the proof of 100% scaling is concerned, i scoured the web for various x264 benchmarks that would support such a contention and this site was indicative of what i found:

http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/11390-intel-core-i7-neha...review-14.html

as you can see, going from a QX9770 to an i7 920 we see a "massive" 3.5 fps increase and keep in mind that while the i7 920 is lower clocked it can not only handle twice the number of threads but according to the x264 developers:

http://x264dev.multimedia.cx/?p=51

First of all, the Nehalem has a much faster SSE unit than the Penryn. A huge number of SSE operations have had their throughput doubled:

so despite the i7's having a "much faster" sse unit than the penryn, despite being able to handle twice the number of cores, we don't see a doubling of performance, not even close.

i will tell you this, the more i read in x264 developer's "diary" the more i'm amazed that the codec even works at all, they do some really odd things, for instance they were converting integer ops into floating point ops, they do float ops on integer registers, reading through their "diary" one gets the picture of a programmer that coded himself into a corner and instead of going back and rethinking his approach he started using programming band aids, kind of like how we used to end up with "spaghetti" code because programmers would start using goto's like they were going out of style.

regardless, you claim that on a dual quad you have seen a near 100% scaling as thread number increases, so how many fps do you achieve on the hd x264 test?

Quote
10th Feb 2010 21:03 #18
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
You can set the core number in BOOT.INI (in XP anyway). Windows will only use the number of cores you list there. I'm not sure what happens with hyperthreading though. I don't know if you get 4 cores with no hyperthreading or 2 cores with hyperthreading.

Quote
10th Feb 2010 21:07 #19
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
Originally Posted by deadrats

http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/11390-intel-core-i7-neha...review-14.html

Those are 1st pass results. Try searching for 2nd pass results, or CRF results

so despite the i7's having a "much faster" sse unit than the penryn, despite being able to handle twice the number of cores, we don't see a doubling of performance, not even close.

You do see (near) doubling for physical cores, which all the results support.

iregardless, you claim that on a dual quad you have seen a near 100% scaling as thread number increases, so how many fps do you achieve on the hd x264 test?

I've sold it a long time ago for an i7. I think I posted results in a thread at videohelp. Near linear scaling.

Just do the fricken test deadrats. There's no way you will believe it until you see it with your own eyes (I'm the same way like you).

Or use a CLI build and enter the threads.

OK 2 birds , 1 stone, from the TechReport review of the i7: HT only added about 26%, but it will vary depending on many factors. Personally I've seen as high as 40% on/off (as have others at Doom9) and as low as 10%. Never slower in x264 (but as mentioned earlier, some apps do get slower with HT a few %). Most these review sites use non i7 patched builds (they don't implement the recent speed increases which offer ~8-10%)

This has all been discussed a year ago when i7 came out, not sure why we are digging all this old stuff up.

Quote
11th Feb 2010 15:20 #20
Calidore

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2007

Location
United States
Deadrats: Thanks for the clear explanations re: threads and queueing. I looked up "deterministic" in the x264 docs to see what that means.

non-deterministic Default: Not Set
Slightly improve quality of SMP, at the cost of repeatability. It will enable multi-threaded mv and uses the entire lookahead buffer in slicetype decisions when slicetype is threaded.
Not for general use.
Recommendation: Default

Can someone translate "at the cost of repeatability" in this context?

Re: Hyperthreading better or worse: Easy enough to turn off hyperthreading in the BIOS and run my test again.

Recap--with hyperthreading on (8 effective cores), I got the following times:
4 threads: 11:16
6 threads: 9:25
8 threads: 7:38

Hyperthreading off, 4 physical cores only:
4 threads: 10:29
6 threads: 8:28
7 threads: 7:16
8 threads: 7:10

Well, that's interesting.

Best,

Calidore

Quote
11th Feb 2010 15:34 #21
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
deterministic means each time you encode under the same settings/conditions and x264 version, you will get the same bit for bit encode

non-deterministic means you might not (it will be slightly different)

Your results are interesting. With HT off (4 physical cores), the i7 generally acts like Core2quad behaviour, ie. the "fastest" threads setting should be 6 in the absense of bottlenecks, I'm confused as to how you got improvements going higher; it might be a number of things, I can only guess

If you still have other bottlenecks, hyperthreading might worsen performance. (e.g. a thread has to wait for the deinterlacer to finish before it can start, when the x264 algorithm has been tuned (and expecting) for a straight encode). HT will *always* improve a straight encode for i7 for 2nd pass and CRF mode in the absence of bottlenecks.

HT usually lowers 1st pass performace, because 1st pass is usually set to low quality (the settings are not high enough to benefit from threading, essentially they are maxed out speedwise for those quality settings at that clockspeed). If you use a slow 1st pass, it will behave with similar characteristics a to CRF or 2nd pass encode (i.e always improves in absence of bottlenecks) - so even though it will scale almost linearly with physical cores, the overall encoding speed will be slower because of the increased quality settings.

It's also very possible the vfw version has issues. The version you linked to is missing some components that it should have if it was correctly derived from the CLI version, like mb-tree, threaded lookahead. It might be missing other things as well.

For SD encodes, it's possible the settings you used were lower in quality. In another thread, one guy had 16 physical cores (a 4x4 AMD system), and was having difficulty scaling properly. It turned out, there were multiple bottlenecks, avisynth , some filters, some lower quality encoding settings. Even when these were removed/fixed (he fed a raw yuv input to get rid of bottlenecks), he still couldn't get >90% scaling until the source file used was changed from 720p to 1080p, and some quality settings were increased (better search algorithms, longer motion vectors). Too "insane" settings can destroy speed and scaling as well, eg. --me tesa is super slow and single threaded , --b-adapt 2 is not mulithreaded, so they become bottlenecks

There are too many variables to generalize settings used, conditions, etc... , hence the use of that standardized benchmark. It's so "replicable" and consistent across different system, that if your results vary from similar systems in the database, this suggests something wrong with your system (and as such can be used as a diagnostic tool)

Last edited by poisondeathray; 11th Feb 2010 at 16:11.

Quote
11th Feb 2010 20:17 #22
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by jagabo

You can set the core number in BOOT.INI (in XP anyway). Windows will only use the number of cores you list there. I'm not sure what happens with hyperthreading though. I don't know if you get 4 cores with no hyperthreading or 2 cores with hyperthreading.

not in boot.ini you can't:

http://support.microsoft.com/kb/314081

i know the option you are talking about, in xp it's listed as /numproc, it's present on vista and i'm guessing win 7, but all it does is specify how many cores the OS can use during boot up, try it yourself, set the number of processors to 1, restart and then enable the "show kernel times" in task manager, you will see that the kernel still uses all available cores, it's the HAL type used that determines how many cores the OS uses, not boot.ini.

Quote
11th Feb 2010 20:51 #23
deadrats

View Profile

View Forum Posts
Banned

Join Date
Nov 2005

Location
United States
Originally Posted by Calidore

Recap--with hyperthreading on (8 effective cores), I got the following times:
4 threads: 11:16
6 threads: 9:25
8 threads: 7:38

Hyperthreading off, 4 physical cores only:
4 threads: 10:29
6 threads: 8:28
7 threads: 7:16
8 threads: 7:10

that's a bit odd, makes me think that in this particular workload all the threads launched tended not to be "complimentary", in other words each thread seemed to be performing the same type of workload, stressing the same execution units at the same time, in the same way and more than likely the thread scheduler was doing a poor job of assigning threads.

out of curiosity which OS were you using?

regardless, for any type of serious video work, where any type of filtering will take place and where audio encoding will take place, i would leave hyper threading enabled, i would even leave it enabled if any benchmark indicated that it slowed down performance a bit under that benchmark, if only because it will provide better multitasking performance.

Quote
11th Feb 2010 20:54 #24
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by deadrats

i know the option you are talking about, in xp it's listed as /numproc, it's present on vista and i'm guessing win 7, but all it does is specify how many cores the OS can use during boot up, try it yourself

It works for me. I've been using it for years. The difference in performance is obvious.

NumProc=1, x264 --threads 6

NumProc=4, x264 --threads 6

Last edited by jagabo; 11th Feb 2010 at 22:18.

Quote
12th Feb 2010 17:23 #25
Calidore

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2007

Location
United States
Poisondeathray: Thanks for the clarification re: deterministic. I now understand what repeatability means, but am still curious about the why. With the same input and the same settings, what's the random factor that could lead to different output? Also, since it allows the motion vector info to be threaded (which the smaller file size bears out), I wonder why it's turned off by default and labeled "not for general use" in the docs. Is non-repeatability that big a deal, or is there another major negative?

I do know that the vfw version of x264 is missing some functions, but it's awesomely convenient. As far as benchmarking goes, I'm not too concerned about matching them, because benchmark settings have nothing to do with real-world usage. When I'm compressing with x264, I'm using filters as necessary for deinterlacing, deblocking, denoising (especially on old PD movies), and sometimes fixing brightness/contrast (ditto). My only concern, and the reason for the comparisons I'm doing now, is finding the most efficient way to do it.

Deadrats: I'm using XP SP3. Agree re: use hyperthreading anyway. Thinking about it, I realized these results go back to what we were talking about before re: purely physical vs. physical/logical cores. Looking at the results from the standpoint of percentages of available power:

4 threads (8 cores available, 50% usage): 11:16 -- This would be 2 physical + 2 logical cores
4 threads (4 cores available, 100% usage): 10:29 -- Four physical cores wins
8 threads (8 cores available, 100% usage): 7:38 -- But four physical cores + four logical cores wins bigger

6 threads (4 cores available, 150% usage): 8:28 -- Still not even close
7 threads (4 cores available, 175% usage): 7:16 -- But this takes the lead back. Howcum? Dunno. Diminishing returns starts bigtime past here, though.
12 threads (8 cores available, 150% usage): 7:11 -- Diminishing returns starts somewhere between 8-12 threads with HT, but here we're still using <2/3 CPU. I guess this means that x264 is hitting its wall, but with the extra processing power still available, the system has more room to work on other things without performance penalty.

Best,

Calidore

Quote
12th Feb 2010 17:51 #26
poisondeathray

View Profile

View Forum Posts

Private Message
Member

Join Date
Sep 2007

Location
Canada
... am still curious about the why. With the same input and the same settings, what's the random factor that could lead to different output? Also, since it allows the motion vector info to be threaded (which the smaller file size bears out), I wonder why it's turned off by default and labeled "not for general use" in the docs. Is non-repeatability that big a deal, or is there another major negative?

It's not a big deal. Non-determinism is important for debugging when you are doing PSNR testing and tweaking the encoder - it's important to know you get the exact encode bit for bit if you have a particular set of settings for a single run under a specific set of conditions. The benefit of turning it off allows you to have slightly more efficient results, but in my experience it is negligible

The "official" explanation is : "It allows the motion search to go up to the area that the previous threads have actually completed rather than what they are required to have completed. In practice, this can slightly increase motion search range."

Calidore - I understand you want to find the most efficient specific workflow for this specfic case, but be careful as others may interpret this as broad based generalizations or conclusions about x264 in general - because what you're testing has nothing to do with x264 performance or threading. For the benefit of other folk who stumble on this thread, I think it would be wise to change or clarify the thread title, lest generic users start changing their settings such as turning HT off , expecting faster results

Cheers

Quote
12th Feb 2010 19:24 #27
jagabo

View Profile

View Forum Posts

Private Message
Member

Join Date
Dec 2005
Originally Posted by Calidore

When I'm compressing with x264, I'm using filters as necessary for deinterlacing, deblocking, denoising (especially on old PD movies), and sometimes fixing brightness/contrast (ditto)... Diminishing returns starts somewhere between 8-12 threads with HT, but here we're still using <2/3 CPU. I guess this means that x264 is hitting its wall

If you are including those filters in your benchmarks that's why you're not getting better scaling. The filtering is taking a greater percentage of the total conversion time x264 gets faster. Also try using a multithreaded build of AviSynth if you're not already.

Quote

Follow-up: x264 CPU usage--threads vs. instances

Thread Tools

Search Thread

Similar Threads

ffmpeg CPU usage

looking for low cpu usage video player.

how to limit cpu usage for program

CPU usage goes to 100%

ffmpeg only 50% CPU usage in Windows