BMS works faster with HT on and no affinity set
-
It was known that BMS works best with CPU affinity to 3 cores. However, after disabling HT got measly ~43 frames in the pit with the default view.
Turn HT back on, disable forced affinity, got 60-ish frames. What the heck? Now BMS runs on up to 4*2 cores. Rather than ~43, got ~61 frames.
I tested starting up the LGB training TE. Note, there are no processes invoking “timeBeginPeriod” at all.
Did something change with the 4.33 release? Is it different with lots of ATM/GTM stuff going on? I can imagine that TE not being representative of campaign gameplay.
Note, the Windows version is 8.1 but Windows scheduler was always bad and probably will always be, so it’s not that important.
-
Not sure of all that but I do know that empty TEs are usually GPU-bounded, while campaign missions in crowded areas are CPU-bounded, so better do CPU tests only in such proper env.
-
It was known that BMS works best with CPU affinity to 3 cores. However, after disabling HT got measly ~43 frames in the pit with the default view.
Turn HT back on, disable forced affinity, got 60-ish frames.
Can you elaborate more on how to enable/disable HT and enable/disable affinity?
-
Can you elaborate more on how to enable/disable HT and enable/disable affinity?
HT is normally Disabled/Enabled via the BIOS
Affinity is set by launching BMS program and at the main in-game screen you then ALT-TAB out of BMS then CTRL-ALT-DELETE to start Task Manager
Next select the ‘Processes’ Tab then find Falcon BMS.exe in the Processes List and Right-Click on it
Drop Down box lists ‘Set Affinity…’ and here you can choose which cores, and the number of cores that you set to run that program
Assigning only 3 cores to a program then prioritises that program for those 3 cores and allows other Processes that are running, to take priority on the ‘free’ core
Note that Hyperthreading is the ability to run several sub-processes or ‘threads’ on 1 core and emulates multi-core processing. So for example 4 cores each able to Hyperthread could run say 2 threads on each core and emulate running a PC with 8 cores. Theory being that the more cores or even virtual cores that you can run then, providing a program is able to take advantage of the multiple cores, the program can run sub-processes in parallel and so run faster than if it were having to run sub-processes sequentially on, say, a single core.
I was not sure that BMS supports HT though? Or even takes advantage of multiple cores? -
Thanks for the explanation Jetlag!
-
You are welcome Ice
I forgot to mention however, this all assumes that your CPU supports Hyperthreading and has multiple cores.
If the CPU does not have HT then of course you will not see it as a BIOS option
And naturally CPU comes in single, dual, quad or even now octa core - so the Affinity will only list the number of cores that the CPU physically has
Also a Dual Core will almost always outperform a single core with HT - and again the program has to support HT and/or multi core for its’ sub processes -
Mine is a 6600K on a Z170X motherboard…
-
Mine is a 6600K on a Z170X motherboard…
No hyperthreading on the I5 afaik
https://ark.intel.com/products/88191/Intel-Core-i5-6600K-Processor-6M-Cache-up-to-3_90-GHz
-
Yeah, I thought as much. I can still do the affinity part though, correct?
-
-
I was not sure that BMS supports HT though? Or even takes advantage of multiple cores?
BMS runs multiple threads. You can compare CPU affinity framerate between two and three cores.
Applications don’t contain explicit support for HT. It either helps the workload or pessimizes it. That depends, but note that core cache is shared between the two HT threads of execution.
-
Can you create a synthetic, easy to repeat test? In fact there’s barely any CPU usage even in a campaign mission at 20k feet.
Thanks to the .pdb being supplied, I can do an actual, if shallow analysis. Had the .pdb not been available, there could be only little understanding gained from all of this.
One performance problem I already noticed. Accidentally had export MFD/etc to in-memory texture enabled. The issue here, it runs in lock-step with the renderer. Ideally we’d export asynchronously, not slowing down the main loop. The present state is that downloading a texture from a rendertarget and “memcpy” takes 14% of time. Any other things, including actual world rendering, can only run once this process is complete. Hence my suggestion of doing the mapping asynchronously and not perform blocking waits on it in the main loop.
I also noticed the amd64 version having a lower CPI rate than x86. I’m happy see that Visual C++ can do something properly at least sometimes.
Also, on my AMD GPU there’s a separate thread constantly doing the driver’s work. It uses the CPU itself heavily.
Finally, I wonder whether “timeBeginPeriod” with a reasonable amount can speed things up. I saw few dozen threads during gameplay, of which most doesn’t do anything CPU-bound. I haven’t tested “timeBeginPeriod”. Can one of you provide me with a synthetic CPU-bound TE or campaign state?
sh
-
I’ve always suspected that the i7 CPU should not be overlooked when it come to BMS.
-
There’s another matter on Windows. It’s “timer granularity”, and it can be checked using sysinternals clockres, just google for it, first hit, from Microsoft.
Windows likes to switch threads between CPU cores which wastes L1/2/3/4 cache. There’s a builtin Windows software that explains what software lowers current “clockres” to 1 ms:
powercfg -energy -duration 1 && start energy-report.html
In “warnings” with yellow background, look for Platform Timer Resolution.
It’s pretty evil that Chrome and Spotify do it, especially since they could adjust timer resolution for their process, only, using a well-known, working, undocumented Windows function[1].
Now the question is whether changing “clockres” from 16 ms to 1 ms reduces lock contention. If it does, there might be a timer resolution that balances CPU cache wastage and lock contention.
I’d expect contention to decrease with lower values, but there’s stuff like priority inversion etc. that’s hard to estimate on paper toward one way or another.
If someone made me a synthetic test, a TE or similar, I’d be very grateful.
I’ve done some preliminary profiling with VTune (thanks to the team for the .pdb file), and there’s nothing too shady going on. For one, disable ALL MFD exports[2] to shared memory. It queries the GPU synchronously waiting for the download, while the GPU does other stuff and it takes a while. Another thing is the sim running in lock-step with rendering. But this is very common in games and similar software.
[1] We /might/ be able to inject the call, decreasing timer resolution for BMS, using a d3d9 wrapper or similar. I dunno if it works while timeBeginPeriod is active…
[2] It could be done asynchronously, alas it’s not done that way. VTune has shown 20% of wall clock time in the download.A question to @__I-Hawk__, do you use a thread pool?
Apologies for thread necromancy. I’m in-and-out between projects and it shows.
Also, few of the paragraphs are too technical, just skip the ones you don’t understand. It’s enough to get a clear picture as it is.
sh
-
I’ve always suspected that the i7 CPU should not be overlooked when it come to BMS.
Turbo and multiplier OC will help a lot in busy missions.
-
I’m new in that simulator but I’ll tray to check how it (simultaneous cores) works for me. Thank you.
-
You can set affinity in a shortcut to start automatically. Here is one link of many that shows how to do it. https://www.eightforums.com/threads/cpu-affinity-shortcut-for-a-program-create-in-windows.40339/… But then I just found this which will makes me reconsider not using affinity.
Quote: Assigning CPU affinities to specific executables is a bad idea. Setting affinity on a process doesn’t reserve a CPU for that process you specify, locking out all other processes from that CPU. It just says that that process can only use the designated CPU.
It might be better to leave BMS alone, and use affinity to ‘limit’ other programs.
Well , back to the drawing board.
-
This is an old thread … BMS threading model has changed a lot (with move to DX11 and a whole new graphics engine in 4.35)… and also Windows threading model has changed a lot since ca. 2017. (Eg. starting with 20H1 apparently timeBeginPeriod no longer has crappy system-wide side effects… hooray for that.)
BUT if you have a 4-core cpu (or 2-core w/ HT) … imo it may still make sense to run BMS on 3 logical cores, and leave the 4th core open for the OS and graphics driver etc to do work without preempting BMS threads.
(The OS drives a couple of high-pri usermode threads … for things like input-dispatching, and for DWM sending frames to the graphics device… probably audio and networking stuff too. Seems maybe worthwhile to let the OS have a free core, to deal with all that…)
Anyway… if you have just 2 cores, it’s def not worth further constraining down to 1. And if you have 6 or more, probably not worth messing with process affinities or priorities.
But with 4-cores, I think this is still worth trying.
@start "BMS" /high /affinity 0x0E "C:\Falcon BMS 4.35\Bin\x64\Falcon BMS.exe" ;; 0x0E == 0000_1110 in binary
If 4-cores-with-HT… then I recommend either during HT off in the BIOS… or, constraining BMS to 3 cores (but! remember to use just the odd or even numbered cores… you definitely don’t want to squeeze 2 BMS threads into a single physical cpu core… hyperthreading is mostly a sham, especially for threads that are heavy on memory I/O).
@start "BMS" /high /affinity 0x2A "C:\Falcon BMS 4.35\Bin\x64\Falcon BMS.exe" ;; 0x2A == 0010_1010 in binary
Every system is different, so you just have to try this and see if impacts framerate. Eg. I expect recent generation Ryzen chips, with their ton of L3 cache, probably do better with HT left on, than do comparable Intel chips. But that’s just wild speculation, I haven’t run any side by side comparisons…