BMS works faster with HT on and no affinity set
-
Mine is a 6600K on a Z170X motherboard…
-
Mine is a 6600K on a Z170X motherboard…
No hyperthreading on the I5 afaik
https://ark.intel.com/products/88191/Intel-Core-i5-6600K-Processor-6M-Cache-up-to-3_90-GHz
-
Yeah, I thought as much. I can still do the affinity part though, correct?
-
-
I was not sure that BMS supports HT though? Or even takes advantage of multiple cores?
BMS runs multiple threads. You can compare CPU affinity framerate between two and three cores.
Applications don’t contain explicit support for HT. It either helps the workload or pessimizes it. That depends, but note that core cache is shared between the two HT threads of execution.
-
Can you create a synthetic, easy to repeat test? In fact there’s barely any CPU usage even in a campaign mission at 20k feet.
Thanks to the .pdb being supplied, I can do an actual, if shallow analysis. Had the .pdb not been available, there could be only little understanding gained from all of this.
One performance problem I already noticed. Accidentally had export MFD/etc to in-memory texture enabled. The issue here, it runs in lock-step with the renderer. Ideally we’d export asynchronously, not slowing down the main loop. The present state is that downloading a texture from a rendertarget and “memcpy” takes 14% of time. Any other things, including actual world rendering, can only run once this process is complete. Hence my suggestion of doing the mapping asynchronously and not perform blocking waits on it in the main loop.
I also noticed the amd64 version having a lower CPI rate than x86. I’m happy see that Visual C++ can do something properly at least sometimes.
Also, on my AMD GPU there’s a separate thread constantly doing the driver’s work. It uses the CPU itself heavily.
Finally, I wonder whether “timeBeginPeriod” with a reasonable amount can speed things up. I saw few dozen threads during gameplay, of which most doesn’t do anything CPU-bound. I haven’t tested “timeBeginPeriod”. Can one of you provide me with a synthetic CPU-bound TE or campaign state?
sh
-
I’ve always suspected that the i7 CPU should not be overlooked when it come to BMS.
-
There’s another matter on Windows. It’s “timer granularity”, and it can be checked using sysinternals clockres, just google for it, first hit, from Microsoft.
Windows likes to switch threads between CPU cores which wastes L1/2/3/4 cache. There’s a builtin Windows software that explains what software lowers current “clockres” to 1 ms:
powercfg -energy -duration 1 && start energy-report.html
In “warnings” with yellow background, look for Platform Timer Resolution.
It’s pretty evil that Chrome and Spotify do it, especially since they could adjust timer resolution for their process, only, using a well-known, working, undocumented Windows function[1].
Now the question is whether changing “clockres” from 16 ms to 1 ms reduces lock contention. If it does, there might be a timer resolution that balances CPU cache wastage and lock contention.
I’d expect contention to decrease with lower values, but there’s stuff like priority inversion etc. that’s hard to estimate on paper toward one way or another.
If someone made me a synthetic test, a TE or similar, I’d be very grateful.
I’ve done some preliminary profiling with VTune (thanks to the team for the .pdb file), and there’s nothing too shady going on. For one, disable ALL MFD exports[2] to shared memory. It queries the GPU synchronously waiting for the download, while the GPU does other stuff and it takes a while. Another thing is the sim running in lock-step with rendering. But this is very common in games and similar software.
[1] We /might/ be able to inject the call, decreasing timer resolution for BMS, using a d3d9 wrapper or similar. I dunno if it works while timeBeginPeriod is active…
[2] It could be done asynchronously, alas it’s not done that way. VTune has shown 20% of wall clock time in the download.A question to @__I-Hawk__, do you use a thread pool?
Apologies for thread necromancy. I’m in-and-out between projects and it shows.
Also, few of the paragraphs are too technical, just skip the ones you don’t understand. It’s enough to get a clear picture as it is.
sh
-
I’ve always suspected that the i7 CPU should not be overlooked when it come to BMS.
Turbo and multiplier OC will help a lot in busy missions.
-
I’m new in that simulator but I’ll tray to check how it (simultaneous cores) works for me. Thank you.
-
You can set affinity in a shortcut to start automatically. Here is one link of many that shows how to do it. https://www.eightforums.com/threads/cpu-affinity-shortcut-for-a-program-create-in-windows.40339/… But then I just found this which will makes me reconsider not using affinity.
Quote: Assigning CPU affinities to specific executables is a bad idea. Setting affinity on a process doesn’t reserve a CPU for that process you specify, locking out all other processes from that CPU. It just says that that process can only use the designated CPU.
It might be better to leave BMS alone, and use affinity to ‘limit’ other programs.
Well , back to the drawing board.
-
This is an old thread … BMS threading model has changed a lot (with move to DX11 and a whole new graphics engine in 4.35)… and also Windows threading model has changed a lot since ca. 2017. (Eg. starting with 20H1 apparently timeBeginPeriod no longer has crappy system-wide side effects… hooray for that.)
BUT if you have a 4-core cpu (or 2-core w/ HT) … imo it may still make sense to run BMS on 3 logical cores, and leave the 4th core open for the OS and graphics driver etc to do work without preempting BMS threads.
(The OS drives a couple of high-pri usermode threads … for things like input-dispatching, and for DWM sending frames to the graphics device… probably audio and networking stuff too. Seems maybe worthwhile to let the OS have a free core, to deal with all that…)
Anyway… if you have just 2 cores, it’s def not worth further constraining down to 1. And if you have 6 or more, probably not worth messing with process affinities or priorities.
But with 4-cores, I think this is still worth trying.
@start "BMS" /high /affinity 0x0E "C:\Falcon BMS 4.35\Bin\x64\Falcon BMS.exe" ;; 0x0E == 0000_1110 in binary
If 4-cores-with-HT… then I recommend either during HT off in the BIOS… or, constraining BMS to 3 cores (but! remember to use just the odd or even numbered cores… you definitely don’t want to squeeze 2 BMS threads into a single physical cpu core… hyperthreading is mostly a sham, especially for threads that are heavy on memory I/O).
@start "BMS" /high /affinity 0x2A "C:\Falcon BMS 4.35\Bin\x64\Falcon BMS.exe" ;; 0x2A == 0010_1010 in binary
Every system is different, so you just have to try this and see if impacts framerate. Eg. I expect recent generation Ryzen chips, with their ton of L3 cache, probably do better with HT left on, than do comparable Intel chips. But that’s just wild speculation, I haven’t run any side by side comparisons…