Lock granularity, bubbles, shaders, texture filtering

sthalik

Yeah. Let’s bikeshed some whether to use Perlin or Voronoi or Gauss or what for tree spacing

I-Hawk

Now I remember more about the post I’ve read - it said that more than 3 cores cause bottlenecks because of inefficient multithreading. That hyperthreading causes inefficiency, too, because 2-3 cores is optimal, else performance is actually lost.

As for multi-core renderer, as far as I understand it, the actual primitive-drawing is done only on one core. The matter is putting other expensive tasks on a pinned thread and balancing, such-that they’re busy with actual useful work, not spinning on a synchronization primitive, performing calculations that are unnecessary, etc… At least I wouldn’t create multiple concurrent command streams for graphics. Have no of experience in the area, but learning about it would be a breath of fresh air from usual stuff

What I’ve meant is that if parallelization and other optimizations are done, code made more efficient because of this, bubble can be made less restrictive. And yes, I’m staring at a ‘black box’ here, only knowing what you told me

-sh

Actually, regardless of rendering, from what I know, the main load in Falcon is in campaign missions with high number of units within the bubble. I can only guess that while every unit’s processing code isn’t such a heavy load (as 10 ACs and a column of tanks in TE will not cause serious load on average system), adding more and more units, eventually will cause a heavy load, and FPS drop will be noticed very well, even on very strong systems. If we could get those processes to run on multiple cores, that will be the optimal solution for all Falcon performance problems. Because in relatively light environment (Empty TEs and camp flying with not many units in the bubble), Falcon FPS is already pretty good, usually bound by GPU.

While parallel processing for GPU will sure help, the real bottleneck is CPU load in camp missions because of high number of units in the bubble.

MrIch

Theaters could be mixed with autogen and non autogen zones. X-plane uses exclusion zones. A theater designer could create military targets as he wishes and these areas would not be mxied. Furthermore the landclass and street system of X-plane based on osm data is a big deal. Imagine to have a converter tool which reads all military objects from existing theaters and combine it with the osm and mesh data in order to create a second gen theater. …

sthalik

@I-Hawk:

While parallel processing for GPU will sure help, the real bottleneck is CPU load in camp missions because of high number of units in the bubble.

AI can be optimized, including preprocessing path data for A* or whatever is used… So only ‘forks’ in the road data are treated as ‘open set’ data…

Don’t know what’s causing the load though, doubt could use the public PDB for profiling. But it’s worth at least a try optimizing it, definitely.

As for parallel GPU, renderer thread needs no actual resources, except for synchronous flushes which sleep anyway (unless configured otherwise). Pushing into the CS is extremely cheap. So unless it computes/allocates memory (it shouldn’t!), no biggie.

I’m also curious about heat exhaust and HDR. Don’t know (again) how well they’re optimized for SIMD GPU processing, but getting rid of ‘if’ statements in shaders (if any) would speed up GPU processing immensely

fingon

sthalik, what do you mean by “clipping”? Just so we don’t talk past eachother.

Either way, I personally would have no problem if future procedurally generated geometry is a visual effect only., It should be MP-safe by being deterministic, but only to make sure people see the same things (eg earlier on clouds were not shared, which was crappy), not for targeting/objective purposes. It would be far too inefficient to treat autogen objects as full Falcon campaign objects. Existing campaign objectives should receive an X-plane-like exclusion zone so that they “pop out” of the autogen stuff.

Lotsa wishful thinking here…

sthalik

by clipping, meant can through skyscrapers with no damage.

Autogen is tons of work, ask Ben Supnik from X-Plane team nice fellah btw, helped me a lot with the wined3d fork.

fingon

In terms of CPU-bound campaigns I’d be interested to know what kind of schedule is applied to things like simple distance checks between units (agg 2D or de-agg 3D) as they move. These kinds of things can kill performance unnecessarily if done in aggressive timings. Who knows, maybe FPS could be gained by simple things like switching to distance calculations without the square root operation in places where it can work, and using tables to adjust (if they don’t already do things like that).

sthalik

FWIW original falcon 4.0 uses squared distance, no sqrt applied

fingon

Hehe, well, re skyscrapers: those should be true Falcon objectives (at least those over a certain height maybe), while things like road infrastructure (except bridges) residential housing and trees - I really don’t mind if they are not fully physical - Falcon is first and foremost a simulation of the in-cockpit experience .

fingon

sqrt: Yeah? good to hear, low-hanging fruit…

sthalik

@fingon:

Hehe, well, re skyscrapers: those should be true Falcon objectives (at least those over a certain height maybe), while things like road infrastructure (except bridges) residential housing and trees - I really don’t mind if they are not fully physical - Falcon is first and foremost a simulation of the in-cockpit experience .

Why not make objectives fast too?

fingon

That would be the platonic ideal

Jasajas

I hope you gentlemen continued your discussions elsewhere, since there is nothing new here. IMO, its the most progressive read on the forum and I salute you for pushing the frontier in debating what is doable to this amazing sim.

Bms Forever!

/Jas

Ps. I would love to see a more userfriendly Lodeditor, but thats probobly just because I´m stupid. Ds.

jhook

@sthalik:

Now I remember more about the post I’ve read - it said that more than 3 cores cause bottlenecks because of inefficient multithreading. That hyperthreading causes inefficiency, too, because 2-3 cores is optimal, else performance is actually lost.

As for multi-core renderer, as far as I understand it, the actual primitive-drawing is done only on one core. The matter is putting other expensive tasks on a pinned thread and balancing, such-that they’re busy with actual useful work, not spinning on a synchronization primitive, performing calculations that are unnecessary, etc… At least I wouldn’t create multiple concurrent command streams for graphics. Have no of experience in the area, but learning about it would be a breath of fresh air from usual stuff

What I’ve meant is that if parallelization and other optimizations are done, code made more efficient because of this, bubble can be made less restrictive. And yes, I’m staring at a ‘black box’ here, only knowing what you told me

-sh

First of all, this discussion flew under my radar last year, so I am just now reading it and I am grateful for this thread as it has illuminated a little bit behind the bms scene. Just some observations. Hyper-threading or multi-threading the code would improve performance in FBMS 10 fold AFAIK. I think what Sthalik is discussing is correct. It comes down to how the code would utilize multiple cores. This would open up a lot of headroom with FBMS to allow greater additions (such as better GFX and added features) without the reduction (if not improved) performance. I think that is where you guys are heading considering where you left off. Also, since Dunc has asked the question (The OS poll) it would require a more powerful operating system (such as 7 or 8 ) to manage the multi threading. Also, 7 or 8 would allow for more RAM to be recognized and used. Another big plus!

What I have been reading about lately is the fact that multi GPU’s are a problem. I experience problems running games with my 2 HD 7970’s. Micro stuttering, latency issues and overall performance issues. I will get better FPS, but I will get micro stuttering with some games. Sthalik hit it on the head with this post. Everything is done through 1 GPU. The other GPU renders off board processes like FSAA. Then is transmits that process back the the main GPU board for post processing. This is where you actually get issues like micro stuttering. SLI or Xfire does not matter. It is the same issue. Best bet is to get a fast single GPU board. As for FBMS, I would focus on the multi-threading (multi CPU’s) and OS utilizing more RAM to improve FBMS overhead. Everything else will depend on this kind of improvement. I think with multi core support, you could have normal maps for the terrain and still get around 50 FPS with a decent chip and RAM.

sthalik

jhook,

Hyper-threading or multi-threading the code would[…]

The code’s heavily parallel starting with 4.0. It’s -just- lock granularity, threading overhead etc. that’s causing it to perform less well at higher parallelization levels.

That’s a major thing to do, you just don’t take some-orders-of-magnitude-of-SLOC-codebase and refactor it like that…

Also, 7 or 8 would allow for more RAM to be recognized and used. Another big plus!

As far as 32-bit executables are concerned, more address space (what is that “RAM” thing?) doesn’t magically become available with newer OS versions.

I cry in terror at the mere thought of someone making it 64-bit clean without making “database” incompatible with the 32-bit version.

running games with my 2 HD 7970’s. Micro stuttering, latency […]

Nonsense to use crossfire when falcon’s CPU-bound.

-sh

jhook

@sthalik:

jhook,

Hyper-threading or multi-threading the code would[…]

The code’s heavily parallel starting with 4.0. It’s -just- lock granularity, threading overhead etc. that’s causing it to perform less well at higher parallelization levels.

That’s a major thing to do, you just don’t take some-orders-of-magnitude-of-SLOC-codebase and refactor it like that…

Also, 7 or 8 would allow for more RAM to be recognized and used. Another big plus!

As far as 32-bit executables are concerned, more address space (what is that “RAM” thing?) doesn’t magically become available with newer OS versions.

I cry in terror at the mere thought of someone making it 64-bit clean without making “database” incompatible with the 32-bit version.

running games with my 2 HD 7970’s. Micro stuttering, latency […]

Nonsense to use crossfire when falcon’s CPU-bound.

-sh

Thanks for the reply. I agree with the dual GPU approach. No need for that. But do you think multi-CPU threading would be possible for FBMS? More address space for more RAM usage? Seems like that would go a LONG way really. As for the OS thing, I was referencing that a better OS would help with the processing better. Don’t know if that is even a factor really. Anyway, would be great to see FBMS using multi-CPU’s and greater RAM.

sthalik

But do you think multi-CPU threading would be possible for FBMS?
Already is since 4.0.
More address space for more RAM usage?
BMS doesn’t use more than 3Gb anyway with sane data assets.
better OS would help with the processing better.
Elaborate, that’s vague to the point of irrelevance.
would be great to see FBMS[…]
Wrong tense.

-sh

jhook

@sthalik:

But do you think multi-CPU threading would be possible for FBMS?
Already is since 4.0.
More address space for more RAM usage?
BMS doesn’t use more than 3Gb anyway with sane data assets.
better OS would help with the processing better.
Elaborate, that’s vague to the point of irrelevance.
would be great to see FBMS[…]
Wrong tense.

-sh

I didn’t think FBMS was multi-CPU capable. Wow. Ok,

Since Dunc made a small and simple statement about Windows version and a 64 bit version on the horizon, that was actually my thoughts on running FBMS through a better OS. Simply because improving FBMS to a 64 bit OS system might allow for better processing and utilizing CPU/RAM more efficiently (i.e. more headroom). It was just some ideas for improving the performance in FBMS. Since it does use multi-CPU’s (I thought that it did not use multi-CPU’s) how many cores does FBMS utilize? Is there a maximum core limit?

sthalik

The thread actually went on, very interesting at that. Some more fuel to the fire.

jhook: there are no hard limits on CPU amount as we understand it. It just spawns threads - like any other process that spawns threads - and OS schedules on them.

“Utilizing X more efficiently” so vague that equates “make the damn thing run faster!!!”. So let’s!

@I-Hawk:

[…]I’ve done some MPI and Open MP coding in some parallel processing university course (Although not CS or SW degree) and that’s my main knowledge of parallel processing. Never tried to check if something like that can fit Falcon code. I guess that if someone with expertise in multi threading and multi-core coding methods will try to get something implemented, it could work, even locally in some high processing load areas of the code (if that makes any sense…)

a breakthrough in that direction will be revolutionary for a sim like Falcon which can starve sometimes for CPU cycles.

The issues with parallelization – if 2/3 of the code runs serialized AND waiting for the threaded part, the most speedup with infinite cores can be 1/3. There was a name for that law, forgot.

OpenMP is an abstraction with its own tradeoffs. From what I’ve seen on it, it’s for dudes doing scientific processing of datasets. It won’t solve the issue of interlocking either.

What would, IMO, solve the interlocking? Look at the control flow graph and see why things happen the way they do. In particular, render path and simulation aren’t running in unison.

As for lock granularity, that’s a separate thing in itself. Overreliance on locking -however- caused by the need for mutable state. There are ways to change it, for instance:

a) don’t mutate, copy the relevant parts into new instance as needed, say, every frame/dt. The memory’s managed manually so affordable.
b) isolate relevant state into values on its own and torture the graphviz’d control flow graph until spotting something that can be done less bottleneckably.

OpenMP has some autoparallelization, but Turing-machines leave little to the imagination of the compiler. Think – sequence points, side effects, aliasing. It can work well with parallelizing+vectorizing self-contained loops, but what is it for Falcon?

That execution’s structure’s more-or-less derived from F4, puts a frame around what can be done. And the orders-of-magnitude of F4 BMS SLOC – holy jumping jethros…

-sh

Lock granularity, bubbles, shaders, texture filtering

95

10.6k

23.1k

372.6k