Solved 4.37 CTD after several minutes
-
@airtex2019 said in 4.37 CTD after several minutes:
@Icer is it specifically this?
Falcon BMS.exe caused an EXCEPTION_BREAKPOINT in: 00000000045972A3 Falcon BMS.exe, bp::DrawList<unsigned __int64>::addCommand()+419 byte(s), D:\Dev\BMS\Code\Graphics\Bluebox\Engine\Common\DrawList.h, line 508
If it’s a different crash, let’s open a new topic. If the same … can you confirm repro in a simple TE or Instant Action?
@airtex2019 - Just teste IA with labels, CTD and this is the line I have - looks familiar…
-
@Seifer
after disabling Labels in Setup and deleting the parallel_draw entry in cfg, no CTD so far. -
@Eagle10 said in 4.37 CTD after several minutes:
@Seifer
after disabling Labels in Setup and deleting the parallel_draw entry in cfg, no CTD so far.I have had Label “enabled” in Setup since the beginning, only had a CTD after actually turning them on in 3D…
-
I already know where the CTD happens and how to fix it. What I don’t know is why it happens. Really tricky this one. Still working on it.
-
@Boldhead said in 4.37 CTD after several minutes:
@Seifer
the set g_sCpuPerfOptimizations “all-PARALLEL_DRAW_OBJLIST”
worked for my test , usually CTD after 15 min flight, now worked the whole 1 hr flightThis line worked for me to, no CTD on 3 missions in Balkans, in muliplayer with 4 other pilots.
-
@MLU said in 4.37 CTD after several minutes:
@Boldhead said in 4.37 CTD after several minutes:
@Seifer
the set g_sCpuPerfOptimizations “all-PARALLEL_DRAW_OBJLIST”
worked for my test , usually CTD after 15 min flight, now worked the whole 1 hr flightThis line worked for me to, no CTD on 3 missions in Balkans, in muliplayer with 4 other pilots.
I don’t even see that line in cfg?
-
-
@MLU not really a fix, but a workaround, as it disable a major performance component. The fix I am still working on it. If you wish to test, let me know. But I will be away for a week, starting tomorrow.
-
@Seifer said in 4.37 CTD after several minutes:
@MLU not really a fix, but a workaround, as it disable a major performance component. The fix I am still working on it. If you wish to test, let me know. But I will be away for a week, starting tomorrow.
Let’s git er done today then! Just sent you the .dmp fromt eh test exe!
-
@Icer
have you tried Labels off in Setup ? For me that works without the parallelism-entry in cfg. -
@Eagle10 said in 4.37 CTD after several minutes:
@Icer
have you tried Labels off in Setup ? For me that works without the parallelism-entry in cfg.Yes, I believe running with no labels will not cause a CTD (you can actually leave it selected in Setup AFAIK), we are testing with new exe files that @Seifer has come up with specifically to try and trap the bug…
-
Very good news, I think I found the race. Tricky one… very tricky. And misleading.
Thanks Icer and Killroy for patiently helping me with this.
Fix will be included in U1.
-
-
@Seifer was it anywhere near that mutex lock … or something upstream, or different entirely?
-
@airtex2019 said in 4.37 CTD after several minutes:
@Seifer was it anywhere near that mutex lock … or something upstream, or different entirely?
There was another place in code, completely different, that called the same function and was racing since it was not protected by the mutex.
The CTD itself seems tied to some specific configuration that triggers it more. Not sure exactly what config is that. Labels seem to trigger it more.
The really tricky part is that if I stored the mutex in a different place, the race disappeared for some reason. So this was totally misleading (or maybe I found a different race, who knows!) This is a huge coincidence, but seems it changing the timing characteristics enough to make the race go away (just like it never happened to me for example).
Other curious things for this race: it always happened exactly the same: the class has several fields, always the same ones raced (height was fine, width was 0). Also, none of the DMPs were able to catch two threads inside the critical section which made this a lot trickier to find. Also, none of my 2 PCs presented it a single time (left both running for long periods in 3d, like several hours). None of my testers had this problem (they play MP regularly with up to 5 players).
BTW this is an old race, the fact that we added more parallelism just made it easier to surface. But eventually, it would bite someone.
To catch the race I instrumented the code that crashed in several ways, placing some traps to see who was setting what and which mutex it was using. Then in one of the DMPs I saw: hey, someone set this viewport without a mutex… hummm… Started looking for it and there it was.
For those curious about what a race is and how hard it is to fix those:
-
Race condition is when two or more threads operate on the same data without protection (mutex). The result can be anything: normal result, small errors or things that make no sense. In this bug, we forced a crash when things did not make sense. It is never good to let a race unchecked.
-
This bug took me 7 days this time, and 7 iterations with testers (assuming it is really fixed). Record so far is 21 days for a race in 4.36 days that happened to only one tester that had a custom cockpit.
-
-
@Seifer well done sir. yes, for those following along… on the 1-10 difficulty scale of software development, this is an 11.
-
@Seifer said in 4.37 CTD after several minutes:
Very good news, I think I found the race. Tricky one… very tricky. And misleading.
Thanks Icer and Killroy for patiently helping me with this.
Fix will be included in U1.
Great work @Seifer , i’ll spend some time today running the latest iteration and let you know!
-
Icer first report is promising. Lets see if killroy also reports good news.
-
@Seifer said in 4.37 CTD after several minutes:
Icer first report is promising. Lets see if killroy also reports good news.
Several Campaign and a IA session in, no issues…
-
@Seifer
thank you for your excellent work… -
@Seifer That’s amazing. Thank you for the fix, and thank you for the fix report, very interesting