Bug 76664 - Metro: Last Light segfaults very often in level 10 (swamp) on loading last checkpoint
Summary: Metro: Last Light segfaults very often in level 10 (swamp) on loading last ch...
Status: RESOLVED WONTFIX
Alias: None
Product: Mesa
Classification: Unclassified
Component: Mesa core (show other bugs)
Version: 10.1
Hardware: x86 (IA32) Linux (All)
: medium major
Assignee: Tapani Pälli
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 77449
  Show dependency treegraph
 
Reported: 2014-03-26 23:30 UTC by Darius Spitznagel
Modified: 2016-01-06 16:07 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Metro: LL segfault on loading last checkpoint (20.85 KB, text/plain)
2014-03-26 23:30 UTC, Darius Spitznagel
Details
Savegame Folder (545.58 KB, application/x-bzip)
2014-03-27 16:26 UTC, Darius Spitznagel
Details
Replaying apitrace on intel which was made with amd gpu (183.07 KB, text/plain)
2014-03-28 22:43 UTC, Darius Spitznagel
Details
Apitrace made and replayed on intel IVB (203.72 KB, text/plain)
2014-03-28 22:44 UTC, Darius Spitznagel
Details
gdb debug log (2.24 KB, text/plain)
2014-03-31 11:12 UTC, Tapani Pälli
Details

Description Darius Spitznagel 2014-03-26 23:30:30 UTC
Created attachment 96432 [details]
Metro: LL segfault on loading last checkpoint

Hello Devs,

I have very often segfaults when loading last checkpoint in the swamp level.
This does occur mostly on the second or third load of last checkpoint and sometimes but not often during fight.

dmesg shows this after crash (I have collected some of them):
MetroLL[3089]: segfault at 64 ip 08dd401f sp 996eccb0 error 4 in MetroLL[8048000+1336000]
MetroLL[3146]: segfault at 1 ip 08dd401f sp 9a020d50 error 4 in MetroLL[8048000+1336000]
MetroLL[2911]: segfault at 1 ip 08dd401f sp 99eedd50 error 4 in MetroLL[8048000+1336000]
MetroLL[3797]: segfault at 1 ip 08dd401f sp 9a6edd50 error 4 in MetroLL[8048000+1336000]
MetroLL[4363]: segfault at 64 ip 08dd401f sp adadace0 error 4 in MetroLL[8048000+1336000]
MetroLL[4416]: segfault at 1 ip 08dd401f sp 9a820d50 error 4 in MetroLL[8048000+1336000]
MetroLL[2840]: segfault at 1 ip 08dd401f sp 95dedd50 error 4 in MetroLL[8048000+1336000]
MetroLL[2862]: segfault at 1 ip 08dd401f sp ada32ce0 error 4 in MetroLL[8048000+1336000]
MetroLL[2739]: segfault at 1 ip 08dd401f sp 95720d50 error 4 in MetroLL[8048000+1336000]
MetroLL[3276]: segfault at 1 ip 08dd401f sp 99f20d50 error 4 in MetroLL[8048000+1336000]
MetroLL[4004]: segfault at 1 ip 08dd401f sp 9a020d50 error 4 in MetroLL[8048000+1336000]
MetroLL[2727]: segfault at 1 ip 08dd401f sp 99f20d50 error 4 in MetroLL[8048000+1336000]
MetroLL[2803]: segfault at 1 ip 08dd401f sp 99f20d50 error 4 in MetroLL[8048000+1336000]
MetroLL[2695]: segfault at 1 ip 08dd401f sp 9aaedd50 error 4 in MetroLL[8048000+1336000]
MetroLL[2757]: segfault at 64 ip 08dd401f sp 9a51fcb0 error 4 in MetroLL[8048000+1336000]

I have also made a backtrace of one crash attached to this report.
Hope I have picked the right threads!
When not, let me know and I will make another backtrace.
As you can see in the attachment I have started Metro LL with STEAM_RUNTIME enabled so there are some missing debugging symbols.
When it helps I can of course run Metro LL with disabled RUNTIME and install some more debugging symbols for other non-mesa libs.

My specs of now:
Debian Jessie i386
8 GB total RAM
Mesa 10.1.0
Intel Driver 2.99.911
CPU Intel(R) Core(TM) i3-3225 CPU @ 3.30GHz
Xorg 1.15.0
Kernel 3.12.14

I hope this will help you to find the problem cause many people (Metro LL Steam community) have it, a few not.

As a note:
I've tested also todays mesa from git, but had new segfaults (more often then with 10.1 during fight) and some textures where broken too.
Comment 1 Tapani Pälli 2014-03-27 05:10:33 UTC
I'll go ahead and bisect
Comment 2 Tapani Pälli 2014-03-27 05:12:16 UTC
(In reply to comment #1)
> I'll go ahead and bisect

Or let's say try to reproduce first .. Would you happen to have this save game available for share? Or possiblity to make apitrace of it?
Comment 3 Tapani Pälli 2014-03-27 09:31:15 UTC
Texture corruption is caused by following commit. Let's try to fix that first and see if it is related to the segfaults.


--- 8< ---

commit 9cd51bb0c4608258199c69bc7738e72f055799d2
Author: Matt Turner <mattst88@gmail.com>
Date:   Tue Mar 11 13:16:37 2014 -0700

    i965/vec4: Eliminate writes that are never read.
    
    With an awful O(n^2) algorithm that searches previous instructions for
    dead writes.
Comment 4 Darius Spitznagel 2014-03-27 12:33:12 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > I'll go ahead and bisect
> 
> Or let's say try to reproduce first .. Would you happen to have this save
> game available for share? Or possiblity to make apitrace of it?

OK, I will upload my full savegame directory as soon as I'm at home.
Do you habe a ftp server where I can upload this?
Comment 5 Tapani Pälli 2014-03-27 12:58:42 UTC
(In reply to comment #4)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > I'll go ahead and bisect
> > 
> > Or let's say try to reproduce first .. Would you happen to have this save
> > game available for share? Or possiblity to make apitrace of it?
> 
> OK, I will upload my full savegame directory as soon as I'm at home.
> Do you habe a ftp server where I can upload this?

I'm afraid not :/ Let's see if we can find a share. I was able to actually get one crash, when the alien is seen first time and I think it was just at the start of a cutscene. I will try to reproduce it, it could be the same crash.
Comment 6 Darius Spitznagel 2014-03-27 16:26:54 UTC
Created attachment 96467 [details]
Savegame Folder
Comment 7 Darius Spitznagel 2014-03-27 16:29:29 UTC
OK,I have compressed my savegame folder as tar.bz2 and attached it. It isn't that big.
Hope you can reproduce my crashes.
Comment 8 Tapani Pälli 2014-03-27 18:00:48 UTC
(In reply to comment #7)
> OK,I have compressed my savegame folder as tar.bz2 and attached it. It isn't
> that big.
> Hope you can reproduce my crashes.

thanks, I'll take a shot
Comment 9 Tapani Pälli 2014-03-28 08:48:46 UTC
Yes, I'm able to reproduce the crash, will try to get backtrace.
Comment 10 Tapani Pälli 2014-03-28 13:05:21 UTC
(In reply to comment #9)
> Yes, I'm able to reproduce the crash, will try to get backtrace.

I'm not able to get 'stable' (or same as darius's) backtrace, not sure what wrong. Either something trashing the memory or problem with symbols of my libs. Will try some more. The crash itself is fully reproducible now which is great.

It looks like the crasher has been there for a very long time, it just wasn't seen until now.

A bit more info to comment #3, it might not be the fault of this exact commit but could be that this commit simply reveals a bug elsewhere. By simply returning false from dead_code_eliminate() all the artifacts disappear.
Comment 11 Darius Spitznagel 2014-03-28 18:59:38 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > Yes, I'm able to reproduce the crash, will try to get backtrace.
> 
> I'm not able to get 'stable' (or same as darius's) backtrace, not sure what
> wrong. Either something trashing the memory or problem with symbols of my
> libs. Will try some more. The crash itself is fully reproducible now which
> is great.
> 
> It looks like the crasher has been there for a very long time, it just
> wasn't seen until now.
> 
> A bit more info to comment #3, it might not be the fault of this exact
> commit but could be that this commit simply reveals a bug elsewhere. By
> simply returning false from dead_code_eliminate() all the artifacts
> disappear.

I will try testing mesa from git reverting commit cd51bb0c4608258199c69bc7738e72f055799d2 and report back.

I have also found some interessting things...
I took an PC with an AMD GPU at my work and restored my sytem via fsarchiver there.
So the OS is absolutely the same.

I started MetroLL and played many many minutes dying many many times without a single crash.
So we know now, that the crashes occur definetly on Intel iGPU (IVB on my side).
Hope this helps to narrow the problem down.

The specs with this system are:
Intel(R) Core(TM)2 Duo CPU     E7400  @ 2.80GHz (game ran slow but worked)
4GB RAM

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cedar [Radeon HD 5000/6000/7350/8350 Series] (prog-if 00 [VGA controller])
	Subsystem: Hightech Information System Ltd. Device 2291
	Flags: bus master, fast devsel, latency 0, IRQ 45
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Memory at e0100000 (64-bit, non-prefetchable) [size=128K]
	I/O ports at 2000 [size=256]
	Expansion ROM at e0140000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: radeon

darius@pc1:~$ glxinfo | grep OpenGL
OpenGL vendor string: X.Org
OpenGL renderer string: Gallium 0.4 on AMD CEDAR
OpenGL core profile version string: 3.3 (Core Profile) Mesa 10.1.0
OpenGL core profile shading language version string: 3.30
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 3.0 Mesa 10.1.0
OpenGL shading language version string: 1.30
OpenGL context flags: (none)
OpenGL extensions:

After THAT I made an aptirace of the render scene at start. Beside the known flash of outside world (or someting different) I saw not a single error or warning during rendering!!!!

This is all what apitrace showed...

darius@pc1:~/Downloads$ apitrace replay MetroLL_amd.trace 
0 57 glXSwapIntervalMESA(interval = 0) = 0
57: warning: unsupported glXSwapIntervalMESA call
1 143012 glXSwapIntervalMESA(interval = 0) = 0
143012: warning: unsupported glXSwapIntervalMESA call
Rendered 506 frames in 32.006 secs, average of 15.8096 fps

On Intel I have many many warnings...

6650435: glDebugOutputCallback: Medium severity API performance issue 9, Stalling on the GPU for mapping a busy buffer object
16652177: glDebugOutputCallback: Medium severity API performance issue 12, Flushing before mapping a referenced bo.
16652177: glDebugOutputCallback: Medium severity API performance issue 11, Mapping a busy BO, causing a stall on the GPU.

I will make fresh apitraces of both systems so you can investigate them.
But right now I'm short on time.
Will be back in some hours.
Comment 12 Darius Spitznagel 2014-03-28 22:43:12 UTC
Created attachment 96571 [details]
Replaying apitrace on intel which was made with amd gpu
Comment 13 Darius Spitznagel 2014-03-28 22:44:31 UTC
Created attachment 96572 [details]
Apitrace made and replayed on intel IVB
Comment 14 Darius Spitznagel 2014-03-28 23:12:01 UTC
I'm back:)

So far, I replayed now the apitrace which I made with amd gpu (see comment 11) on my IVB PC and got the results attached in https://bugs.freedesktop.org/attachment.cgi?id=96571

After that I've made an apitrace of same kind (render secene on start of MetroLL until game menu) on my Intel IVB and got results attach in https://bugs.freedesktop.org/attachment.cgi?id=96572

Both ones have nearly same output.
Solving these issues will defintely speed up MetroLL and maybe solve also the segfaults.
To be clear: The segfaults are reproducible on loading last checkpoint inside gameplay (NOT on first load) and happen MOSTLY ON SECOND or third reload.

As written in my previous comment I will try with current mesa reverting commit 9cd51bb0c4608258199c69bc7738e72f055799d2 and report later.
Comment 15 Darius Spitznagel 2014-03-29 00:55:41 UTC
As promissed I tried the following...

Mesa git until commit 4047263cb15e89d23cb145c74fb3f303904e8f14 > broken textures, same segfaults.

Mesa git before commit 9cd51bb0c4608258199c69bc7738e72f055799d2 > textures OK, same segfaults.

Mesa 10.0.x > textures OK, same segfaults.

What the heck leeds an OpenGL-App to crash on second or third load of the same data?! Wrong memory or buffer allocation?!

I think its clear that there is something wrong with intel drm, ddx org glx driver.
Comment 16 Darius Spitznagel 2014-03-29 15:02:30 UTC
oooops!

> I think its clear that there is something wrong with intel drm, ddx org glx driver.

I meant intel drm, ddx or dri driver.
Comment 17 Tapani Pälli 2014-03-30 15:15:36 UTC
for the texture corruption issues ... there has been another bug on the same area of the code, the issues might be related to bug #76616
Comment 18 Darius Spitznagel 2014-03-30 21:48:38 UTC
(In reply to comment #17)
> for the texture corruption issues ... there has been another bug on the same
> area of the code, the issues might be related to bug #76616

Indeed, the patch included in bug #76616 fixed the texture corruption with mesa git master.

I also have other news for you.
First I didn't wont to mention it because it's off topic and second it's beta, but...

Painkiller HD has similar crashes as Metro LL. It crashes sometimes on loading save game and more often during game play (especially in level trainstation).

The seqfaults are also telling "error 4"...

[  350.235524] MetroLL[2652]: segfault at 1 ip 08dd401f sp 9ac20d50 error 4 in MetroLL[8048000+1336000]

[  645.899404] PKHDGame[2705]: segfault at 0 ip 084493ba sp bfc55f00 error 4 in PKHDGame[8048000+20d3000]
[  845.409142] PKHDGame[2771]: segfault at 0 ip 084493ba sp bfcc1d20 error 4 in PKHDGame[8048000+20d3000]
[ 1307.553163] PKHDGame[2855]: segfault at 0 ip 09000bf2 sp bfbe8c10 error 4 in PKHDGame[8048000+20d3000]

Beside this, Painkiller HD has a nice Launch.log.
This one ALWAYS tells on crash...

[0277.99] Critical: Error reentered: OpenGL error 0x505
[0277.99] Critical: Error reentered: OpenGL error 0x505
[0277.99] Critical: Error reentered: OpenGL error 0x505
[0277.99] Critical: Error reentered: OpenGL error 0x505
[0277.99] Critical: Error reentered: OpenGL error 0x505
[0277.99] Critical: Error reentered: OpenGL error 0x505
[0277.99] Critical: Error reentered: OpenGL error 0x505
[0277.99] Critical: Error reentered: OpenGL error 0x505
[0277.99] Critical: Error reentered: OpenGL error 0x505
[0277.99] Critical: Error reentered: OpenGL error 0x505

I hope these crashes are related to the same problem Metro LL has.
If not, sorry I did not open another bug report.
Comment 19 Darius Spitznagel 2014-03-30 22:09:19 UTC
Very interesting, look here...

https://bugs.freedesktop.org/show_bug.cgi?id=74868

Especially this one...

<<<<<<<<<<<
Mesa: User error: GL_OUT_OF_MEMORY in glCompressedTexSubImage2D
err:d3d:wined3d_debug_callback 0x1c8178: "GL_OUT_OF_MEMORY in glCompressedTexSubImage2D".
err:d3d_surface:surface_upload_data >>>>>>>>>>>>>>>>> GL_OUT_OF_MEMORY (0x505) from glCompressedTexSubImage2DARB @ ../../../wine-1.7.12/dlls/wined3d/surface.c / 1688
EE r600_texture.c:1003 r600_texture_transfer_map - failed to create temporary texture to hold untiled copy
>>>>>>>>>>>

It mentions also the same error code (0x505) as Painkiller.

A Memory problem with Metro LL too sounds likely as I already mentioned in Comment 15.

Look also at comment 14 at bug 74868!
Badly the patch is only for r600:(
Comment 20 Matt Turner 2014-03-30 22:14:46 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > for the texture corruption issues ... there has been another bug on the same
> > area of the code, the issues might be related to bug #76616
> 
> Indeed, the patch included in bug #76616 fixed the texture corruption with
> mesa git master.

Thanks for testing. In the future please try to keep bug reports separate. Otherwise it gets pretty confusing.
Comment 21 Tapani Pälli 2014-03-31 10:10:12 UTC
I'm now able to reproduce the same backtrace as Darius has. The thread that is working with Mesa stack is waiting for ioctl and I cannot see anything bad going on there. The thread that segfaults does not unfortunately have symbols and it for me it looks like there is a bug in the game itself, maybe related to threads. Using disassemble with the address in the backtrace and 'info registers' one can see that game is accessing something with offset of 1, maybe a struct member (?) Memory usage is not very high, for example for me it is 695396kB at the time of crash.

I will still verify these observations.
Comment 22 Tapani Pälli 2014-03-31 11:12:53 UTC
Created attachment 96650 [details]
gdb debug log

Here's some gdb log output. The thread segfaulting seems to always end up in the same place, accessing array (or struct?) with some specified offset (stored in eax) and the member is null.
Comment 23 Darius Spitznagel 2014-04-01 18:52:32 UTC
@Tapani: Can you clarify this?

http://steamcommunity.com/groups/steamuniverse/announcements/detail/1837773658991804782

<<<<
Fixed "Metro: Last Light" on Intel graphics by backporting GLX support for ARB_create_context from newer X servers
>>>>

???
Comment 24 Tapani Pälli 2014-04-02 06:56:17 UTC
(In reply to comment #23)
> @Tapani: Can you clarify this?
> 
> http://steamcommunity.com/groups/steamuniverse/announcements/detail/
> 1837773658991804782
> 
> <<<<
> Fixed "Metro: Last Light" on Intel graphics by backporting GLX support for
> ARB_create_context from newer X servers
> >>>>
> 
> ???

This is not related to this bug.

With this extension application can create a GL context in a very fine grained way specifying version and required features it wants to use. It looks like SteamOS is still using older version of X server that does not support the extension but Valve backported patches to have the support in place. They did not want to do full X server upgrade.
Comment 25 Darius Spitznagel 2016-01-05 22:33:56 UTC
This bugreport can be closed!
I have opened a new one for Metro: Last Light Redux and Metro 2003 Redux.
https://bugs.freedesktop.org/show_bug.cgi?id=93599

Thanks
Darius
Comment 26 Tapani Pälli 2016-01-06 16:07:24 UTC
closing as WONTFIX as this segfault was in Metro code (see comment #22) .. would be *very* hard to track down without Metro symbols.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.