Bug 13507 - Intermittent GPU crashes with compiz
Intermittent GPU crashes with compiz
Status: RESOLVED FIXED
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i830
unspecified
Other All
: medium normal
Assigned To: Default DRI bug account
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-12-03 21:47 UTC by Ben Gamari
Modified: 2008-01-30 07:13 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
An Xorg of a crashed session (35.88 KB, text/plain)
2007-12-03 21:47 UTC, Ben Gamari
Details
Another Xorg log (38.21 KB, text/plain)
2007-12-05 21:15 UTC, Ben Gamari
Details
GDB log of Xorg crashing (7.40 KB, text/plain)
2007-12-06 12:13 UTC, Ben Gamari
Details
A secondary crash (2.83 KB, text/plain)
2007-12-06 12:25 UTC, Ben Gamari
Details
patch to avoid calling fb->Delete if NULL (372 bytes, patch)
2007-12-06 15:32 UTC, Brian Paul
Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Ben Gamari 2007-12-03 21:47:25 UTC
Every few hours while running compiz, the driver seems to generate an invalid operation or page table entry, causing system lockup and screen corruption.
Comment 1 Ben Gamari 2007-12-03 21:47:53 UTC
Created attachment 12923 [details]
An Xorg of a crashed session
Comment 2 Ben Gamari 2007-12-05 21:15:47 UTC
Created attachment 12973 [details]
Another Xorg log

This one is probably very similar to the first. Very little useful information is given in the log
Comment 3 Ben Gamari 2007-12-05 22:19:46 UTC
Could someone offer some insight into how to approach this damn bug? X just crashed literally times in the last 5 minutes (once while trying to select metacity as my window manager). The bug almost always manifests itself with a "page table error, instruction error" while compiz is rendering a blend or animation, leading to screen corruption and eventual crashing of X. Unfortunately, this has not once provided a stack trace. 

Moreover, the bug seems to occur almost entirely randomly, so reproducing it while attached to a remote debugger has been extremely difficult. Any ideas to aid diagnostics? Thanks.
Comment 4 Jesse Barnes 2007-12-06 10:11:01 UTC
Is there a way of getting a core dump?

Nian, is this something you can reproduce?
Comment 5 Ben Gamari 2007-12-06 12:13:52 UTC
Created attachment 12980 [details]
GDB log of Xorg crashing

Couldn't get a backtrace (where do backtraces of Xorg running under gdm go?) but here's a gdb backtrace, etc.
Comment 6 Ben Gamari 2007-12-06 12:16:05 UTC
I think my cast was wrong in the last gdb dump. All this information should be available anyway in the dump of *data but nevertheless, here's another attempt:

(gdb) print ((struct gl_framebuffer*)data)->Delete
$5 = (void (*)(struct gl_framebuffer *)) 0
Comment 7 Jesse Barnes 2007-12-06 12:23:36 UTC
I wonder if this is a DUP of 13545...
Comment 8 Ben Gamari 2007-12-06 12:25:23 UTC
Created attachment 12981 [details]
A secondary crash

After collecting the last gdb dump, I had gdb continue running through the crashing Xorg process, expecting for it to die at least fairly gracefully after catching the first signal. Much to my surprise, it caught another SIGSEGV at this point, apparently while killing the server ( AbortServer() -> AbortDDX() -> glxDRILeaveVT() -> glxSuspendClients() -> IgnoreClient() ). It seems like the osPrivate field of the client argument isn't set, causing a null pointer dereference when the connection fd is dereferenced.

Perhaps this explains why I rarely had error messages in the Xorg logs and never had Xorg produce a backtrace. Should I open another bug about this?

Thanks,
- Ben
Comment 9 Brian Paul 2007-12-06 15:32:46 UTC
Created attachment 12984 [details] [review]
patch to avoid calling fb->Delete if NULL

Can you try the attached patch?

Also, could you print *fb from in delete_framebuffer_cb() when it crashes?
Comment 10 Ben Gamari 2007-12-06 19:28:33 UTC
(In reply to comment #9)
> Created an attachment (id=12984) [details]
> patch to avoid calling fb->Delete if NULL
> 
> Can you try the attached patch?
> 
> Also, could you print *fb from in delete_framebuffer_cb() when it crashes?
> 

You'll find that I print'ed (struct gl_framebuffer)data in attachment #12980 [details], which should be equivalent to fb (the compiler optimized out the fb local variable, thus the indirect reference).

Insofar as I can't reproduce the problem anymore, the problem is ostensibly fixed. Nevertheless, should we be worried that the function was being called with an incompletely filled out gl_framebuffer?

Moreover, the secondary crash is still lurking in the source. Is it possible that this secondary crash was just caused by the first crash, or could it represent an actual bug?
Comment 11 Brian Paul 2007-12-14 13:59:28 UTC
> You'll find that I print'ed (struct gl_framebuffer)data in attachment #12980 [details]

Unfortunately, your casting there was incorrect.  Try this instead:
    print *fb

I have a feeling that the patch I gave you is just hiding a deeper issue.
Comment 12 Ben Gamari 2007-12-14 14:16:11 UTC
(In reply to comment #11)
> > You'll find that I print'ed (struct gl_framebuffer)data in attachment #12980 [details]
> 
> Unfortunately, your casting there was incorrect.  Try this instead:
>     print *fb
> 
> I have a feeling that the patch I gave you is just hiding a deeper issue.
> 
Sorry, I might not be able to continue working on this bug. My laptop died and will be replaced with an i965 model. How likely is it that this bug is common to both chipsets?
Comment 13 Brian Paul 2008-01-30 07:13:55 UTC
I've commited the fb->Delete != NULL check.
Note that bug 14293 is the same.