Every few hours while running compiz, the driver seems to generate an invalid operation or page table entry, causing system lockup and screen corruption.
Created attachment 12923 [details]
An Xorg of a crashed session
Created attachment 12973 [details]
Another Xorg log
This one is probably very similar to the first. Very little useful information is given in the log
Could someone offer some insight into how to approach this damn bug? X just crashed literally times in the last 5 minutes (once while trying to select metacity as my window manager). The bug almost always manifests itself with a "page table error, instruction error" while compiz is rendering a blend or animation, leading to screen corruption and eventual crashing of X. Unfortunately, this has not once provided a stack trace.
Moreover, the bug seems to occur almost entirely randomly, so reproducing it while attached to a remote debugger has been extremely difficult. Any ideas to aid diagnostics? Thanks.
Is there a way of getting a core dump?
Nian, is this something you can reproduce?
Created attachment 12980 [details]
GDB log of Xorg crashing
Couldn't get a backtrace (where do backtraces of Xorg running under gdm go?) but here's a gdb backtrace, etc.
I think my cast was wrong in the last gdb dump. All this information should be available anyway in the dump of *data but nevertheless, here's another attempt:
(gdb) print ((struct gl_framebuffer*)data)->Delete
$5 = (void (*)(struct gl_framebuffer *)) 0
I wonder if this is a DUP of 13545...
Created attachment 12981 [details]
A secondary crash
After collecting the last gdb dump, I had gdb continue running through the crashing Xorg process, expecting for it to die at least fairly gracefully after catching the first signal. Much to my surprise, it caught another SIGSEGV at this point, apparently while killing the server ( AbortServer() -> AbortDDX() -> glxDRILeaveVT() -> glxSuspendClients() -> IgnoreClient() ). It seems like the osPrivate field of the client argument isn't set, causing a null pointer dereference when the connection fd is dereferenced.
Perhaps this explains why I rarely had error messages in the Xorg logs and never had Xorg produce a backtrace. Should I open another bug about this?
Created attachment 12984 [details] [review]
patch to avoid calling fb->Delete if NULL
Can you try the attached patch?
Also, could you print *fb from in delete_framebuffer_cb() when it crashes?
(In reply to comment #9)
> Created an attachment (id=12984) [details]
> patch to avoid calling fb->Delete if NULL
> Can you try the attached patch?
> Also, could you print *fb from in delete_framebuffer_cb() when it crashes?
You'll find that I print'ed (struct gl_framebuffer)data in attachment #12980 [details], which should be equivalent to fb (the compiler optimized out the fb local variable, thus the indirect reference).
Insofar as I can't reproduce the problem anymore, the problem is ostensibly fixed. Nevertheless, should we be worried that the function was being called with an incompletely filled out gl_framebuffer?
Moreover, the secondary crash is still lurking in the source. Is it possible that this secondary crash was just caused by the first crash, or could it represent an actual bug?
> You'll find that I print'ed (struct gl_framebuffer)data in attachment #12980 [details]
Unfortunately, your casting there was incorrect. Try this instead:
I have a feeling that the patch I gave you is just hiding a deeper issue.
(In reply to comment #11)
> > You'll find that I print'ed (struct gl_framebuffer)data in attachment #12980 [details]
> Unfortunately, your casting there was incorrect. Try this instead:
> print *fb
> I have a feeling that the patch I gave you is just hiding a deeper issue.
Sorry, I might not be able to continue working on this bug. My laptop died and will be replaced with an i965 model. How likely is it that this bug is common to both chipsets?
I've commited the fb->Delete != NULL check.
Note that bug 14293 is the same.