Bug 13507

Summary:	Intermittent GPU crashes with compiz
Product:	Mesa	Reporter:	Ben Gamari <bgamari>
Component:	Drivers/DRI/i830	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED FIXED	QA Contact:
Severity:	normal
Priority:	medium	CC:	nian.wu
Version:	unspecified
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	An Xorg of a crashed session Another Xorg log GDB log of Xorg crashing A secondary crash patch to avoid calling fb->Delete if NULL

Description Ben Gamari 2007-12-03 21:47:25 UTC

Every few hours while running compiz, the driver seems to generate an invalid operation or page table entry, causing system lockup and screen corruption.

Comment 1 Ben Gamari 2007-12-03 21:47:53 UTC

Created attachment 12923 [details]
An Xorg of a crashed session

Comment 2 Ben Gamari 2007-12-05 21:15:47 UTC

Created attachment 12973 [details]
Another Xorg log

This one is probably very similar to the first. Very little useful information is given in the log

Comment 3 Ben Gamari 2007-12-05 22:19:46 UTC

Could someone offer some insight into how to approach this damn bug? X just crashed literally times in the last 5 minutes (once while trying to select metacity as my window manager). The bug almost always manifests itself with a "page table error, instruction error" while compiz is rendering a blend or animation, leading to screen corruption and eventual crashing of X. Unfortunately, this has not once provided a stack trace. 

Moreover, the bug seems to occur almost entirely randomly, so reproducing it while attached to a remote debugger has been extremely difficult. Any ideas to aid diagnostics? Thanks.

Comment 4 Jesse Barnes 2007-12-06 10:11:01 UTC

Is there a way of getting a core dump?

Nian, is this something you can reproduce?

Comment 5 Ben Gamari 2007-12-06 12:13:52 UTC

Created attachment 12980 [details]
GDB log of Xorg crashing

Couldn't get a backtrace (where do backtraces of Xorg running under gdm go?) but here's a gdb backtrace, etc.

Comment 6 Ben Gamari 2007-12-06 12:16:05 UTC

I think my cast was wrong in the last gdb dump. All this information should be available anyway in the dump of *data but nevertheless, here's another attempt:

(gdb) print ((struct gl_framebuffer*)data)->Delete
$5 = (void (*)(struct gl_framebuffer *)) 0

Comment 7 Jesse Barnes 2007-12-06 12:23:36 UTC

I wonder if this is a DUP of 13545...

Comment 8 Ben Gamari 2007-12-06 12:25:23 UTC

Created attachment 12981 [details]
A secondary crash

After collecting the last gdb dump, I had gdb continue running through the crashing Xorg process, expecting for it to die at least fairly gracefully after catching the first signal. Much to my surprise, it caught another SIGSEGV at this point, apparently while killing the server ( AbortServer() -> AbortDDX() -> glxDRILeaveVT() -> glxSuspendClients() -> IgnoreClient() ). It seems like the osPrivate field of the client argument isn't set, causing a null pointer dereference when the connection fd is dereferenced.

Perhaps this explains why I rarely had error messages in the Xorg logs and never had Xorg produce a backtrace. Should I open another bug about this?

Thanks,
- Ben

Comment 9 Brian Paul 2007-12-06 15:32:46 UTC

Created attachment 12984 [details] [review]
patch to avoid calling fb->Delete if NULL

Can you try the attached patch?

Also, could you print *fb from in delete_framebuffer_cb() when it crashes?

Comment 10 Ben Gamari 2007-12-06 19:28:33 UTC

(In reply to comment #9)
> Created an attachment (id=12984) [details]
> patch to avoid calling fb->Delete if NULL
> 
> Can you try the attached patch?
> 
> Also, could you print *fb from in delete_framebuffer_cb() when it crashes?
> 

You'll find that I print'ed (struct gl_framebuffer)data in attachment #12980 [details], which should be equivalent to fb (the compiler optimized out the fb local variable, thus the indirect reference).

Insofar as I can't reproduce the problem anymore, the problem is ostensibly fixed. Nevertheless, should we be worried that the function was being called with an incompletely filled out gl_framebuffer?

Moreover, the secondary crash is still lurking in the source. Is it possible that this secondary crash was just caused by the first crash, or could it represent an actual bug?

Comment 11 Brian Paul 2007-12-14 13:59:28 UTC

> You'll find that I print'ed (struct gl_framebuffer)data in attachment #12980 [details]

Unfortunately, your casting there was incorrect.  Try this instead:
    print *fb

I have a feeling that the patch I gave you is just hiding a deeper issue.

Comment 12 Ben Gamari 2007-12-14 14:16:11 UTC

(In reply to comment #11)
> > You'll find that I print'ed (struct gl_framebuffer)data in attachment #12980 [details]
> 
> Unfortunately, your casting there was incorrect.  Try this instead:
>     print *fb
> 
> I have a feeling that the patch I gave you is just hiding a deeper issue.
> 
Sorry, I might not be able to continue working on this bug. My laptop died and will be replaced with an i965 model. How likely is it that this bug is common to both chipsets?

Comment 13 Brian Paul 2008-01-30 07:13:55 UTC

I've commited the fb->Delete != NULL check.
Note that bug 14293 is the same.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.