20163 – [uxa g45] xorg randomly hang uxa_copy_area/drm_intel_gem_bo_start_gtt_access

Bug 20163 - [uxa g45] xorg randomly hang uxa_copy_area/drm_intel_gem_bo_start_gtt_access

Summary: [uxa g45] xorg randomly hang uxa_copy_area/drm_intel_gem_bo_start_gtt_access

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium critical
Assignee:	Eric Anholt
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:	NEEDINFO

Depends on:
Blocks:

Reported:	2009-02-17 07:20 UTC by martin
Modified:	2009-07-09 18:20 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
gdb backtrace (7.45 KB, text/plain) 2009-02-17 07:20 UTC, martin	no flags	Details
dmesg (while hung) (42.79 KB, text/plain) 2009-02-17 07:21 UTC, martin	no flags	Details
gem_objects (while hung) (104 bytes, text/plain) 2009-02-17 07:21 UTC, martin	no flags	Details
intel_reg_dumper (while hung) (7.68 KB, text/plain) 2009-02-17 07:21 UTC, martin	no flags	Details
i915_gem_interrupt (while hung) (265 bytes, text/plain) 2009-02-17 07:22 UTC, martin	no flags	Details
xorg conf (while hung) (1.04 KB, application/octet-stream) 2009-02-17 07:22 UTC, martin	no flags	Details
xorg log (while hung) (35.05 KB, text/x-log) 2009-02-17 07:22 UTC, martin	no flags	Details
xorg.log.old (while hung) (38.21 KB, application/x-trash) 2009-02-17 07:23 UTC, martin	no flags	Details
View All

Description martin 2009-02-17 07:20:10 UTC

I just got this UXA hang on Gigabyte GA-EG45M-DS2H (G45) using:

vanilla upstream 2.6.29-020629rc5-generic
libdrm-intel1 and libdrm2 is 2.4.4-0ubuntu6
xserver-xorg-video-intel is 2:2.6.1-1ubuntu2
libgl1-mesa-dev is 7.3-1ubuntu1

I can move the mouse but the mouse cursor is no longer not animating. I could still ssh into the box and capture logs (attaching).

Comment 1 martin 2009-02-17 07:20:45 UTC

Created attachment 23023 [details]
gdb backtrace

Comment 2 martin 2009-02-17 07:21:10 UTC

Created attachment 23024 [details]
dmesg (while hung)

Comment 3 martin 2009-02-17 07:21:28 UTC

Created attachment 23025 [details]
gem_objects (while hung)

Comment 4 martin 2009-02-17 07:21:50 UTC

Created attachment 23026 [details]
intel_reg_dumper (while hung)

Comment 5 martin 2009-02-17 07:22:13 UTC

Created attachment 23027 [details]
i915_gem_interrupt (while hung)

Comment 6 martin 2009-02-17 07:22:35 UTC

Created attachment 23028 [details]
xorg conf (while hung)

Comment 7 martin 2009-02-17 07:22:51 UTC

Created attachment 23029 [details]
xorg log (while hung)

Comment 8 martin 2009-02-17 07:23:09 UTC

Created attachment 23030 [details]
xorg.log.old (while hung)

Comment 9 martin 2009-02-17 07:27:14 UTC

Because the dmesg has this weird message about stack trap blah in compiz.real just before the crash happened I also captured the compiz.real stack:

(gdb) info threads 
  1 Thread 0x7f2256f58750 (LWP 9297)  0x00007f2254df16f3 in __select_nocancel () from /lib/libc.so.6
(gdb) bt full
#0  0x00007f2254df16f3 in __select_nocancel () from /lib/libc.so.6
No symbol table info available.
#1  0x00007f225378f19e in _xcb_conn_wait (c=0x23eef40, cond=<value optimized out>, vector=0x0, count=0x0)
    at /build/buildd/libxcb-1.1.93/./src/xcb_conn.c:283
	ret = 0
	rfds = {__fds_bits = {16, 0 <repeats 15 times>}}
	wfds = {__fds_bits = {0 <repeats 16 times>}}
#2  0x00007f2253790c8c in xcb_wait_for_reply (c=0x23eef40, request=984088, e=0x7fff5ef86c28) at /build/buildd/libxcb-1.1.93/./src/xcb_in.c:376
	cond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, 
    __broadcast_seq = 0}, __size = '\0' <repeats 47 times>, __align = 0}
	reader = {request = 984088, data = 0x7fff5ef86ba0, next = 0x0}
	prev_reader = (reader_list **) 0x23efff8
	widened_request = <value optimized out>
	ret = (void *) 0x0
#3  0x00007f2254a55fbc in _XReply (dpy=0x23ee500, rep=0x7fff5ef86c70, extra=0, discard=0) at ../../src/xcb_io.c:454
	error = <value optimized out>
	c = (xcb_connection_t *) 0x23eef40
	__PRETTY_FUNCTION__ = "_XReply"
#4  0x00007f2255353ce2 in DRI2CopyRegion () from /usr/lib/libGL.so.1
No symbol table info available.
#5  0x00007f2255353a3f in ?? () from /usr/lib/libGL.so.1
No symbol table info available.
#6  0x00007f225532f2fb in ?? () from /usr/lib/libGL.so.1
No symbol table info available.
#7  0x00000000004123bf in eventLoop ()
No symbol table info available.
#8  0x000000000040d451 in main ()
No symbol table info available.

Comment 10 Gordon Jin 2009-02-17 16:59:24 UTC

Is there a reproducible steps for this?

Comment 11 martin 2009-02-17 23:45:07 UTC

No repro steps yet.

Comment 12 Eric Anholt 2009-02-18 12:34:00 UTC

this information shows pretty much the generic "the chip is hung" state -- not much to do without steps to reproduce.

Comment 13 martin 2009-02-18 13:18:28 UTC

Thanks for having a look Eric.

Gordon, what's your policy? Should it be marked as "NEEDINFO" asking for a repro or do you prefer to close it?

Question: Is it correct that once the chip hangs xorg can end up in many different stacks depending on what exactly xorg was doing _when_ the chip hung? 

I'm asking because I implicitly assumed that different stacks meant it was different bugs but when I think about it, that doesn't feel like a solid assumption when the code involves a GPU (I'm used to doing mostly user space apps).

Comment 14 Eric Anholt 2009-02-18 15:20:49 UTC

there are piles of different stacktraces you could have when apps ended up waiting for the gpu to finish some task that it didn't.

Comment 15 Gordon Jin 2009-02-18 18:28:19 UTC

(In reply to comment #13)
> Gordon, what's your policy? Should it be marked as "NEEDINFO" asking for a
> repro or do you prefer to close it?

No. I can't mark "NEEDINFO" to force you to provide some info which you've answered you can't provide. We also can't provide a bug just because there's no steady reproducible step.

So I think we should just leave this bug open, but it will probably be lower priority from developer's point of view, since there's no clear info.

Comment 16 Brian Rogers 2009-02-22 03:58:13 UTC

I'm getting this hang, too, on an X3100 and I get a similar backtrace. It usually happens right when I click a window, and it's being redrawn with the 'active' appearance. But it's rare and random.

Comment 17 Brian Rogers 2009-03-17 07:13:47 UTC

Hmm, this isn't good... I got a freeze with exa while running Google Earth.

Frame 1 was in drm_intel_gem_bo_start_gtt_access.

Comment 18 Florian Reinhard 2009-03-23 05:03:57 UTC

(In reply to comment #0)
> I can move the mouse but the mouse cursor is no longer not animating. I could
> still ssh into the box and capture logs (attaching).

same applys for 915GM. can move the mouse, ssh on the box but not reboot.

kernel: ubuntu-jaunty 2.6.28-11-generic
libdrmm 2.4.5
xserver-xorg-video-intel-dbg 2.6.3
mesa 7.3

backtrace starts in drm_intel_gem_bo_start_gtt_access either.

Comment 19 Eric Anholt 2009-04-08 14:50:39 UTC

Brian: If you're reliably getting a hang from googleearth, please open your own bug for that issue so we can track and fix it.

Everyone else: If you're looking at hopping in on this bug with "me too", please just open your own bug if you've got something specific you can do ("run this app, click this, go to this location", not "use the desktop for an hour") to reproduce.  Just because the backtrace is the same doesn't mean the cause is the same, and your own bug means individual attention to your problem.

Comment 20 Jesse Barnes 2009-05-11 11:21:44 UTC

Adjusting severity: crashes & hangs should be marked critical.

Comment 21 Eric Anholt 2009-05-19 09:21:28 UTC

If you're still experiencing this, could you use intel_gpu_dump on 2.6.30rc4 or newer when it's hung so we can look at what we did that angered the GPU?

Also, note that there are some fixes in git master of the 2D driver that may help with GPU hangs.

Comment 22 Gordon Jin 2009-07-08 23:57:01 UTC

Martin, we have many fixes for such gpu hang in the latest xf86-video-intel driver and kernel. Could you try that? 
If it still exists, please provide intel_gpu_dump according to http://intellinuxgraphics.org/intel-gpu-dump.html.

Comment 23 martin 2009-07-09 09:05:26 UTC

When I first opened this bug I didn't know how hard it is to do something useful with a bug report for a GPU hang that lacks buffer dumps. I remember not seeing this bug again for at least a few weeks after I reported it (I was Ubuntu devel release back then so my bits changed quite a lot). For the last few weeks though (and also up until end of Aug) my intel G45 box will packed away be in a moving box in a storage facility. My suggestion is to close this bug report.

If I get another hang, I will capture buffers and open a new bug.

Comment 24 Gordon Jin 2009-07-09 18:20:16 UTC

closing

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.