Bug 30082

Summary: [965GM] hang during firefox-talos-svg
Product: DRI Reporter: Sitsofe Wheeler <sitsofe>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED WORKSFORME QA Contact:
Severity: normal    
Priority: medium CC: eugeni
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg output
none
Output of /sys/kernel/debug/dri/0/i915_error_state
none
dmesg output
none
i915_error_state 2.6.38-rc7+
none
dmesg 2.6.38-rc7+ none

Description Sitsofe Wheeler 2010-09-08 02:28:38 UTC
Created attachment 38541 [details]
dmesg output

Description of the problem:
When running cairo-perf-trace on the firefox-talos-svg benchmark X starts becoming slow before locking up completely on a desktop with a 965G. Lots of
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
messages appear in dmesg. The machine has to be rebooted to get a working X back - restarting X just results in a black screen with a cursor on top.

Steps to reproduce:
1. Run
CAIRO_TEST_TARGET=gl ./cairo-perf-trace -r -v -i3 benchmark/firefox-talos-svg

Expected result:
Benchmark to run to completion, X to be responsive throughout.

Actual result:
X becomes increasingly slower before becoming frozen with an unchanging picture on the screen. Messages like:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 1820 at 1797)
appear frequently in dmesg after the test starts.

How reproducible is the problem:
It is reproducible every time.

Version information:
Fedora 13 x86_64
00:02.0 VGA compatible controller: Intel Corporation 82G965 Integrated Graphics Controller (rev 02)
xorg-x11-server-Xorg-1.8.2-3.fc13.x86_64
xorg-x11-drv-intel-2.11.0-5.fc13.x86_64
Kernel: 2.6.36-rc3+8554048070906579ec9fa19ac381deddd2d7b155 from drm-intel-fixes
(Hangchecks are also triggered with Fedora 13's 2.6.34.6-47.fc13.x86_64)
Comment 1 Sitsofe Wheeler 2010-09-08 02:30:36 UTC
Created attachment 38542 [details]
Output of /sys/kernel/debug/dri/0/i915_error_state
Comment 2 Sitsofe Wheeler 2010-09-08 02:33:30 UTC
Version information:
mesa-dri-drivers-7.8.1-8.fc13.x86_64
mesa-libGL-7.8.1-8.fc13.x86_64
Comment 3 Chris Wilson 2010-09-30 10:20:34 UTC
Finally my i965 has a working mesa, previously it just segfaulted when trying to run cairo-gl. Don't see any hangs so far...
Comment 4 Sitsofe Wheeler 2010-10-02 10:37:22 UTC
With 0e49892ed0bd2c34c1c3461d38a409ae18bfdc64 from drm-intel-staging I still get a hang and the following is printed:


[  374.318024] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  374.318793] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 8052 at 8029)
[  374.415003] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[...]

[  449.829780] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 10689 at 8029)
[  449.829796] [drm:i915_gem_do_execbuffer] *ERROR* Failed to pin buffer 239 of 251, total 105504768 bytes, 0 fences: -5
[  449.829799] [drm:i915_gem_do_execbuffer] *ERROR* 5520 objects [241 pinned], 264699904 object bytes [106901504 pinned], 179945472/260308992 gtt bytes
[  449.844200] ------------[ cut here ]------------
[  449.845005] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:2575!
[  449.845005] invalid opcode: 0000 [#1] 
[  449.845005] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0/uevent
[  449.845005] CPU 0 
[  449.845005] Pid: 2755, comm: lt-cairo-perf-t Not tainted 2.6.36-rc5+ #4 DG965WH/        
[  449.845005] RIP: 0010:[<ffffffff811db1ba>]  [<ffffffff811db1ba>] i915_gem_object_bind_to_gtt+0x298/0x2f0
[  449.845005] RSP: 0018:ffff8801247e1bd8  EFLAGS: 00010202
[  449.845005] RAX: ffff8800c7e30e90 RBX: ffff8800ca16e200 RCX: ffff880000010000
[  449.845005] RDX: ffff8800ca16e290 RSI: ffff880128c77eb8 RDI: ffff880128c77e00
[  449.845005] RBP: ffff8801247e1c18 R08: 000000000000077f R09: 0000000000000246
[  449.845005] R10: 0000000000000280 R11: 00000000078b8000 R12: 0000000000000000
[  449.845005] R13: ffff88012be8e800 R14: ffff88012be23350 R15: 0000000000001000
[  449.845005] FS:  00007fb15d0cf840(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
[  449.845005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  449.845005] CR2: 00007fb159e05000 CR3: 00000001247f8000 CR4: 00000000000006b0
[  449.845005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  449.845005] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  449.845005] Process lt-cairo-perf-t (pid: 2755, threadinfo ffff8801247e0000, task ffff88012b3405b0)
[  449.845005] Stack:
[  449.845005]  ffff88012be232c0 00001200247e1de8 ffff880037816c00 ffff8800ca16e200
[  449.845005] <0> ffff88012be8e800 0000000000000000 ffff8800ca16e200 000000000000003e
[  449.845005] <0> ffff8801247e1c48 ffffffff811dc2dd ffffc90019587460 0000000000000000
[  449.845005] Call Trace:
[  449.845005]  [<ffffffff811dc2dd>] i915_gem_object_pin+0xe5/0x170
[  449.845005]  [<ffffffff811dec4c>] i915_gem_do_execbuffer+0x518/0xe4f
[  449.845005]  [<ffffffff81090f5d>] ? __vmalloc_node+0x82/0x92
[  449.845005]  [<ffffffff811de732>] ? drm_malloc_ab+0x4e/0x50
[  449.845005]  [<ffffffff811df651>] i915_gem_execbuffer2+0xce/0x12c
[  449.845005]  [<ffffffff811bf736>] drm_ioctl+0x273/0x339
[  449.845005]  [<ffffffff811df583>] ? i915_gem_execbuffer2+0x0/0x12c
[  449.845005]  [<ffffffff8111f801>] ? file_has_perm+0x95/0xa3
[  449.845005]  [<ffffffff810a79ef>] do_vfs_ioctl+0x4a1/0x4f0
[  449.845005]  [<ffffffff812dcae7>] ? schedule+0x193/0x37e
[  449.845005]  [<ffffffff810a7a94>] sys_ioctl+0x56/0x7a
[  449.845005]  [<ffffffff81001e98>] system_call_fastpath+0x16/0x1b
[  449.845005] Code: 00 00 00 49 89 96 58 13 00 00 49 81 c6 50 13 00 00 48 89 83 98 00 00 00 4c 89 b3 90 00 00 00 48 89 10 f7 43 6c be ff ff ff 74 04 <0f> 0b eb fe f7 43 70 be ff ff ff 74 04 0f 0b eb fe 44 8b b3 d8 
[  449.845005] RIP  [<ffffffff811db1ba>] i915_gem_object_bind_to_gtt+0x298/0x2f0
[  449.845005]  RSP <ffff8801247e1bd8>
[  449.926679] ---[ end trace 75509941ef16036b ]---

I think X hangs in the D state when this happens...
Comment 5 Sitsofe Wheeler 2010-10-02 10:39:21 UTC
(The assertion being hit on i915_gem.c:2575 is
BUG_ON(obj->read_domains & I915_GEM_GPU_DOMAINS);
)
Comment 6 Sitsofe Wheeler 2010-10-02 10:53:41 UTC
Sigh - tested the wrong kernel. I should have used -next rather than -staging and the symptoms should really have been reported back in bug #30083...
Comment 7 Chris Wilson 2010-10-08 05:34:48 UTC
Sitsofe, thanks for testing again... I don't think I'm going to apply the domain fixup to -fixes since we so late in the cycle and this is an error path that we should not hit in the first place.

Any chance you can try a more recent version of mesa?
Comment 8 Sitsofe Wheeler 2010-10-13 11:35:07 UTC
I've just tried a 32 bit Ubuntu 10.10 live cd which had the following packages:

libgl1-mesa-dri 7.9~git20100924-0ubuntu2
libgl1-mesa-glx 7.9~git20100924-0ubuntu2
xserver-xorg 1:7.5+6ubuntu3
xserver-xorg-video-intel 2:2.12.0-1ubuntu5
linux-image-2.6.35-22-generic 2.6.35-22.33

and it had the same issue as described in comment #1.
Comment 9 Sitsofe Wheeler 2011-03-11 14:02:00 UTC
Still here in the latest Fedora 15 pre-pre-release. At first everything is frozen and then after a few seconds the screen goes black and stays black no matter what (the system still appears to be alive though).

00:02.0 VGA compatible controller [0300]: Intel Corporation 82G965 Integrated Graphics Controller [8086:29a2] (rev 02)
Kernel: 2.6.38-0.rc6.git6.1.fc15.x86_64
xorg-x11-drv-intel-2.14.0-2.fc15.x86_64
xorg-x11-server-Xorg-1.9.99.1-6.20101201.fc15.x86_64

dmesg snippet to be attached.
Comment 10 Sitsofe Wheeler 2011-03-11 14:02:31 UTC
Created attachment 44370 [details]
dmesg output
Comment 11 Chris Wilson 2011-03-13 01:02:06 UTC
Still here with rawhide on this t61 (965gm)... Rawhide with drm-intel-next, mesa.git and cairo.git (+ deps). WORKSFORME. :|

Just recently I fixed mesa to handle the URB_FENCE erratum on 965GM. If we cross our fingers, that might just be the issue. And we also found a bug where HWS could only handle being allocated in the low 4GiB, and there is are further unfixed errata regarding handling of pages >4GiB by the i965. Do you have more than 4GiB and, if so, does mem=3g help?

Can you try rawhide (or at least mesa.git and drm-intel-next) and grab a few i915_error_states if the hang persists? Thanks.
Comment 12 Sitsofe Wheeler 2011-03-15 11:03:40 UTC
> Do you have more than 4GiB

No - this machine has 4GiB exactly.

The previous results were from a LiveCD on http://download.fedora.redhat.com/pub/alt/nightly-composes/desktop/ which was probably rawhide (see http://www.mail-archive.com/fedora-devel-list@redhat.com/msg10108.html ).

Unfortunately, due to the nature of the system in question I can compile and test later kernels but I can't switch anything in userland unless I boot temporarily to a LiveCD.

I have tried a later kernel from drm-intel-next with the old Fedora 13 userland:
Kernel: 2.6.38-rc7+ (47ae63e0c2e5fdb582d471dc906eb29be94c732f drm-intel-next)
xorg-x11-drv-intel-2.11.0-5.fc13.x86_64
mesa-dri-drivers-7.8.2-1.fc13.x86_64

Output to be attached.
Comment 13 Sitsofe Wheeler 2011-03-15 11:04:38 UTC
Created attachment 44483 [details]
i915_error_state 2.6.38-rc7+
Comment 14 Sitsofe Wheeler 2011-03-15 11:06:15 UTC
Created attachment 44485 [details]
dmesg 2.6.38-rc7+
Comment 15 Chris Wilson 2011-03-15 11:26:34 UTC
Hmm, still nothing obvious. We need to find a way of getting mesa.git onto your test machine so we can experiment.
Comment 16 Sitsofe Wheeler 2011-03-16 16:07:48 UTC
A very quick heads up:

Testing with an Ubuntu Natty LiveCD with all the pieces updated to xorg-edgers and after pulling in goodness knows how many packages AND updating my cairo directory to the latest git version it ran the test fine. Unfortunately I don't know which piece fixed things.

Versions:
xserver-xorg-core 2:1.10.0+git20110307+server-1.10-branch.35503964-0ubuntu0sarvatt
libgl1-mesa-dri 7.11.0+git20110316.b47d5084-0ubuntu0sarvatt
libgl1-mesa-glx 7.10.1-0ubuntu2
linux-image-2.6.38-6-generic 2.6.38-6.34
libdrm-intel1 2.4.24+git20110307.3b04c736-0ubuntu0sarvatt
xserver-xorg-video-intel 2:2.14.901+git20110316.ee740778-0ubuntu0sarvatt
Comment 17 Sitsofe Wheeler 2011-03-21 12:40:13 UTC
It looks like a newer versions of cairo can't trigger this problem on setups that have problems with the older cairo. I tried cairo.git on the previously mentioned Fedora 15 LiveCD and it caused a segfault in one of the i915 libraries (but X was seemingly unharmed). However, an older cairo caused the kernel errors and a lock-up.

Newer cairo has also become much more difficult to test on older systems due to its dependency on a new pixman.
Comment 18 Sitsofe Wheeler 2011-03-21 12:45:10 UTC
I should also note that stock Ubuntu Natty alphas do not show the problem with a cairo.git but gcc refuses to compile my older cairo on Ubuntu Natty.
Comment 19 Andrey Batyiev 2011-07-10 15:57:27 UTC
Hello

I've just got the same bug:

Jul 11 01:07:36 valkyrie kernel: [ 5646.216051] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Jul 11 01:07:36 valkyrie kernel: [ 5646.216104] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 227670 at 227669, next 227671)
Jul 11 01:07:36 valkyrie kernel: [ 5646.216205] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 00000000 head 00000000 tail 00000000 start 00000000
Jul 11 01:07:36 valkyrie kernel: [ 5646.216912] [drm] Changing LVDS panel from (+hsync, +vsync) to (+hsync, -vsync)
Jul 11 01:07:36 valkyrie kernel: [ 5646.217072] ------------[ cut here ]------------
Jul 11 01:07:36 valkyrie kernel: [ 5646.217089] WARNING: at drivers/gpu/drm/i915/intel_display.c:1079 intel_enable_pipe+0x111/0x160()
Jul 11 01:07:36 valkyrie kernel: [ 5646.217095] Hardware name: Vostro1310
Jul 11 01:07:36 valkyrie kernel: [ 5646.217101] PLL state assertion failure (expected on, current off)
Jul 11 01:07:36 valkyrie kernel: [ 5646.217105] Modules linked in: hidp bnep rfcomm sdhci_pci iwl3945 sdhci btusb bluetooth firewire_ohci iwl_legacy dell_wmi wmi
Jul 11 01:07:36 valkyrie kernel: [ 5646.217136] Pid: 735, comm: kworker/u:2 Not tainted 2.6.39-gentoo-r2 #1
Jul 11 01:07:36 valkyrie kernel: [ 5646.217142] Call Trace:
Jul 11 01:07:36 valkyrie kernel: [ 5646.217155]  [<c1035ab8>] ? warn_slowpath_common+0x78/0xb0
Jul 11 01:07:36 valkyrie kernel: [ 5646.217165]  [<c12c18e1>] ? intel_enable_pipe+0x111/0x160
Jul 11 01:07:36 valkyrie kernel: [ 5646.217174]  [<c12c18e1>] ? intel_enable_pipe+0x111/0x160
Jul 11 01:07:36 valkyrie kernel: [ 5646.217182]  [<c1035b83>] ? warn_slowpath_fmt+0x33/0x40
Jul 11 01:07:36 valkyrie kernel: [ 5646.217191]  [<c12c18e1>] ? intel_enable_pipe+0x111/0x160
Jul 11 01:07:36 valkyrie kernel: [ 5646.217200]  [<c12c6ad0>] ? intel_crtc_mode_set+0xab0/0x1b60
Jul 11 01:07:36 valkyrie kernel: [ 5646.217211]  [<c129d097>] ? drm_vblank_off+0x57/0x1b0
Jul 11 01:07:36 valkyrie kernel: [ 5646.217219]  [<c12bfe77>] ? i8xx_enable_fbc+0x167/0x1c0
Jul 11 01:07:36 valkyrie kernel: [ 5646.217227]  [<c12bea4a>] ? i965_update_wm+0x19a/0x220
Jul 11 01:07:36 valkyrie kernel: [ 5646.217236]  [<c12c2045>] ? i9xx_crtc_disable+0xc5/0x130
Jul 11 01:07:36 valkyrie kernel: [ 5646.217245]  [<c129565b>] ? drm_crtc_helper_set_mode+0x26b/0x3a0
Jul 11 01:07:36 valkyrie kernel: [ 5646.217258]  [<c1295814>] ? drm_helper_resume_force_mode+0x84/0x270
Jul 11 01:07:36 valkyrie kernel: [ 5646.217268]  [<c12ac335>] ? i915_reset+0x245/0x420
Jul 11 01:07:36 valkyrie kernel: [ 5646.217276]  [<c12b0447>] ? i915_error_work_func+0xc7/0x120
Jul 11 01:07:36 valkyrie kernel: [ 5646.217286]  [<c104a2b5>] ? process_one_work+0x105/0x350
Jul 11 01:07:36 valkyrie kernel: [ 5646.217293]  [<c12b0380>] ? i915_get_vblank_counter+0xb0/0xb0
Jul 11 01:07:36 valkyrie kernel: [ 5646.217302]  [<c104a8b1>] ? worker_thread+0x121/0x3c0
Jul 11 01:07:36 valkyrie kernel: [ 5646.217310]  [<c104a790>] ? rescuer_thread+0x290/0x290
Jul 11 01:07:36 valkyrie kernel: [ 5646.217317]  [<c104de74>] ? kthread+0x74/0x80
Jul 11 01:07:36 valkyrie kernel: [ 5646.217325]  [<c104de00>] ? kthread_worker_fn+0x170/0x170
Jul 11 01:07:36 valkyrie kernel: [ 5646.217335]  [<c15a8436>] ? kernel_thread_helper+0x6/0xd
Jul 11 01:07:36 valkyrie kernel: [ 5646.217341] ---[ end trace b94f1897cc05f11d ]---
Jul 11 01:07:36 valkyrie kernel: [ 5646.321139] ------------[ cut here ]------------


Before this bug occured I've only surfed web with one of webkit based browsers. No suspend-resume, no 3D graphics and desktop effects are off.

I'm using Gentoo with:
Vostro 1310 laptop
00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 0c)
00:02.1 Display controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 0c)

linux-2.6.39.2
xorg-server-1.10.2
mesa-7.10.2, classic mode
xf86-video-intel-2.15.0-r1 (yes, gentoo patched)
libdrm-2.4.26
dri2proto-2.6

Reproduceablity: Yes, this bug happens *sometimes*. It may pass a week before appear or appear in several hours.

Sorry, I'll attach i915_error_state later. Anything else to add?
I can build debug kernel and do some debugging over ssh, what should I be looking for?

Thanks,
   Andrey
Comment 20 Chris Wilson 2011-07-11 06:58:43 UTC
Andrey, it is highly unlikely you have the same bug, so please when you come to attach your error-state can you do so in new bug report. And can you include the full dmesg along with Xorg.log? Thanks.
Comment 21 Chris Wilson 2011-10-18 14:50:33 UTC
Sitosfe, I hope your still having fun. :)

As we made no progress on identifying the crash scenario and it appears to have resolved itself by cairo doing things differently, I can only hope that the bug has been magically fixed with all the mesa/i965 work...

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.