Bug 40503

Summary: [Sandy Bridge] i915 System Freeze gpu hangcheck timer
Product: DRI Reporter: jaschak1
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED INVALID QA Contact:
Severity: critical    
Priority: medium CC: 384toregzteez, ben, chris, daniel, jbarnes, mike, slava, yunta83
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
glxinfo
none
xorg
none
i915_error_state
none
i915 warnings from syslog
none
dmesg
none
Error state collected during warnings reported above
none
i915_error_state
none
dmesg
none
error state
none
dmesg corresponding to last error_state
none
Error state for previous commend none

Description jaschak1 2011-08-30 18:24:06 UTC
Created attachment 50733 [details]
dmesg

Bug description:

System freezes after playing OpenGL game. Sometimes after 60 minutes and othertimes just after 10 minutes. The whole system completely locks up, no sys+rq, only power button.

System environment:
-- chipset: H67
-- system architecture: 64-bit
-- xf86-video-intel: 2.15.0
-- xserver: X.Org X Server 1.10.3
-- mesa: OpenGL version string: 2.1 Mesa 7.11
-- libdrm:
-- kernel: 2.6.40.3-0.fc15.x86_64
-- Linux distribution: Fedora 15, but it also happens with Ubuntu 11.04
-- Machine or mobo model: Asrock H67
-- Display connector: HDMI

Reproducing steps:

Start the Game "Heroes of Newerth" (free Game with native Linux support). Logging in and go Matchmaking and ingame. Sometimes i can play over an hour and sometimes just 10 min.

Additional info:

/sys/kernel/debug/dri/0/i915_error_state -> No error state collected.

Error occurs on
Ubuntu 11.04
Ubuntu 11.04 + xorg-edgers repo.
Fedora 15
Ubuntu 11.04 + xorg-edgers repo. + 3.1-rc4
Fedora 15 + 3.1-rc4
Comment 1 jaschak1 2011-08-30 18:24:37 UTC
Created attachment 50734 [details]
glxinfo
Comment 2 jaschak1 2011-08-30 18:25:01 UTC
Created attachment 50735 [details]
xorg
Comment 3 jaschak1 2011-08-31 15:37:15 UTC
Bug description:

I tested it again with a completely different OS-Setup. Now the System freeze occurs after one Minute reproducable. The whole system completely locks up, no
sys+rq, only power button.

System environment:
-- chipset: H67
-- system architecture: 64-bit
-- xf86-video-intel: 2.16.0
-- xserver: X.Org X Server 1.10.3.902 (1.10.4 RC 2)
-- mesa: OpenGL version string: 2.1 Mesa 7.12-devel
-- libdrm: 0.25
-- kernel: 2.6.38-11-generic
-- Linux distribution: Mint 11 + xorg-edgers Aug 31
-- Machine or mobo model: Asrock H67
-- Display connector: HDMI

Reproducing steps:

same as above.

Additional info:

/sys/kernel/debug/dri/0/i915_error_state -> No error state collected.

And Errors are the same

xorg: [   458.073] (EE) intel(0): Detected a hung GPU, disabling acceleration.
dmesg: Aug 31 18:18:39 desktop kernel: [  457.610441] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Aug 31 18:18:39 desktop kernel: [  457.612075] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 498706 at 498705, next 498707)
Comment 4 jaschak1 2011-08-31 18:08:30 UTC
tested it one more time and could save an i915_error_state. uploaded it in attachments.
Comment 5 jaschak1 2011-08-31 18:09:11 UTC
Created attachment 50790 [details]
i915_error_state
Comment 6 Slava Gorbunov 2011-10-06 01:34:04 UTC
I have this bug too. In my case, the bug is triggered by OpenGL screensavers from xscreensaver, randomly. Dmesg:

Oct  6 10:32:22 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Oct  6 10:32:22 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6338481 at 6338472, next 6338482)
Oct  6 10:32:28 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Oct  6 10:32:28 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6338482 at 6338472, next 6338483)
Oct  6 10:32:35 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Oct  6 10:32:35 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6338483 at 6338472, next 6338484)
Oct  6 10:32:41 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Oct  6 10:32:41 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6338484 at 6338472, next 6338485)
Oct  6 10:32:47 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Oct  6 10:32:47 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6338485 at 6338472, next 6338486)
Oct  6 10:33:14 nout kernel: ------------[ cut here ]------------
Oct  6 10:33:14 nout kernel: kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3336!
Oct  6 10:33:14 nout kernel: invalid opcode: 0000 [#1] SMP 
Oct  6 10:33:14 nout kernel: CPU 2 
Oct  6 10:33:14 nout kernel: Modules linked in: cryptd aes_x86_64 aes_generic bnep rfcomm snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss reiserfs dm_mod uvcvideo videodev v4l2_compat_ioctl32 btusb bluetooth snd_hda_codec_hdmi snd_hda_codec_realtek arc4 rtl8192ce rtl8192c_common i915 rtlwifi fbcon font bitblit softcursor drm_kms_helper snd_hda_intel drm mac80211 fb fbdev snd_hda_codec cfg80211 i2c_algo_bit snd_hwdep snd_pcm thinkpad_acpi ehci_hcd usbcore cfbcopyarea snd_timer thermal processor video snd rfkill thermal_sys r8169 battery ac hwmon cfbimgblt power_supply sg sr_mod cfbfillrect psmouse evdev wmi i2c_i801 soundcore intel_agp button intel_gtt nvram cdrom agpgart snd_page_alloc
Oct  6 10:33:14 nout kernel: 
Oct  6 10:33:14 nout kernel: Pid: 2170, comm: X Not tainted 3.0.4-gentoo #3 LENOVO 78543MG/78543MG
Oct  6 10:33:14 nout kernel: RIP: 0010:[<ffffffffa0438392>]  [<ffffffffa0438392>] i915_gem_object_unpin+0xa2/0xb0 [i915]
Oct  6 10:33:14 nout kernel: RSP: 0018:ffff880072967be0  EFLAGS: 00010246
Oct  6 10:33:14 nout kernel: RAX: ffff88006fd94000 RBX: ffff880070a34000Oct

The log is truncated, I suppose, due to hard reset which was required to reboot the system. Sometimes after freeze syslog doesn't even contain any traces of it - perhaps because syslogd didn't have time to write messages before system becomes completely frozen.

The last relevant line in Xorg.log is:

[ 85260.319] (WW) intel(0): flip queue failed: Input/output error

My environment:
OS: Gentoo linux
Kernel: 3.0.4-gentoo
xf86-video-intel-2.16.0, xorg-server-1.11.0, mesa-7.11

Hardware: Lenovo L420, Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz.

lspci -vv:
00:02.0 VGA compatible controller: Intel Corporation Device 0116 (rev 09) (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device 21dd
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 43
        Region 0: Memory at f0000000 (64-bit, non-prefetchable) [size=4M]
        Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Region 4: I/O ports at 1800 [size=64]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee0f00c  Data: 4181
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a4] PCI Advanced Features
                AFCap: TP+ FLR+
                AFCtrl: FLR-
                AFStatus: TP-
        Kernel driver in use: i915
        Kernel modules: i915
Comment 7 Slava Gorbunov 2011-10-07 04:48:01 UTC
Created attachment 52079 [details]
i915 warnings from syslog

Here are several warnings with backtraces from i915.ko which I've found in my syslog today. They didn't cause system lockup, but XV has stopped working.
Comment 8 Slava Gorbunov 2011-10-21 13:41:26 UTC
I've tried to build i915.ko from most recent git tree (taken from git://people.freedesktop.org/~keithp/linux, branch drm-intel-fixes, commit cd0de039bff32ee314046c0e4c047c38aa696f84). It has required two small changes to make it compile wit current stable kernel (3.0.7). But the problem is still present - kernel module prints warnings to dmesg, acceleration gets disabled, but the whole system doesn't hang. I'll attach the dmesg and i915_error_state:
Comment 9 Slava Gorbunov 2011-10-21 13:42:22 UTC
Created attachment 52618 [details]
dmesg
Comment 10 Slava Gorbunov 2011-10-21 13:43:32 UTC
Created attachment 52619 [details]
Error state collected during warnings reported above
Comment 11 Chris Wilson 2011-10-30 02:33:01 UTC
Here's the culprit (finally):

0x042cb5ac:      0x7b003c04: 3DPRIMITIVE: rect list sequential
0x042cb5b0:      0xffffff4a:    vertex count
0x042cb5b4:      0x000001dd:    start vertex
0x042cb5b8:      0x00000001:    instance count
0x042cb5bc:      0x00000000:    start instance
0x042cb5c0:      0x00000000:    index bias

which I believe is fixed by

commit 786a770f528a0daee2971494352672cb89f48384
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 25 19:52:13 2011 +0100

    sna/video: Flush the video state at the end of the operation
    
    Or in the case where a second command is received prior to the batch
    being flushed, the vertex data is not flushed and leads to the a
    miscompution of the number of vertices emitted.
    
    Reported-by: Elias Probst <mail@eliasprobst.eu>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=40332
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 12 Slava Gorbunov 2011-11-03 01:41:19 UTC
It doesn't help. I've applied this patch to xf86-video-intel-2.16.0. The error occurs less frequently, but still occurs. Here is the snippet from dmesg:

Nov  1 22:38:21 nout kernel: [drm:ironlake_update_pch_refclk] *ERROR* enabling SSC on PCH
Nov  1 22:43:08 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Nov  1 22:43:08 nout kernel: [drm:kick_ring] *ERROR* Kicking stuck wait on render ring
Nov  1 22:43:14 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Nov  1 22:43:14 nout kernel: [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Nov  1 22:43:14 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 344308 at 344304, next 344309)

Unfortunately, I was in a hurry and forgot to copy i915_error_state :(

Next I'll try to update xf86-video-intel to version 2.16.901 (which should already include the patch above).
Comment 13 Slava Gorbunov 2011-11-08 01:06:13 UTC
Created attachment 53276 [details]
i915_error_state

Again "GPU hung", now with xf86-video-intel-2.16.901 (which should include the patch cited above) and kernel 3.1.0-gentoo. I've copied i915_error_state (see the attachment). Interestingly, it was not easy to copy i915_error_state - most attemps to do it result in "page allocation failed" error (in dmesg) - no matter what method I choose to copy it (cp, cat, dd, etc.). But one attempt was successfull.

The dmesg will be attached in next comment.
Comment 14 Slava Gorbunov 2011-11-08 01:07:59 UTC
Created attachment 53277 [details]
dmesg

Dmesg, which corresponds to last i915_error_state.
Comment 15 Slava Gorbunov 2011-11-19 18:18:12 UTC
Created attachment 53694 [details]
error state

Another "gpu hang". Now with xf86-video-intel-2.16.902 and linux-3.1.0.

I also occasionally observe messages like "[drm:ironlake_update_pch_refclk] *ERROR* enabling SSC on PCH" in syslog, but they don't necessary lead to any other problems.
Comment 16 Slava Gorbunov 2011-11-19 18:19:24 UTC
Created attachment 53695 [details]
dmesg corresponding to last error_state
Comment 17 delete 2011-12-06 12:28:22 UTC
I have this issue with xf86-video-intel-2.17.0-r2 and kernel 2.6.39-r3 on gentoo too.
Comment 18 Chris Wilson 2011-12-06 12:46:02 UTC
The last error state is a different bug entirely,

commit 856da892d8caaeaf19748a1705e993a5eff2c28e in danvet/my-next
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 29 15:12:16 2011 +0000

    drm/i915: Only clear the GPU domains upon a successful finish
    
    By clearing the GPU read domains before waiting upon the buffer, we run
    the risk of the wait being interrupted and the domains prematurely
    cleared. The next time we attempt to wait upon the buffer (after
    userspace handles the signal), we believe that the buffer is idle and so
    skip the wait.
    
    There are a number of bugs across all generations which show signs of an
    overly haste reuse of active buffers.

should fix it.

Nick, it would be best if you open a separate bug report with your own i915_error_state so that we don't make the mistake of assuming we have identified your issue.
Comment 19 Slava Gorbunov 2011-12-13 14:24:00 UTC
(In reply to comment #18)
> The last error state is a different bug entirely,
> 
> commit 856da892d8caaeaf19748a1705e993a5eff2c28e in danvet/my-next
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Nov 29 15:12:16 2011 +0000
> 
>     drm/i915: Only clear the GPU domains upon a successful finish
> 
>     By clearing the GPU read domains before waiting upon the buffer, we run
>     the risk of the wait being interrupted and the domains prematurely
>     cleared. The next time we attempt to wait upon the buffer (after
>     userspace handles the signal), we believe that the buffer is idle and so
>     skip the wait.
> 
>     There are a number of bugs across all generations which show signs of an
>     overly haste reuse of active buffers.
> 
> should fix it.
> 
> Nick, it would be best if you open a separate bug report with your own
> i915_error_state so that we don't make the mistake of assuming we have
> identified your issue.

I've just applied this patch, and got the error in several minutes after reboot:

Dec 14 01:48:59 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Dec 14 01:48:59 nout kernel: [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Dec 14 01:48:59 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 165705 at 165692, next 165708)
Dec 14 01:48:59 nout kernel: [drm:ironlake_update_pch_refclk] *ERROR* enabling SSC on PCH

The error state will be attached to the next comment.
Comment 20 Slava Gorbunov 2011-12-13 14:25:17 UTC
Created attachment 54404 [details]
Error state for previous commend
Comment 21 Slava Gorbunov 2011-12-13 14:31:01 UTC
I forgot to mention that I've upgraded xf86-video-intel to the latest available version, 2.17.0-r2 (Gentoo, which is 2.17.0 + patch "sna: Avoid the double application of drawable offsets for tiled spans")
Comment 22 Chris Wilson 2011-12-13 14:37:41 UTC
(In reply to comment #20)
> Created attachment 54404 [details]
> Error state for previous commend

That's a broken mesa missing the depth stall workarounds, at least,
commit 407785d0e97abd0cc51a6e360089111973748e7c
Author: Eric Anholt <eric@anholt.net>
Date:   Mon Jul 18 17:17:03 2011 -0700

    i965: Enable the PIPE_CONTROL workaround workaround out of paranoia.
    
    There's scary stuff going on in PIPE_CONTROL internals, and if the
    BSpec says to do this to make PIPE_CONTROL work, I'll go ahead and do
    it because we'll probably never be able to debug it after the fact.
    
    v2: Use stall at scoreboard instead of depth stall, as noted by Ken.
Comment 23 Slava Gorbunov 2011-12-14 06:13:06 UTC
(In reply to comment #22)
> That's a broken mesa missing the depth stall workarounds, at least,

Is it OK that ordinary userspace program launched by ordinary user (non-root) can cause kernel error? May be it is a security issue?

Or DRI is insecure by design?
Comment 24 Daniel Vetter 2011-12-21 01:18:02 UTC
On Wed, Dec 14, 2011 at 02:13:06PM +0000, bugzilla-daemon@freedesktop.org wrote:
> https://bugs.freedesktop.org/show_bug.cgi?id=40503
> 
> --- Comment #23 from Slava Gorbunov <slava@fizlesh.org.ru> 2011-12-14 06:13:06 PST ---
> (In reply to comment #22)
> > That's a broken mesa missing the depth stall workarounds, at least,
> 
> Is it OK that ordinary userspace program launched by ordinary user (non-root)
> can cause kernel error? May be it is a security issue?
> 
> Or DRI is insecure by design?

gpu's are simply broken in reality (but not by design). In theory
the kernel could check the command stream that userspace generates for the
gpu more throughroughly, but in practice this is pointless because:
- it would make things really slow
- gpu's have tons of errata, so an earlier kernel won't know about all of
  them. So an old userspace on an old kernel would always be able to crash
  your kernel (if we missed a workaround like it seems to be the case
  here).

With that out of the way, did upgrading mesa fix your issues?
Comment 25 Slava Gorbunov 2011-12-28 23:42:23 UTC
(In reply to comment #24)
> With that out of the way, did upgrading mesa fix your issues?
No, it doesn't. Furthermore, the mesa version on my system was initially (i. e. at the moment of my first report) 7.11, which, as far as I understand, already includes the commit cited above. Yesterday I've upgraded mesa to version 7.11.2, and this morning I've found my notebook completely frozen... Again, I doubt that this upgrade could help, because 7.11 already includes this patch.
Comment 26 Chris Wilson 2012-04-14 06:40:21 UTC
How about Mesa 8.0? Do you have a recent error-state?
Comment 27 Chris Wilson 2012-10-21 14:29:48 UTC
Timeout. Please do reopen if you can still reproduce the issue and help us diagnose the problem, thanks.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.