Summary: | [Sandy Bridge] i915 System Freeze gpu hangcheck timer | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | jaschak1 | ||||||||||||||||||||||||||
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> | ||||||||||||||||||||||||||
Status: | CLOSED INVALID | QA Contact: | |||||||||||||||||||||||||||
Severity: | critical | ||||||||||||||||||||||||||||
Priority: | medium | CC: | 384toregzteez, ben, chris, daniel, jbarnes, mike, slava, yunta83 | ||||||||||||||||||||||||||
Version: | unspecified | ||||||||||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||||||||
Attachments: |
|
Created attachment 50734 [details]
glxinfo
Created attachment 50735 [details]
xorg
Bug description: I tested it again with a completely different OS-Setup. Now the System freeze occurs after one Minute reproducable. The whole system completely locks up, no sys+rq, only power button. System environment: -- chipset: H67 -- system architecture: 64-bit -- xf86-video-intel: 2.16.0 -- xserver: X.Org X Server 1.10.3.902 (1.10.4 RC 2) -- mesa: OpenGL version string: 2.1 Mesa 7.12-devel -- libdrm: 0.25 -- kernel: 2.6.38-11-generic -- Linux distribution: Mint 11 + xorg-edgers Aug 31 -- Machine or mobo model: Asrock H67 -- Display connector: HDMI Reproducing steps: same as above. Additional info: /sys/kernel/debug/dri/0/i915_error_state -> No error state collected. And Errors are the same xorg: [ 458.073] (EE) intel(0): Detected a hung GPU, disabling acceleration. dmesg: Aug 31 18:18:39 desktop kernel: [ 457.610441] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Aug 31 18:18:39 desktop kernel: [ 457.612075] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 498706 at 498705, next 498707) tested it one more time and could save an i915_error_state. uploaded it in attachments. Created attachment 50790 [details]
i915_error_state
I have this bug too. In my case, the bug is triggered by OpenGL screensavers from xscreensaver, randomly. Dmesg: Oct 6 10:32:22 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Oct 6 10:32:22 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6338481 at 6338472, next 6338482) Oct 6 10:32:28 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Oct 6 10:32:28 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6338482 at 6338472, next 6338483) Oct 6 10:32:35 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Oct 6 10:32:35 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6338483 at 6338472, next 6338484) Oct 6 10:32:41 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Oct 6 10:32:41 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6338484 at 6338472, next 6338485) Oct 6 10:32:47 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Oct 6 10:32:47 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6338485 at 6338472, next 6338486) Oct 6 10:33:14 nout kernel: ------------[ cut here ]------------ Oct 6 10:33:14 nout kernel: kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3336! Oct 6 10:33:14 nout kernel: invalid opcode: 0000 [#1] SMP Oct 6 10:33:14 nout kernel: CPU 2 Oct 6 10:33:14 nout kernel: Modules linked in: cryptd aes_x86_64 aes_generic bnep rfcomm snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss reiserfs dm_mod uvcvideo videodev v4l2_compat_ioctl32 btusb bluetooth snd_hda_codec_hdmi snd_hda_codec_realtek arc4 rtl8192ce rtl8192c_common i915 rtlwifi fbcon font bitblit softcursor drm_kms_helper snd_hda_intel drm mac80211 fb fbdev snd_hda_codec cfg80211 i2c_algo_bit snd_hwdep snd_pcm thinkpad_acpi ehci_hcd usbcore cfbcopyarea snd_timer thermal processor video snd rfkill thermal_sys r8169 battery ac hwmon cfbimgblt power_supply sg sr_mod cfbfillrect psmouse evdev wmi i2c_i801 soundcore intel_agp button intel_gtt nvram cdrom agpgart snd_page_alloc Oct 6 10:33:14 nout kernel: Oct 6 10:33:14 nout kernel: Pid: 2170, comm: X Not tainted 3.0.4-gentoo #3 LENOVO 78543MG/78543MG Oct 6 10:33:14 nout kernel: RIP: 0010:[<ffffffffa0438392>] [<ffffffffa0438392>] i915_gem_object_unpin+0xa2/0xb0 [i915] Oct 6 10:33:14 nout kernel: RSP: 0018:ffff880072967be0 EFLAGS: 00010246 Oct 6 10:33:14 nout kernel: RAX: ffff88006fd94000 RBX: ffff880070a34000Oct The log is truncated, I suppose, due to hard reset which was required to reboot the system. Sometimes after freeze syslog doesn't even contain any traces of it - perhaps because syslogd didn't have time to write messages before system becomes completely frozen. The last relevant line in Xorg.log is: [ 85260.319] (WW) intel(0): flip queue failed: Input/output error My environment: OS: Gentoo linux Kernel: 3.0.4-gentoo xf86-video-intel-2.16.0, xorg-server-1.11.0, mesa-7.11 Hardware: Lenovo L420, Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz. lspci -vv: 00:02.0 VGA compatible controller: Intel Corporation Device 0116 (rev 09) (prog-if 00 [VGA controller]) Subsystem: Lenovo Device 21dd Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 43 Region 0: Memory at f0000000 (64-bit, non-prefetchable) [size=4M] Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M] Region 4: I/O ports at 1800 [size=64] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee0f00c Data: 4181 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a4] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP- Kernel driver in use: i915 Kernel modules: i915 Created attachment 52079 [details]
i915 warnings from syslog
Here are several warnings with backtraces from i915.ko which I've found in my syslog today. They didn't cause system lockup, but XV has stopped working.
I've tried to build i915.ko from most recent git tree (taken from git://people.freedesktop.org/~keithp/linux, branch drm-intel-fixes, commit cd0de039bff32ee314046c0e4c047c38aa696f84). It has required two small changes to make it compile wit current stable kernel (3.0.7). But the problem is still present - kernel module prints warnings to dmesg, acceleration gets disabled, but the whole system doesn't hang. I'll attach the dmesg and i915_error_state: Created attachment 52618 [details]
dmesg
Created attachment 52619 [details]
Error state collected during warnings reported above
Here's the culprit (finally): 0x042cb5ac: 0x7b003c04: 3DPRIMITIVE: rect list sequential 0x042cb5b0: 0xffffff4a: vertex count 0x042cb5b4: 0x000001dd: start vertex 0x042cb5b8: 0x00000001: instance count 0x042cb5bc: 0x00000000: start instance 0x042cb5c0: 0x00000000: index bias which I believe is fixed by commit 786a770f528a0daee2971494352672cb89f48384 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Aug 25 19:52:13 2011 +0100 sna/video: Flush the video state at the end of the operation Or in the case where a second command is received prior to the batch being flushed, the vertex data is not flushed and leads to the a miscompution of the number of vertices emitted. Reported-by: Elias Probst <mail@eliasprobst.eu> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=40332 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> It doesn't help. I've applied this patch to xf86-video-intel-2.16.0. The error occurs less frequently, but still occurs. Here is the snippet from dmesg: Nov 1 22:38:21 nout kernel: [drm:ironlake_update_pch_refclk] *ERROR* enabling SSC on PCH Nov 1 22:43:08 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Nov 1 22:43:08 nout kernel: [drm:kick_ring] *ERROR* Kicking stuck wait on render ring Nov 1 22:43:14 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Nov 1 22:43:14 nout kernel: [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state Nov 1 22:43:14 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 344308 at 344304, next 344309) Unfortunately, I was in a hurry and forgot to copy i915_error_state :( Next I'll try to update xf86-video-intel to version 2.16.901 (which should already include the patch above). Created attachment 53276 [details]
i915_error_state
Again "GPU hung", now with xf86-video-intel-2.16.901 (which should include the patch cited above) and kernel 3.1.0-gentoo. I've copied i915_error_state (see the attachment). Interestingly, it was not easy to copy i915_error_state - most attemps to do it result in "page allocation failed" error (in dmesg) - no matter what method I choose to copy it (cp, cat, dd, etc.). But one attempt was successfull.
The dmesg will be attached in next comment.
Created attachment 53277 [details]
dmesg
Dmesg, which corresponds to last i915_error_state.
Created attachment 53694 [details]
error state
Another "gpu hang". Now with xf86-video-intel-2.16.902 and linux-3.1.0.
I also occasionally observe messages like "[drm:ironlake_update_pch_refclk] *ERROR* enabling SSC on PCH" in syslog, but they don't necessary lead to any other problems.
Created attachment 53695 [details]
dmesg corresponding to last error_state
I have this issue with xf86-video-intel-2.17.0-r2 and kernel 2.6.39-r3 on gentoo too. The last error state is a different bug entirely, commit 856da892d8caaeaf19748a1705e993a5eff2c28e in danvet/my-next Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Nov 29 15:12:16 2011 +0000 drm/i915: Only clear the GPU domains upon a successful finish By clearing the GPU read domains before waiting upon the buffer, we run the risk of the wait being interrupted and the domains prematurely cleared. The next time we attempt to wait upon the buffer (after userspace handles the signal), we believe that the buffer is idle and so skip the wait. There are a number of bugs across all generations which show signs of an overly haste reuse of active buffers. should fix it. Nick, it would be best if you open a separate bug report with your own i915_error_state so that we don't make the mistake of assuming we have identified your issue. (In reply to comment #18) > The last error state is a different bug entirely, > > commit 856da892d8caaeaf19748a1705e993a5eff2c28e in danvet/my-next > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue Nov 29 15:12:16 2011 +0000 > > drm/i915: Only clear the GPU domains upon a successful finish > > By clearing the GPU read domains before waiting upon the buffer, we run > the risk of the wait being interrupted and the domains prematurely > cleared. The next time we attempt to wait upon the buffer (after > userspace handles the signal), we believe that the buffer is idle and so > skip the wait. > > There are a number of bugs across all generations which show signs of an > overly haste reuse of active buffers. > > should fix it. > > Nick, it would be best if you open a separate bug report with your own > i915_error_state so that we don't make the mistake of assuming we have > identified your issue. I've just applied this patch, and got the error in several minutes after reboot: Dec 14 01:48:59 nout kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung Dec 14 01:48:59 nout kernel: [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state Dec 14 01:48:59 nout kernel: [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 165705 at 165692, next 165708) Dec 14 01:48:59 nout kernel: [drm:ironlake_update_pch_refclk] *ERROR* enabling SSC on PCH The error state will be attached to the next comment. Created attachment 54404 [details]
Error state for previous commend
I forgot to mention that I've upgraded xf86-video-intel to the latest available version, 2.17.0-r2 (Gentoo, which is 2.17.0 + patch "sna: Avoid the double application of drawable offsets for tiled spans") (In reply to comment #20) > Created attachment 54404 [details] > Error state for previous commend That's a broken mesa missing the depth stall workarounds, at least, commit 407785d0e97abd0cc51a6e360089111973748e7c Author: Eric Anholt <eric@anholt.net> Date: Mon Jul 18 17:17:03 2011 -0700 i965: Enable the PIPE_CONTROL workaround workaround out of paranoia. There's scary stuff going on in PIPE_CONTROL internals, and if the BSpec says to do this to make PIPE_CONTROL work, I'll go ahead and do it because we'll probably never be able to debug it after the fact. v2: Use stall at scoreboard instead of depth stall, as noted by Ken. (In reply to comment #22) > That's a broken mesa missing the depth stall workarounds, at least, Is it OK that ordinary userspace program launched by ordinary user (non-root) can cause kernel error? May be it is a security issue? Or DRI is insecure by design? On Wed, Dec 14, 2011 at 02:13:06PM +0000, bugzilla-daemon@freedesktop.org wrote: > https://bugs.freedesktop.org/show_bug.cgi?id=40503 > > --- Comment #23 from Slava Gorbunov <slava@fizlesh.org.ru> 2011-12-14 06:13:06 PST --- > (In reply to comment #22) > > That's a broken mesa missing the depth stall workarounds, at least, > > Is it OK that ordinary userspace program launched by ordinary user (non-root) > can cause kernel error? May be it is a security issue? > > Or DRI is insecure by design? gpu's are simply broken in reality (but not by design). In theory the kernel could check the command stream that userspace generates for the gpu more throughroughly, but in practice this is pointless because: - it would make things really slow - gpu's have tons of errata, so an earlier kernel won't know about all of them. So an old userspace on an old kernel would always be able to crash your kernel (if we missed a workaround like it seems to be the case here). With that out of the way, did upgrading mesa fix your issues? (In reply to comment #24) > With that out of the way, did upgrading mesa fix your issues? No, it doesn't. Furthermore, the mesa version on my system was initially (i. e. at the moment of my first report) 7.11, which, as far as I understand, already includes the commit cited above. Yesterday I've upgraded mesa to version 7.11.2, and this morning I've found my notebook completely frozen... Again, I doubt that this upgrade could help, because 7.11 already includes this patch. How about Mesa 8.0? Do you have a recent error-state? Timeout. Please do reopen if you can still reproduce the issue and help us diagnose the problem, thanks. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 50733 [details] dmesg Bug description: System freezes after playing OpenGL game. Sometimes after 60 minutes and othertimes just after 10 minutes. The whole system completely locks up, no sys+rq, only power button. System environment: -- chipset: H67 -- system architecture: 64-bit -- xf86-video-intel: 2.15.0 -- xserver: X.Org X Server 1.10.3 -- mesa: OpenGL version string: 2.1 Mesa 7.11 -- libdrm: -- kernel: 2.6.40.3-0.fc15.x86_64 -- Linux distribution: Fedora 15, but it also happens with Ubuntu 11.04 -- Machine or mobo model: Asrock H67 -- Display connector: HDMI Reproducing steps: Start the Game "Heroes of Newerth" (free Game with native Linux support). Logging in and go Matchmaking and ingame. Sometimes i can play over an hour and sometimes just 10 min. Additional info: /sys/kernel/debug/dri/0/i915_error_state -> No error state collected. Error occurs on Ubuntu 11.04 Ubuntu 11.04 + xorg-edgers repo. Fedora 15 Ubuntu 11.04 + xorg-edgers repo. + 3.1-rc4 Fedora 15 + 3.1-rc4