Bug 36515

Summary: GPU lockup WAIT_FOR_EVENT on disabled pipe
Product: DRI Reporter: Bryce Harrington <bryce>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact:
Severity: major    
Priority: high CC: ben, bgamari, chris, daniel, devtry1, dries, eugeni, jbarnes, kamil.42920, leann.ogasawara, pcarns, przanoni
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=43088
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 42991, 44622    
Attachments:
Description Flags
BootDmesg.txt
none
CurrentDmesg.txt
none
CurrentDmesg.txt
none
Apply the big hammer to finish the fb before disabling it.
none
Apply the big hammer to finish the fb before disabling it.
none
dmesg
none
i915_error_state none

Description Bryce Harrington 2011-04-22 16:27:58 UTC
Forwarding this bug from Ubuntu reporter Stuart Langridge:
http://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/768184

[Problem]
Infrequent gpu lockup on i965.

We've had a handful of reports in the last couple weeks of a gpu lockup on i965 systems which had not had freeze troubles for a long while (>6 months).  Most reporters have experienced the freeze only once or twice; they don't know how to reproduce it, nor really have a way to definitively tell whether it is fixed or just occurs rarely.

I'm forwarding this report on the chance that the bug is a recognizable one to upstream; I don't think users are going to be able to pinpoint this down any further.

Bugs I believe to be dupes, all on i965 systems:

768184  IPEHR: 0x01800020
767511  IPEHR: 0x60020100
767425  IPEHR: 0x08000000
757968  IPEHR: 0x14000000

These i965 reports started coming in shortly after when we updated Ubuntu from xserver 1.10.0 to 1.10.1 and mesa from 7.10.1 to 7.10.2 and adding patch 25521900d to -intel (bug #35808).  (Due to the intermittency of the bug I haven't had people try downgrading those packages.)

[Original Description]
Crash which required reboot. The crash itself is described in https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/768176 and this is after I persuaded apport-gpu-error-intel.py to run.

My screen went entirely black (both laptop screen and second monitor). Switching to a VC did not show anything on screen. At first I could still hear sounds from running applications, but eventually (after ~10 seconds) they stopped. I had to powercycle the machine to get control back. The "system problem detected" apport dialog offered to let me file a bug, but then I got another crash dialog saying "apport-gpu-error-intel.py closed unexpectedly".

ProblemType: Crash
DistroRelease: Ubuntu 11.04
Package: xserver-xorg-video-intel 2:2.14.0-4ubuntu7
ProcVersionSignature: Ubuntu 2.6.38-8.42-generic-pae 2.6.38.2
Uname: Linux 2.6.38-8-generic-pae i686
Architecture: i386
Chipset: i965gm
CompositorRunning: compiz
DRM.card0.HDMI.A.1:
status: disconnected
enabled: disabled
dpms: Off
modes:
edid-base64:
DRM.card0.LVDS.1:
status: connected
enabled: enabled
dpms: On
modes: 1280x800
edid-base64: AP///////wAwZAYjMjQ5NTISAQOAHRJ4Cof1lFdPjCcnUFQAAAABAQEBAQEBAQEBAQEBAQEBKhwAqFAgHjAQMCIAH7QQAAAYAAAAAAAAAAAAAAAAAAAAAAAAAAAA/gBSUDc3NKMxMzNFV0REAAAA/gAIDBAUKFB/2AEBCiAgAL4=
DRM.card0.VGA.1:
status: connected
enabled: enabled
dpms: On
modes: 1680x1050 1280x1024 1280x1024 1280x960 1152x864 1024x768 1024x768 1024x768 832x624 800x600 800x600 800x600 800x600 640x480 640x480 640x480 640x480 720x400
edid-base64: AP///////wBMLdIDMjJBSCMTAQMOMB54KtxVo1lIniQRUFS/74CzAIGAgUBxTwEBAQEBAQEBITmQMGIaJ0BosDYA2igRAAAcAAAA/QA4Sx5REAAKICAgICAgAAAA/ABTeW5jTWFzdGVyCiAgAAAA/wBIOUZTODM5NDg1CiAgAAI=
Date: Thu Apr 21 10:25:20 2011
DistUpgraded: Log time: 2011-01-18 17:25:59.814253
DistroCodename: natty
DistroVariant: ubuntu
DuplicateSignature: (ESR: 0x00000001 IPEHR: 0x01800020)
ExecutablePath: /home/aquarius/apport-gpu-error-intel.py
GraphicsCard:
Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (primary) [8086:2a02] (rev 0c) (prog-if 00 [VGA controller])
Subsystem: Dell Device [1028:0209]
Subsystem: Dell Device [1028:0209]
InterpreterPath: /usr/bin/python2.7
MachineType: Dell Inc. XPS M1330
ProcCmdline: python apport-gpu-error-intel.py
ProcEnviron:PATH=(custom, user)
LC_MESSAGES=en_GB.utf8
LANG=en_US.UTF-8
LANGUAGE=en_GB:en
ProcKernelCmdLine: root=UUID=b572742c-deea-43ec-92d3-b1d1e6b6802f ro quiet splash
ProcKernelCmdLine_: root=UUID=b572742c-deea-43ec-92d3-b1d1e6b6802f ro quiet splash
RelatedPackageVersions:
xserver-xorg             1:7.6+4ubuntu3
libdrm2                  2.4.23-1ubuntu6
xserver-xorg-video-intel 2:2.14.0-4ubuntu7
SourcePackage: xserver-xorg-video-intel
Title: [i965gm] GPU lockup (ESR: 0x00000001 IPEHR: 0x01800020)
UpgradeStatus: Upgraded to natty on 2011-01-18 (92 days ago)
UserGroups: adm admin cdrom couchdb dialout dip floppy fuse lpadmin plugdev video
dmi.bios.date: 12/26/2008
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A15
dmi.board.name: 0N6705
dmi.board.vendor: Dell Inc.
dmi.chassis.type: 8
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvrA15:bd12/26/2008:svnDellInc.:pnXPSM1330:pvr:rvnDellInc.:rn0N6705:rvr:cvnDellInc.:ct8:cvr:
dmi.product.name: XPS M1330
dmi.sys.vendor: Dell Inc.
version.compiz: compiz 1:0.9.4+bzr20110415-0ubuntu2
version.libdrm2: libdrm2 2.4.23-1ubuntu6
version.libgl1-mesa-dri: libgl1-mesa-dri 7.10.2-0ubuntu2
version.libgl1-mesa-dri-experimental: libgl1-mesa-dri-experimental N/A
version.libgl1-mesa-glx: libgl1-mesa-glx 7.10.2-0ubuntu2
version.xserver-xorg: xserver-xorg 1:7.6+4ubuntu3
version.xserver-xorg-video-ati: xserver-xorg-video-ati N/A
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.14.0-4ubuntu7
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:0.0.16+git20110107+b795ca6e-0ubuntu7
Comment 1 Bryce Harrington 2011-04-22 16:31:26 UTC
Created attachment 45980 [details]
BootDmesg.txt
Comment 2 Bryce Harrington 2011-04-22 16:31:47 UTC
Created attachment 45981 [details]
CurrentDmesg.txt
Comment 3 Bryce Harrington 2011-04-22 16:32:03 UTC
Created attachment 45982 [details]
CurrentDmesg.txt
Comment 5 Chris Wilson 2011-04-27 00:56:41 UTC
Bryce, one aspect that we are wary of with 965G[M] is that the early chipsets had severe issues with memory above 4G. It the memory configuration captured in the LP reports? The attached dmesg has 4G + PAE, is that common?
Comment 6 Timo Jyrinki 2011-04-27 01:14:02 UTC
One affected 965gm user here (bug report with attachments https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/771655) - 4GB of memory but no PAE, ie. 64-bit. On the other hand my problem, is simply X.org crashing/segfaulting, I don't get apport triggered for a GPU lockup bug report. So sorry for the (possible) noise, even though my problem is clearly coming from the same bunch of changes and is similarly random/rare.

To make up for that, I went through the mentioned lockup bug reports to answer the question and: only two has PAE, four don't have PAE, but all those i965gm GPU lockup reports currently so far seem to be i686 unlike me.
Comment 7 Chris Wilson 2011-06-16 05:08:29 UTC
Created attachment 48039 [details] [review]
Apply the big hammer to finish the fb before disabling it.
Comment 8 Chris Wilson 2011-06-16 05:55:04 UTC
Created attachment 48043 [details] [review]
Apply the big hammer to finish the fb before disabling it.

When flushing before disabling, it helps to do it before and not after the disable.
Comment 9 Bryce Harrington 2011-06-16 15:41:41 UTC
Created attachment 48066 [details]
dmesg

I think I may have reproduced this same bug on my own i965 finally.  Not sure exactly how I did it, but it showed up after a lid open event (resume from sleep I guess).  The machine has been plugged into its docking station with external monitor continuously.
Comment 10 Bryce Harrington 2011-06-16 15:42:15 UTC
Created attachment 48067 [details]
i915_error_state

IPEHR=0x01820000
Comment 11 Chris Wilson 2011-06-16 16:15:59 UTC
I was hoping to see the contents of the display registers in the error state to confirm the theory about the WAIT_FOR_EVENT being on a disabled pipe. Alas, that feature isn't part of that kernel.
Comment 12 Chris Wilson 2011-06-16 16:18:52 UTC
May I also make a polite request that you enable pageflipping once more ;-)

I wonder if we should just be waiting for the VBLANK on a full screen blit rather than a range that is impossible. Hmm.
Comment 13 Chris Wilson 2011-07-08 03:17:30 UTC
*** Bug 35576 has been marked as a duplicate of this bug. ***
Comment 14 Chris Wilson 2011-07-08 03:22:19 UTC
*** Bug 37450 has been marked as a duplicate of this bug. ***
Comment 15 Kamil Iskra 2011-07-08 07:45:07 UTC
A bug I reported (Bug 37450) has been marked as a duplicate of this bug, and this bug is marked as NEEDINFO.

Since I can reproduce the bug I reported 100% of the time, please let me know if you would like me to provide any additional info.
Comment 16 Chris Wilson 2011-07-18 08:00:59 UTC
Kamil, can you try applying the patch https://bugs.freedesktop.org/attachment.cgi?id=48043 to your kernel and seeing if that is sufficient.

I'm confident that's the fix, just waiting for testing.
Comment 17 Kamil Iskra 2011-07-19 21:39:08 UTC
I applied the patch to 2.6.39.3 kernel, but it did *not* help.  I'm seeing the same problem as before (enabling an output after suspend/resume hangs the server).  Do I need to be running a newer kernel perhaps?

xf86-video-intel: 2.15.0
xorg-server: 1.10.2
mesa: 7.10.3
libdrm: 2.4.26
kernel: 2.6.39.3

Do I need to be running a newer kernel perhaps?
Comment 18 Chris Wilson 2011-07-20 02:45:38 UTC
Sigh. After applying the patch can you post an i915_error_state.
Comment 19 Kamil Iskra 2011-07-20 09:15:38 UTC
$ cat /sys/kernel/debug/dri/0/i915_error_state 
no error state collected

That's after a restart of the X server (Ctrl-Alt-Bcksp) so that I can access the machine again; I assume that would not reset i915_error_state?

The only indication in the logs I can see is in /var/log/Xorg.0.log:

[   259.306] (WW) intel(0): flip queue failed: Invalid argument
[   259.306] (WW) intel(0): Page flip failed: Invalid argument
[   260.299] (WW) intel(0): flip queue failed: Device or resource busy
[   260.299] (WW) intel(0): Page flip failed: Device or resource busy
[last two lines repeating]

These start occurring after I enable an output using xrandr (after a suspend/resume cycle); Xorg works for a while, but hangs immediately after I switch to a text console and back to X (a required action to actually see something via the new output, as per https://bugzilla.kernel.org/show_bug.cgi?id=24982).

A workaround that works for me is to modify xf86-video-intel to force intel->use_pageflipping to FALSE.  I believe there used to be a user-accessible option to turn it off, but it's been removed?  That is rather unfortunate, I must say.
Comment 20 Chris Wilson 2011-07-20 09:22:35 UTC
I was just about to add that you hit kernel bug # 24982...

So we can't tell if the GPU lockup itself has been fixed if the second prevents you from testing.
Comment 21 Kamil Iskra 2011-07-20 09:45:30 UTC
Are you saying that *this* bug is probably fixed, but X still hangs because of the (unrelated) DPMS bug in the kernel?  That could be, as I no longer see the GPU hung messages.

Well, I guess all I can do at this point is sit and wait for that kernel bug to be fixed, hopefully some time soon; it's been open since last year...  I'd be happy to try any patches you guys might have.
Comment 22 Chris Wilson 2011-07-29 02:28:04 UTC
Ok, to be really complicated, can you please retest this patch on top of keithp/drm-intel-fixes [ git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6.git]. Hopefully we have the modeswitching bug fixed and so we can then successfully test the WAIT_FOR_EVENT fix...
Comment 23 Kamil Iskra 2011-07-29 22:22:22 UTC
Chris, drm-intel-fixes (last commit
cda2bb78c24de7674eafa3210314dc75bed344a6) does *not* fix the modeswitching bug for me.  I guess no point in retesting your patch then?
Comment 24 Chris Wilson 2011-07-30 01:47:00 UTC
The patch should prevent the GPU hang upon turning off a pipe, but it is a nuisance if the machine is dying for other reason we can't but sure that the patch is sufficient.
Comment 25 Eugeni Dodonov 2011-10-19 12:32:34 UTC
Hi,

does this still happens with the latest versions of the drivers, or it is not an issue anymore?
Comment 26 Chris Wilson 2011-10-19 13:16:22 UTC
Yes, the patch is still required, just no one has volunteered to test it.
Comment 27 Kamil Iskra 2011-10-19 22:07:31 UTC
Well, I would've loved to test it, but I just tried kernel 3.1-rc10 and with vanilla xf86-video-intel 2.16.0 the kernel still crashes for me on enabling an output via xrandr.  I assume it's due to the infamous kernel bug 24982, which has probably been open for a year now with no resolution in sight, though with kernel bugzilla apparently still being down (pathetic), it's hard to tell.

For what it's worth, with your patch applied, the kernel seems to crash less easily for me than without it.
Comment 28 Chris Wilson 2011-10-20 02:53:36 UTC
(In reply to comment #27)
> Well, I would've loved to test it, but I just tried kernel 3.1-rc10 and with
> vanilla xf86-video-intel 2.16.0 the kernel still crashes for me on enabling an
> output via xrandr.  I assume it's due to the infamous kernel bug 24982, which
> has probably been open for a year now with no resolution in sight, though with
> kernel bugzilla apparently still being down (pathetic), it's hard to tell.

bugzilla.kernel.org and that I'm currently unaware of any crash inside i915.ko, so you're going to have to remind me...
Comment 29 Kamil Iskra 2011-10-24 21:47:09 UTC
(In reply to comment #28)
> I'm currently unaware of any crash inside i915.ko,
> so you're going to have to remind me...

Chris, please see comment #19 in this bugzilla entry, or, for a complete description, see bug #37450.  In essence, it seems that stale DPMS properties (kernel bug 24982), which normally just result in a blank screen, can in some situations result in a crash/hang.  When I originally reported it I could only trigger it after suspend/resume; nowadays I can reproduce it just by repeatedly enabling and disabling an output a few times.  The only workaround that works for me is modifying the xf86-video-intel driver to force page flipping off.
Comment 30 Chris Wilson 2011-10-26 02:08:42 UTC
(In reply to comment #29)
> (In reply to comment #28)
> > I'm currently unaware of any crash inside i915.ko,
> > so you're going to have to remind me...
> 
> Chris, please see comment #19 in this bugzilla entry, or, for a complete
> description, see bug #37450.  In essence, it seems that stale DPMS properties
> (kernel bug 24982), which normally just result in a blank screen, can in some
> situations result in a crash/hang.  When I originally reported it I could only
> trigger it after suspend/resume; nowadays I can reproduce it just by repeatedly
> enabling and disabling an output a few times.  The only workaround that works
> for me is modifying the xf86-video-intel driver to force page flipping off.

Ok, I think we know that bug and had a fix for the races inside the page-flipping code, but I think Keith dropped them on the floor...
Comment 31 Chris Wilson 2011-11-09 07:45:27 UTC
*** Bug 40526 has been marked as a duplicate of this bug. ***
Comment 32 Chris Wilson 2011-11-09 07:45:34 UTC
*** Bug 40527 has been marked as a duplicate of this bug. ***
Comment 33 Paulo Zanoni 2011-11-09 08:47:22 UTC
All our 4 duplicates were high/major. Adjusting.
Comment 34 Gordon Jin 2012-01-03 21:53:36 UTC
(clear needinfo)
Comment 35 Chris Wilson 2012-01-20 12:48:05 UTC
*** Bug 45000 has been marked as a duplicate of this bug. ***
Comment 36 Leann Ogasawara 2012-02-03 14:08:26 UTC
Hi Stuart,

Per a recent request, I've built an Ubuntu test kernel based on the latest drm-intel-fixes branch (eg git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux.git drm-intel-fixes) and applied the patch noted in comment 16.  If you could please test and post you results that would be great.  Thanks in advance.

http://people.canonical.com/~ogasawara/fdo36515/
Comment 37 Chris Wilson 2012-02-08 07:48:41 UTC
*** Bug 40052 has been marked as a duplicate of this bug. ***
Comment 38 Chris Wilson 2012-04-10 15:07:03 UTC
*** Bug 48518 has been marked as a duplicate of this bug. ***
Comment 39 devtry1 2012-04-11 11:00:44 UTC
I'm the original reporter of bug 48518.

That is the description of my problem (on Ubuntu 12.04):

I was working on an external monitor with my notebook.
I changed the monitor settings to turn off the external monitor and turn on the notebook integrated monitor.
When I applied the changes something went wrong: I couldn't see anything. Then I replugged the VGA cable for the external monitor and something appeared again, but it was all working strange and not as it should.

Unfortunately I was working and I had to restart the computer; anyway nearly nothing worked: it was just a miracle that I managed to report the bug. I couldn't even open a terminal, I don't think I would have managed to take a screenshot. Neither I had a camera at hand to take a photo.

I'll do my best to describe what I saw:

 - no launcher appearing when approching the left screen side
 - no panel (the top bar)
 - all the windows (which were already open) overlapping and cut
 - clicking on a window didn't cause it to appear on the foreground, but I was able to interact with the application program (write, click buttons)
 - no keyboard shortcut worked (at least not visibly); I tried using these: CTRL-ALT-T (to open a Terminal), Windows button (to open the Dash), ALT-F4 (to close windows)
 - I was able to drag the windows

I didn't think about this when it happened: maybe even if I was "blind" everything was working and I could take a screenshot. Anyway it's too late and I'm still not sure I could.

For further details please refer to "https://bugs.freedesktop.org/show_bug.cgi?id=48518" and the original Ubuntu bug report.

If somehow I can help you, please let me know.

Thank!
Comment 40 devtry1 2012-04-11 23:54:57 UTC
I tried to reproduce the bug but without success: everything went right.
Comment 41 Kamil Iskra 2012-04-12 07:51:03 UTC
devtry1, don't worry, you're not the only one with this problem.  Given your extended description of the issue in comment #39, I would say Chris was right merging these two bug reports.  I have sporadically seen what appears to be the same problem, only with KDE.  The GPU will hung, the timer will time out and usually manage to reset the chip.  In the process, however, the compositing window manager dies, so you no longer have control over window positions, desktop shortcuts, etc.  For me the solution is generally to reboot, because even if I can successfully restart the X server, the GPU is in an ill-defined state so 3D acceleration doesn't work properly.  I have not been able to find a reliable way of reproducing it; sometimes it "just happens", luckily, not very frequently.
Comment 42 Chris Wilson 2012-04-16 05:40:57 UTC
The first of the fixes has landed:

commit 14667a4bde4361b7ac420d68a2e9e9b9b2df5231
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 3 17:58:35 2012 +0100

    drm/i915: Finish any pending operations on the framebuffer before disabling
    
    Similar to the case where we are changing from one framebuffer to
    another, we need to be sure that there are no pending WAIT_FOR_EVENTs on
    the pipe for the current framebuffer before switching. If we disable the
    pipe, and then try to execute a WAIT_FOR_EVENT it will block
    indefinitely and cause a GPU hang.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.