Forwarding this bug from Ubuntu reporter Stuart Langridge:
Infrequent gpu lockup on i965.
We've had a handful of reports in the last couple weeks of a gpu lockup on i965 systems which had not had freeze troubles for a long while (>6 months). Most reporters have experienced the freeze only once or twice; they don't know how to reproduce it, nor really have a way to definitively tell whether it is fixed or just occurs rarely.
I'm forwarding this report on the chance that the bug is a recognizable one to upstream; I don't think users are going to be able to pinpoint this down any further.
Bugs I believe to be dupes, all on i965 systems:
768184 IPEHR: 0x01800020
767511 IPEHR: 0x60020100
767425 IPEHR: 0x08000000
757968 IPEHR: 0x14000000
These i965 reports started coming in shortly after when we updated Ubuntu from xserver 1.10.0 to 1.10.1 and mesa from 7.10.1 to 7.10.2 and adding patch 25521900d to -intel (bug #35808). (Due to the intermittency of the bug I haven't had people try downgrading those packages.)
Crash which required reboot. The crash itself is described in https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/768176 and this is after I persuaded apport-gpu-error-intel.py to run.
My screen went entirely black (both laptop screen and second monitor). Switching to a VC did not show anything on screen. At first I could still hear sounds from running applications, but eventually (after ~10 seconds) they stopped. I had to powercycle the machine to get control back. The "system problem detected" apport dialog offered to let me file a bug, but then I got another crash dialog saying "apport-gpu-error-intel.py closed unexpectedly".
DistroRelease: Ubuntu 11.04
Package: xserver-xorg-video-intel 2:2.14.0-4ubuntu7
ProcVersionSignature: Ubuntu 2.6.38-8.42-generic-pae 126.96.36.199
Uname: Linux 2.6.38-8-generic-pae i686
modes: 1680x1050 1280x1024 1280x1024 1280x960 1152x864 1024x768 1024x768 1024x768 832x624 800x600 800x600 800x600 800x600 640x480 640x480 640x480 640x480 720x400
Date: Thu Apr 21 10:25:20 2011
DistUpgraded: Log time: 2011-01-18 17:25:59.814253
DuplicateSignature: (ESR: 0x00000001 IPEHR: 0x01800020)
Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (primary) [8086:2a02] (rev 0c) (prog-if 00 [VGA controller])
Subsystem: Dell Device [1028:0209]
Subsystem: Dell Device [1028:0209]
MachineType: Dell Inc. XPS M1330
ProcCmdline: python apport-gpu-error-intel.py
ProcKernelCmdLine: root=UUID=b572742c-deea-43ec-92d3-b1d1e6b6802f ro quiet splash
ProcKernelCmdLine_: root=UUID=b572742c-deea-43ec-92d3-b1d1e6b6802f ro quiet splash
Title: [i965gm] GPU lockup (ESR: 0x00000001 IPEHR: 0x01800020)
UpgradeStatus: Upgraded to natty on 2011-01-18 (92 days ago)
UserGroups: adm admin cdrom couchdb dialout dip floppy fuse lpadmin plugdev video
dmi.bios.vendor: Dell Inc.
dmi.board.vendor: Dell Inc.
dmi.chassis.vendor: Dell Inc.
dmi.product.name: XPS M1330
dmi.sys.vendor: Dell Inc.
version.compiz: compiz 1:0.9.4+bzr20110415-0ubuntu2
version.libdrm2: libdrm2 2.4.23-1ubuntu6
version.libgl1-mesa-dri: libgl1-mesa-dri 7.10.2-0ubuntu2
version.libgl1-mesa-dri-experimental: libgl1-mesa-dri-experimental N/A
version.libgl1-mesa-glx: libgl1-mesa-glx 7.10.2-0ubuntu2
version.xserver-xorg: xserver-xorg 1:7.6+4ubuntu3
version.xserver-xorg-video-ati: xserver-xorg-video-ati N/A
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.14.0-4ubuntu7
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:0.0.16+git20110107+b795ca6e-0ubuntu7
Created attachment 45980 [details]
Created attachment 45981 [details]
Created attachment 45982 [details]
Here are links to some of the i915_error_state files for the various (suspected dupe) bugs:
Bryce, one aspect that we are wary of with 965G[M] is that the early chipsets had severe issues with memory above 4G. It the memory configuration captured in the LP reports? The attached dmesg has 4G + PAE, is that common?
One affected 965gm user here (bug report with attachments https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/771655) - 4GB of memory but no PAE, ie. 64-bit. On the other hand my problem, is simply X.org crashing/segfaulting, I don't get apport triggered for a GPU lockup bug report. So sorry for the (possible) noise, even though my problem is clearly coming from the same bunch of changes and is similarly random/rare.
To make up for that, I went through the mentioned lockup bug reports to answer the question and: only two has PAE, four don't have PAE, but all those i965gm GPU lockup reports currently so far seem to be i686 unlike me.
Created attachment 48039 [details] [review]
Apply the big hammer to finish the fb before disabling it.
Created attachment 48043 [details] [review]
Apply the big hammer to finish the fb before disabling it.
When flushing before disabling, it helps to do it before and not after the disable.
Created attachment 48066 [details]
I think I may have reproduced this same bug on my own i965 finally. Not sure exactly how I did it, but it showed up after a lid open event (resume from sleep I guess). The machine has been plugged into its docking station with external monitor continuously.
Created attachment 48067 [details]
I was hoping to see the contents of the display registers in the error state to confirm the theory about the WAIT_FOR_EVENT being on a disabled pipe. Alas, that feature isn't part of that kernel.
May I also make a polite request that you enable pageflipping once more ;-)
I wonder if we should just be waiting for the VBLANK on a full screen blit rather than a range that is impossible. Hmm.
*** Bug 35576 has been marked as a duplicate of this bug. ***
*** Bug 37450 has been marked as a duplicate of this bug. ***
A bug I reported (Bug 37450) has been marked as a duplicate of this bug, and this bug is marked as NEEDINFO.
Since I can reproduce the bug I reported 100% of the time, please let me know if you would like me to provide any additional info.
Kamil, can you try applying the patch https://bugs.freedesktop.org/attachment.cgi?id=48043 to your kernel and seeing if that is sufficient.
I'm confident that's the fix, just waiting for testing.
I applied the patch to 188.8.131.52 kernel, but it did *not* help. I'm seeing the same problem as before (enabling an output after suspend/resume hangs the server). Do I need to be running a newer kernel perhaps?
Do I need to be running a newer kernel perhaps?
Sigh. After applying the patch can you post an i915_error_state.
$ cat /sys/kernel/debug/dri/0/i915_error_state
no error state collected
That's after a restart of the X server (Ctrl-Alt-Bcksp) so that I can access the machine again; I assume that would not reset i915_error_state?
The only indication in the logs I can see is in /var/log/Xorg.0.log:
[ 259.306] (WW) intel(0): flip queue failed: Invalid argument
[ 259.306] (WW) intel(0): Page flip failed: Invalid argument
[ 260.299] (WW) intel(0): flip queue failed: Device or resource busy
[ 260.299] (WW) intel(0): Page flip failed: Device or resource busy
[last two lines repeating]
These start occurring after I enable an output using xrandr (after a suspend/resume cycle); Xorg works for a while, but hangs immediately after I switch to a text console and back to X (a required action to actually see something via the new output, as per https://bugzilla.kernel.org/show_bug.cgi?id=24982).
A workaround that works for me is to modify xf86-video-intel to force intel->use_pageflipping to FALSE. I believe there used to be a user-accessible option to turn it off, but it's been removed? That is rather unfortunate, I must say.
I was just about to add that you hit kernel bug # 24982...
So we can't tell if the GPU lockup itself has been fixed if the second prevents you from testing.
Are you saying that *this* bug is probably fixed, but X still hangs because of the (unrelated) DPMS bug in the kernel? That could be, as I no longer see the GPU hung messages.
Well, I guess all I can do at this point is sit and wait for that kernel bug to be fixed, hopefully some time soon; it's been open since last year... I'd be happy to try any patches you guys might have.
Ok, to be really complicated, can you please retest this patch on top of keithp/drm-intel-fixes [ git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6.git]. Hopefully we have the modeswitching bug fixed and so we can then successfully test the WAIT_FOR_EVENT fix...
Chris, drm-intel-fixes (last commit
cda2bb78c24de7674eafa3210314dc75bed344a6) does *not* fix the modeswitching bug for me. I guess no point in retesting your patch then?
The patch should prevent the GPU hang upon turning off a pipe, but it is a nuisance if the machine is dying for other reason we can't but sure that the patch is sufficient.
does this still happens with the latest versions of the drivers, or it is not an issue anymore?
Yes, the patch is still required, just no one has volunteered to test it.
Well, I would've loved to test it, but I just tried kernel 3.1-rc10 and with vanilla xf86-video-intel 2.16.0 the kernel still crashes for me on enabling an output via xrandr. I assume it's due to the infamous kernel bug 24982, which has probably been open for a year now with no resolution in sight, though with kernel bugzilla apparently still being down (pathetic), it's hard to tell.
For what it's worth, with your patch applied, the kernel seems to crash less easily for me than without it.
(In reply to comment #27)
> Well, I would've loved to test it, but I just tried kernel 3.1-rc10 and with
> vanilla xf86-video-intel 2.16.0 the kernel still crashes for me on enabling an
> output via xrandr. I assume it's due to the infamous kernel bug 24982, which
> has probably been open for a year now with no resolution in sight, though with
> kernel bugzilla apparently still being down (pathetic), it's hard to tell.
bugzilla.kernel.org and that I'm currently unaware of any crash inside i915.ko, so you're going to have to remind me...
(In reply to comment #28)
> I'm currently unaware of any crash inside i915.ko,
> so you're going to have to remind me...
Chris, please see comment #19 in this bugzilla entry, or, for a complete description, see bug #37450. In essence, it seems that stale DPMS properties (kernel bug 24982), which normally just result in a blank screen, can in some situations result in a crash/hang. When I originally reported it I could only trigger it after suspend/resume; nowadays I can reproduce it just by repeatedly enabling and disabling an output a few times. The only workaround that works for me is modifying the xf86-video-intel driver to force page flipping off.
(In reply to comment #29)
> (In reply to comment #28)
> > I'm currently unaware of any crash inside i915.ko,
> > so you're going to have to remind me...
> Chris, please see comment #19 in this bugzilla entry, or, for a complete
> description, see bug #37450. In essence, it seems that stale DPMS properties
> (kernel bug 24982), which normally just result in a blank screen, can in some
> situations result in a crash/hang. When I originally reported it I could only
> trigger it after suspend/resume; nowadays I can reproduce it just by repeatedly
> enabling and disabling an output a few times. The only workaround that works
> for me is modifying the xf86-video-intel driver to force page flipping off.
Ok, I think we know that bug and had a fix for the races inside the page-flipping code, but I think Keith dropped them on the floor...
*** Bug 40526 has been marked as a duplicate of this bug. ***
*** Bug 40527 has been marked as a duplicate of this bug. ***
All our 4 duplicates were high/major. Adjusting.
*** Bug 45000 has been marked as a duplicate of this bug. ***
Per a recent request, I've built an Ubuntu test kernel based on the latest drm-intel-fixes branch (eg git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux.git drm-intel-fixes) and applied the patch noted in comment 16. If you could please test and post you results that would be great. Thanks in advance.
*** Bug 40052 has been marked as a duplicate of this bug. ***
*** Bug 48518 has been marked as a duplicate of this bug. ***
I'm the original reporter of bug 48518.
That is the description of my problem (on Ubuntu 12.04):
I was working on an external monitor with my notebook.
I changed the monitor settings to turn off the external monitor and turn on the notebook integrated monitor.
When I applied the changes something went wrong: I couldn't see anything. Then I replugged the VGA cable for the external monitor and something appeared again, but it was all working strange and not as it should.
Unfortunately I was working and I had to restart the computer; anyway nearly nothing worked: it was just a miracle that I managed to report the bug. I couldn't even open a terminal, I don't think I would have managed to take a screenshot. Neither I had a camera at hand to take a photo.
I'll do my best to describe what I saw:
- no launcher appearing when approching the left screen side
- no panel (the top bar)
- all the windows (which were already open) overlapping and cut
- clicking on a window didn't cause it to appear on the foreground, but I was able to interact with the application program (write, click buttons)
- no keyboard shortcut worked (at least not visibly); I tried using these: CTRL-ALT-T (to open a Terminal), Windows button (to open the Dash), ALT-F4 (to close windows)
- I was able to drag the windows
I didn't think about this when it happened: maybe even if I was "blind" everything was working and I could take a screenshot. Anyway it's too late and I'm still not sure I could.
For further details please refer to "https://bugs.freedesktop.org/show_bug.cgi?id=48518" and the original Ubuntu bug report.
If somehow I can help you, please let me know.
I tried to reproduce the bug but without success: everything went right.
devtry1, don't worry, you're not the only one with this problem. Given your extended description of the issue in comment #39, I would say Chris was right merging these two bug reports. I have sporadically seen what appears to be the same problem, only with KDE. The GPU will hung, the timer will time out and usually manage to reset the chip. In the process, however, the compositing window manager dies, so you no longer have control over window positions, desktop shortcuts, etc. For me the solution is generally to reboot, because even if I can successfully restart the X server, the GPU is in an ill-defined state so 3D acceleration doesn't work properly. I have not been able to find a reliable way of reproducing it; sometimes it "just happens", luckily, not very frequently.
The first of the fixes has landed:
Author: Chris Wilson <firstname.lastname@example.org>
Date: Tue Apr 3 17:58:35 2012 +0100
drm/i915: Finish any pending operations on the framebuffer before disabling
Similar to the case where we are changing from one framebuffer to
another, we need to be sure that there are no pending WAIT_FOR_EVENTs on
the pipe for the current framebuffer before switching. If we disable the
pipe, and then try to execute a WAIT_FOR_EVENT it will block
indefinitely and cause a GPU hang.