Bug 39947

Summary: [945gm] Display B: Invalid GTT PTE (enable plane too early?)
Product: DRI Reporter: Bryce Harrington <bryce>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: medium CC: ben, chris, daniel, eugeni, florian, freedesktop-bugzilla, jbarnes, przanoni
Version: unspecified   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
BootDmesg.txt
none
CurrentDmesg.txt
none
XorgLog.txt
none
i915_error_state.txt none

Description Bryce Harrington 2011-08-08 19:11:45 UTC
For several releases now we've been seeing "fake" GPU lockups flagged by the Intel driver.  The user's system (typically) does not lock up, but it is enough to trigger the apport crash handler, which displays a "GPU lockup" dialog to the user and prompts them to file a bug report.

The main problem is that this makes it hard to distinguish and prioritize 'real' gpu lockups from these fake ones.  I'd like to either figure out what is causing the fake gpu lockups and solve it, or identify a good reliable way of detecting that it's a fake gpu lockup and fix our crash detector to ignore them.

Below is an example of one of  these types of bugs, forwarded from:
  https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/815798

ProblemType: Crash
DistroRelease: Ubuntu 11.10
Package: xserver-xorg-video-intel 2:2.15.0-3ubuntu2
ProcVersionSignature: Ubuntu 3.0.0-6.7-generic 3.0.0-rc7
Uname: Linux 3.0.0-6-generic i686
Architecture: i386
BootLog:
 fsck from util-linux 2.19.1
 fsck from util-linux 2.19.1
 /dev/sda2: clean, 324507/655360 files, 1668623/2621184 blocks
 Linux_Home: clean, 79489/6545408 files, 12735361/26159616 blocks
 Skipping profile in /etc/apparmor.d/disable: usr.bin.firefox
Chipset: i915gm
CompizPlugins: No value set for `/apps/compiz-1/general/screen0/options/active_plugins'
CompositorRunning: compiz
Date: Mon Jul 25 01:19:49 2011
DistUpgraded: Log time: 2011-07-22 00:26:16.817036
DistroCodename: oneiric
DistroVariant: ubuntu
DkmsStatus: virtualbox, 4.0.10, 3.0.0-6-generic, i686: installed
DuplicateSignature: [i915gm] GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00000100 render.IPEHR: 0x02000004 Ubuntu 11.10
ExecutablePath: /usr/share/apport/apport-gpu-error-intel.py
GraphicsCard:
 Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller [8086:2592] (rev 04) (prog-if 00 [VGA controller])
   Subsystem: Uniwill Computer Corp Device [1584:9800]
   Subsystem: Uniwill Computer Corp Device [1584:9800]
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release i386 (20101007)
InterpreterPath: /usr/bin/python2.7
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: ALIENWARE 255/259 Series
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcCmdline: /usr/bin/python /usr/share/apport/apport-gpu-error-intel.py
ProcEnviron:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.0.0-6-generic root=UUID=2a79b732-f48c-4ead-ac45-09b92b7ffee7 ro quiet splash vt.handoff=7
RelatedPackageVersions:
 xserver-xorg 1:7.6+7ubuntu6
 libdrm2 2.4.26-1ubuntu1
 xserver-xorg-video-intel 2:2.15.0-3ubuntu2
SourcePackage: xserver-xorg-video-intel
Title: [i915gm] GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00000100 render.IPEHR: 0x02000004
UpgradeStatus: Upgraded to oneiric on 2011-07-22 (3 days ago)
UserGroups:

dmi.bios.date: 04/21/2006
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2.03W
dmi.board.name: 255/259 Series
dmi.board.vendor: ALIENWARE
dmi.chassis.type: 10
dmi.chassis.vendor: American Megatrends Inc
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2.03W:bd04/21/2006:svnALIENWARE:pn255/259Series:pvr:rvnALIENWARE:rn255/259Series:rvr:cvnAmericanMegatrendsInc:ct10:cvr:
dmi.product.name: 255/259 Series
dmi.sys.vendor: ALIENWARE
version.compiz: compiz 1:0.9.5.0-0ubuntu1
version.libdrm2: libdrm2 2.4.26-1ubuntu1
version.libgl1-mesa-dri: libgl1-mesa-dri 7.11~1-0ubuntu4
version.libgl1-mesa-dri-experimental: libgl1-mesa-dri-experimental N/A
version.libgl1-mesa-glx: libgl1-mesa-glx 7.11~1-0ubuntu4
version.xserver-xorg: xserver-xorg 1:7.6+7ubuntu6
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev 1:2.6.0-1ubuntu13
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:6.14.2-1ubuntu2
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.15.0-3ubuntu2
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:0.0.16+git20110411+8378443-1
Comment 1 Bryce Harrington 2011-08-08 19:13:25 UTC
Created attachment 50048 [details]
BootDmesg.txt
Comment 2 Bryce Harrington 2011-08-08 19:13:44 UTC
Created attachment 50049 [details]
CurrentDmesg.txt
Comment 3 Bryce Harrington 2011-08-08 19:14:00 UTC
Created attachment 50050 [details]
XorgLog.txt
Comment 4 Bryce Harrington 2011-08-08 19:14:23 UTC
Created attachment 50051 [details]
i915_error_state.txt
Comment 5 Chris Wilson 2011-08-09 01:30:43 UTC
They are still bugs, in some ways much more frightening than performing an undefined operation - the chip has detected that we are accessing invalid memory. Who knows what illegal accesses we did before the invalid access!

The trick to determine if the GPU is truly wedged would be to cat /sys/kernel/debug/dri/0/i915_wedged (or you can try issuing a throttle command and look for an EIO error code).
Comment 6 Bryce Harrington 2011-08-18 14:19:17 UTC
From what I've seen, most of the false gpu hang reports have a hang which occurs late during boot, basically right at the point that the drm driver is loaded.  Could the issue be that some memory is not being initialized, or a race condition in initialization?

Do you have an idea if this problem is unique to Ubuntu?  I'm wondering if it boils down to some boot optimization we did ourselves, or if it is a legitimate bug in the driver?
Comment 7 Eugeni Dodonov 2011-08-22 12:12:12 UTC
(In reply to comment #6)
> Do you have an idea if this problem is unique to Ubuntu?  I'm wondering if it
> boils down to some boot optimization we did ourselves, or if it is a legitimate
> bug in the driver?

Don't know if it will help, but I haven't seen such issues in Mandriva/Mageia while maintaining their mesa/X/init stacks. At the same time, we have seen similar issues when booting Ubuntu on same hardware for reference. Don't know if it is a coincidence (as compile flags, versions and so on do not match always), but Ubuntu was the only one to show this. But I admit that I could be wrong, and I certainly haven't tested it in-depth.
Comment 8 Eugeni Dodonov 2011-09-22 16:30:50 UTC
I lowered the priority a bit to have it in the same priority scale as other false GPU lockups.
Comment 9 Chris Wilson 2012-04-16 05:30:16 UTC
So Display B is unbound but enabled... Big time modesetting screwup.
Comment 10 Jesse Barnes 2012-04-16 14:37:48 UTC
Shouldn't the sanitize function have disabled the planes??  If so this should be fixed right?
Comment 11 Bryce Harrington 2012-04-16 17:27:22 UTC
Well, we're still seeing false lockups, although not exactly the same set of error codes as this bug.

[i915gm] False GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00000010 render.IPEHR: 0x01000000
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/981171

[IGDgm] False GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00010000 render.IPEHR: 0x01000000
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/978968
(+4 dupes)

[i965gm] GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00000100
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/982021

[gm45] GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00100000
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/981297

The latter two sound like actual misbehaviors happened.


Would you prefer I file new upstream reports on each of these, or do they seem like the same issue?
Comment 12 Bryce Harrington 2012-04-16 17:30:55 UTC
Btw, for comparison, there were 142 bugs collected last cycle as dupes of this:

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/828684
Comment 13 Chris Wilson 2012-04-17 01:08:10 UTC
My old favourite:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d
index e0e8cb5..7978e41 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -5846,7 +5846,6 @@ static int i9xx_crtc_mode_set(struct drm_crtc *crtc,
 
        I915_WRITE(DSPCNTR(plane), dspcntr);
        POSTING_READ(DSPCNTR(plane));
-       intel_enable_plane(dev_priv, plane, pipe);
 
        ret = intel_pipe_set_base(crtc, x, y, old_fb);
Comment 14 Chris Wilson 2012-04-25 02:28:10 UTC
These bugs all have similar symptoms that could be explained and fixed by the following patch. So please do test drm-intel-next-queued and report back. On trying the equivalent patch in the past, it has caused modesetting regression for the initial switch from the BIOS configuration, so do look out for any glitches during boot. Thanks.

commit 969d380a39d33f7533b6dcee35e834109d23f9e9
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 24 16:36:50 2012 +0100

    drm/i915: Remove too early plane enable on pre-PCH hardware
    
    Enabling the plane before we have assigned valid address means that it
    will access random PTE (often with conflicting memory types) and cause
    GPU lockups. However, enabling the plane too early appears to workaround
    a number of bugs in our modesetting code.
    
    Cc: Franz Melchior <melchior.franz@gmail.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=39947
    References: https://bugs.freedesktop.org/show_bug.cgi?id=41091
    References: https://bugs.freedesktop.org/show_bug.cgi?id=49041
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Comment 15 Florian Mickler 2012-07-01 03:45:52 UTC
A patch referencing this bug report has been merged in Linux v3.5-rc1:

commit c7bd4c25650704d4d065eb4ce2a122d2a80ce804
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 24 16:36:50 2012 +0100

    drm/i915: Remove too early plane enable on pre-PCH hardware

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.