For several releases now we've been seeing "fake" GPU lockups flagged by the Intel driver. The user's system (typically) does not lock up, but it is enough to trigger the apport crash handler, which displays a "GPU lockup" dialog to the user and prompts them to file a bug report. The main problem is that this makes it hard to distinguish and prioritize 'real' gpu lockups from these fake ones. I'd like to either figure out what is causing the fake gpu lockups and solve it, or identify a good reliable way of detecting that it's a fake gpu lockup and fix our crash detector to ignore them. Below is an example of one of these types of bugs, forwarded from: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/815798 ProblemType: Crash DistroRelease: Ubuntu 11.10 Package: xserver-xorg-video-intel 2:2.15.0-3ubuntu2 ProcVersionSignature: Ubuntu 3.0.0-6.7-generic 3.0.0-rc7 Uname: Linux 3.0.0-6-generic i686 Architecture: i386 BootLog: fsck from util-linux 2.19.1 fsck from util-linux 2.19.1 /dev/sda2: clean, 324507/655360 files, 1668623/2621184 blocks Linux_Home: clean, 79489/6545408 files, 12735361/26159616 blocks Skipping profile in /etc/apparmor.d/disable: usr.bin.firefox Chipset: i915gm CompizPlugins: No value set for `/apps/compiz-1/general/screen0/options/active_plugins' CompositorRunning: compiz Date: Mon Jul 25 01:19:49 2011 DistUpgraded: Log time: 2011-07-22 00:26:16.817036 DistroCodename: oneiric DistroVariant: ubuntu DkmsStatus: virtualbox, 4.0.10, 3.0.0-6-generic, i686: installed DuplicateSignature: [i915gm] GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00000100 render.IPEHR: 0x02000004 Ubuntu 11.10 ExecutablePath: /usr/share/apport/apport-gpu-error-intel.py GraphicsCard: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller [8086:2592] (rev 04) (prog-if 00 [VGA controller]) Subsystem: Uniwill Computer Corp Device [1584:9800] Subsystem: Uniwill Computer Corp Device [1584:9800] InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release i386 (20101007) InterpreterPath: /usr/bin/python2.7 Lsusb: Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub MachineType: ALIENWARE 255/259 Series PccardctlIdent: Socket 0: no product info available PccardctlStatus: Socket 0: no card ProcCmdline: /usr/bin/python /usr/share/apport/apport-gpu-error-intel.py ProcEnviron: ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.0.0-6-generic root=UUID=2a79b732-f48c-4ead-ac45-09b92b7ffee7 ro quiet splash vt.handoff=7 RelatedPackageVersions: xserver-xorg 1:7.6+7ubuntu6 libdrm2 2.4.26-1ubuntu1 xserver-xorg-video-intel 2:2.15.0-3ubuntu2 SourcePackage: xserver-xorg-video-intel Title: [i915gm] GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00000100 render.IPEHR: 0x02000004 UpgradeStatus: Upgraded to oneiric on 2011-07-22 (3 days ago) UserGroups: dmi.bios.date: 04/21/2006 dmi.bios.vendor: American Megatrends Inc. dmi.bios.version: 2.03W dmi.board.name: 255/259 Series dmi.board.vendor: ALIENWARE dmi.chassis.type: 10 dmi.chassis.vendor: American Megatrends Inc dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2.03W:bd04/21/2006:svnALIENWARE:pn255/259Series:pvr:rvnALIENWARE:rn255/259Series:rvr:cvnAmericanMegatrendsInc:ct10:cvr: dmi.product.name: 255/259 Series dmi.sys.vendor: ALIENWARE version.compiz: compiz 1:0.9.5.0-0ubuntu1 version.libdrm2: libdrm2 2.4.26-1ubuntu1 version.libgl1-mesa-dri: libgl1-mesa-dri 7.11~1-0ubuntu4 version.libgl1-mesa-dri-experimental: libgl1-mesa-dri-experimental N/A version.libgl1-mesa-glx: libgl1-mesa-glx 7.11~1-0ubuntu4 version.xserver-xorg: xserver-xorg 1:7.6+7ubuntu6 version.xserver-xorg-input-evdev: xserver-xorg-input-evdev 1:2.6.0-1ubuntu13 version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:6.14.2-1ubuntu2 version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.15.0-3ubuntu2 version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:0.0.16+git20110411+8378443-1
Created attachment 50048 [details] BootDmesg.txt
Created attachment 50049 [details] CurrentDmesg.txt
Created attachment 50050 [details] XorgLog.txt
Created attachment 50051 [details] i915_error_state.txt
They are still bugs, in some ways much more frightening than performing an undefined operation - the chip has detected that we are accessing invalid memory. Who knows what illegal accesses we did before the invalid access! The trick to determine if the GPU is truly wedged would be to cat /sys/kernel/debug/dri/0/i915_wedged (or you can try issuing a throttle command and look for an EIO error code).
From what I've seen, most of the false gpu hang reports have a hang which occurs late during boot, basically right at the point that the drm driver is loaded. Could the issue be that some memory is not being initialized, or a race condition in initialization? Do you have an idea if this problem is unique to Ubuntu? I'm wondering if it boils down to some boot optimization we did ourselves, or if it is a legitimate bug in the driver?
(In reply to comment #6) > Do you have an idea if this problem is unique to Ubuntu? I'm wondering if it > boils down to some boot optimization we did ourselves, or if it is a legitimate > bug in the driver? Don't know if it will help, but I haven't seen such issues in Mandriva/Mageia while maintaining their mesa/X/init stacks. At the same time, we have seen similar issues when booting Ubuntu on same hardware for reference. Don't know if it is a coincidence (as compile flags, versions and so on do not match always), but Ubuntu was the only one to show this. But I admit that I could be wrong, and I certainly haven't tested it in-depth.
I lowered the priority a bit to have it in the same priority scale as other false GPU lockups.
So Display B is unbound but enabled... Big time modesetting screwup.
Shouldn't the sanitize function have disabled the planes?? If so this should be fixed right?
Well, we're still seeing false lockups, although not exactly the same set of error codes as this bug. [i915gm] False GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00000010 render.IPEHR: 0x01000000 https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/981171 [IGDgm] False GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00010000 render.IPEHR: 0x01000000 https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/978968 (+4 dupes) [i965gm] GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00000100 https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/982021 [gm45] GPU lockup EIR: 0x00000010 PGTBL_ER: 0x00100000 https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/981297 The latter two sound like actual misbehaviors happened. Would you prefer I file new upstream reports on each of these, or do they seem like the same issue?
Btw, for comparison, there were 142 bugs collected last cycle as dupes of this: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/828684
My old favourite: diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d index e0e8cb5..7978e41 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -5846,7 +5846,6 @@ static int i9xx_crtc_mode_set(struct drm_crtc *crtc, I915_WRITE(DSPCNTR(plane), dspcntr); POSTING_READ(DSPCNTR(plane)); - intel_enable_plane(dev_priv, plane, pipe); ret = intel_pipe_set_base(crtc, x, y, old_fb);
These bugs all have similar symptoms that could be explained and fixed by the following patch. So please do test drm-intel-next-queued and report back. On trying the equivalent patch in the past, it has caused modesetting regression for the initial switch from the BIOS configuration, so do look out for any glitches during boot. Thanks. commit 969d380a39d33f7533b6dcee35e834109d23f9e9 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 24 16:36:50 2012 +0100 drm/i915: Remove too early plane enable on pre-PCH hardware Enabling the plane before we have assigned valid address means that it will access random PTE (often with conflicting memory types) and cause GPU lockups. However, enabling the plane too early appears to workaround a number of bugs in our modesetting code. Cc: Franz Melchior <melchior.franz@gmail.com> References: https://bugs.freedesktop.org/show_bug.cgi?id=39947 References: https://bugs.freedesktop.org/show_bug.cgi?id=41091 References: https://bugs.freedesktop.org/show_bug.cgi?id=49041 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
A patch referencing this bug report has been merged in Linux v3.5-rc1: commit c7bd4c25650704d4d065eb4ce2a122d2a80ce804 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 24 16:36:50 2012 +0100 drm/i915: Remove too early plane enable on pre-PCH hardware
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.