Bug 87780

Summary: [GM45] relocation error in 3.14 -- causes hang with -intel-2.99.917
Product: DRI Reporter: andreas.sturmlechner
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
20141228-0203_3.4.105-gentoo-stop_i915errdecode-ON.log.tar.xz
none
20141228-0045_3.4.105-gentoo-stop_dmesg-ON.log none

Description andreas.sturmlechner 2014-12-28 01:38:01 UTC
Created attachment 111414 [details]
20141228-0203_3.4.105-gentoo-stop_i915errdecode-ON.log.tar.xz

For other reasons I am bound to use kernel 3.4.105, without any trouble, including xf86-video-intel versions up to 2.99.916.

Upgrading xf86-video-intel to 2.99.917 breaks this setup; however, it works in combination with latest linux-3.19-rc1+ (which I'm currently testing for fixing the reason I'm stuck with 3.4.x).

All using X.Org X Server 1.16.3; the regression is reproducable:

[   19.540530] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[   19.541519] render error detected, EIR: 0x00000010
[   19.541519]   IPEIR: 0x00000000
[   19.541519]   IPEHR: 0x00000000
[   19.541519]   INSTDONE: 0xffffffff
[   19.541519]   INSTPS: 0x4001e020
[   19.541519]   INSTDONE1: 0xbfffffff
[   19.541519]   ACTHD: 0x7ffff000
[   19.541519] page table error
[   19.541519]   PGTBL_ER: 0x00100000
[   19.541519] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
[   25.532145] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[   25.532154] render error detected, EIR: 0x00000010
[   25.532158]   IPEIR: 0x00000000
[   25.532162]   IPEHR: 0x00000000
[   25.532165]   INSTDONE: 0xffffffff
[   25.532169]   INSTPS: 0x4001e020
[   25.532172]   INSTDONE1: 0xbfffffff
[   25.532175]   ACTHD: 0x7ffff000
[   25.532179] page table error
[   25.532182]   PGTBL_ER: 0x00100000
Comment 1 andreas.sturmlechner 2014-12-28 01:39:48 UTC
Created attachment 111415 [details]
20141228-0045_3.4.105-gentoo-stop_dmesg-ON.log
Comment 2 Chris Wilson 2014-12-28 09:31:01 UTC
Yes, it's a bug in the kernel relocation routines.
Comment 3 andreas.sturmlechner 2015-01-10 21:51:15 UTC
In case there is a kernel fix, just how big are my chances to get it backported to 3.4? ;)
Comment 4 Rodrigo Vivi 2015-01-12 21:18:55 UTC
Does it happen with latest drm-intel-nightly branch from cgit.freedesktop.org/drm-intel?

If so a bisect could lead you to the fix commit. If it doesn't it is still an upstream issue.
Comment 5 andreas.sturmlechner 2015-01-12 21:29:45 UTC
As said, it works with 3.19-rc1+, so actually I would need to find out at which point in the past it was fixed. If there was no prominent fix you could point me to from memory, I will start going back the last few majors. I know why I keep my .configs around...
Comment 6 Jesse Barnes 2015-03-20 22:03:48 UTC
I guess Chris might know offhand, otherwise I guess you get to do the long, painful, bisect. :/
Comment 7 Chris Wilson 2015-03-20 22:10:30 UTC
Oh. I think I know what it might actually have been: 

commit 983d308cb8f602d1920a8c40196eb2ab6cc07bd2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 26 10:47:10 2015 +0000

    agp/intel: Serialise after GTT updates

That could explain a few of these similar bugs.
Comment 8 andreas.sturmlechner 2015-03-21 14:00:01 UTC
Some things I can tell now:

Kernel versions that hang with >=xf86-video-intel-2.99.917: 3.4.106, 3.10.53
What works: 3.14.33, 3.17.4

Trying to apply the patch over 3.14.33 (only to check for backportability) breaks build, and the same happens for 3.4.106:

drivers/char/agp/intel-gtt.c: In function ‘i810_write_entry’:
drivers/char/agp/intel-gtt.c:331:2: error: implicit declaration of function ‘writel_relaxed’

If I remove the `writel_relaxed` hunks to make the patch succeed, 3.4.106 still hangs. But it seems that patch isn't the real fix anyway - it must be something between 3.10 and 3.14.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.