Bug 53385 - [GM45] i915: render error detected - Invalid GTT entry during fetch for host
Summary: [GM45] i915: render error detected - Invalid GTT entry during fetch for host
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Daniel Vetter
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 79222 (view as bug list)
Depends on:
Blocks:
 
Reported: 2012-08-11 18:22 UTC by Dmitry Nezhevenko
Modified: 2017-07-24 23:00 UTC (History)
9 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (60.34 KB, text/plain)
2012-08-11 18:23 UTC, Dmitry Nezhevenko
no flags Details
Xorg.log (30.36 KB, text/plain)
2012-08-11 18:23 UTC, Dmitry Nezhevenko
no flags Details
i915_error_state (1.32 MB, text/plain)
2012-08-11 18:24 UTC, Dmitry Nezhevenko
no flags Details
output of intel_error_decode (2.59 MB, text/plain)
2012-09-02 16:23 UTC, andreas.sturmlechner
no flags Details
i915_error_state for 3.6.3 kernel (1.32 MB, text/plain)
2012-11-06 20:12 UTC, Dmitry Nezhevenko
no flags Details
intel_error_decode output for 3.6.3 kernel (2.45 MB, text/plain)
2012-11-06 20:14 UTC, Dmitry Nezhevenko
no flags Details
output of intel_error_decode (2.45 MB, text/plain)
2013-02-10 13:26 UTC, andreas.sturmlechner
no flags Details
3.8 dmesg with drm.debug=6 (195.32 KB, text/plain)
2013-02-27 18:21 UTC, andreas.sturmlechner
no flags Details
intel reg dump from drm-intel-nightly with i915.panel_ignore_lid=0 (13.55 KB, text/plain)
2013-07-04 20:47 UTC, andreas.sturmlechner
no flags Details
output of /sys/class/drm/card0/error (1.34 MB, text/plain)
2013-07-16 16:28 UTC, andreas.sturmlechner
no flags Details
intel error decode (3.10.0-rc7+ drm-intel-nightly from 13/07/15) (2.48 MB, text/plain)
2013-07-16 16:45 UTC, andreas.sturmlechner
no flags Details
intel-error-decode-131222.log (drm-intel-nightly-3.13.0-rc4+) (2.72 MB, text/plain)
2013-12-22 19:06 UTC, andreas.sturmlechner
no flags Details
intel-reg-dump-131222.log (drm-intel-nightly-3.13.0-rc4+) (13.71 KB, text/plain)
2013-12-22 19:08 UTC, andreas.sturmlechner
no flags Details
20140908-0828_3.16.1-gentoo-stop_i915errdecode-ON.log (2.46 MB, text/plain)
2014-09-11 23:17 UTC, andreas.sturmlechner
no flags Details
20140908-0828_3.16.1-gentoo-stop_i915regdump-ON.log (13.71 KB, text/plain)
2014-09-11 23:20 UTC, andreas.sturmlechner
no flags Details

Description Dmitry Nezhevenko 2012-08-11 18:22:42 UTC
I'm getting this just after booting kernel 3.5 (including 3.5.1) on GM45.

[  298.460954] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[  298.462548] i915: render error detected, EIR: 0x00000010
[  298.462553] i915:   IPEIR: 0x00000000
[  298.462555] i915:   IPEHR: 0x01000000
[  298.462557] i915:   INSTDONE: 0xfffffffe
[  298.462559] i915:   INSTPS: 0x0001e000
[  298.462561] i915:   INSTDONE1: 0xffffffff
[  298.462563] i915:   ACTHD: 0x0021aaa0
[  298.462566] i915: page table error
[  298.462568] i915:   PGTBL_ER: 0x00000001
[  298.462573] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking

00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)

dmesg, xorg.log and i915_error_state attached. 

xorg    1:7.7+1
xserver-xorg    1:7.7+1
xserver-xorg-core       2:1.12.3-1
xserver-xorg-video-intel        2:2.19.0-5
libdrm-intel1:amd64     2.4.33-3
libdrm-intel1:i386      2.4.33-3

There were no such issue with kernel 3.4
Comment 1 Dmitry Nezhevenko 2012-08-11 18:23:28 UTC
Created attachment 65441 [details]
dmesg
Comment 2 Dmitry Nezhevenko 2012-08-11 18:23:49 UTC
Created attachment 65442 [details]
Xorg.log
Comment 3 Dmitry Nezhevenko 2012-08-11 18:24:44 UTC
Created attachment 65443 [details]
i915_error_state
Comment 4 Chris Wilson 2012-08-17 15:11:33 UTC
The GPU is completely idle at the time of the error, and the source of the error is from the CPU accessing an invalid PTE through the GTT. There never should be an invalid PTE (the entire GTT is meant to only be pointing at buffer objects or the scratch page, valid entries one and all) so this is doubly concerning.

Is there any chance you can perform a bisection between 3.4 and 3.5?
Comment 5 andreas.sturmlechner 2012-08-30 09:56:32 UTC
I've had the same error once, but without a method to reproduce it will be hard to bisect - I haven't seen it since.
Comment 6 andreas.sturmlechner 2012-09-01 12:34:54 UTC
It just happened again after a few hours runtime:

[14153.513354] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[14153.514006] i915: render error detected, EIR: 0x00000010
[14153.514006] i915:   IPEIR: 0x00000000
[14153.514006] i915:   IPEHR: 0x01000000
[14153.514006] i915:   INSTDONE: 0xfffffffe
[14153.514006] i915:   INSTPS: 0x0001e000
[14153.514006] i915:   INSTDONE1: 0xffffffff
[14153.514006] i915:   ACTHD: 0x1f80d0b8
[14153.514006] i915: page table error
[14153.514006] i915:   PGTBL_ER: 0x00000001
[14153.514006] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking


Package versions:
kernel-3.6_rc3-drm
xorg-server-1.12.4
xf86-video-intel-2.20.5
libdrm-2.4.38
mesa-8.1_rc1_pre20120814
Comment 7 andreas.sturmlechner 2012-09-02 16:23:18 UTC
Created attachment 66499 [details]
output of intel_error_decode

After today's upgrades to mesa master, libdrm-2.4.39 and 3.6_rc4-drm the error happened again. I still have no idea what causes it - the error occurred at wildly different uptimes and always at regular desktop workload.
Comment 8 Chris Wilson 2012-09-05 13:18:18 UTC
Pity this is irregular, otherwise I could ask you to switch to SNA and see if that helps.
Comment 9 Dmitry Nezhevenko 2012-09-05 13:20:25 UTC
I've tried to start bisecting but unfortunately it doesn't reproduces with my 'minimal' .config. So it's definitely something configuration-specific


Any ideas which options to check first?
Comment 10 Daniel Vetter 2012-09-05 14:03:36 UTC
(In reply to comment #9)
> I've tried to start bisecting but unfortunately it doesn't reproduces with my
> 'minimal' .config. So it's definitely something configuration-specific
> 
> 
> Any ideas which options to check first?

That just means it's a timing-related race somewhere. Which makes this really hard to track down :(
Comment 11 andreas.sturmlechner 2012-09-07 00:35:33 UTC
Dmitry, since we both have GM45 hardware and share at least one symptom - would you mind testing kernel 3.4.10 for bug 54575?
Comment 12 andreas.sturmlechner 2012-09-09 20:47:17 UTC
(In reply to comment #8)
> Pity this is irregular, otherwise I could ask you to switch to SNA and see if
> that helps.

The error didn't occur in ~ 9 hours after switching from UXA to SNA.
Comment 13 andreas.sturmlechner 2012-09-16 12:28:55 UTC
Another week with SNA and I think it's safe to say that it really only happens with UXA.
Comment 14 Chris Wilson 2012-09-28 13:51:15 UTC
The hint here is that this appears to be the a race with pageflipping. So UXA should receive the same level of protection as SNA with current xf86-video-intel.git, and there is yet another bug to be fixed in the kernel...

commit 5a6c82a097e23cadc73eb65ebe6634bd84d363bc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 27 21:17:28 2012 +0100

    drm/i915: Flush the pending flips on the CRTC before modification
Comment 15 Chris Wilson 2012-10-18 12:39:03 UTC
Working on the theory that this is also related to the cpu-relocs issue, does using 3.7 help?
Comment 16 Dmitry Nezhevenko 2012-10-18 12:40:46 UTC
It looks like I can't reproduce it using 3.6.2 kernel. So 3.7 is probably also ok.

So I think that it's ok to close this as resolved.

Thanks.
Comment 17 Daniel Vetter 2012-10-18 12:52:02 UTC
Ok, closing this as no longer reproducible on latest kernels, thanks a lot for the bug report and please reopen if this issue pops up again.
Comment 18 Dmitry Nezhevenko 2012-11-06 20:10:44 UTC
Well, it still reproduces somehow. But now it happens at very random times. I don't see any conditions. Maybe switching to console, watching video (mplayer -vo gl2) or just switching between multiple X11 sessions is cause. I can't be sure. And also I don't know any steps to reproduce it.

This is with kernel 3.6.3

[248021.375714] i915: render error detected, EIR: 0x00000010
[248021.375724] i915:   IPEIR: 0x00000000
[248021.375728] i915:   IPEHR: 0x01000000
[248021.375733] i915:   INSTDONE: 0xfffffffe
[248021.375736] i915:   INSTPS: 0x0001e000
[248021.375740] i915:   INSTDONE1: 0xffffffff
[248021.375744] i915:   ACTHD: 0x098151d0
[248021.375749] i915: page table error
[248021.375752] i915:   PGTBL_ER: 0x00000001
[248021.375760] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking

I'm attaching i915_error_state and intel_error_decode output for this case
Comment 19 Dmitry Nezhevenko 2012-11-06 20:12:04 UTC
Created attachment 69637 [details]
i915_error_state for 3.6.3 kernel
Comment 20 Dmitry Nezhevenko 2012-11-06 20:14:17 UTC
Created attachment 69638 [details]
intel_error_decode output for 3.6.3 kernel
Comment 21 Chris Wilson 2012-12-07 19:41:46 UTC
Smells like a missing mb().
Comment 22 Chris Wilson 2012-12-07 19:42:32 UTC
Can you please try http://cgit.freedesktop.org/~ickle/linux-2.6 #master which contains a review of the mb() around GTT access.
Comment 23 andreas.sturmlechner 2013-02-10 13:21:06 UTC
I've switched to SNA since comment #13, but the error pops up now and then, seen it in 3.7.x and also in 3.8 (right now in rc6 after 13+ hrs uptime).
Comment 24 andreas.sturmlechner 2013-02-10 13:26:21 UTC
Created attachment 74543 [details]
output of intel_error_decode

latest output of intel_error_decode
Comment 25 andreas.sturmlechner 2013-02-10 16:28:39 UTC
(In reply to comment #22)
> Can you please try http://cgit.freedesktop.org/~ickle/linux-2.6 #master
> which contains a review of the mb() around GTT access.

Chris, could you point out which patches I should take from there so I can try with a stable kernel? ~ickle/linux-2.6 master suffers from bug 58867 which I could patch, but right at startup I already see an other ugly kernel oops...
Comment 26 Chris Wilson 2013-02-10 16:30:50 UTC
The most interesting of those patches are now in drm-intel-next (http://cgit.freedesktop.org/~danvet/drm-intel) or something like the drm-intel-experimental ppa.
Comment 27 andreas.sturmlechner 2013-02-10 18:35:36 UTC
Ok thx, I 'unblurred' current drm-intel-next branch for my system, testing the resulting image now. :)
Comment 28 andreas.sturmlechner 2013-02-16 17:42:58 UTC
I've manually picked your commits

d0a57789d5ec807fc218151b2fb2de4da30fbef5
97c809fd9cf5e914322b53773ad0d67efe503fde
a3e30cef4b84f92763ed54c9934d70e2dd591246
9ddcb7df360c62ac6d4090ae60376c26510022f1

from 2012-12-16, all about mb(), as the current drm-intel-next branch kernel image panicks on my system after some time. Testing with 3.8_rc7 right now.

Other related packages updated since commit #6:

xf86-video-intel-2.20.19
xorg-server-1.13.2
libdrm-2.4.40
mesa-9.0.1
Comment 29 andreas.sturmlechner 2013-02-21 22:40:41 UTC
Happened again with the above mentioned patches, this time very early - before even wlan0 was up. :(
Comment 30 andreas.sturmlechner 2013-02-27 18:21:02 UTC
Created attachment 75646 [details]
3.8 dmesg with drm.debug=6

If anything, it seems to happen more often now...

Also updated to xf86-video-intel-2.21.3
Comment 31 Ben Widawsky 2013-06-04 20:05:16 UTC
(In reply to comment #30)
> Created attachment 75646 [details]
> 3.8 dmesg with drm.debug=6
> 
> If anything, it seems to happen more often now...
> 
> Also updated to xf86-video-intel-2.21.3

Can you please retest with the latest drm-intel-nightly?
Comment 32 Chris Wilson 2013-06-04 20:08:23 UTC
Do we have enough of fastboot upstream yet to fix the regression of not turning off the BIOS outputs whilst we overwrite its memory and PTE?
Comment 33 Daniel Vetter 2013-06-06 20:51:20 UTC
Nope, fastboot framebuffer reconstruction is still missing :(
Comment 34 andreas.sturmlechner 2013-06-25 22:11:55 UTC
In my small collection of dmesg logs, it was last seen in a 3.9.0 kernel. I should probably automate this and grep/save dmesg at each shutdown. However, these days I'm running 3.10 and so far haven't stumbled over it, while not exactly watching out for it. I will do that in the coming days and report back should it happen / then try out drm-intel-nightly.
Comment 35 andreas.sturmlechner 2013-07-01 00:51:13 UTC
OK, it just happened again in 3.10.0-rc7+ which brings me to drm-intel-nightly next.
Comment 36 andreas.sturmlechner 2013-07-01 19:58:00 UTC
Unfortunately both drm-intel-nightly as well as -next are currently unusable on my system - there's no external display output at all. KDE detects when I fire up the DP monitor but X hangs when I actually try to enable some output there.

Maybe I'll find a working state somewhere back in git history.
Comment 37 Chris Wilson 2013-07-01 20:06:05 UTC
Where? What? When? How? Don't leave us hanging like this!

cat /proc/`pidof Xorg`/stack or attach gdb would be useful as would the last traces from the log file.
Comment 38 andreas.sturmlechner 2013-07-04 20:43:48 UTC
Sorry, not much time for bug hunting with my current workload. :(

However, I found out that what's actually broken is one of my boot params that is needed for correct kms fbcon native resolution detection - after removing i915.panel_ignore_lid=0 I do have output again on DP.

So now I'm able to test today's state of drm-intel-nightly.
Comment 39 andreas.sturmlechner 2013-07-04 20:47:29 UTC
Created attachment 82046 [details]
intel reg dump from drm-intel-nightly with i915.panel_ignore_lid=0

fwiw, attaching the reg dump from nightly without dp output
Comment 40 andreas.sturmlechner 2013-07-07 11:41:28 UTC
I haven't seen the error so far using nightly, but it's too early to be safe.

Only this has appeared in dmesg when switching on the external display via xrandr (it wouldn't come up by itself, it's the troubled setup from bug 58876):

[   45.127774] [drm:intel_dp_aux_ch] *ERROR* dp_aux_ch not done status 0x11450085
Comment 41 andreas.sturmlechner 2013-07-16 15:56:14 UTC
OK, there it is again with a drm-intel-nightly image pulled and built yesterday evening: 

[15979.289716] [drm] capturing error event; look for more information in /sys/class/drm/card0/error
[15979.290709] i915: render error detected, EIR: 0x00000010
[15979.290709] i915:   IPEIR: 0x00000000
[15979.290709] i915:   IPEHR: 0x01000000
[15979.290709] i915:   INSTDONE_0: 0xfffffffe
[15979.290709] i915:   INSTDONE_1: 0xffffffff
[15979.290709] i915:   INSTDONE_2: 0x00000000
[15979.290709] i915:   INSTDONE_3: 0x00000000
[15979.290709] i915:   INSTPS: 0x0001e000
[15979.290709] i915:   ACTHD: 0x164041f8
[15979.290709] i915: page table error
[15979.290709] i915:   PGTBL_ER: 0x00000001
[15979.290709] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
Comment 42 andreas.sturmlechner 2013-07-16 16:28:38 UTC
Created attachment 82487 [details]
output of /sys/class/drm/card0/error

attaching error log.

Error happened while also using SNA, xf86-video-intel-2.21.12, xorg-server-1.13.4, libdrm-2.4.45, mesa-9.1.4
Comment 43 andreas.sturmlechner 2013-07-16 16:45:00 UTC
Created attachment 82488 [details]
intel error decode (3.10.0-rc7+ drm-intel-nightly from 13/07/15)
Comment 44 Chris Wilson 2013-07-20 14:44:35 UTC
Note that the immediate after boot vs after several hours runtime are likely two different bugs. Or rather I have a two theories that explains each one independently...
Comment 45 andreas.sturmlechner 2013-09-15 20:31:02 UTC
Since I'm currently doing a lot of rebooting due to other issues with i915, I did notice that with 3.8.13 the early after boot error was more or less guaranteed, and that seems to have disappeared as I only noticed it late in the game with recent kernels. Which kind of confirms your theory and that there has been some progress indeed, I guess.
Comment 46 Jani Nikula 2013-12-16 14:13:56 UTC
Andreas, what's the situation with current drm-intel-nightly?
Comment 47 andreas.sturmlechner 2013-12-22 19:00:36 UTC
(In reply to comment #46)
> Andreas, what's the situation with current drm-intel-nightly?

I just tried the latest state of drm-intel-nightly on the setup that's troubled by bug 57461 and bug 69251 (external display via DisplayPort), and it's got a bit worse:


1.) System freezes every time that - presumably - EDID is accessed. At first there's a noticeable black screen delay between grub2 and init, then it proceeds fine to the login manager, all seems fine at that point.

2.) That 5-6 seconds freeze (total lock, any input is lost) then happens each time I switch between fbcon and login manager, and doing that I can soon provoke the following error in dmesg:

[   66.836072] [drm] GMBUS [i915 gmbus dpb] timed out, falling back to bit banging on pin 5

3.) Starting the desktop environment results in a multitude of those freezes, presumably because KDE tries to detect and find out a few things about display capabilities, color management and whatnot, startup is considerably delayed by that.

4.) How to reproduce the freeze:

~ $ time oyranos-monitor -l
0: ":0.0" 1920,00x1200,00+0,00+0,00  S2243W

real    0m6.230s
user    0m0.034s
sys     0m0.033s

5.) During first startup of the new kernel image I also got an 'hpd interrupt storm' in dmesg, a few restarts later a familiar error has reappeared:

[  457.189291] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  457.189296] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  457.189297] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  457.189299] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  457.189300] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  457.190004] i915: render error detected, EIR: 0x00000010
[  457.190004] i915:   IPEIR: 0x00000000
[  457.190004] i915:   IPEHR: 0x54c00006
[  457.190004] i915:   INSTDONE_0: 0x808f837f
[  457.190004] i915:   INSTDONE_1: 0xbf2706ae
[  457.190004] i915:   INSTDONE_2: 0x00000000
[  457.190004] i915:   INSTDONE_3: 0x00000000
[  457.190004] i915:   INSTPS: 0x8001e025
[  457.190004] i915:   ACTHD: 0x01bcb45c
[  457.190004] i915: page table error
[  457.190004] i915:   PGTBL_ER: 0x00000001
[  457.190004] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
Comment 48 andreas.sturmlechner 2013-12-22 19:02:40 UTC
Sorry, the first bug number in above comment #47 should have pointed at bugzilla.kernel.org.
Comment 49 andreas.sturmlechner 2013-12-22 19:06:55 UTC
Created attachment 91132 [details]
intel-error-decode-131222.log (drm-intel-nightly-3.13.0-rc4+)
Comment 50 andreas.sturmlechner 2013-12-22 19:08:40 UTC
Created attachment 91133 [details]
intel-reg-dump-131222.log (drm-intel-nightly-3.13.0-rc4+)
Comment 51 andreas.sturmlechner 2013-12-27 17:00:31 UTC
Happened now as well on the DP-DVI setup from bug 58876 and as soon as [ 1716.048044], but here at least there are no 6sec freezes.
Comment 52 Jani Nikula 2014-09-11 16:45:33 UTC
Timeout, please try current drm-intel-nightly.
Comment 53 Dmitry Nezhevenko 2014-09-11 16:48:22 UTC
Hi,

I (submitter of bug) don't have access to affected GM45 laptop anymore. Probably we can wait a week for other guys from CC...
Comment 54 andreas.sturmlechner 2014-09-11 23:17:19 UTC
Created attachment 106161 [details]
20140908-0828_3.16.1-gentoo-stop_i915errdecode-ON.log

*checks logs*

Error was last recorded on 2014-09-08 with kernel 3.16.1 for the first time in about a month since I started saving logs at shutdown:

[   81.622660] [drm] GPU HANG: ecode -1:0x00000000, reason: Command parser error, iir 0x00008000, action: continue
[   81.622660] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   81.622660] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   81.622660] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   81.622660] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   81.622660] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   81.622660] i915: render error detected, EIR: 0x00000010
[   81.622660] i915:   IPEIR: 0x00000000
[   81.622660] i915:   IPEHR: 0x01000000
[   81.622660] i915:   INSTDONE_0: 0xfffffffe
[   81.622660] i915:   INSTDONE_1: 0xffffffff
[   81.622660] i915:   INSTDONE_2: 0x00000000
[   81.622660] i915:   INSTDONE_3: 0x00000000
[   81.622660] i915:   INSTPS: 0x0001e000
[   81.622660] i915:   ACTHD: 0x0080b8a0
[   81.622660] i915: page table error
[   81.622660] i915:   PGTBL_ER: 0x00000001
[   81.622660] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
Comment 55 andreas.sturmlechner 2014-09-11 23:20:11 UTC
Created attachment 106162 [details]
20140908-0828_3.16.1-gentoo-stop_i915regdump-ON.log

regdump available as well
Comment 56 Chris Wilson 2015-01-27 12:33:32 UTC
*** Bug 79222 has been marked as a duplicate of this bug. ***
Comment 57 Chris Wilson 2015-01-27 12:34:26 UTC
This could be related to http://patchwork.freedesktop.org/patch/41094/
Comment 58 Chris Wilson 2015-02-06 11:24:46 UTC
I am going to take a risk and say this is fixed by:

commit 983d308cb8f602d1920a8c40196eb2ab6cc07bd2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 26 10:47:10 2015 +0000

    agp/intel: Serialise after GTT updates


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.