Summary: | [gm45][BISECTED] Invalid GTT entry during Command Fetch from CS prefetch crossing page boundary | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Vincent Legoll <vincent.legoll> | ||||||||||||||||||||||||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||||||||||||||
Severity: | normal | ||||||||||||||||||||||||||||||
Priority: | medium | CC: | intel-gfx-bugs, thad.fisch | ||||||||||||||||||||||||||||
Version: | unspecified | ||||||||||||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||||||||
URL: | http://patchwork.freedesktop.org/patch/55818/ | ||||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||||
i915 platform: | G45 | i915 features: | GPU hang | ||||||||||||||||||||||||||||
Attachments: |
|
Description
Vincent Legoll
2015-06-28 06:56:08 UTC
Created attachment 116758 [details]
/sys/class/drm/card0/error
# uname -a Linux debian 4.1.0-10980-ge0dd880 #8 SMP PREEMPT Sat Jun 27 10:35:48 CEST 2015 x86_64 GNU/Linux # cat /etc/debian_version stretch/sid I'm not reproducing this any more with a new kernel: 4.1.0-11202-g4a10a91 but I slightly tweaked the config, but not in GFX options though... Please ask if you want some more testing... (In reply to Vincent Legoll from comment #3) > I'm not reproducing this any more with a new kernel: 4.1.0-11202-g4a10a91 > but I slightly tweaked the config, but not in GFX options though... How reproducible was it before? My hypothesis is that the CS prefetched a page as its PTE was being rewritten (since it crossed the boundary into an idle object) and that somehow corrupted the page address. I just haven't figured out how it can ever see an invalid GTT entry. Anyway, my hypothesis says that it would be very hard to trigger reliably (depends on timing and memory layout). I don't really know how reproducible it was. It was the first time I saw that kind of hang, I have that laptop for a long time, always been running linux on it, probably more than 7 years. Display went black for a few secs, and then I saw the drm output in dmesg. After that it was having the same kind of black display upon launching vlc, it stayed black until I ctrl-Q'ed it in the blind... I still have that kernel around, I'll reboot into it and see if it reproduces. I'm on that kernel, and nothing wrong is happening, I can use vlc to watch videos... It hung 4 times yesterday... # grep 'GPU HANG' syslog.1 Jun 28 08:39:53 debian kernel: [ 1379.819934] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [570], reason: Ring hung, action: reset Jun 28 08:46:11 debian kernel: [ 1757.820844] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [570], reason: Ring hung, action: reset Jun 28 08:46:19 debian kernel: [ 1765.820315] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [570], reason: Ring hung, action: reset Jun 28 23:05:24 debian kernel: [ 1764.820639] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [569], reason: Ring hung, action: reset Oops spoke too soon, it happened again... I'm attaching my .config Created attachment 116801 [details]
kernel config of the buggy kernel
This is today's... [ 999.816107] [drm] stuck on render ring [ 999.820923] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [606], reason: Ring hung, action: reset [ 999.820927] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 999.820928] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 999.820930] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 999.820932] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 999.820934] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 999.820936] i915: render error detected, EIR: 0x00000010 [ 999.820940] i915: IPEIR: 0x00000000 [ 999.820941] i915: IPEHR: 0x05000000 [ 999.820943] i915: INSTDONE_0: 0xffffffff [ 999.820945] i915: INSTDONE_1: 0xbfffffff [ 999.820946] i915: INSTDONE_2: 0x00000000 [ 999.820948] i915: INSTDONE_3: 0x00000000 [ 999.820949] i915: INSTPS: 0x8001e120 [ 999.820951] i915: ACTHD: 0x04ec1eac [ 999.820953] i915: page table error [ 999.820955] i915: PGTBL_ER: 0x00100001 [ 999.820968] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking [ 999.828112] drm/i915: Resetting chip after gpu hang [ 1005.816108] [drm] stuck on render ring [ 1005.821049] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [606], reason: Ring hung, action: reset [ 1005.826409] [drm:i915_set_reset_status] *ERROR* gpu hanging too fast, banning! [ 1005.826453] drm/i915: Resetting chip after gpu hang Created attachment 116802 [details]
new drm/card0/error
# apt-cache showpkg xserver-xorg-video-intel Package: xserver-xorg-video-intel Versions: 2:2.99.917-1 (/var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_binary-amd64_Packages) (/var/lib/apt/lists/ftp.us.debian.org_debian_dists_unstable_main_binary-amd64_Packages) (/var/lib/dpkg/status) Description Language: File: /var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_binary-amd64_Packages MD5: 4c1c091bee575987f9997018db5db7a4 Description Language: en File: /var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_i18n_Translation-en MD5: 4c1c091bee575987f9997018db5db7a4 Reverse Depends: xserver-xorg-video-intel-dbg,xserver-xorg-video-intel 2:2.99.917-1 xserver-xorg-video-all,xserver-xorg-video-intel intel-gpu-tools,xserver-xorg-video-intel 2.9.1 Dependencies: 2:2.99.917-1 - libc6 (2 2.17) libdrm-intel1 (2 2.4.38) libdrm2 (2 2.4.25) libpciaccess0 (2 0.8.0+git20071002) libpixman-1-0 (2 0.30.0) libudev1 (2 183) libx11-6 (0 (null)) libx11-xcb1 (0 (null)) libxcb-dri2-0 (0 (null)) libxcb-dri3-0 (0 (null)) libxcb-sync1 (0 (null)) libxcb-util0 (2 0.3.8) libxcb1 (0 (null)) libxcursor1 (4 1.1.2) libxdamage1 (2 1:1.1) libxext6 (0 (null)) libxfixes3 (0 (null)) libxinerama1 (0 (null)) libxrandr2 (2 2:1.2.99.2) libxrender1 (0 (null)) libxshmfence1 (0 (null)) libxtst6 (0 (null)) libxv1 (0 (null)) libxvmc1 (0 (null)) xorg-video-abi-19 (0 (null)) xserver-xorg-core (2 2:1.16.99.901) Provides: 2:2.99.917-1 - xorg-driver-video Reverse Provides: # apt-cache showpkg xserver-xorg Package: xserver-xorg Versions: 1:7.7+9 (/var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_binary-amd64_Packages) (/var/lib/apt/lists/ftp.us.debian.org_debian_dists_unstable_main_binary-amd64_Packages) (/var/lib/dpkg/status) Description Language: File: /var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_binary-amd64_Packages MD5: 3d8c1d268e8af6b69f54d86fbd5a3939 Description Language: en File: /var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_i18n_Translation-en MD5: 3d8c1d268e8af6b69f54d86fbd5a3939 Reverse Depends: lightdm,xserver-xorg deejayd,xserver-xorg parl-desktop,xserver-xorg design-desktop,xserver-xorg xorg,xserver-xorg 1:7.7+9 xinit,xserver-xorg kde-plasma-netbook,xserver-xorg kde-plasma-desktop,xserver-xorg lxde,xserver-xorg lightdm,xserver-xorg ldm,xserver-xorg keyboards-rg,xserver-xorg kdm,xserver-xorg gnome-session,xserver-xorg 1:7.4 gdm3,xserver-xorg Dependencies: 1:7.7+9 - xserver-xorg-core (2 2:1.15.0.901) xserver-xorg-video-all (16 (null)) xorg-driver-video (0 (null)) xserver-xorg-input-all (16 (null)) xorg-driver-input (0 (null)) xserver-xorg-input-evdev (0 (null)) libc6 (2 2.7) xkb-data (2 1.4) x11-xkb-utils (0 (null)) libgl1-mesa-dri (0 (null)) Provides: 1:7.7+9 - xserver Reverse Provides: Created attachment 116803 [details]
dmesg from failing kernel
Please ask if you want to investigate more Another one today... [ 2229.808173] [drm] stuck on render ring [ 2229.809649] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [605], reason: Ring hung, action: reset [ 2229.809653] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 2229.809655] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 2229.809657] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 2229.809659] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 2229.809661] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 2229.809664] i915: render error detected, EIR: 0x00000010 [ 2229.809668] i915: IPEIR: 0x00000000 [ 2229.809671] i915: IPEHR: 0x05000000 [ 2229.809673] i915: INSTDONE_0: 0xffffffff [ 2229.809675] i915: INSTDONE_1: 0xbfffffff [ 2229.809678] i915: INSTDONE_2: 0x00000000 [ 2229.809680] i915: INSTDONE_3: 0x00000000 [ 2229.809682] i915: INSTPS: 0x8001e120 [ 2229.809684] i915: ACTHD: 0x00cf0ea4 [ 2229.809687] i915: page table error [ 2229.809689] i915: PGTBL_ER: 0x00100000 [ 2229.809698] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking [ 2229.809774] drm/i915: Resetting chip after gpu hang [ 2241.808135] [drm] stuck on render ring [ 2241.812307] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [605], reason: Ring hung, action: reset [ 2241.812379] [drm:i915_set_reset_status] *ERROR* gpu hanging too fast, banning! [ 2241.812420] drm/i915: Resetting chip after gpu hang root@debian:/home/vince# uname -a Linux debian 4.1.0-12639-gb1be9ea #13 SMP PREEMPT Sat Jul 4 18:40:25 CEST 2015 x86_64 GNU/Linux # sensors coretemp-isa-0000 Adapter: ISA adapter Core 0: +60.0°C (high = +105.0°C, crit = +105.0°C) Core 1: +60.0°C (high = +105.0°C, crit = +105.0°C) they eve were a few degs lower just after that hang... Created attachment 116982 [details]
new one kernel 4.1.0-12639-gb1be9ea
Rebooted in the same kernel, ran some "make -j2" in a c++ project and a vlc at the same time, went good for a few minutes (~10mins) then I tried resizing the vlc window and it crashed again. The X server had a problem, maybe crashed, as I was logged out. I'll retry that to see if it's a good reproducer... [ 938.804033] [drm] stuck on render ring [ 938.805597] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [616], reason: Ring hung, action: reset [ 938.805601] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 938.805603] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 938.805605] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 938.805607] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 938.805609] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 938.805613] i915: render error detected, EIR: 0x00000010 [ 938.805621] i915: IPEIR: 0x00000000 [ 938.805624] i915: IPEHR: 0x05000000 [ 938.805627] i915: INSTDONE_0: 0xffffffff [ 938.805629] i915: INSTDONE_1: 0xbfffffff [ 938.805631] i915: INSTDONE_2: 0x00000000 [ 938.805634] i915: INSTDONE_3: 0x00000000 [ 938.805636] i915: INSTPS: 0x8001e120 [ 938.805638] i915: ACTHD: 0x00cf0eec [ 938.805641] i915: page table error [ 938.805643] i915: PGTBL_ER: 0x00100000 [ 938.805660] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking [ 938.805745] drm/i915: Resetting chip after gpu hang [ 946.804178] [drm] stuck on render ring [ 946.805667] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [616], reason: Ring hung, action: reset [ 946.805742] [drm:i915_set_reset_status] *ERROR* gpu hanging too fast, banning! [ 946.805782] drm/i915: Resetting chip after gpu hang [ 947.995796] vlc[1200]: segfault at 18 ip 00007f831ea5da50 sp 00007f830c1fdfa8 error 4 in libxcb.so.1.1.0[7f831ea4f000+21000] Created attachment 116986 [details]
second for 4.1.0-12639-gb1be9ea
Created attachment 116987 [details]
third in a row
This one right after reboot, logged in, ran vlc, resize window kind of aggressively => crash
Now things start to get interesting, I was suspecting some kind of HW failure, since this laptop is old, and pushed to its limits, at least CPU-wise (lots of -j2 C++ compilation happening). But booting: # uname -a Linux debian 3.16.0 #94 SMP PREEMPT Mon Aug 4 08:52:38 CEST 2014 x86_64 GNU/Linux Also self-compiled, config probably not so different (oldconfig'ed), but maybe another (older) gcc 4.9.1 vs 4.9.3 for the newer kernels... And trying to reproduce with resizing vlc, did not crash it for a few minutes. Even after having launched a background make -j2 in addition. That smells a lot like GPU driver related problem. Do you think it'd help if I try to bissect ? Linux acer-tm8471 4.2.0-rc2-patched #2 SMP Thu Jul 16 11:21:14 CEST 2015 x86_64 x86_64 x86_64 GNU/Linux --- [25339.804085] [drm] stuck on render ring [25339.805541] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [643], reason: Ring hung, action: reset [25339.805544] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [25339.805546] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [25339.805548] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [25339.805551] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [25339.805553] [drm] GPU crash dump saved to /sys/class/drm/card1/error [25339.805556] i915: render error detected, EIR: 0x00000010 [25339.805560] i915: IPEIR: 0x00000000 [25339.805563] i915: IPEHR: 0x05000000 [25339.805565] i915: INSTDONE_0: 0xffffffff [25339.805568] i915: INSTDONE_1: 0xbfffffff [25339.805570] i915: INSTDONE_2: 0x00000000 [25339.805572] i915: INSTDONE_3: 0x00000000 [25339.805575] i915: INSTPS: 0x8001e120 [25339.805577] i915: ACTHD: 0x09f09eec [25339.805580] i915: page table error [25339.805582] i915: PGTBL_ER: 0x00100000 [25339.805631] [drm:i915_handle_error [i915]] *ERROR* EIR stuck: 0x00000010, masking [25339.805726] drm/i915: Resetting chip after gpu hang Vincent, I think it would greatly help if you could bisect this issue. I tried, but it was not conclusive, it is perhaps config-dependent, and I could not find the culprit. It may also be linked to gcc-version, I dunno. I lost countless hours trying to find it. If anyone has the slightest clue, that would help... Did you capture the content of /sys/class/drm/card1/error ? Can you reproduce it ? I was able to trigger it while switching between a Firefox window (forced hardware acceleration) and mpv which was playing a H.264 encoded video (VA-API, g45-h264 branch). Created attachment 117220 [details]
dmesg
Created attachment 117221 [details]
/sys/class/drm/card1/error
I am still getting this on 4.2.0-rc2-00322-g8be5701 [ 73.804083] [drm] stuck on render ring [ 73.805632] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [589], reason: Ring hung, action: reset [ 73.805636] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 73.805638] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 73.805640] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 73.805641] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 73.805644] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 73.805646] i915: render error detected, EIR: 0x00000010 [ 73.805650] i915: IPEIR: 0x00000000 [ 73.805652] i915: IPEHR: 0x05000000 [ 73.805654] i915: INSTDONE_0: 0xffffffff [ 73.805656] i915: INSTDONE_1: 0xbfffffff [ 73.805658] i915: INSTDONE_2: 0x00000000 [ 73.805660] i915: INSTDONE_3: 0x00000000 [ 73.805661] i915: INSTPS: 0x8001e120 [ 73.805663] i915: ACTHD: 0x02881eb4 [ 73.805666] i915: page table error [ 73.805667] i915: PGTBL_ER: 0x00100000 [ 73.805687] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking [ 73.805790] drm/i915: Resetting chip after gpu hang Created attachment 117263 [details]
4.2.0-rc2-00322-g8be5701 - /sys/class/drm/card0/error
Created attachment 117264 [details]
4.2.0-rc2-00322-g8be5701 - Xorg.0.log
Created attachment 117265 [details]
4.2.0-rc2-00322-g8be5701 - /proc/config.gz
Created attachment 117266 [details]
4.2.0-rc2-00322-g8be5701 - dmesg
I have finally managed to bisect it properly to : [75d04a3773ecee617847de963ae4195d6aa74c28] drm/i915/gtt: Allocate va range only if vma is not bound which is related to performance bug report https://bugs.freedesktop.org/show_bug.cgi?id=90224 I tested a kernel just before that commit, no bug, and then that commit itself, where I reproduced the crash in a few seconds with the "vlc video resize wiggling" GPU-hanger But unfortunately cannot test current git without that single patch, as it won't revert properly... In fact the patch does revert with a bit of fuzz on current linus's tree, and I tested the bug disappear with the revert. Your machine doesn't use that allocate_va_range(), the only way that could have had an impact then is if the bind flags are inaccurate - or if there is a missing mb(). How confident are you in that bisect result? Can you run for a few days on the preceding commit and see if the error occurs? I cannot be 100% sure of the bisect result, but I can reproduce the bug in a few seconds with that patch in, whereas I cannot if it's not. That does not mean the bug is not there, only that it becomes way harder to hit... Do you have any other way to try to pinpoint what it really is ? I can no longer reproduce this bug after reverting the commit mentioned in comment #33. Thanks Vincent for finding the culprit. Thaddaeus, what is you GPU ? I'm writing the changelog, and would like to provide as much info as possible on impacted systems, especially if some are newer than mine... I also hope that you'll allow me to put a "tested-by:" line for you in it, if you can test the final version I'll post later... I did not mention my GPU simply because it's the same as yours (GM45). Also, I gladly test any patches and report back as soon as possible. It's just the last chunk of diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c index 9d3852c521c7..9b304f28b4d3 100644 --- a/drivers/gpu/drm/i915/i915_gem_gtt.c +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c @@ -1924,13 +1924,13 @@ static int ggtt_bind_vma(struct i915_vma *vma, pte_flags |= PTE_READ_ONLY; - if (!dev_priv->mm.aliasing_ppgtt || flags & GLOBAL_BIND) { + if (flags & GLOBAL_BIND) { vma->vm->insert_entries(vma->vm, pages, vma->node.start, cache_level, pte_flags); } - if (dev_priv->mm.aliasing_ppgtt && flags & LOCAL_BIND) { + if (flags & LOCAL_BIND && dev_priv->mm.aliasing_ppgtt) { struct i915_hw_ppgtt *appgtt = dev_priv->mm.aliasing_ppgtt; appgtt->base.insert_entries(&appgtt->base, pages, vma->node.start, @@ -1953,7 +1953,7 @@ static void ggtt_unbind_vma(struct i915_vma *vma) true); } - if (dev_priv->mm.aliasing_ppgtt && vma->bound & LOCAL_BIND) { + if (vma->bound & LOCAL_BIND && dev_priv->mm.aliasing_ppgtt) { struct i915_hw_ppgtt *appgtt = dev_priv->mm.aliasing_ppgtt; appgtt->base.clear_range(&appgtt->base, vma->node.start, @@ -2809,7 +2809,7 @@ int i915_vma_bind(struct i915_vma *vma, enum i915_cache_level cache_level, return -EINVAL; bind_flags = 0; - if (flags & PIN_GLOBAL) + if (flags & PIN_GLOBAL || !dev_priv->mm.aliasing_ppgtt) bind_flags |= GLOBAL_BIND; if (flags & PIN_USER) bind_flags |= LOCAL_BIND; Went with http://patchwork.freedesktop.org/patch/55818/ instead. No GPU hangs while testing 4.2.0-rc5. Thanks. commit d0e30adc42d979e4adc36b6c112b57337423b70c Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Jul 29 20:02:48 2015 +0100 drm/i915: Mark PIN_USER binding as GLOBAL_BIND without the aliasing ppgtt Sorry for the late testing, I was on holidays. This seems to be fixed on 4.2.0-rc5-01262-gdd2384a Thanks a lot ! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.