Bug 91133

Summary: [gm45][BISECTED] Invalid GTT entry during Command Fetch from CS prefetch crossing page boundary
Product: DRI Reporter: Vincent Legoll <vincent.legoll>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs, thad.fisch
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
URL: http://patchwork.freedesktop.org/patch/55818/
Whiteboard:
i915 platform: G45 i915 features: GPU hang
Attachments:
Description Flags
/sys/class/drm/card0/error
none
kernel config of the buggy kernel
none
new drm/card0/error
none
dmesg from failing kernel
none
new one kernel 4.1.0-12639-gb1be9ea
none
second for 4.1.0-12639-gb1be9ea
none
third in a row
none
dmesg
none
/sys/class/drm/card1/error
none
4.2.0-rc2-00322-g8be5701 - /sys/class/drm/card0/error
none
4.2.0-rc2-00322-g8be5701 - Xorg.0.log
none
4.2.0-rc2-00322-g8be5701 - /proc/config.gz
none
4.2.0-rc2-00322-g8be5701 - dmesg none

Description Vincent Legoll 2015-06-28 06:56:08 UTC
[ 1379.816023] [drm] stuck on render ring
[ 1379.819934] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [570], reason: Ring hung, action: reset
[ 1379.819937] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1379.819938] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1379.819940] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1379.819941] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1379.819943] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1379.819945] i915: render error detected, EIR: 0x00000010
[ 1379.819949] i915:   IPEIR: 0x00000000
[ 1379.819951] i915:   IPEHR: 0x05000000
[ 1379.819953] i915:   INSTDONE_0: 0xffffffff
[ 1379.819954] i915:   INSTDONE_1: 0xbfffffff
[ 1379.819956] i915:   INSTDONE_2: 0x00000000
[ 1379.819957] i915:   INSTDONE_3: 0x00000000
[ 1379.819959] i915:   INSTPS: 0x8001e120
[ 1379.819961] i915:   ACTHD: 0x00cf1e8c
[ 1379.819963] i915: page table error
[ 1379.819964] i915:   PGTBL_ER: 0x00100000
[ 1379.819972] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[ 1379.824445] drm/i915: Resetting chip after gpu hang
[ 1757.816140] [drm] stuck on render ring
[ 1757.820844] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [570], reason: Ring hung, action: reset
[ 1757.824332] drm/i915: Resetting chip after gpu hang
[ 1765.816105] [drm] stuck on render ring
[ 1765.820315] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [570], reason: Ring hung, action: reset
[ 1765.823341] [drm:i915_set_reset_status] *ERROR* gpu hanging too fast, banning!
[ 1765.823387] drm/i915: Resetting chip after gpu hang
Comment 1 Vincent Legoll 2015-06-28 06:58:10 UTC
Created attachment 116758 [details]
/sys/class/drm/card0/error
Comment 2 Vincent Legoll 2015-06-28 07:23:40 UTC
# uname -a

Linux debian 4.1.0-10980-ge0dd880 #8 SMP PREEMPT Sat Jun 27 10:35:48 CEST 2015 x86_64 GNU/Linux

# cat /etc/debian_version 
stretch/sid
Comment 3 Vincent Legoll 2015-06-29 18:58:33 UTC
I'm not reproducing this any more with a new kernel: 4.1.0-11202-g4a10a91
but I slightly tweaked the config, but not in GFX options though...

Please ask if you want some more testing...
Comment 4 Chris Wilson 2015-06-29 19:44:03 UTC
(In reply to Vincent Legoll from comment #3)
> I'm not reproducing this any more with a new kernel: 4.1.0-11202-g4a10a91
> but I slightly tweaked the config, but not in GFX options though...

How reproducible was it before? My hypothesis is that the CS prefetched a page as its PTE was being rewritten (since it crossed the boundary into an idle object) and that somehow corrupted the page address. I just haven't figured out how it can ever see an invalid GTT entry.

Anyway, my hypothesis says that it would be very hard to trigger reliably (depends on timing and memory layout).
Comment 5 Vincent Legoll 2015-06-29 20:17:39 UTC
I don't really know how reproducible it was.

It was the first time I saw that kind of hang, I have that laptop for a long time, always been running linux on it, probably more than 7 years.

Display went black for a few secs, and then I saw the drm output in dmesg. After that it was having the same kind of black display upon launching vlc, it stayed black until I ctrl-Q'ed it in the blind...

I still have that kernel around, I'll reboot into it and see if it reproduces.
Comment 6 Vincent Legoll 2015-06-29 20:33:41 UTC
I'm on that kernel, and nothing wrong is happening, I can use vlc to watch videos...

It hung 4 times yesterday...

# grep 'GPU HANG' syslog.1
Jun 28 08:39:53 debian kernel: [ 1379.819934] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [570], reason: Ring hung, action: reset
Jun 28 08:46:11 debian kernel: [ 1757.820844] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [570], reason: Ring hung, action: reset
Jun 28 08:46:19 debian kernel: [ 1765.820315] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [570], reason: Ring hung, action: reset
Jun 28 23:05:24 debian kernel: [ 1764.820639] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [569], reason: Ring hung, action: reset
Comment 7 Vincent Legoll 2015-06-29 20:36:55 UTC
Oops spoke too soon, it happened again...

I'm attaching my .config
Comment 8 Vincent Legoll 2015-06-29 20:37:53 UTC
Created attachment 116801 [details]
kernel config of the buggy kernel
Comment 9 Vincent Legoll 2015-06-29 20:40:11 UTC
This is today's...

[  999.816107] [drm] stuck on render ring
[  999.820923] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [606], reason: Ring hung, action: reset
[  999.820927] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  999.820928] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  999.820930] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  999.820932] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  999.820934] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  999.820936] i915: render error detected, EIR: 0x00000010
[  999.820940] i915:   IPEIR: 0x00000000
[  999.820941] i915:   IPEHR: 0x05000000
[  999.820943] i915:   INSTDONE_0: 0xffffffff
[  999.820945] i915:   INSTDONE_1: 0xbfffffff
[  999.820946] i915:   INSTDONE_2: 0x00000000
[  999.820948] i915:   INSTDONE_3: 0x00000000
[  999.820949] i915:   INSTPS: 0x8001e120
[  999.820951] i915:   ACTHD: 0x04ec1eac
[  999.820953] i915: page table error
[  999.820955] i915:   PGTBL_ER: 0x00100001
[  999.820968] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[  999.828112] drm/i915: Resetting chip after gpu hang
[ 1005.816108] [drm] stuck on render ring
[ 1005.821049] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [606], reason: Ring hung, action: reset
[ 1005.826409] [drm:i915_set_reset_status] *ERROR* gpu hanging too fast, banning!
[ 1005.826453] drm/i915: Resetting chip after gpu hang
Comment 10 Vincent Legoll 2015-06-29 20:41:31 UTC
Created attachment 116802 [details]
new drm/card0/error
Comment 11 Vincent Legoll 2015-06-29 20:45:10 UTC
# apt-cache showpkg xserver-xorg-video-intel
Package: xserver-xorg-video-intel
Versions: 
2:2.99.917-1 (/var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_binary-amd64_Packages) (/var/lib/apt/lists/ftp.us.debian.org_debian_dists_unstable_main_binary-amd64_Packages) (/var/lib/dpkg/status)
 Description Language: 
                 File: /var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_binary-amd64_Packages
                  MD5: 4c1c091bee575987f9997018db5db7a4
 Description Language: en
                 File: /var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_i18n_Translation-en
                  MD5: 4c1c091bee575987f9997018db5db7a4


Reverse Depends: 
  xserver-xorg-video-intel-dbg,xserver-xorg-video-intel 2:2.99.917-1
  xserver-xorg-video-all,xserver-xorg-video-intel
  intel-gpu-tools,xserver-xorg-video-intel 2.9.1
Dependencies: 
2:2.99.917-1 - libc6 (2 2.17) libdrm-intel1 (2 2.4.38) libdrm2 (2 2.4.25) libpciaccess0 (2 0.8.0+git20071002) libpixman-1-0 (2 0.30.0) libudev1 (2 183) libx11-6 (0 (null)) libx11-xcb1 (0 (null)) libxcb-dri2-0 (0 (null)) libxcb-dri3-0 (0 (null)) libxcb-sync1 (0 (null)) libxcb-util0 (2 0.3.8) libxcb1 (0 (null)) libxcursor1 (4 1.1.2) libxdamage1 (2 1:1.1) libxext6 (0 (null)) libxfixes3 (0 (null)) libxinerama1 (0 (null)) libxrandr2 (2 2:1.2.99.2) libxrender1 (0 (null)) libxshmfence1 (0 (null)) libxtst6 (0 (null)) libxv1 (0 (null)) libxvmc1 (0 (null)) xorg-video-abi-19 (0 (null)) xserver-xorg-core (2 2:1.16.99.901) 
Provides: 
2:2.99.917-1 - xorg-driver-video 
Reverse Provides:
Comment 12 Vincent Legoll 2015-06-29 20:46:17 UTC
# apt-cache showpkg xserver-xorg
Package: xserver-xorg
Versions: 
1:7.7+9 (/var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_binary-amd64_Packages) (/var/lib/apt/lists/ftp.us.debian.org_debian_dists_unstable_main_binary-amd64_Packages) (/var/lib/dpkg/status)
 Description Language: 
                 File: /var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_binary-amd64_Packages
                  MD5: 3d8c1d268e8af6b69f54d86fbd5a3939
 Description Language: en
                 File: /var/lib/apt/lists/ftp.us.debian.org_debian_dists_testing_main_i18n_Translation-en
                  MD5: 3d8c1d268e8af6b69f54d86fbd5a3939


Reverse Depends: 
  lightdm,xserver-xorg
  deejayd,xserver-xorg
  parl-desktop,xserver-xorg
  design-desktop,xserver-xorg
  xorg,xserver-xorg 1:7.7+9
  xinit,xserver-xorg
  kde-plasma-netbook,xserver-xorg
  kde-plasma-desktop,xserver-xorg
  lxde,xserver-xorg
  lightdm,xserver-xorg
  ldm,xserver-xorg
  keyboards-rg,xserver-xorg
  kdm,xserver-xorg
  gnome-session,xserver-xorg 1:7.4
  gdm3,xserver-xorg
Dependencies: 
1:7.7+9 - xserver-xorg-core (2 2:1.15.0.901) xserver-xorg-video-all (16 (null)) xorg-driver-video (0 (null)) xserver-xorg-input-all (16 (null)) xorg-driver-input (0 (null)) xserver-xorg-input-evdev (0 (null)) libc6 (2 2.7) xkb-data (2 1.4) x11-xkb-utils (0 (null)) libgl1-mesa-dri (0 (null)) 
Provides: 
1:7.7+9 - xserver 
Reverse Provides:
Comment 13 Vincent Legoll 2015-06-29 20:58:29 UTC
Created attachment 116803 [details]
dmesg from failing kernel
Comment 14 Vincent Legoll 2015-06-29 20:59:07 UTC
Please ask if you want to investigate more
Comment 15 Vincent Legoll 2015-07-06 17:28:51 UTC
Another one today...

[ 2229.808173] [drm] stuck on render ring
[ 2229.809649] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [605], reason: Ring hung, action: reset
[ 2229.809653] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 2229.809655] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 2229.809657] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 2229.809659] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 2229.809661] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 2229.809664] i915: render error detected, EIR: 0x00000010
[ 2229.809668] i915:   IPEIR: 0x00000000
[ 2229.809671] i915:   IPEHR: 0x05000000
[ 2229.809673] i915:   INSTDONE_0: 0xffffffff
[ 2229.809675] i915:   INSTDONE_1: 0xbfffffff
[ 2229.809678] i915:   INSTDONE_2: 0x00000000
[ 2229.809680] i915:   INSTDONE_3: 0x00000000
[ 2229.809682] i915:   INSTPS: 0x8001e120
[ 2229.809684] i915:   ACTHD: 0x00cf0ea4
[ 2229.809687] i915: page table error
[ 2229.809689] i915:   PGTBL_ER: 0x00100000
[ 2229.809698] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[ 2229.809774] drm/i915: Resetting chip after gpu hang
[ 2241.808135] [drm] stuck on render ring
[ 2241.812307] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [605], reason: Ring hung, action: reset
[ 2241.812379] [drm:i915_set_reset_status] *ERROR* gpu hanging too fast, banning!
[ 2241.812420] drm/i915: Resetting chip after gpu hang


root@debian:/home/vince# uname -a
Linux debian 4.1.0-12639-gb1be9ea #13 SMP PREEMPT Sat Jul 4 18:40:25 CEST 2015 x86_64 GNU/Linux


# sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +60.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +60.0°C  (high = +105.0°C, crit = +105.0°C)

they eve were a few degs lower just after that hang...
Comment 16 Vincent Legoll 2015-07-06 17:31:09 UTC
Created attachment 116982 [details]
new one kernel 4.1.0-12639-gb1be9ea
Comment 17 Vincent Legoll 2015-07-06 18:05:20 UTC
Rebooted in the same kernel, ran some "make -j2" in a c++ project and a vlc at the same time, went good for a few minutes (~10mins) then I tried resizing the vlc window and it crashed again. The X server had a problem, maybe crashed, as I was logged out.

I'll retry that to see if it's a good reproducer...

[  938.804033] [drm] stuck on render ring
[  938.805597] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [616], reason: Ring hung, action: reset
[  938.805601] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  938.805603] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  938.805605] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  938.805607] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  938.805609] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  938.805613] i915: render error detected, EIR: 0x00000010
[  938.805621] i915:   IPEIR: 0x00000000
[  938.805624] i915:   IPEHR: 0x05000000
[  938.805627] i915:   INSTDONE_0: 0xffffffff
[  938.805629] i915:   INSTDONE_1: 0xbfffffff
[  938.805631] i915:   INSTDONE_2: 0x00000000
[  938.805634] i915:   INSTDONE_3: 0x00000000
[  938.805636] i915:   INSTPS: 0x8001e120
[  938.805638] i915:   ACTHD: 0x00cf0eec
[  938.805641] i915: page table error
[  938.805643] i915:   PGTBL_ER: 0x00100000
[  938.805660] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[  938.805745] drm/i915: Resetting chip after gpu hang
[  946.804178] [drm] stuck on render ring
[  946.805667] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [616], reason: Ring hung, action: reset
[  946.805742] [drm:i915_set_reset_status] *ERROR* gpu hanging too fast, banning!
[  946.805782] drm/i915: Resetting chip after gpu hang
[  947.995796] vlc[1200]: segfault at 18 ip 00007f831ea5da50 sp 00007f830c1fdfa8 error 4 in libxcb.so.1.1.0[7f831ea4f000+21000]
Comment 18 Vincent Legoll 2015-07-06 18:06:24 UTC
Created attachment 116986 [details]
second for 4.1.0-12639-gb1be9ea
Comment 19 Vincent Legoll 2015-07-06 18:15:43 UTC
Created attachment 116987 [details]
third in a row

This one right after reboot, logged in, ran vlc, resize window kind of aggressively => crash
Comment 20 Vincent Legoll 2015-07-06 18:30:10 UTC
Now things start to get interesting, I was suspecting some kind of HW failure, since this laptop is old, and pushed to its limits, at least CPU-wise (lots of -j2 C++ compilation happening).

But booting:

# uname -a
Linux debian 3.16.0 #94 SMP PREEMPT Mon Aug 4 08:52:38 CEST 2014 x86_64 GNU/Linux

Also self-compiled, config probably not so different (oldconfig'ed), but maybe another (older) gcc 4.9.1 vs 4.9.3 for the newer kernels...

And trying to reproduce with resizing vlc, did not crash it for a few minutes. Even after having launched a background make -j2 in addition.

That smells a lot like GPU driver related problem.

Do you think it'd help if I try to bissect ?
Comment 21 Thaddaeus Tintenfisch 2015-07-17 21:13:41 UTC
Linux acer-tm8471 4.2.0-rc2-patched #2 SMP Thu Jul 16 11:21:14 CEST 2015 x86_64 x86_64 x86_64 GNU/Linux

---
[25339.804085] [drm] stuck on render ring
[25339.805541] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [643], reason: Ring hung, action: reset
[25339.805544] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[25339.805546] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[25339.805548] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[25339.805551] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[25339.805553] [drm] GPU crash dump saved to /sys/class/drm/card1/error
[25339.805556] i915: render error detected, EIR: 0x00000010
[25339.805560] i915:   IPEIR: 0x00000000
[25339.805563] i915:   IPEHR: 0x05000000
[25339.805565] i915:   INSTDONE_0: 0xffffffff
[25339.805568] i915:   INSTDONE_1: 0xbfffffff
[25339.805570] i915:   INSTDONE_2: 0x00000000
[25339.805572] i915:   INSTDONE_3: 0x00000000
[25339.805575] i915:   INSTPS: 0x8001e120
[25339.805577] i915:   ACTHD: 0x09f09eec
[25339.805580] i915: page table error
[25339.805582] i915:   PGTBL_ER: 0x00100000
[25339.805631] [drm:i915_handle_error [i915]] *ERROR* EIR stuck: 0x00000010, masking
[25339.805726] drm/i915: Resetting chip after gpu hang
Comment 22 Thaddaeus Tintenfisch 2015-07-17 21:18:46 UTC
Vincent, I think it would greatly help if you could bisect this issue.
Comment 23 Vincent Legoll 2015-07-17 21:35:47 UTC
I tried, but it was not conclusive, it is perhaps config-dependent, and I could not find the culprit. It may also be linked to gcc-version, I dunno. I lost countless hours trying to find it.

If anyone has the slightest clue, that would help...
Comment 24 Vincent Legoll 2015-07-17 21:37:15 UTC
Did you capture the content of /sys/class/drm/card1/error ?
Can you reproduce it ?
Comment 25 Thaddaeus Tintenfisch 2015-07-18 11:24:33 UTC
I was able to trigger it while switching between a Firefox window (forced hardware acceleration) and mpv which was playing a H.264 encoded video (VA-API, g45-h264 branch).
Comment 26 Thaddaeus Tintenfisch 2015-07-18 11:25:29 UTC
Created attachment 117220 [details]
dmesg
Comment 27 Thaddaeus Tintenfisch 2015-07-18 11:27:06 UTC
Created attachment 117221 [details]
/sys/class/drm/card1/error
Comment 28 Vincent Legoll 2015-07-20 17:16:25 UTC
I am still getting this on 4.2.0-rc2-00322-g8be5701

[   73.804083] [drm] stuck on render ring
[   73.805632] [drm] GPU HANG: ecode 4:0:0xfaffffff, in Xorg [589], reason: Ring hung, action: reset
[   73.805636] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   73.805638] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   73.805640] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   73.805641] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   73.805644] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   73.805646] i915: render error detected, EIR: 0x00000010
[   73.805650] i915:   IPEIR: 0x00000000
[   73.805652] i915:   IPEHR: 0x05000000
[   73.805654] i915:   INSTDONE_0: 0xffffffff
[   73.805656] i915:   INSTDONE_1: 0xbfffffff
[   73.805658] i915:   INSTDONE_2: 0x00000000
[   73.805660] i915:   INSTDONE_3: 0x00000000
[   73.805661] i915:   INSTPS: 0x8001e120
[   73.805663] i915:   ACTHD: 0x02881eb4
[   73.805666] i915: page table error
[   73.805667] i915:   PGTBL_ER: 0x00100000
[   73.805687] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[   73.805790] drm/i915: Resetting chip after gpu hang
Comment 29 Vincent Legoll 2015-07-20 17:17:18 UTC
Created attachment 117263 [details]
4.2.0-rc2-00322-g8be5701 - /sys/class/drm/card0/error
Comment 30 Vincent Legoll 2015-07-20 18:00:20 UTC
Created attachment 117264 [details]
4.2.0-rc2-00322-g8be5701 - Xorg.0.log
Comment 31 Vincent Legoll 2015-07-20 18:01:35 UTC
Created attachment 117265 [details]
4.2.0-rc2-00322-g8be5701 - /proc/config.gz
Comment 32 Vincent Legoll 2015-07-20 18:03:12 UTC
Created attachment 117266 [details]
4.2.0-rc2-00322-g8be5701 - dmesg
Comment 33 Vincent Legoll 2015-07-22 17:19:40 UTC
I have finally managed to bisect it properly to :
[75d04a3773ecee617847de963ae4195d6aa74c28] drm/i915/gtt: Allocate va range only if vma is not bound

which is related to performance bug report  https://bugs.freedesktop.org/show_bug.cgi?id=90224

I tested a kernel just before that commit, no bug, and then that commit itself, where I reproduced the crash in a few seconds with the "vlc video resize wiggling" GPU-hanger

But unfortunately cannot test current git without that single patch, as it won't revert properly...
Comment 34 Vincent Legoll 2015-07-22 19:54:46 UTC
In fact the patch does revert with a bit of fuzz on current linus's tree,
and I tested the bug disappear with the revert.
Comment 35 Chris Wilson 2015-07-22 20:39:56 UTC
Your machine doesn't use that allocate_va_range(), the only way that could have had an impact then is if the bind flags are inaccurate - or if there is a missing mb().

How confident are you in that bisect result? Can you run for a few days on the preceding commit and see if the error occurs?
Comment 36 Vincent Legoll 2015-07-22 20:51:55 UTC
I cannot be 100% sure of the bisect result, but I can reproduce the bug in a few seconds with that patch in, whereas I cannot if it's not. That does not mean the bug is not there, only that it becomes way harder to hit...
Comment 37 Vincent Legoll 2015-07-22 20:53:17 UTC
Do you have any other way to try to pinpoint what it really is ?
Comment 38 Thaddaeus Tintenfisch 2015-07-27 15:32:36 UTC
I can no longer reproduce this bug after reverting the commit mentioned in comment #33.

Thanks Vincent for finding the culprit.
Comment 39 Vincent Legoll 2015-07-27 16:50:45 UTC
Thaddaeus, what is you GPU ?

I'm writing the changelog, and would like to provide as much info as possible on impacted systems, especially if some are newer than mine...

I also hope that you'll allow me to put a "tested-by:" line for you in it, if you can test the final version I'll post later...
Comment 40 Thaddaeus Tintenfisch 2015-07-27 18:07:16 UTC
I did not mention my GPU simply because it's the same as yours (GM45).

Also, I gladly test any patches and report back as soon as possible.
Comment 41 Chris Wilson 2015-07-29 18:51:25 UTC
It's just the last chunk of

diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 9d3852c521c7..9b304f28b4d3 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -1924,13 +1924,13 @@ static int ggtt_bind_vma(struct i915_vma *vma,
                pte_flags |= PTE_READ_ONLY;
 
 
-       if (!dev_priv->mm.aliasing_ppgtt || flags & GLOBAL_BIND) {
+       if (flags & GLOBAL_BIND) {
                vma->vm->insert_entries(vma->vm, pages,
                                        vma->node.start,
                                        cache_level, pte_flags);
        }
 
-       if (dev_priv->mm.aliasing_ppgtt && flags & LOCAL_BIND) {
+       if (flags & LOCAL_BIND && dev_priv->mm.aliasing_ppgtt) {
                struct i915_hw_ppgtt *appgtt = dev_priv->mm.aliasing_ppgtt;
                appgtt->base.insert_entries(&appgtt->base, pages,
                                            vma->node.start,
@@ -1953,7 +1953,7 @@ static void ggtt_unbind_vma(struct i915_vma *vma)
                                     true);
        }
 
-       if (dev_priv->mm.aliasing_ppgtt && vma->bound & LOCAL_BIND) {
+       if (vma->bound & LOCAL_BIND && dev_priv->mm.aliasing_ppgtt) {
                struct i915_hw_ppgtt *appgtt = dev_priv->mm.aliasing_ppgtt;
                appgtt->base.clear_range(&appgtt->base,
                                         vma->node.start,
@@ -2809,7 +2809,7 @@ int i915_vma_bind(struct i915_vma *vma, enum i915_cache_level cache_level,
                return -EINVAL;
 
        bind_flags = 0;
-       if (flags & PIN_GLOBAL)
+       if (flags & PIN_GLOBAL || !dev_priv->mm.aliasing_ppgtt)
                bind_flags |= GLOBAL_BIND;
        if (flags & PIN_USER)
                bind_flags |= LOCAL_BIND;
Comment 42 Chris Wilson 2015-07-29 19:06:04 UTC
Went with http://patchwork.freedesktop.org/patch/55818/ instead.
Comment 43 Thaddaeus Tintenfisch 2015-08-04 09:15:31 UTC
No GPU hangs while testing 4.2.0-rc5. Thanks.
Comment 44 Chris Wilson 2015-08-04 10:22:04 UTC
commit d0e30adc42d979e4adc36b6c112b57337423b70c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jul 29 20:02:48 2015 +0100

    drm/i915: Mark PIN_USER binding as GLOBAL_BIND without the aliasing ppgtt
Comment 45 Vincent Legoll 2015-08-08 15:08:12 UTC
Sorry for the late testing, I was on holidays.

This seems to be fixed on 4.2.0-rc5-01262-gdd2384a

Thanks a lot !

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.