Bug 96526

Summary: [BAT execlists] Sporadic - gem_exec_suspend basic-s4 GPU hang after resume
Product: DRI Reporter: Marius Vlad <marius.c.vlad>
Component: DRM/IntelAssignee: Humberto Israel Perez Rodriguez <humberto.i.perez.rodriguez>
Status: CLOSED FIXED QA Contact: Humberto Israel Perez Rodriguez <humberto.i.perez.rodriguez>
Severity: blocker    
Priority: highest CC: abchk1234, chris, chris.harris, christophe.prigent, ewfalor, fei.yang, harish.hyma, humberto.i.perez.rodriguez, intel-gfx-bugs, kassick, leonard, matwey.kornilov, Nikolaus, pj.crommen, ricardo.vega, rodrigo.vivi, slacker702, solitone, unki, xlionell, ziegler
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features: firmware/guc, GPU hang, power/suspend-resume
Attachments:
Description Flags
dump from intel_error_decode
none
/sys/class/drm/card0/error on APL
none
APL-gem_exec_suspend_gpu-hang_kern.log
none
dmes_bsw.log
none
dmesg_bxt.log
none
gpu_error_bxt
none
dmesg_gem_exec_suspend_apl.log
none
APL_sys-class-drm-card0-error_without-guc
none
APL-gem_exec_suspend_basic-S4_without-guc_kern.log
none
APL_sys-class-drm-card0-error_with-guc
none
APL-gem_exec_suspend_basic-S4_with-guc_kern.log
none
APL_gem_exec_suspend_basic-S4_output-with-and-without-guc
none
bsw-gem_exec_suspend__basic_S4-kern.log
none
bsw-gem_exec_suspend__basic_S4-output
none
bsw-error
none
BDW__gem_exec_suspend--basic-S4__kern.log
none
BDW_error
none
BDW__gem_exec_suspend--basic-S4__output
none
BDW__with-patch-comment-54 none

Description Marius Vlad 2016-06-14 13:08:09 UTC
Created attachment 124525 [details]
dump from intel_error_decode

Time        0:01:18.061184
            IGT-Version: 1.15-g3ce58b6 (x86_64) (Linux: 4.7.0-rc1+ x86_64)
            rtcwake: wakeup from "disk" using /dev/rtc0 at Fri Jun  3 12:20:58 2016
            Stack trace:
              #0 [__igt_fail_assert+0xf1]
              #1 [sig_abort+0x3a]
              #2 [killpg+0x40]
              #3 [ioctl+0x7]
              #4 [drmIoctl+0x28]
              #5 [__gem_execbuf+0x15]
Stdout        #6 [gem_has_ring+0x54]
              #7 [test_all+0x40]
              #8 [run_test+0x3fe]
              #9 [__real_main227+0x26f]
              #10 [main+0x23]
              #11 [__libc_start_main+0xf0]
              #12 [_start+0x29]
              #13 [<unknown>+0x29]
            Subtest basic-S4: FAIL (7.956s)


            (gem_exec_suspend:7902) DEBUG: Test requirement passed: gem_has_ring(fd, 0)
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: can_mi_store_dword(gen, 0)
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: nengine
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: nengine
            (gem_exec_suspend:7902) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
            (gem_exec_suspend:7902) DEBUG: Verifying result
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: nengine
            (gem_exec_suspend:7902) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
            (gem_exec_suspend:7902) DEBUG: Verifying result
Stderr      (gem_exec_suspend:7902) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: nengine
            (gem_exec_suspend:7902) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
            (gem_exec_suspend:7902) DEBUG: Verifying result
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: nengine
            (gem_exec_suspend:7902) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
            (gem_exec_suspend:7902) DEBUG: Verifying result
            (gem_exec_suspend:7902) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
            (gem_exec_suspend:7902) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
            (gem_exec_suspend:7902) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
            (gem_exec_suspend:7902) igt-aux-DEBUG: Test requirement passed: system("rtcwake -n -s 30 -m disk" SQUELCH) == 0
            (gem_exec_suspend:7902) DEBUG: Verifying result
            (gem_exec_suspend:7902) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:399:
            (gem_exec_suspend:7902) igt-aux-CRITICAL: Failed assertion: !"GPU hung"


dmesg:

[  642.918976] [drm] stuck on blitter ring
[  642.918987] [drm] stuck on bsd ring
[  642.918993] [drm] stuck on video enhancement ring
[  642.930715] [drm] GPU HANG: ecode 8:1:0x5ccddf92, in gem_exec_suspen [6045], reason: Engine(s) hung, action: reset
[  642.930921] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  642.930925] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  642.930929] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  642.930932] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  642.930935] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  642.938079] [drm:i915_set_reset_status [i915]] *ERROR* gpu hanging too fast, banning!


S3 and S4 works fine without issuing batch commands to the GPU.
Comment 1 cprigent 2016-07-05 15:14:04 UTC
Created attachment 124910 [details]
/sys/class/drm/card0/error on APL

Reproduced on APL.

# ./gem_exec_suspend --r basic-S4
IGT-Version: 1.15-g88c1f7c (x86_64) (Linux: 4.7.0-rc5-nightly+ x86_64)
rtcwake: wakeup from "disk" using /dev/rtc0 at Tue Jul  5 15:06:28 2016
(gem_exec_suspend:2939) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:399:
(gem_exec_suspend:2939) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Stack trace:
  #0 [__igt_fail_assert+0xf1]
  #1 [sig_abort+0x3a]
  #2 [killpg+0x40]
  #3 [__write_nocancel+0x7]
  #4 [igt_drop_caches_set+0xa4]
  #5 [gem_quiescent_gpu+0xcf]
  #6 [run_test+0x403]
  #7 [__real_main227+0x159]
  #8 [main+0x29]
  #9 [__libc_start_main+0xf0]
  #10 [_start+0x29]
  #11 [<unknown>+0x29]
Subtest basic-S4 failed.
**** DEBUG ****
(gem_exec_suspend:2939) DEBUG: Test requirement passed: gem_has_ring(fd, 0)
(gem_exec_suspend:2939) DEBUG: Test requirement passed: can_mi_store_dword(gen, 0)
(gem_exec_suspend:2939) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:2939) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:2939) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:2939) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:2939) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:2939) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:2939) DEBUG: Verifying result
(gem_exec_suspend:2939) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:2939) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:2939) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:2939) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:2939) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:2939) DEBUG: Verifying result
(gem_exec_suspend:2939) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:2939) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:2939) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:2939) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:2939) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:2939) DEBUG: Verifying result
(gem_exec_suspend:2939) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:2939) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:2939) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:2939) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:2939) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:2939) DEBUG: Verifying result
(gem_exec_suspend:2939) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:2939) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:2939) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_exec_suspend:2939) igt-aux-DEBUG: Test requirement passed: system("rtcwake -n -s 30 -m disk" SQUELCH) == 0
(gem_exec_suspend:2939) DEBUG: Verifying result
(gem_exec_suspend:2939) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:399:
(gem_exec_suspend:2939) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
****  END  ****
Subtest basic-S4: FAIL (14.214s)


Platform: APL system
CPU Name : Intel(R) Genuine Processor @ 1.1 GHz (family: 6, model: 12, stepping: 9) 4 cores
QDF : Q6HE
SoC : B1
CRB : Apollo Lake DDR3L RVP1A FAB2
Reworks : R19, R20

Software 
Bios: 144_B10 - APLK_B0_IFWI_X64_R_2016_06_27_0956_SPI_RVP1 from from \\gar\ec\proj\ba\CCG\APL BIOS\External\BIOS_Release\Daily\v144_10_2016_WW27.1\IFWI\IFWI_RVP1_Release\IFWI
KSC: 1.15
Linux distribution: Ubuntu 16.04 64 bits
Kernel: drm-intel-nightly 4.7.0-rc4 5c244f4 from http://cgit.freedesktop.org/drm-intel/
   commit 5c244f4b128c6274755007e080d46e0a61b71534
   Author: Chris Wilson <chris@chris-wilson.co.uk>
   Date:   Fri Jun 24 16:17:56 2016 +0100
   drm-intel-nightly: 2016y-06m-24d-15h-17m-32s UTC integration manifest
drm: libdrm-2.4.68-9 625d181 from git://anongit.freedesktop.org/mesa/drm
mesa: mesa-11.2.2 56cd706 from git://anongit.freedesktop.org/mesa/mesa
cairo: 1.15.2 db8a7f1 from git://anongit.freedesktop.org/cairo
server: xorg-server-1.18.0-419 7397a21 from git://git.freedesktop.org/git/xorg/xserver
xf86-video-intel: 2.99.917-670 cac7c8d from git://git.freedesktop.org/git/xorg/driver/xf86-video-intel
libva: libva-1.7.0-26 c36971c from git://git.freedesktop.org/git/vaapi/libva
vaapi-intel-driver: 1.7.0-52 f47e513 from git://git.freedesktop.org/git/vaapi/intel-driver
DMC 1.07
GuC 8.7
Intel-Gpu-Tools: 1.15-54 88c1f7c from http://anongit.freedesktop.org/git/xorg/app/intel-gpu-tools.git
Comment 2 cprigent 2016-07-05 15:15:47 UTC
Created attachment 124911 [details]
APL-gem_exec_suspend_gpu-hang_kern.log
Comment 3 cprigent 2016-07-07 08:23:38 UTC
On APL it could be related to Guc:
[   16.937392] [drm:guc_fw_fetch] GuC fw fetch status FAIL; err -11, fw           (null), obj           (null)
[   16.937417] [drm:intel_guc_init [i915]] *ERROR* Failed to fetch GuC firmware from i915/bxt_guc_ver8_7.bin (error -11)
Adding Rodrigo as watcher.
Comment 4 Rodrigo Vivi 2016-07-07 18:23:26 UTC
About GuC: BSW doesn't have GuC so probably good to file a separated issue for GuC not loading after S4 on APL. (probably happen on SKL and KBL as well).

About this on APL: Can you please reproduce disabling GuC so we see if this is happening only on BSW or also on APL regardless GuC?
Comment 5 cprigent 2016-07-08 10:23:21 UTC
(In reply to Rodrigo Vivi from comment #4)
> About GuC: BSW doesn't have GuC so probably good to file a separated issue
> for GuC not loading after S4 on APL. (probably happen on SKL and KBL as
> well).
> 
> About this on APL: Can you please reproduce disabling GuC so we see if this
> is happening only on BSW or also on APL regardless GuC?

Reported internally.
On APL, the GPU Hang is reproduced with GuC not loaded.
Comment 6 cprigent 2016-07-12 08:20:08 UTC
Executed several times.

Sometimes the test is skip:
 ./gem_exec_suspend --r basic-S4
IGT-Version: 1.15-g2038b24 (x86_64) (Linux: 4.7.0-rc6-testing+ x86_64)
Test requirement not met in function run_test, file gem_exec_suspend.c:157:
Test requirement: __gem_execbuf(fd, &execbuf) == 0
Subtest basic-S4: SKIP (0.001s)
The reason is different than in: http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/CI_DRM_1416/bxtp-1/html/bxtp-1@CI_DRM_1416@1/igt@gem_exec_suspend@basic-s4.html 

Most of the time the test is fail with GPU Hang.

Tested with:
Platform: APL system
CPU Name : Intel(R) Genuine Processor @ 1.1 GHz (family: 6, model: 12, stepping: 9) 4 cores
QDF : Q6HE
SoC : B1
CRB : Apollo Lake DDR3L RVP1A FAB2
Reworks : R19, R20

Software 
Bios: 144_B10 APLK_B0_IFWI_X64_R_2016_06_27_0956_SPI_RVP1.bin from \\gar\ec\proj\ba\CCG\APL BIOS\External\BIOS_Release\Daily\v144_10_2016_WW27.1\IFWI\IFWI_RVP1_Release\IFWI
KSC: 1.15
Linux distribution: Ubuntu 16.04 64 bits
Kernel: tag drm-intel-testing-2016-07-11 4.7.0-rc6 0230e3c from http://cgit.freedesktop.org/drm-intel/
commit 0230e3c4eb76cf8f57cf40db0e908b96b84e3911
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jul 10 13:24:46 2016 +0100
drm-intel-nightly: 2016y-07m-10d-12h-23m-38s UTC integration manifest
drm: libdrm-2.4.68-14 8c8d5ddfrom git://anongit.freedesktop.org/mesa/drm
mesa: mesa-11.2.2 3a9f628from git://anongit.freedesktop.org/mesa/mesa
cairo: 1.15.2 db8a7f1 from git://anongit.freedesktop.org/cairo
xserver: xorg-server-1.18.0-454 033888e from git://git.freedesktop.org/git/xorg/xserver
xf86-video-intel: 2.99.917-676 26f8ab5 from git://git.freedesktop.org/git/xorg/driver/xf86-video-intel
libva: libva-1.7.0-26 c36971c from git://git.freedesktop.org/git/vaapi/libva
vaapi-intel-driver: 1.7.0-53 bcde10d from git://git.freedesktop.org/git/vaapi/intel-driver
GuC 8.7
DMC 1.07 from https://01.org/linuxgraphics/downloads/broxton-dmc-1.07
Intel-Gpu-Tools 1.15 2038b24 from http://anongit.freedesktop.org/git/xorg/app/intel-gpu-tools.git
Comment 7 Ville Syrjala 2016-07-14 16:36:15 UTC
No hang in ring buffer mode on BSW, so looks like something execlist related.
Comment 8 Ville Syrjala 2016-07-14 18:20:34 UTC
And not using stolen for the ring buffer also cures it:

@@ -2049,7 +2049,7 @@ static int intel_alloc_ringbuffer_obj(struct drm_device *dev,
        struct drm_i915_gem_object *obj;
 
        obj = NULL;
-       if (!HAS_LLC(dev))
+       if (!HAS_LLC(dev) && !i915.enable_execlists)
                obj = i915_gem_object_create_stolen(dev, ringbuf->size);
        if (obj == NULL)
                obj = i915_gem_object_create(dev, ringbuf->size);

I would assume our ring should be empty when we resume, so it shouldn't matter that stolen gets clobbered. But this patch says otherwise.
Comment 9 Chris Wilson 2016-07-14 20:22:12 UTC
It's hard to tell because we don't record the request->head in the error state (review sigh), but my inkling is that it is actually dying with HEAD before our request, and it is just that using stolen has invalid content triggering the hang. Following that suspicion it would be that we are flushing the context image to coherent memory before the hibernation image is made.

Quickest way to test that theory would be to reset the HEAD/TAIL in the context image upon resume.
Comment 10 yann 2016-07-15 10:18:25 UTC
Humberto, can you re-test with https://patchwork.freedesktop.org/patch/98894/ ?
if this works, please sign as "tested-by"
Comment 11 yann 2016-07-15 12:07:53 UTC
Humberto please rather consider this patch set: https://patchwork.freedesktop.org/series/9926/
Comment 12 yann 2016-07-15 12:35:10 UTC
*** Bug 94698 has been marked as a duplicate of this bug. ***
Comment 13 yann 2016-07-15 12:36:52 UTC
Humberto, re-run as well igt@gem_softpin@noreloc-s4 to confirm this patch fix issue.
Comment 14 yann 2016-07-15 13:17:08 UTC
*** Bug 96895 has been marked as a duplicate of this bug. ***
Comment 15 yann 2016-07-15 13:18:08 UTC
Finally, please re-run also igt@gem_exec_suspend
Comment 16 Humberto Israel Perez Rodriguez 2016-07-15 18:51:52 UTC
(In reply to yann from comment #11)
> Humberto please rather consider this patch set:
> https://patchwork.freedesktop.org/series/9926/

Hi, after test this patch i got the following results 


test case : gem_exec_suspend basic-s4 
platform : bsw / status : pass
platform : bxt / status : fail

please see the output of APL
======================================
IGT-Version: 1.15-gee5d5c4 (x86_64) (Linux: 4.7.0-rc7drm-intel-nightly-bug-96526-commit-d416f56-mbox+ x86_64)
(gem_exec_suspend:1553) drmtest-DEBUG: Test requirement passed: fd >= 0
(gem_exec_suspend:1553) drmtest-DEBUG: Test requirement passed: fd >= 0
(gem_exec_suspend:1553) drmtest-DEBUG: Test requirement passed: drmSetMaster(fd) == 0
(gem_exec_suspend:1553) igt-core-DEBUG: Starting subtest: basic-S4
(gem_exec_suspend:1553) DEBUG: Test requirement passed: gem_has_ring(fd, 0)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: can_mi_store_dword(gen, 0)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1553) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1553) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1553) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1553) DEBUG: Verifying result
(gem_exec_suspend:1553) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1553) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1553) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1553) DEBUG: Verifying result
(gem_exec_suspend:1553) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1553) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1553) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1553) DEBUG: Verifying result
(gem_exec_suspend:1553) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1553) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1553) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1553) DEBUG: Verifying result
(gem_exec_suspend:1553) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1553) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1553) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_exec_suspend:1553) igt-aux-DEBUG: Test requirement passed: system("rtcwake -n -s 30 -m disk" SQUELCH) == 0
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "disk" using /dev/rtc0 at Fri Jul 15 18:41:53 2016
(gem_exec_suspend:1553) DEBUG: Verifying result
(gem_exec_suspend:1553) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:401:
(gem_exec_suspend:1553) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
Stack trace:
  #0 [__igt_fail_assert+0xf1]
  #1 [sig_abort+0x3a]
  #2 [killpg+0x40]
  #3 [__write_nocancel+0x7]
  #4 [igt_drop_caches_set+0xa4]
  #5 [gem_quiescent_gpu+0xcf]
  #6 [run_test+0x3eb]
  #7 [__real_main227+0x2a8]
  #8 [main+0x23]
  #9 [__libc_start_main+0xf0]
  #10 [_start+0x29]
  #11 [<unknown>+0x29]
Subtest basic-S4 failed.
**** DEBUG ****
(gem_exec_suspend:1553) DEBUG: Test requirement passed: gem_has_ring(fd, 0)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: can_mi_store_dword(gen, 0)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1553) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1553) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1553) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1553) DEBUG: Verifying result
(gem_exec_suspend:1553) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1553) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1553) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1553) DEBUG: Verifying result
(gem_exec_suspend:1553) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1553) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1553) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1553) DEBUG: Verifying result
(gem_exec_suspend:1553) DEBUG: Test requirement passed: gem_has_ring(fd, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: can_mi_store_dword(gen, engine)
(gem_exec_suspend:1553) DEBUG: Test requirement passed: nengine
(gem_exec_suspend:1553) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1553) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1553) DEBUG: Verifying result
(gem_exec_suspend:1553) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
(gem_exec_suspend:1553) DEBUG: Test requirement passed: __gem_execbuf(fd, &execbuf) == 0
(gem_exec_suspend:1553) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_exec_suspend:1553) igt-aux-DEBUG: Test requirement passed: system("rtcwake -n -s 30 -m disk" SQUELCH) == 0
(gem_exec_suspend:1553) DEBUG: Verifying result
(gem_exec_suspend:1553) igt-aux-CRITICAL: Test assertion failure function sig_abort, file igt_aux.c:401:
(gem_exec_suspend:1553) igt-aux-CRITICAL: Failed assertion: !"GPU hung"
****  END  ****
Subtest basic-S4: FAIL (25.898s)
(gem_exec_suspend:1553) igt-core-DEBUG: Exiting with status code 99


relevant dmesg info
=====================
[   61.796101] [drm] GPU HANG: ecode 9:0:0x5931a887, in gem_exec_suspen [1553], reason: Hang on blitter ring, bsd ring, video enhancement ring, acti
[   61.796103] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   61.796104] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   61.796104] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   61.796122] drm/i915: Resetting chip after gpu hang


kernel used 
=====================
branch : nightly
commit d416f561e8fad56f2c6922ef3a703a5a829a99cf
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 15 13:03:40 2016 +0100


Gfx stack
=======================
Component         : drm
	tag       : libdrm-2.4.68-14-g8c8d5dd
	commit    : 8c8d5dd
Component         : cairo
	tag       : 1.15.2-44-g1a380ef
	commit    : 1a380ef
Component         : intel-gpu-tools
	tag       : intel-gpu-tools-1.15-127-gee5d5c4
	commit    : ee5d5c4



Attachments
==============================
dmesg_bsw.log
dmesg_bxt.log
gpu_error_bxt
Comment 17 yann 2016-07-15 19:20:51 UTC
Humberto, don't forget the attachments ;)
Comment 18 Chris Wilson 2016-07-15 19:44:09 UTC
I'll take the partial victory.

commit 5ab57c7020697942ea15f45ad14c69cecb164329
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 15 14:56:20 2016 +0100

    drm/i915: Flush logical context image out to memory upon suspend
    
    Before suspend, and especially before building the hibernation image, we
    need to context image to be coherent in memory. To do this we require
    that we perform a context switch to a disposable context (i.e. the
    dev_priv->kernel_context) - when that switch is complete, all other
    context images will be complete. This leaves the kernel_context image as
    incomplete, but fortunately that is disposable and we can do a quick
    fixup of the logical state after resuming.

But it looks like there's more fish in the ocean. Can you please attach the latest error state?
Comment 19 Humberto Israel Perez Rodriguez 2016-07-15 20:08:47 UTC
Created attachment 125090 [details]
dmes_bsw.log
Comment 20 Humberto Israel Perez Rodriguez 2016-07-15 20:09:05 UTC
Created attachment 125091 [details]
dmesg_bxt.log
Comment 21 Humberto Israel Perez Rodriguez 2016-07-15 20:09:19 UTC
Created attachment 125092 [details]
gpu_error_bxt
Comment 22 Chris Wilson 2016-07-15 20:31:55 UTC
We've seen that hang before, it looks like the execlists failed to submit the next request to the hardware. This hang is worth checking to see if it changes between guc and plain execlists.
Comment 23 Humberto Israel Perez Rodriguez 2016-07-15 20:41:51 UTC
(In reply to yann from comment #15)
> Finally, please re-run also igt@gem_exec_suspend

Hi Yaan : 

after test all the family "gem_exec_suspend" the following subtest are fail with the configuration in my comment 16

test cases
=============================
basic-S4
default-uncached-S4
default-cached-S4
render-uncached-S4
render-cached-S4
bsd-uncached-S4
bsd-cached-S4
bsd1-uncached
bsd1-cached
bsd1-uncached-S3
bsd1-cached-S3
bsd1-uncached-S4
bsd1-cached-S4
bsd2-uncached
bsd2-cached
bsd2-uncached-S3
bsd2-cached-S3
bsd2-uncached-S4
bsd2-cached-S4
blt-uncached-S4
blt-cached-S4
vebox-uncached-S4
vebox-cached-S4
Comment 24 Humberto Israel Perez Rodriguez 2016-07-15 20:44:45 UTC
Created attachment 125093 [details]
dmesg_gem_exec_suspend_apl.log

Please see the attachment "dmesg_gem_exec_suspend" for my previuos comment
Comment 25 cprigent 2016-07-19 11:36:54 UTC
Test is Pass on APL with commit 5ab57c7020697942ea15f45ad14c69cecb164329 and patch to revert GuC loading and submission.

Platform: APL system
CPU Name : Intel(R) Genuine Processor @ 1.1 GHz (family: 6, model: 12, stepping: 9) 4 cores
QDF : Q6HE
SoC : B1
CRB : Apollo Lake DDR3L RVP1A FAB2
Reworks : R19, R20

Software 
Bios: 144_B10 APLK_B0_IFWI_X64_R_2016_06_27_0956_SPI_RVP1.bin from \\gar\ec\proj\ba\CCG\APL BIOS\External\BIOS_Release\Daily\v144_10_2016_WW27.1\IFWI\IFWI_RVP1_Release\IFWI
KSC: 1.15
Linux distribution: Ubuntu 16.04 64 bits
Kernel: 4.7.0-rc7 895a714 from http://cgit.freedesktop.org/drm-intel/ with https://patchwork.freedesktop.org/patch/99445/ applied
commit 895a714b0b596cfcbe82065f99376ad02d369125
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Jul 18 14:35:39 2016 +0200
drm-intel-nightly: 2016y-07m-18d-12h-35m-15s UTC integration manifest
drm: libdrm-2.4.68-15 2212a64 from git://anongit.freedesktop.org/mesa/drm
mesa: mesa-11.2.2 3a9f628from git://anongit.freedesktop.org/mesa/mesa
cairo: 1.15.2 db8a7f1 from git://anongit.freedesktop.org/cairo
xserver: xorg-server-1.18.0-460 e8e3675 from git://git.freedesktop.org/git/xorg/xserver
xf86-video-intel: 2.99.917-676 26f8ab5 from git://git.freedesktop.org/git/xorg/driver/xf86-video-intel
libva: libva-1.7.0-26 c36971c from git://git.freedesktop.org/git/vaapi/libva
vaapi-intel-driver: 1.7.0-53 bcde10d from git://git.freedesktop.org/git/vaapi/intel-driver
DMC 1.07 from https://01.org/linuxgraphics/downloads/broxton-dmc-1.07
Intel-Gpu-Tools 1.15-127 ee5d5c4 from http://anongit.freedesktop.org/git/xorg/app/intel-gpu-tools.git
Comment 26 Chris Wilson 2016-07-19 11:44:30 UTC
(In reply to cprigent from comment #25)
> Test is Pass on APL with commit 5ab57c7020697942ea15f45ad14c69cecb164329 and
> patch to revert GuC loading and submission.

Hmm. Can you verify this by leaving the test in a loop for hours and see how long it takes before it eventually fails?
Comment 27 cprigent 2016-07-19 12:59:26 UTC
You are right. The GPU Hang is reproduced after around 5 iterations.
Comment 28 cprigent 2016-07-19 13:03:55 UTC
And I tried again with GuC loaded. The GPU Hang is always reproduced at first iteration.
Comment 29 Chris Wilson 2016-07-19 13:05:39 UTC
(In reply to cprigent from comment #27)
> You are right. The GPU Hang is reproduced after around 5 iterations.

Can you please attach the error from the non-guc fail? I want to see if it has the same characteristics. And for completeness grab an error state after a hang with the guc enabled on your machine.
Comment 30 cprigent 2016-07-19 13:33:46 UTC
Created attachment 125142 [details]
APL_sys-class-drm-card0-error_without-guc
Comment 31 cprigent 2016-07-19 13:34:19 UTC
Created attachment 125143 [details]
APL-gem_exec_suspend_basic-S4_without-guc_kern.log
Comment 32 cprigent 2016-07-19 13:55:25 UTC
Created attachment 125145 [details]
APL_sys-class-drm-card0-error_with-guc
Comment 33 cprigent 2016-07-19 13:55:57 UTC
Created attachment 125146 [details]
APL-gem_exec_suspend_basic-S4_with-guc_kern.log
Comment 34 cprigent 2016-07-19 13:56:20 UTC
Created attachment 125147 [details]
APL_gem_exec_suspend_basic-S4_output-with-and-without-guc
Comment 35 yann 2016-07-19 15:03:11 UTC
Updating priority based on the fact that w/o GuC this becomes sporadic
Comment 36 Chris Wilson 2016-07-20 09:04:43 UTC
The failure is reasonably consistent, it looks like the execlists context-switch goes awry, and for whatever reason the guc is more susceptible. It is more likely to be something not quite right in the interrupt routing upon resume (first guess).
Comment 37 Dave Gordon 2016-07-26 16:25:04 UTC
This doesn't really appear to be GuC-related. For example, the APL-gem-exec-suspend-basic-s4 test logs show exactly the same failure (GPU HANG) with or without GuC submission. GuC mode may expose it more quickly but the issue itself is not caused by the GuC.

dmesg_gem_exec_suspend_apl.log shows:

[  331.343467] [drm:intel_guc_setup] GuC fw status: path i915/bxt_guc_ver8_7.bin, fetch FAIL, load NONE

in other words the (correct) firmware is not present. Ditto for dmesg_bxt_log:

[    1.671603] i915 0000:00:02.0: Direct firmware load for i915/bxt_guc_ver8_7.bin failed with error -2

As for dmes_bsw.log, that contains:

[  351.405681] [drm:intel_guc_setup] GuC fw status: path (null), fetch NONE, load NONE

where the (null) path means that this kernel does not support BSW.
Comment 38 Dave Gordon 2016-07-26 17:19:37 UTC
To contradict my previous comment, APL-gem_exec_suspend_gpu-hang_kern.log shows something very odd: the GuC firmware has disappeared. Early in the log we have

Jul  5 16:56:27 BXTP5 kernel: [    1.689144] [drm:intel_guc_setup] GuC fw status: path i915/bxt_guc_ver8_7.bin, fetch SUCCESS, load NONE

but 5 minutes later, on the next reboot cycle:

Jul  5 17:01:44 BXTP5 kernel: [    1.711802] i915 0000:00:02.0: Direct firmware load for i915/bxt_guc_ver8_7.bin failed with error -2

The kernel logs for each cycle look generally similar, but the order of some operations is not identical. In particular, the appearance of the MMC devices can come before OR after the attempt to load the GuC firmware.

So this is really a completely different issue, related to the way that devices are initialised asynchronously w.r.t one another. It should be moved to a separate bug report.

.Dave.
Comment 39 cprigent 2016-08-10 09:15:46 UTC
(In reply to david.s.gordon from comment #38)
> It should be moved to a separate bug report.

Reported here: bug 97275
Comment 40 cprigent 2016-08-10 09:17:14 UTC
Created attachment 125661 [details]
bsw-gem_exec_suspend__basic_S4-kern.log

I reproduced the GPU on a BSW production device

Hardware: Acer Desktop
Motherboard: Aspire XC-704
CPU: Intel(R) Pentium(R) CPU N3700 @ 1.60GHz (Family 6, Model 76, Stepping 3)
GPU:  IntelĀ® HD Graphics - Intel Corporation Device 22b1 (rev 21)
Memory card: 1 card 4GB Hynix HMT451S6BFR8APB
HDD: Western Digital WDC WD10EZEX-21M (1TB)

Software:
Bios: R01-A2
Linux distribution: Ubuntu 16.04 64 bits
Kernel: 4.8.0-rc1 d0e3a4b from http://cgit.freedesktop.org/drm-intel/
  commit d0e3a4b2e1743e3ed20327718b5cd069f6a39414
  Author: Daniel Vetter <daniel.vetter@ffwll.ch>
  Date:   Tue Aug 9 22:19:07 2016 +0200
  drm-intel-nightly: 2016y-08m-09d-20h-18m-38s UTC integration manifest
drm: libdrm-2.4.70-2 b214b05 from git://anongit.freedesktop.org/mesa/drm
mesa: mesa-11.2.2 3a9f628from git://anongit.freedesktop.org/mesa/mesa
cairo: 1.15.2 db8a7f1 from git://anongit.freedesktop.org/cairo
xserver: xorg-server-1.18.0-502 c833c08 from git://git.freedesktop.org/git/xorg/xserver
xf86-video-intel: 2.99.917-691 a77397a from git://git.freedesktop.org/git/xorg/driver/xf86-video-intel
libva: libva-1.7.0-44 695f99e from git://git.freedesktop.org/git/vaapi/libva
vaapi-intel-driver: 1.7.0-66 fb7d6f5 from git://git.freedesktop.org/git/vaapi/intel-driver
Intel-Gpu-Tools 1.15-216 9afd545 from http://anongit.freedesktop.org/git/xorg/app/intel-gpu-tools.git
Comment 41 cprigent 2016-08-10 09:17:34 UTC
Created attachment 125662 [details]
bsw-gem_exec_suspend__basic_S4-output
Comment 42 cprigent 2016-08-10 09:17:58 UTC
Created attachment 125663 [details]
bsw-error
Comment 43 Humberto Israel Perez Rodriguez 2016-08-29 21:24:10 UTC
still occurs with the following configuration


  Software information
============================================
Linux distribution              : Ubuntu 16.04.1 LTS
Architecture                    : 64-bit
Bios revision                   : 148.11
Bios release date               : 07/25/2016
KSC revision                    : 1.15


 Hardware information
============================================
Platform                        : BXT-P
Motherboard model               : Broxton P
Motherboard type                : NOTEBOOK Hand Held
Motherboard manufacturer        : Intel Corp.
CPU family                      : Other
CPU information                 : 06/5c
GPU Card                        : Intel Corporation Device 5a84 (rev 0a) (prog-if 00 [VGA controller])
Memory ram                      : 8 GB
CPU thread                      : 4
CPU core                        : 4

 Firmwares information
============================================
DMC fw loaded                   : yes
DMC version                       : 1.7



Gfx stack
================================================
Component         : drm
	tag       : libdrm-2.4.70-2-gb214b05
	commit    : b214b05 

Component         : cairo
	tag       : 1.15.2
	commit    : db8a7f1 

Component         : intel-gpu-tools
	tag       : intel-gpu-tools-1.15-245-g572a770
	commit    : 572a770
Add Comment


Kernel
================================================
commit f4f46e5544894b2198cdfd5a226ee587d9834cc4
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Aug 29 16:09:42 2016 +0200

    drm-intel-nightly: 2016y-08m-29d-14h-09m-23s UTC integration manifest
Comment 44 yann 2016-08-30 07:25:31 UTC
Upgrading priority since it is impacting APL. BTW, Christophe, Humberto, it looks like on FI CI, these tests are passed. Please confirm on you your side.
Comment 45 Humberto Israel Perez Rodriguez 2016-08-30 21:15:55 UTC
(In reply to yann from comment #44)
> Upgrading priority since it is impacting APL. BTW, Christophe, Humberto, it
> looks like on FI CI, these tests are passed. Please confirm on you your side.

Hi Yann; this test is fail in our side, actually when the DUT try to resume the dut reboots with the following configuration

 Hardware information
============================================
Platform                        : BXT-P FAB2
Motherboard model               : Broxton P
Motherboard type                : NOTEBOOK Hand Held
Motherboard manufacturer        : Intel Corp.
CPU information                 : 06/5c
GPU Card                        : Intel Corporation Device 5a84 (rev 0a) (prog-if 00 [VGA controller])
Memory ram                      : 16 GB
Maximum memory ram allowed      : 16 GB
CPU thread                      : 4
CPU core                        : 4


 Firmwares information
============================================
DMC fw loaded                   : yes
DMC version                     : 1.7



Gfx Stack
=======================================================
Component         : drm
	tag       : libdrm-2.4.70-2-gb214b05
	commit    : b214b05ccd433c484a6a65e491a1a51b19e4811d 

Component         : cairo
	tag       : 1.15.2
	commit    : db8a7f1697c49ae4942d2aa49eed52dd73dd9c7a 


Component         : intel-gpu-tools
	tag       : intel-gpu-tools-1.15-245-g572a770
	commit    : 572a770f997cae6c3bcb76577e6eac61baa0afa3 

Kernel
=======================================================
commit f4f46e5544894b2198cdfd5a226ee587d9834cc4
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Aug 29 16:09:42 2016 +0200

    drm-intel-nightly: 2016y-08m-29d-14h-09m-23s UTC integration manifest
Comment 46 Humberto Israel Perez Rodriguez 2016-08-30 21:16:00 UTC
(In reply to yann from comment #44)
> Upgrading priority since it is impacting APL. BTW, Christophe, Humberto, it
> looks like on FI CI, these tests are passed. Please confirm on you your side.

Hi Yann; this test is fail in our side, actually when the DUT try to resume the dut reboots with the following configuration

 Hardware information
============================================
Platform                        : BXT-P FAB2
Motherboard model               : Broxton P
Motherboard type                : NOTEBOOK Hand Held
Motherboard manufacturer        : Intel Corp.
CPU information                 : 06/5c
GPU Card                        : Intel Corporation Device 5a84 (rev 0a) (prog-if 00 [VGA controller])
Memory ram                      : 16 GB
Maximum memory ram allowed      : 16 GB
CPU thread                      : 4
CPU core                        : 4


 Firmwares information
============================================
DMC fw loaded                   : yes
DMC version                     : 1.7



Gfx Stack
=======================================================
Component         : drm
	tag       : libdrm-2.4.70-2-gb214b05
	commit    : b214b05ccd433c484a6a65e491a1a51b19e4811d 

Component         : cairo
	tag       : 1.15.2
	commit    : db8a7f1697c49ae4942d2aa49eed52dd73dd9c7a 


Component         : intel-gpu-tools
	tag       : intel-gpu-tools-1.15-245-g572a770
	commit    : 572a770f997cae6c3bcb76577e6eac61baa0afa3 

Kernel
=======================================================
commit f4f46e5544894b2198cdfd5a226ee587d9834cc4
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Aug 29 16:09:42 2016 +0200

    drm-intel-nightly: 2016y-08m-29d-14h-09m-23s UTC integration manifest
Comment 47 yann 2016-09-01 11:49:20 UTC
Humberto,

Do you still see hung? If not, this a different issue and therefore fill a new bug and close this one.
Comment 48 Humberto Israel Perez Rodriguez 2016-09-06 17:27:07 UTC
this test keep fail on BSW with the following configuration



 Software information
============================================
Kernel version                  : 4.8.0-rc4-drm-intel-nightly-ww37-commit-507a1d9+
Linux distribution              : Ubuntu 16.04.1 LTS
Architecture                    : 64-bit
Kernel driver in use            : i915
Bios revision                   : 0.33
Bios release date               : 08/12/2015
KSC revision                    : 0.16


 Hardware information
============================================
Platform                        : BSW
Motherboard model               : 10G9000NUS
Motherboard type                : BRASWELL Desktop
Motherboard manufacturer        : LENOVO
CPU family                      : Pentium
CPU information                 : Intel(R) Pentium(R) CPU  N3700  @ 1.60GHz
GPU Card                        : Intel Corporation Device 22b1 (rev 21) (prog-if 00 [VGA controller])
Memory ram                      : 8 GB
CPU thread                      : 4
CPU core                        : 4
Socket                          : Socket BGA1155
Signature                       : Type 0, Family 6, Model 76, Stepping 3

Kernel
============================================
commit 507a1d98d13f18acd36d9b81f4b316a3f79af00e
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Tue Sep 6 16:55:52 2016 +0300

    drm-intel-nightly: 2016y-09m-06d-13h-55m-34s UTC integration manifest

Gfx Stack
==============================================
Component         : drm
	tag       : libdrm-2.4.68
	commit    : fc09c5ab84240e9b6bd0bed01685ef004f56c4fa 

Component         : cairo
	tag       : 1.15.2
	commit    : db8a7f1697c49ae4942d2aa49eed52dd73dd9c7a 

Component         : intel-gpu-tools
	tag       : intel-gpu-tools-1.16
	commit    : a28e9e38a9efc6daf5a08d60d29adcd3e328fe6f
Comment 49 Humberto Israel Perez Rodriguez 2016-09-06 21:37:18 UTC
(In reply to yann from comment #47)
> Humberto,
> 
> Do you still see hung? If not, this a different issue and therefore fill a
> new bug and close this one.

Hi Yann :

regarding APL i have a issue with rtcwake tool, by itselft it works well but launch it by c file looks like that shows the following issue : PM swap header not found, i'll investigate in order to reprduce this issue
Comment 50 cprigent 2016-09-16 13:40:49 UTC
Created attachment 126574 [details]
BDW__gem_exec_suspend--basic-S4__kern.log

GPU Hang is reproduced on BDW with fresh setup

Platform: NUC5i3RYB
CPU: Intel(R) Core(TM) i3-5010U CPU @ 2.10GHz (family 6, model 61, stepping 4)
Motherboard version: H41000-503
GPU: IntelĀ® HD Graphics 5500 - Intel Corporation Broadwell-U Integrated Graphics (rev 09)
Memory: two 4GB card Crucial CT51264BF160B.C16F
SSD: INTEL SSDSC2BW48 480 Go

Software
Bios: RYBDWi35.86A.0358.2016.0606.1423 from https://downloadcenter.intel.com/downloads/eula/26081/BIOS-Update-RYBDWi35-86A-?httpDown=https%3A%2F%2Fdownloadmirror.intel.com%2F26081%2Feng%2FRY0358.bio
Linux distribution: Ubuntu 16.04 64 bits
Kernel: 4.8.0-rc5 bef9c1f from http://cgit.freedesktop.org/drm-intel/
  commit bef9c1f4afe24cfff578d386bde349add65673eb
  Author: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
  Date:   Mon Sep 12 11:35:34 2016 +0300
  drm-intel-nightly: 2016y-09m-12d-08h-35m-02s UTC integration manifest
libdrm-2.4.70-12 2d00869 from git://anongit.freedesktop.org/mesa/drm
mesa: mesa-11.2.2 3a9f628 from git://anongit.freedesktop.org/mesa/mesa
cairo 1.15.2 db8a7f1 from git://anongit.freedesktop.org/cairo
xorg-server-1.18.0-549 527c6ba from git://git.freedesktop.org/git/xorg/xserver
xf86-video-intel 2.99.917-703 15c5ff1 from git://git.freedesktop.org/git/xorg/driver/xf86-video-intel
libva-1.7.0-47 2ebf897 from git://git.freedesktop.org/git/vaapi/libva 
vaapi-intel-driver: 1.7.0-117 8c11f51 from git://git.freedesktop.org/git/vaapi/intel-driver
Intel-Gpu-Tools 1.16 f565b6c from http://anongit.freedesktop.org/git/xorg/app/intel-gpu-tools.git
Comment 51 cprigent 2016-09-16 13:41:23 UTC
Created attachment 126575 [details]
BDW_error
Comment 52 cprigent 2016-09-16 13:43:19 UTC
Created attachment 126576 [details]
BDW__gem_exec_suspend--basic-S4__output

Tested 3 times, reproduced 3 times
Comment 53 Chris Wilson 2016-09-16 14:07:55 UTC
(In reply to cprigent from comment #51)
> Created attachment 126575 [details]
> BDW_error

Looks like the GPU resumed execution from before the saved portion of the ring buffer i.e. the context image was stale (RING_HEAD).
Comment 54 Chris Wilson 2016-09-16 14:30:51 UTC
Does

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index df10f4e95736..331c4a5c6822 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -945,7 +945,7 @@ int i915_gem_switch_to_kernel_context(struct drm_i915_private *dev_priv)
                        return PTR_ERR(req);
 
                ret = i915_switch_context(req);
-               i915_add_request_no_flush(req);
+               i915_add_request(req);
                if (ret)
                        return ret;
        }

help?
Comment 55 yann 2016-09-17 07:19:08 UTC
Patch from Chris available at: https://patchwork.freedesktop.org/series/12592/
Comment 56 cprigent 2016-09-21 13:36:50 UTC
Created attachment 126704 [details]
BDW__with-patch-comment-54

The patch (In reply to Chris Wilson from comment #54)
> Does
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
> b/drivers/gpu/drm/i915/i915_gem_context.c
> index df10f4e95736..331c4a5c6822 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -945,7 +945,7 @@ int i915_gem_switch_to_kernel_context(struct
> drm_i915_private *dev_priv)
>                         return PTR_ERR(req);
>  
>                 ret = i915_switch_context(req);
> -               i915_add_request_no_flush(req);
> +               i915_add_request(req);
>                 if (ret)
>                         return ret;
>         }
> 
> help?

No. I reproduce it on BDW.
Comment 57 cprigent 2016-09-21 13:37:47 UTC
(In reply to yann from comment #55)
> Patch from Chris available at:
> https://patchwork.freedesktop.org/series/12592/

I tried several commits and tags. I'm not able to apply patch number 1.
Comment 58 Chris Wilson 2016-09-21 16:31:00 UTC
commit bafb2f7d4755bf1571bd5e9a03b97f3fc4fe69ae
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Sep 21 14:51:08 2016 +0100

    drm/i915/execlists: Reset RING registers upon resume
    
    There is a disparity in the context image saved to disk and our own
    bookkeeping - that is we presume the RING_HEAD and RING_TAIL match our
    stored ce->ring->tail value. However, as we emit WA_TAIL_DWORDS into the
    ring but may not tell the GPU about them, the GPU may be lagging behind
    our bookkeeping. Upon hibernation we do not save stolen pages, presuming
    that their contents are volatile. This means that although we start
    writing into the ring at tail, the GPU starts executing from its HEAD
    and there may be some garbage in between and so the GPU promptly hangs
    upon resume.
    
    Testcase: igt/gem_exec_suspend/basic-S4
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96526
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/20160921135108.29574-3-chris@chris-wilson.co.uk
Comment 59 Humberto Israel Perez Rodriguez 2016-09-29 18:02:24 UTC
(In reply to Chris Wilson from comment #58)
> commit bafb2f7d4755bf1571bd5e9a03b97f3fc4fe69ae
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed Sep 21 14:51:08 2016 +0100
> 
>     drm/i915/execlists: Reset RING registers upon resume
>     
>     There is a disparity in the context image saved to disk and our own
>     bookkeeping - that is we presume the RING_HEAD and RING_TAIL match our
>     stored ce->ring->tail value. However, as we emit WA_TAIL_DWORDS into the
>     ring but may not tell the GPU about them, the GPU may be lagging behind
>     our bookkeeping. Upon hibernation we do not save stolen pages, presuming
>     that their contents are volatile. This means that although we start
>     writing into the ring at tail, the GPU starts executing from its HEAD
>     and there may be some garbage in between and so the GPU promptly hangs
>     upon resume.
>     
>     Testcase: igt/gem_exec_suspend/basic-S4
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96526
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>     Link:
> http://patchwork.freedesktop.org/patch/msgid/20160921135108.29574-3-
> chris@chris-wilson.co.uk


with this commit and the following configuration on BXT this test pass :

gem_exec_suspend basic-s4

Component         : drm
	tag       : libdrm-2.4.70-15-gabfa680
	commit    : abfa680 

Component         : cairo
	tag       : 1.15.2-58-gb207a93
	commit    : b207a93 

Component         : intel-gpu-tools
	tag       : intel-gpu-tools-1.16-36-gd16318a
	commit    : d16318a 



 Hardware information
============================================
Platform                        : BXT-P
Motherboard model               : Broxton P
Motherboard type                : NOTEBOOK Hand Held
Motherboard manufacturer        : Intel Corp.
CPU family                      : Other
CPU information                 : 06/5c
GPU Card                        : Intel Corporation Device 5a84 (rev 0a) (prog-if 00 [VGA controller])
Memory ram                      : 16 GB
Maximum memory ram allowed      : 16 GB
CPU thread                      : 4
CPU core                        : 4
Comment 60 Humberto Israel Perez Rodriguez 2016-09-29 19:00:52 UTC
(In reply to Humberto Israel Perez Rodriguez from comment #59)
> (In reply to Chris Wilson from comment #58)
> > commit bafb2f7d4755bf1571bd5e9a03b97f3fc4fe69ae
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Wed Sep 21 14:51:08 2016 +0100
> > 
> >     drm/i915/execlists: Reset RING registers upon resume
> >     
> >     There is a disparity in the context image saved to disk and our own
> >     bookkeeping - that is we presume the RING_HEAD and RING_TAIL match our
> >     stored ce->ring->tail value. However, as we emit WA_TAIL_DWORDS into the
> >     ring but may not tell the GPU about them, the GPU may be lagging behind
> >     our bookkeeping. Upon hibernation we do not save stolen pages, presuming
> >     that their contents are volatile. This means that although we start
> >     writing into the ring at tail, the GPU starts executing from its HEAD
> >     and there may be some garbage in between and so the GPU promptly hangs
> >     upon resume.
> >     
> >     Testcase: igt/gem_exec_suspend/basic-S4
> >     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96526
> >     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >     Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >     Link:
> > http://patchwork.freedesktop.org/patch/msgid/20160921135108.29574-3-
> > chris@chris-wilson.co.uk
> 
> 
> with this commit and the following configuration on BXT this test pass :
> 
> gem_exec_suspend basic-s4
> 
> Component         : drm
> 	tag       : libdrm-2.4.70-15-gabfa680
> 	commit    : abfa680 
> 
> Component         : cairo
> 	tag       : 1.15.2-58-gb207a93
> 	commit    : b207a93 
> 
> Component         : intel-gpu-tools
> 	tag       : intel-gpu-tools-1.16-36-gd16318a
> 	commit    : d16318a 
> 
> 
> 
>  Hardware information
> ============================================
> Platform                        : BXT-P
> Motherboard model               : Broxton P
> Motherboard type                : NOTEBOOK Hand Held
> Motherboard manufacturer        : Intel Corp.
> CPU family                      : Other
> CPU information                 : 06/5c
> GPU Card                        : Intel Corporation Device 5a84 (rev 0a)
> (prog-if 00 [VGA controller])
> Memory ram                      : 16 GB
> Maximum memory ram allowed      : 16 GB
> CPU thread                      : 4
> CPU core                        : 4



with the same gfx stack configuration and the same kernel this test pass as well in BDW platform  :

 Hardware information
============================================
Platform                        : BDW
Motherboard type                : NUC5i5RYB Desktop
CPU family                      : Core i5
CPU information                 : Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
GPU Card                        : Intel Corporation Broadwell-U Integrated Graphics (rev 09) (prog-if 00 [VGA controller])
Memory ram                      : 8 GB
Maximum memory ram allowed      : 16 GB
CPU thread                      : 4
CPU core                        : 2
Socket                          : Socket BGA1168
Signature                       : Type 0, Family 6, Model 61, Stepping 4
Comment 61 Chris Wilson 2016-12-01 08:42:59 UTC
*** Bug 98288 has been marked as a duplicate of this bug. ***
Comment 62 Chris Wilson 2017-02-05 17:39:59 UTC
*** Bug 99632 has been marked as a duplicate of this bug. ***
Comment 63 Chris Wilson 2017-02-05 21:13:53 UTC
*** Bug 99545 has been marked as a duplicate of this bug. ***
Comment 64 Chris Wilson 2017-02-09 07:59:32 UTC
*** Bug 99719 has been marked as a duplicate of this bug. ***
Comment 65 Chris Wilson 2017-02-12 14:05:42 UTC
*** Bug 99771 has been marked as a duplicate of this bug. ***
Comment 66 Chris Wilson 2017-02-14 16:54:26 UTC
*** Bug 99814 has been marked as a duplicate of this bug. ***
Comment 67 Chris Wilson 2017-06-03 15:45:11 UTC
*** Bug 101289 has been marked as a duplicate of this bug. ***
Comment 68 Chris Wilson 2017-07-23 11:21:32 UTC
*** Bug 101884 has been marked as a duplicate of this bug. ***
Comment 69 Chris Wilson 2017-07-28 09:55:25 UTC
*** Bug 101959 has been marked as a duplicate of this bug. ***
Comment 70 solitone 2017-07-29 07:55:41 UTC
(In reply to Chris Wilson from comment #18)
> commit 5ab57c7020697942ea15f45ad14c69cecb164329
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri Jul 15 14:56:20 2016 +0100

This seems included in version 4.8-rc2:

$ git describe bafb2f7d4755bf1571bd5e9a03b97f3fc4fe69ae
v4.8-rc2-641-gbafb2f7d4755

I have kernel version 4.9.30:

$ uname -v
#1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)

but I still experience the bug. This is because, if I understand it right, the commit containing this patch has been reverted in the production kernel:

################
commit 0ee72d8f9b8e17b8e4ccfebc7a25cbc2d395cd6a
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Wed Apr 12 15:49:39 2017 +0200

    Revert "drm/i915/execlists: Reset RING registers upon resume"
    
    This reverts commit f2a0409a08502d64fbe3990354dff5902b08d2fb which is
    commit bafb2f7d4755bf1571bd5e9a03b97f3fc4fe69ae upstream.
    
    It was reported to have problems.
################

https://lists.freedesktop.org/archives/intel-gfx/2017-April/125833.html

I therefore wonder whether this means this bug is still there in the production kernel.
Comment 71 solitone 2017-07-30 08:12:05 UTC
Jani Nikula explained the history of this patch [1]:

> (In reply to Damian Martinez Dreyer from comment #0)
> > Description: I have bisected Kernel 4.9.9 and determined the following to be
> > the cause:
> > 
> > commit f2a0409a08502d64fbe3990354dff5902b08d2fb
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Wed Sep 21 14:51:08 2016 +0100
> > 
> >     drm/i915/execlists: Reset RING registers upon resume
> >     
> >     commit bafb2f7d4755bf1571bd5e9a03b97f3fc4fe69ae upstream.
> 
> The stable backport has been reverted in v4.9.23 by
> 
> commit 0ee72d8f9b8e17b8e4ccfebc7a25cbc2d395cd6a
> Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Date:   Wed Apr 12 15:49:39 2017 +0200
> 
>     Revert "drm/i915/execlists: Reset RING registers upon resume"
>     
>     This reverts commit f2a0409a08502d64fbe3990354dff5902b08d2fb which is
>     commit bafb2f7d4755bf1571bd5e9a03b97f3fc4fe69ae upstream.
>     
>     It was reported to have problems.
>     
>     Cc: Jani Nikula <jani.nikula@linux.intel.com>
>     Cc: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>     Cc: Eric Blau <eblau1@gmail.com>
>     Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
> 
> Thread http://mid.mail-archive.com/1489443835.5568.7.camel@mailbox.org has
> the details.

[1] https://bugs.freedesktop.org/show_bug.cgi?id=100221#c10
Comment 72 Chris Wilson 2017-08-06 12:39:39 UTC
*** Bug 102056 has been marked as a duplicate of this bug. ***
Comment 73 Chris Wilson 2017-08-17 07:50:49 UTC
*** Bug 102269 has been marked as a duplicate of this bug. ***
Comment 74 Chris Wilson 2017-09-04 14:17:16 UTC
*** Bug 102534 has been marked as a duplicate of this bug. ***
Comment 75 Chris Wilson 2017-09-18 09:13:44 UTC
*** Bug 102831 has been marked as a duplicate of this bug. ***
Comment 76 Chris Wilson 2017-10-02 16:41:48 UTC
*** Bug 103065 has been marked as a duplicate of this bug. ***
Comment 77 Chris Wilson 2017-10-21 17:57:12 UTC
*** Bug 103394 has been marked as a duplicate of this bug. ***
Comment 78 Chris Wilson 2018-04-11 13:01:21 UTC
*** Bug 103275 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.