Bug 72631

Summary: [BDW]igt/kms_flip/flip-vs-modeset-vs-hang-interruptible randomly causes system hang
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Mika Kuoppala <mika.kuoppala>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: ben, intel-gfx-bugs
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
dmesg(netconsole)
none
dmesg with patch
none
drm/i915: fix forcewake counts for gen8
none
drm/i915: Fix forcewake counts for gen8
none
tests/kms_flip: stop only ring when hanging the gpu
none
dmesg(2 patches)
none
Fallback to mmio flips
none
dmesg(commit 8ebc02a) none

Description lu hua 2013-12-12 07:27:39 UTC
System Environment:
--------------------------
Platform: Broadwell
kernel    (drm-intel-nightly)639e4d5e3b595cdf3088e23092afd49532addf56

Bug detailed description:
-----------------------------
It causes system on Broadwell with -nightly kernel.

output:
Using monotonic timestamps
Beginning flip-vs-modeset-vs-hang-interruptible on crtc 3, connector 15
  1920x1200 60 1920 1968 2000 2080 1200 1203 1209 1235 0x9 0x48 154000
.

(BTW, I didn't get any dmesg. We use USB network card, it doesn't support netconsole.I will try to use serial port later.)
Steps:
---------------------------
./kms_flip --run-subtest flip-vs-modeset-vs-hang-interruptible
Comment 1 lu hua 2013-12-13 05:23:22 UTC
Created attachment 90689 [details]
dmesg
Comment 2 lu hua 2013-12-13 05:26:19 UTC
It's regression, Test on commit 798183c5, the system doesn't hang.
The latest known good commit: 798183c54799fbe1e5a5bfabb3a8c0505ffd2149
The latest known bad commit: c461562e84d180fb691af57f93a42bd9cc7eb69c
Comment 3 Damien Lespiau 2013-12-13 14:40:58 UTC
Is it possible to bisect to get the offending commit?
Comment 4 Ben Widawsky 2013-12-14 21:12:50 UTC
> 
> (BTW, I didn't get any dmesg. We use USB network card, it doesn't support
> netconsole.I will try to use serial port later.)

On board ethernet should work just fine. Do you require USB ethernet?
Comment 5 lu hua 2013-12-16 08:08:11 UTC
(In reply to comment #2)
> It's regression, Test on commit 798183c5, the system doesn't hang.
> The latest known good commit: 798183c54799fbe1e5a5bfabb3a8c0505ffd2149
> The latest known bad commit: c461562e84d180fb691af57f93a42bd9cc7eb69c

Retest on commit 798183c5, It also causes randomly system hang.
It randomly causes system hang on latest -queued branch. It happens 4 in 5 runs.
Comment 6 Ben Widawsky 2014-01-09 23:59:36 UTC
Can we please get the netconsole output using the onboard ethernet?
Comment 7 lu hua 2014-01-17 01:26:38 UTC
Created attachment 92250 [details]
dmesg(netconsole)
Comment 8 lu hua 2014-01-17 02:17:04 UTC
kms_flip subcase flip-vs-panning-vs-hang-interruptible also causes system hang.
Comment 9 Antti Koskipaa 2014-02-12 13:24:04 UTC
Assigning to Mika. Also, there is a rumor that this patch series might fix it:
http://lists.freedesktop.org/archives/intel-gfx/2014-January/038953.html
Comment 10 lu hua 2014-02-14 07:46:23 UTC
(In reply to comment #9)
> Assigning to Mika. Also, there is a rumor that this patch series might fix
> it:
> http://lists.freedesktop.org/archives/intel-gfx/2014-January/038953.html

Test this patch, It still happens.
Comment 11 lu hua 2014-02-14 07:46:52 UTC
Created attachment 94048 [details]
dmesg with patch
Comment 12 Mika Kuoppala 2014-02-18 15:57:11 UTC
Created attachment 94292 [details] [review]
drm/i915: fix forcewake counts for gen8
Comment 13 Mika Kuoppala 2014-02-18 15:57:42 UTC
(In reply to comment #12)
> Created attachment 94292 [details] [review] [review]
> drm/i915: fix forcewake counts for gen8

Could you please test with the attached patch if it makes a difference?
Comment 14 Mika Kuoppala 2014-02-18 17:25:24 UTC
Created attachment 94296 [details] [review]
drm/i915: Fix forcewake counts for gen8
Comment 15 lu hua 2014-02-19 05:14:01 UTC
(In reply to comment #14)
> Created attachment 94296 [details] [review] [review]
> drm/i915: Fix forcewake counts for gen8

Test this patch, It core dumped, doesn't cause system hang.

output:
Using monotonic timestamps
Beginning flip-vs-modeset-vs-hang-interruptible on crtc 3, connector 10
  1920x1080 60 1920 1966 1996 2080 1080 1082 1086 1112 0xa 0x48 138780
..Test assertion failure function exec_nop, file kms_flip.c:693:
Last errno: 5, Input/output error
Failed assertion: drmIoctl(fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf) == 0
Subtest flip-vs-modeset-vs-hang-interruptible: FAIL
Test assertion failure function gem_quiescent_gpu, file drmtest.c:156:
Last errno: 5, Input/output error
Failed assertion: drmIoctl((fd), ((((1U) << (((0+8)+8)+14)) | ((('d')) << (0+8)) | (((0x40 + 0x29)) << 0) | ((((sizeof(struct drm_i915_gem_execbuffer2)))) << ((0+8)+8)))), (&execbuf)) == 0
kms_flip: drmtest.c:1113: igt_fail: Assertion `!test_with_subtests || in_fixture' failed.
Aborted (core dumped)
Comment 16 Mika Kuoppala 2014-02-19 16:55:03 UTC
Created attachment 94371 [details] [review]
tests/kms_flip: stop only ring when hanging the gpu
Comment 17 Mika Kuoppala 2014-02-19 17:08:17 UTC
Please retest with both attached patches applied.
Comment 18 lu hua 2014-02-20 07:48:42 UTC
Created attachment 94412 [details]
dmesg(2 patches)

(In reply to comment #17)
> Please retest with both attached patches applied.

Test these 2 patches, It still core dumped.
output
Using monotonic timestamps
Beginning flip-vs-modeset-vs-hang-interruptible on crtc 3, connector 10
  1920x1080 60 1920 1966 1996 2080 1080 1082 1086 1112 0xa 0x48 138780
...Test assertion failure function exec_nop, file kms_flip.c:693:
Last errno: 5, Input/output error
Failed assertion: drmIoctl(fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf) == 0
Subtest flip-vs-modeset-vs-hang-interruptible: FAIL
Test assertion failure function gem_quiescent_gpu, file drmtest.c:156:
Last errno: 5, Input/output error
Failed assertion: drmIoctl((fd), ((((1U) << (((0+8)+8)+14)) | ((('d')) << (0+8)) | (((0x40 + 0x29)) << 0) | ((((sizeof(struct drm_i915_gem_execbuffer2)))) << ((0+8)+8)))), (&execbuf)) == 0
kms_flip: drmtest.c:1113: igt_fail: Assertion `!test_with_subtests || in_fixture' failed.
Aborted (core dumped)
Comment 19 Chris Wilson 2014-02-20 09:25:51 UTC
Created attachment 94417 [details] [review]
Fallback to mmio flips

Can you please try this patch to see if prevents the EIO being reported whilst flipping?
Comment 20 Ville Syrjala 2014-02-20 17:38:44 UTC
I think one problem is that the kernel decides to ban the default context. The test doesn't expect that.
Comment 21 lu hua 2014-02-21 08:55:55 UTC
(In reply to comment #19)
> Created attachment 94417 [details] [review] [review]
> Fallback to mmio flips
> 
> Can you please try this patch to see if prevents the EIO being reported
> whilst flipping?

Test this patch, It also core dumped.
output:
IGT-Version: 1.5-g06189c6 (x86_64) (Linux: 3.13.0_nightly_72631patch_20140221+ x86_64)
Using monotonic timestamps
Beginning flip-vs-modeset-vs-hang-interruptible on crtc 3, connector 10
  1920x1080 60 1920 1966 1996 2080 1080 1082 1086 1112 0xa 0x48 138780
...Test assertion failure function exec_nop, file kms_flip.c:693:
Last errno: 5, Input/output error
Failed assertion: drmIoctl(fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf) == 0
Subtest flip-vs-modeset-vs-hang-interruptible: FAIL
Test assertion failure function gem_quiescent_gpu, file drmtest.c:156:
Last errno: 5, Input/output error
Failed assertion: drmIoctl((fd), ((((1U) << (((0+8)+8)+14)) | ((('d')) << (0+8)) | (((0x40 + 0x29)) << 0) | ((((sizeof(struct drm_i915_gem_execbuffer2)))) << ((0+8)+8)))), (&execbuf)) == 0
kms_flip: drmtest.c:1113: igt_fail: Assertion `!test_with_subtests || in_fixture' failed.
Aborted (core dumped)
Comment 22 Chris Wilson 2014-02-21 09:42:00 UTC
Yeah, as Ville pointed out, the core dump is first from the exec_nop() failing, not at the pageflip failing (which would be next).
Comment 23 Chris Wilson 2014-02-21 11:40:54 UTC
This should stop that particular assertion failure:

commit 3db29744f74017a99d1b430b30623dce405ebb1a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Feb 21 09:38:43 2014 +0000

    kms_flip: Try to make hang_gpu() robust against hanging the GPU
    
    On a bad day, hanging the GPU may be terminal. Yet even if the GPU is
    terminally wedged we expect modesetting (and pageflips) to continue.
    That deserves to be a dedicated test, but in the meantime we should
    strive to avoid falling over just because the code is not resilient.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

but there are probably a couple of other spots that require help before we can run the testing against a terminally wedged GPU.
Comment 24 Ville Syrjala 2014-02-21 14:24:29 UTC
I pushed a couple of other things. One patch to prevent the kernel from getting stuck waiting for page flips, and the other to fail the hang tests when it didn't actually test anything.

commit 5f190f2d674222b27eff9f80d14761fde2e8fe7a
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Fri Feb 21 16:08:28 2014 +0200

    kms_flip: Fail the subtest if page flip hang recovery wasn't actually tested
    
    Context banning can prevent the page flip hang tests from actaully
    testing anything, so make the relevant subtests fail in that case.
    
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>


commit 48ba2cdf969698a2520193ec0c9cff99f89fe1f6
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Fri Feb 21 15:14:33 2014 +0200

    kms_flip: Restore rings to running state in unhang_gpu()
    
    If things go bad, make sure the rings aren't left in the stopped state.
    
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Comment 25 lu hua 2014-02-24 06:08:09 UTC
run on igt commit 8ebc02a54c22b7a83a34c923153861848183cd96, It  still fails with system hang.

output:
IGT-Version: 1.5-g8ebc02a (x86_64) (Linux: 3.13.0_drm-intel-nightly_1be8f2_20140
Using monotonic timestamps
Beginning flip-vs-modeset-vs-hang-interruptible on crtc 3, connector 10
  1920x1080 60 1920 1966 1996 2080 1080 1082 1086 1112 0xa 0x48 138780
.
Comment 26 lu hua 2014-02-24 06:08:45 UTC
Created attachment 94630 [details]
dmesg(commit 8ebc02a)
Comment 27 Mika Kuoppala 2014-03-07 12:43:11 UTC
Should be fixed with:

commit ccc7bed05e27a654db1e9e248ce5fb291c12add1
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Fri Feb 21 16:26:47 2014 +0200

    drm/i915: Don't ban default context when stop_rings!=0

fallout handled in:

https://bugs.freedesktop.org/show_bug.cgi?id=75876
Comment 28 lu hua 2014-03-10 06:52:18 UTC
Verified.Fixed.
Comment 29 Elizabeth 2017-10-06 14:41:24 UTC
Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.