Bug 72917 - [HSW] kms_flip hangs the machine
Summary: [HSW] kms_flip hangs the machine
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 72982 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-12-20 15:19 UTC by Paulo Zanoni
Modified: 2017-07-24 22:56 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Paulo Zanoni 2013-12-20 15:19:41 UTC
On my HSW ULT machine, with just an eDP monitor attached, if I run kms_flip, it never finishes: stays stuck forever.

Also, I could identify that at least the flip-vs-modeset-vs-hang, flip-vs-panning-vs-hang and flip-vs-dpms-off-vs-modeset tests fail and cause lots of errors on dmesg.

I don't know if this is a recent regression: spotted the problem today on -nightly.
Comment 1 Paulo Zanoni 2013-12-20 16:11:30 UTC
Seems to consistently freeze after this output:

Beginning flip-vs-panning-vs-hang on crtc 7, connector 10
1920x1080 60 1920 1966 1996 2080 1080 1082 1086 1112 0xa 0x48 138780
...Test assertion failure function exec_nop, file kms_flip.c:648:
Last errno: 16, Device or resource busy
Failed assertion: drmIoctl(fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf) == 0
Subtest flip-vs-panning-vs-hang: FAIL
Test requirement not met in function run_pair, file kms_flip.c:1319:
Last errno: 16, Device or resource busy
Test requirement: (!(modes)) 
Subtest 2x-flip-vs-panning-vs-hang: SKIP
Beginning flip-vs-bad-tiling on crtc 3, connector 10
Test requirement not met in function set_y_tiling, file kms_flip.c:611:
Last errno: 16, Device or resource busy
Test requirement: (!(__gem_set_tiling(drm_fd, r->handle, 2, fb_info->stride) == 0))
Subtest flip-vs-bad-tiling: SKIP
Test requirement not met in function run_pair, file kms_flip.c:1319:
Last errno: 16, Device or resource busy
Test requirement: (!(modes))
Subtest 2x-flip-vs-bad-tiling: SKIP
Beginning flip-vs-dpms-off-vs-modeset on crtc 3, connector 10 
  1920x1080 60 1920 1966 1996 2080 1080 1082 1086 1112 0xa 0x48 138780
Test assertion failure function wait_for_events, file kms_flip.c:1081:
Last errno: 16, Device or resource busy
Failed assertion: ret > 0 
select timed out or error (ret 0)
Subtest flip-vs-dpms-off-vs-modeset: FAIL
Test requirement not met in function run_pair, file kms_flip.c:1319:
Last errno: 16, Device or resource busy 
Test requirement: (!(modes))
Subtest 2x-flip-vs-dpms-off-vs-modeset: SKIP 
Beginning single-buffer-flip-vs-dpms-off-vs-modeset on crtc 3, connector 10
  1920x1080 60 1920 1966 1996 2080 1080 1082 1086 1112 0xa 0x48 138780

After it prints the last line above, it appears to be still running but never prints anything else.
Comment 2 Paulo Zanoni 2013-12-20 16:12:55 UTC
Also spotted this once while running the suite:

[ 1650.790400] [drm:i915_error_work_func], resetting chip
[ 1650.791275] [drm] Simulated gpu hang, resetting stop_rings
[ 1650.791395] [drm:init_status_page], render ring hws offset: 0x00211000
[ 1650.791803] [drm:init_pipe_control], render ring pipe control offset: 0x00232000
[ 1650.791925] [drm:init_status_page], bsd ring hws offset: 0x00233000
[ 1650.792349] [drm:init_status_page], blitter ring hws offset: 0x00254000
[ 1650.792851] [drm:init_status_page], video enhancement ring hws offset: 0x00275000
[ 1650.793155] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe A
[ 1650.793161] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[ 1650.793348] [drm:ironlake_update_plane], Writing base 0CB4F000 00000028 10 0 15360
[ 1650.793723] [drm:i915_gem_open],
[ 1650.794657] [drm:gen6_ppgtt_init], Allocated pde space (2M) at GTT entry: 5c70
[ 1650.795087] [drm:i915_error_state_write], Resetting error state
[ 1650.795836] [drm:i915_gem_open],
[ 1650.796453] [drm:gen6_ppgtt_init], Allocated pde space (2M) at GTT entry: 5c70
[ 1650.796794] [drm:i915_ring_stop_set], Stopping rings 0x0000000f
[ 1650.796848] ------------[ cut here ]------------
[ 1650.796863] WARNING: CPU: 0 PID: 1467 at drivers/gpu/drm/i915/i915_gem.c:3899 i915_gem_object_pin+0x6fe/0x720 [i915
]()
[ 1650.796864] Modules linked in: fuse ip6table_filter ip6_tables ebtable_nat ebtables iTCO_wdt iTCO_vendor_support x8
6_pkg_temp_thermal coretemp microcode serio_raw pcspkr e1000e ptp mei_me mei i2c_i801 lpc_ich pps_core mfd_core uinput
 dm_crypt i915 i2c_algo_bit drm_kms_helper crc32_pclmul crc32c_intel drm ghash_clmulni_intel video
[ 1650.796881] CPU: 0 PID: 1467 Comm: kms_flip Not tainted 3.13.0-rc4+ #90
[ 1650.796883] Hardware name: Intel Corporation Shark Bay Client platform/WhiteTip Mountain 1, BIOS HSWLPTU1.86C.0133.
R00.1309172123 09/17/2013
[ 1650.796884]  0000000000000009 ffff880080699b20 ffffffff81648376 0000000000000000
[ 1650.796887]  ffff880080699b58 ffffffff81054c6d ffff880035e43828 0000000000000000
[ 1650.796890]  ffff88007de4fc00 ffff880093f7a8a0 ffff88007de4fcf0 ffff880080699b68
[ 1650.796892] Call Trace:
[ 1650.796897]  [<ffffffff81648376>] dump_stack+0x4d/0x66
[ 1650.796900]  [<ffffffff81054c6d>] warn_slowpath_common+0x7d/0xa0
[ 1650.796902]  [<ffffffff81054d4a>] warn_slowpath_null+0x1a/0x20
[ 1650.796910]  [<ffffffffa00ae3ce>] i915_gem_object_pin+0x6fe/0x720 [i915]
[ 1650.796914]  [<ffffffff81190f00>] ? kmem_cache_alloc_trace+0x100/0x210
[ 1650.796928]  [<ffffffffa00ff8fe>] ? intel_ring_begin+0xbe/0x170 [i915]
[ 1650.796936]  [<ffffffffa00b0fa2>] do_switch+0x192/0x540 [i915]
[ 1650.796943]  [<ffffffffa00b2193>] i915_switch_context+0x53/0x80 [i915]
[ 1650.796951]  [<ffffffffa00b4422>] i915_gem_do_execbuffer.isra.22+0x8b2/0x12a0 [i915]
[ 1650.796958]  [<ffffffffa00b52c1>] ? i915_gem_execbuffer2+0x51/0x290 [i915]
[ 1650.796965]  [<ffffffffa00b5316>] i915_gem_execbuffer2+0xa6/0x290 [i915]
[ 1650.796972]  [<ffffffffa0029bb2>] drm_ioctl+0x4f2/0x630 [drm]
[ 1650.796976]  [<ffffffff811cbf3e>] ? mntput_no_expire+0x6e/0x1c0
[ 1650.796978]  [<ffffffff811cbee7>] ? mntput_no_expire+0x17/0x1c0
[ 1650.796981]  [<ffffffff811be970>] do_vfs_ioctl+0x300/0x520
[ 1650.796984]  [<ffffffff810ef73c>] ? __audit_syscall_entry+0x9c/0xf0
[ 1650.796986]  [<ffffffff811bebd5>] SyS_ioctl+0x45/0x80
[ 1650.796988]  [<ffffffff81659792>] system_call_fastpath+0x16/0x1b
[ 1650.796990] ---[ end trace 81da9ea827d78249 ]---
[ 1650.797067] [drm:drm_mode_getresources], CRTC[3] CONNECTORS[4] ENCODERS[3]
[ 1650.797072] [drm:drm_mode_getresources], CRTC[3] CONNECTORS[4] ENCODERS[3]
Comment 3 Paulo Zanoni 2013-12-20 17:55:38 UTC
Subtest flip-vs-panning-vs-hang seems to be the first one to fail. It fails on the following line:

..Test assertion failure function exec_nop, file kms_flip.c:648:
Last errno: 5, Input/output error
Failed assertion: drmIoctl(fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf) == 0 
Subtest flip-vs-panning-vs-hang: FAIL

On, the Kernel, I see the execbuf IOTCL returning -EIO and printing the following message:
[drm:i915_gem_validate_context], Context 0 tried to submit while banned

Ben, Mika, any comments on this?
Comment 4 Ben Widawsky 2013-12-22 19:59:25 UTC
(In reply to comment #3)
> Subtest flip-vs-panning-vs-hang seems to be the first one to fail. It fails
> on the following line:
> 
> ..Test assertion failure function exec_nop, file kms_flip.c:648:
> Last errno: 5, Input/output error
> Failed assertion: drmIoctl(fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf) ==
> 0 
> Subtest flip-vs-panning-vs-hang: FAIL
> 
> On, the Kernel, I see the execbuf IOTCL returning -EIO and printing the
> following message:
> [drm:i915_gem_validate_context], Context 0 tried to submit while banned
> 
> Ben, Mika, any comments on this?

Just making sure, that backtrace only showed up once, but you always get the test failure, correct?

The backtrace is likely to hit after a GPU hang, and somewhat benign as longs as the execbuf is resubmitted.

Possible to bisect?
Comment 5 Ben Widawsky 2013-12-23 17:47:24 UTC
*** Bug 72982 has been marked as a duplicate of this bug. ***
Comment 6 Ben Widawsky 2013-12-23 18:28:16 UTC
Bisected:

commit bfca05275a594920ad5111f5a23ec6fadc0d0780
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Dec 18 16:40:38 2013 +0100

    Revert "drm/i915: Do not allow buffers at offset 0"

Assigning to Daniel since he decided to revert in the first place. As I noted in the original commit, I had unexplainable errors when using offset 0.

A simple revert of the SHA fixes the problem for me.
Comment 7 Paulo Zanoni 2013-12-26 15:30:17 UTC
(In reply to comment #6)
> Bisected:
> 
> commit bfca05275a594920ad5111f5a23ec6fadc0d0780
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Wed Dec 18 16:40:38 2013 +0100
> 
>     Revert "drm/i915: Do not allow buffers at offset 0"
> 
> Assigning to Daniel since he decided to revert in the first place. As I
> noted in the original commit, I had unexplainable errors when using offset 0.
> 
> A simple revert of the SHA fixes the problem for me.

I can still get the backtrace even after reverting the revert :(

How can I help debug this?
Comment 8 Paulo Zanoni 2013-12-26 16:51:07 UTC
I couldn't reproduce the problem with plain drm-intel-next-queued (I was trying drm-intel-nightly).

It's really hard to bisect the problem since the whole suite takes 33 minutes to run on my machine, and it seems that the problem only happens when I run all subtests (instead of each subtest in separate).
Comment 9 Paulo Zanoni 2013-12-26 18:27:55 UTC
(In reply to comment #8)
> I couldn't reproduce the problem with plain drm-intel-next-queued (I was
> trying drm-intel-nightly).

Actually if I run it on dinq the test runs longer, does not get the backtrace mentioned before, but it dies with exit status 137 at some point later :(
Comment 10 Ben Widawsky 2013-12-26 21:55:00 UTC
(In reply to comment #7)
> 
> I can still get the backtrace even after reverting the revert :(
> 
> How can I help debug this?

Backtrace and hang I presume? I wonder if we have two problems then. Reverting this definitely fixes the problem for me.

Can you please try:
http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=vacation-nightly

and in addition, revert bfca05275a594920ad5111f5a23ec6fadc0d0780 on top of that?
Comment 11 Ben Widawsky 2013-12-26 22:00:34 UTC
de-assigning from Daniel until I can confirm what Paulo is seeing.
Comment 12 Ben Widawsky 2013-12-26 22:09:18 UTC
(In reply to comment #10)
> (In reply to comment #7)
> > 
> > I can still get the backtrace even after reverting the revert :(
> > 
> > How can I help debug this?
> 
> Backtrace and hang I presume? I wonder if we have two problems then.
> Reverting this definitely fixes the problem for me.
> 
> Can you please try:
> http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=vacation-nightly
> 
> and in addition, revert bfca05275a594920ad5111f5a23ec6fadc0d0780 on top of
> that?

BTW, to bisect this, I had been using:
sudo ./tests/kms_flip --run-subtest flip-vs-panning-vs-hang
Comment 13 Paulo Zanoni 2013-12-30 16:37:52 UTC
Ok, I reran some tests, and let me clarify:


With plain -nightly, the flip-vs-panning-vs-hang test *fails*, leaves a lot of error messages on dmesg, and occasionally some ugly WARNs at drivers/gpu/drm/i915/i915_gem.c:3899. I was testing by running the whole kms_flip program, not a single test. If I run flip-vs-panning-vs-hang as a single test, it still FAILs.


So I reverted 4b5d8f2ce384d17d56ea7e83bbfbb93e5aff4740, and now if I run just this subtest, it returns SUCCESS, so it seems this revert is already an improvement. If I run the whole kms_flip suite, I will still get some WARNs.


I also verified that on drm-intel-nightly + 4b5d reverted, if I run "sudo ./kms_flip --run-subtest flip-vs-panning-vs-hang" a few times, I may occasionally get the following backtrace:

[  156.513242] WARNING: CPU: 3 PID: 1484 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0()
[  156.513245] list_del corruption. prev->next should be ffff880090f5fca0, but was ffff88009de7a878
[  156.513247] Modules linked in: fuse ip6table_filter ip6_tables ebtable_nat ebtables iTCO_wdt iTCO_vendor_support x8
6_pkg_temp_thermal coretemp microcode serio_raw pcspkr e1000e i2c_i801 lpc_ich mei_me mfd_core mei ptp pps_core uinput
 dm_crypt i915 crc32_pclmul i2c_algo_bit drm_kms_helper drm crc32c_intel ghash_clmulni_intel video
[  156.513284] CPU: 3 PID: 1484 Comm: kms_flip Not tainted 3.13.0-rc4+ #95
[  156.513286] Hardware name: Intel Corporation Shark Bay Client platform/WhiteTip Mountain 1, BIOS HSWLPTU1.86C.0133.
R00.1309172123 09/17/2013
[  156.513289]  0000000000000009 ffff880090f5fb38 ffffffff816475df ffff880090f5fb80
[  156.513312]  ffff880090f5fb70 ffffffff81054bed ffff880090f5fc88 ffff880090f5fca0
[  156.513316]  ffff88009de7a840 0000000000000292 ffff88009de78000 ffff880090f5fbd0
[  156.513321] Call Trace:
[  156.513328]  [<ffffffff816475df>] dump_stack+0x4d/0x66
[  156.513333]  [<ffffffff81054bed>] warn_slowpath_common+0x7d/0xa0
[  156.513336]  [<ffffffff81054c5c>] warn_slowpath_fmt+0x4c/0x50
[  156.513340]  [<ffffffff81307671>] __list_del_entry+0xa1/0xd0
[  156.513343] [drm:init_status_page], render ring hws offset: 0x00211000
[  156.513349]  [<ffffffff81099643>] finish_wait+0x43/0x70
[  156.513368]  [<ffffffffa00a7d11>] __wait_seqno+0x381/0x540 [i915]
[  156.513384]  [<ffffffff81099720>] ? abort_exclusive_wait+0xb0/0xb0
[  156.513403]  [<ffffffffa002e873>] ? drm_gem_object_lookup+0x23/0xb0 [drm]
[  156.513407]  [<ffffffff810a0ddd>] ? trace_hardirqs_on+0xd/0x10
[  156.513419]  [<ffffffffa00ae90f>] i915_gem_set_domain_ioctl+0x16f/0x260 [i915]
[  156.513427]  [<ffffffffa002cbb2>] drm_ioctl+0x4f2/0x630 [drm]
[  156.513432]  [<ffffffff811cb9d7>] ? mntput_no_expire+0x17/0x1c0
[  156.513436]  [<ffffffff811be490>] do_vfs_ioctl+0x300/0x520
[  156.513440]  [<ffffffff810ef48c>] ? __audit_syscall_entry+0x9c/0xf0
[  156.513443]  [<ffffffff811be6f5>] SyS_ioctl+0x45/0x80
[  156.513447]  [<ffffffff816589d2>] system_call_fastpath+0x16/0x1b
Comment 14 Paulo Zanoni 2013-12-30 16:42:29 UTC
(In reply to comment #13)
> Ok, I reran some tests, and let me clarify:
> 
> 
> With plain -nightly, the flip-vs-panning-vs-hang test *fails*, leaves a lot
> of error messages on dmesg, and occasionally some ugly WARNs at
> drivers/gpu/drm/i915/i915_gem.c:3899. I was testing by running the whole
> kms_flip program, not a single test. If I run flip-vs-panning-vs-hang as a
> single test, it still FAILs.
> 
> 
> So I reverted 4b5d8f2ce384d17d56ea7e83bbfbb93e5aff4740, 

I wanted to say I reverted bfca05275a594920ad5111f5a23ec6fadc0d0780.

Also, I can confirm that if I run kms_flip on drm-intel-next-queued, these problems don't happen. But you need 6 patches on top of intel-gpu-tools if you want the test suite to finish (patches on the mailing list).
Comment 15 Jesse Barnes 2014-01-29 22:45:29 UTC
Some kind of GEM internal failure it looks like.
Comment 16 Chris Wilson 2014-01-29 22:50:22 UTC
Quickest way to check would seem to be i915.enable_ppgtt=1
Comment 17 Jani Nikula 2014-09-05 11:51:41 UTC
Paulo, what's the status of this bug? Please close if it's now fixed.
Comment 18 Paulo Zanoni 2014-09-05 14:15:35 UTC
(In reply to comment #17)
> Paulo, what's the status of this bug? Please close if it's now fixed.

As far as I remember, at that time we had PPGTT as a branch merged on -nightly. So I tried retesting this with i915.enable_ppgtt=2 and couldn't reproduce any of those problems, so closing bug.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.