Bug 102259 - [CI][HSW] Failed assertion: !mismatch in igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-(legacy|atomic)
Summary: [CI][HSW] Failed assertion: !mismatch in igt@kms_cursor_legacy@flip-vs-cursor...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: highest critical
Assignee: Maarten Lankhorst
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-08-16 13:25 UTC by Martin Peres
Modified: 2017-09-21 07:25 UTC (History)
2 users (show)

See Also:
i915 platform: HSW
i915 features: display/Other


Attachments
dmesg snippets with inserted kmsg from test (21.26 KB, text/plain)
2017-09-14 12:32 UTC, Marta Löfstedt
no flags Details
dmesg snippets with inserted kmsg from test this time clean after reboot (11.51 KB, text/plain)
2017-09-14 12:50 UTC, Marta Löfstedt
no flags Details
Fix GPU hang? (2.30 KB, patch)
2017-09-19 08:00 UTC, Maarten Lankhorst
no flags Details | Splinter Review

Description Martin Peres 2017-08-16 13:25:45 UTC
The tests igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-(legacy|atomic) hit the following assertion when running on our shared haswell pool:
	
(kms_cursor_legacy:1642) igt-debugfs-CRITICAL: Test assertion failure function igt_assert_crc_equal, file igt_debugfs.c:325:
(kms_cursor_legacy:1642) igt-debugfs-CRITICAL: Failed assertion: !mismatch
(kms_cursor_legacy:1642) igt-debugfs-CRITICAL: Last errno: 25, Inappropriate ioctl for device

Full logs:
 - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2968/shard-hsw1/igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-legacy.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2968/shard-hsw2/igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-atomic.html
Comment 1 Marta Löfstedt 2017-08-31 12:19:08 UTC
I can't find the test in logs added above. However, from this:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3000/shard-hsw1/igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-atomic.html

and forward. I see: 

[  109.729167] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x61/0x80 [i915], irq posted? no, current seqno=255, last=256
[  116.724119] [drm] GPU HANG: ecode 7:0:0xe757feff, in kms_cursor_lega [1543], reason: Hang on rcs0, action: reset

I don't think this test is supposed to HANG the GPU.
Comment 2 Marta Löfstedt 2017-08-31 12:42:04 UTC
The is no GPU HANG on these test on KBL- nor SNB-shards
Comment 3 Marta Löfstedt 2017-09-04 09:54:31 UTC
if I remove the make_fb_busy call, the GPU_HANG stops, but the test still fails.
Comment 4 Marta Löfstedt 2017-09-04 09:57:07 UTC
Note, it would be good to have conditional dmesg filtering for IGT+piglit, to only allow test that is supposed to cause GPU_HANG to pass the dmesg check.
Comment 5 Marta Löfstedt 2017-09-06 06:26:25 UTC
This is now failing on BDW NUCi5
Comment 6 Marta Löfstedt 2017-09-06 06:28:18 UTC
<marta_> ickle, https://bugs.freedesktop.org/show_bug.cgi?id=102259 I noticed that the make_fb_busy cause GPU_HANG on hsw-shards, but not on SNBs nor KBLs. The test still fails without make_fb_busy, on HSW. But I am still wondering on why the GPU_HANG for hsw,  
* pie_ (~pie_@unaffiliated/pie-/x-0787662) has joined
<mupuf> ickle: it does more than just adding debugging
<ickle> the hang comes from after
<ickle> because the test failed and we don't have spinbatches cleanup after a subtest, it goes on to cause a hang
<mupuf> so nothing changes then
<ickle> try with https://patchwork.freedesktop.org/patch/171768/
Comment 7 Marta Löfstedt 2017-09-08 10:45:55 UTC
For BDW this is a regression that started with: drm-tip@6e644626945c

However, for HSW-shards this appear to have been around for a longer time.
Comment 9 Marta Löfstedt 2017-09-14 12:32:27 UTC
Created attachment 134222 [details]
dmesg snippets with inserted kmsg from test
Comment 10 Marta Löfstedt 2017-09-14 12:50:32 UTC
Created attachment 134223 [details]
dmesg snippets with inserted kmsg from test this time clean after reboot
Comment 11 Marta Löfstedt 2017-09-14 12:54:30 UTC
I have tested with Chris spinbatches cleanup patch, but I still see GPU_HANG on my BDW NUCi5.

Also, I still believe that the GPU_HANG is during test execution and not afterwards 

 static void stress(igt_display_t *display,
@@ -1376,7 +1400,7 @@ static void flip_vs_cursor_busy_crc(igt_display_t *display, bool atomic)
                uint32_t *busy;
 
                busy = make_fb_busy(display->drm_fd, &fb_info[1]);
-
+               kmsg(KERN_DEBUG "--- AFTER BUSY ---");
                vblank_start = get_vblank(display->drm_fd, pipe, DRM_VBLANK_NEXTONMISS);
 
                flip_nonblocking(display, pipe, atomic, &fb_info[1]);
@@ -1385,6 +1409,7 @@ static void flip_vs_cursor_busy_crc(igt_display_t *display, bool atomic)
                igt_assert_eq(get_vblank(display->drm_fd, pipe, 0), vblank_start);
 
                igt_pipe_crc_collect_crc(pipe_crc, &crcs[2]);
+               kmsg(KERN_DEBUG "--- BEFORE finish_fb_busy ---");
 
                finish_fb_busy(busy);

I added above to the test to insert kmsgs. The attached "dmesg snippets with inserted kmsg from test *" shows that the GPU is reset after "AFTER BUSY", but before "BEFORE finish_fb_busy". Also, I sometimes get message of timeout fences:

"[   81.824182] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x61/0x80 [i915], irq posted? yes, current seqno=f6, last=f7
[   88.800051] asynchronous wait on fence i915:kms_cursor_lega[1771]/0:11 timed out
[   88.800056] asynchronous wait on fence i915:kms_cursor_lega[1771]/0:11 timed out
[   94.791508] [drm] GPU HANG: ecode 8:0:0xe75ffefe, in kms_cursor_lega [1771], reason: No progress on rcs0, action: reset".

Anyways, with this going on I am not surprised that the CRCs doesn't match.
Comment 12 Marta Löfstedt 2017-09-14 13:44:12 UTC
IRC conversation about this for documentation:

<marta_> ickle, I have done some testing of your spibatch cleanup: https://patchwork.freedesktop.org/patch/171768/ for https://bugs.freedesktop.org/show_bug.cgi?id=102259
<marta_> you patch doesn't appear to help for BDW, also I still believe that the GPU_HANG is during test extecition and not afterwards.
<marta_> spibatch -> spinbatch ... extecition -> execution
<ickle> ok, just an issue elsewhere, the test isn't as nonblocking as it is supposed to be
<ickle> oh, it doesn't use spinbatch anyway
<ickle> so it can still be residual batches on the gpu
<ickle> the other aspect is that a hang in the middle there isn't supposed to alter the crc
<marta_> I am wondering why this only appear to bother HSW and BDW, so I was thinking that there was some issue with the bacth itself. 
<marta_> which could be easily tested...
<ickle> the batch is nothing
<ickle> it's just a means for making the plane update wait behind the scenes
<marta_> I tested to remove everything between make_fb_busy and finish_fb_busy -> no GPU_HANG
<marta_> good
<marta_> next step only remove igt_pipe_crc_collect_crc(pipe_crc, &crcs[2]); --> no GPU_HANG
* YuGiOhJCJ has quit (Quit: YuGiOhJCJ)
<marta_> also, for completion I have previously concluded that by removing the whole busy thing there is no GPU_HANG
* Notify: jsaa is online (Ubuntu Servers (freenode))
* jsaa (jsaa@nat/intel/x-iuwomztxdxwslqob) has joined
<marta_> and what is even more puzzling to me is that for BDW this issue started with vsyrjalas drm-tip@6e644626
<ickle> "Forcing a modeset"
* jarit (~jtahvana@134.134.139.82) has joined
* simonle1 (~Thunderbi@192.55.54.42) has joined
<vsyrjala> modeset causes gpu hangs?
<ickle> no, it's the test
<ickle> the test doesn't like being blocked
<ickle> it relies on the nonblocking interface in order to cancel the spinner and let the plane update proceed
* jewins (~Thunderbi@192.55.55.37) has joined
<vsyrjala> which test is that? maybe it shouldn't enable/disable crc capture during the critical section
<ickle> that would be the easy fix
<ickle> kms_cursor_legacy
<ickle> flip_vs_cursor_crc()
* jewins has quit (Remote host closed the connection)
* _jewins (Thunderbir@nat/intel/x-wiswhomstokkdhzi) has joined
<marta_> it is the flip-vs-cursor-busy-crc-(legacy|atomic)
* _jewins has quit (Remote host closed the connection)
* jewins (Thunderbir@nat/intel/x-oivcjudratazilmy) has joined
<vsyrjala> i guess the main issue with not toggling crc capture all the time would be making sure it can deal with potentially getting multiple crcs at each step (in case something blocked it for more than a frame)
<vsyrjala> and also making sure we look at the correct crc after a nonblocking commit
Comment 13 Maarten Lankhorst 2017-09-19 07:59:26 UTC
Well that it starts at that commit makes the puzzle fall into place..

igt_pipe_crc_collect_crc now forces a modeset, which is bad when we hang the GPU..
Comment 14 Maarten Lankhorst 2017-09-19 08:00:28 UTC
Created attachment 134329 [details] [review]
Fix GPU hang?
Comment 15 Marta Löfstedt 2017-09-19 10:31:07 UTC
(In reply to Maarten Lankhorst from comment #14)
> Created attachment 134329 [details] [review] [review]
> Fix GPU hang?

This fix the test on BDW.
Comment 16 Maarten Lankhorst 2017-09-20 12:59:32 UTC
commit 1043c09ccbcba8e5c2ec5f2a358a442346348bd8 (HEAD -> master, origin/master, origin/HEAD)
Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Date:   Tue Sep 19 13:31:13 2017 +0200

    tests/kms_cursor_legacy: Do not start collecting CRC after making FB busy


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.