The tests igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-(legacy|atomic) hit the following assertion when running on our shared haswell pool: (kms_cursor_legacy:1642) igt-debugfs-CRITICAL: Test assertion failure function igt_assert_crc_equal, file igt_debugfs.c:325: (kms_cursor_legacy:1642) igt-debugfs-CRITICAL: Failed assertion: !mismatch (kms_cursor_legacy:1642) igt-debugfs-CRITICAL: Last errno: 25, Inappropriate ioctl for device Full logs: - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2968/shard-hsw1/igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-legacy.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2968/shard-hsw2/igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-atomic.html
I can't find the test in logs added above. However, from this: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3000/shard-hsw1/igt@kms_cursor_legacy@flip-vs-cursor-busy-crc-atomic.html and forward. I see: [ 109.729167] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x61/0x80 [i915], irq posted? no, current seqno=255, last=256 [ 116.724119] [drm] GPU HANG: ecode 7:0:0xe757feff, in kms_cursor_lega [1543], reason: Hang on rcs0, action: reset I don't think this test is supposed to HANG the GPU.
The is no GPU HANG on these test on KBL- nor SNB-shards
if I remove the make_fb_busy call, the GPU_HANG stops, but the test still fails.
Note, it would be good to have conditional dmesg filtering for IGT+piglit, to only allow test that is supposed to cause GPU_HANG to pass the dmesg check.
This is now failing on BDW NUCi5
<marta_> ickle, https://bugs.freedesktop.org/show_bug.cgi?id=102259 I noticed that the make_fb_busy cause GPU_HANG on hsw-shards, but not on SNBs nor KBLs. The test still fails without make_fb_busy, on HSW. But I am still wondering on why the GPU_HANG for hsw, * pie_ (~pie_@unaffiliated/pie-/x-0787662) has joined <mupuf> ickle: it does more than just adding debugging <ickle> the hang comes from after <ickle> because the test failed and we don't have spinbatches cleanup after a subtest, it goes on to cause a hang <mupuf> so nothing changes then <ickle> try with https://patchwork.freedesktop.org/patch/171768/
For BDW this is a regression that started with: drm-tip@6e644626945c However, for HSW-shards this appear to have been around for a longer time.
History: https://intel-gfx-ci.01.org/cibuglog/index.html%3Faction_failures_history=209.html
Created attachment 134222 [details] dmesg snippets with inserted kmsg from test
Created attachment 134223 [details] dmesg snippets with inserted kmsg from test this time clean after reboot
I have tested with Chris spinbatches cleanup patch, but I still see GPU_HANG on my BDW NUCi5. Also, I still believe that the GPU_HANG is during test execution and not afterwards static void stress(igt_display_t *display, @@ -1376,7 +1400,7 @@ static void flip_vs_cursor_busy_crc(igt_display_t *display, bool atomic) uint32_t *busy; busy = make_fb_busy(display->drm_fd, &fb_info[1]); - + kmsg(KERN_DEBUG "--- AFTER BUSY ---"); vblank_start = get_vblank(display->drm_fd, pipe, DRM_VBLANK_NEXTONMISS); flip_nonblocking(display, pipe, atomic, &fb_info[1]); @@ -1385,6 +1409,7 @@ static void flip_vs_cursor_busy_crc(igt_display_t *display, bool atomic) igt_assert_eq(get_vblank(display->drm_fd, pipe, 0), vblank_start); igt_pipe_crc_collect_crc(pipe_crc, &crcs[2]); + kmsg(KERN_DEBUG "--- BEFORE finish_fb_busy ---"); finish_fb_busy(busy); I added above to the test to insert kmsgs. The attached "dmesg snippets with inserted kmsg from test *" shows that the GPU is reset after "AFTER BUSY", but before "BEFORE finish_fb_busy". Also, I sometimes get message of timeout fences: "[ 81.824182] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x61/0x80 [i915], irq posted? yes, current seqno=f6, last=f7 [ 88.800051] asynchronous wait on fence i915:kms_cursor_lega[1771]/0:11 timed out [ 88.800056] asynchronous wait on fence i915:kms_cursor_lega[1771]/0:11 timed out [ 94.791508] [drm] GPU HANG: ecode 8:0:0xe75ffefe, in kms_cursor_lega [1771], reason: No progress on rcs0, action: reset". Anyways, with this going on I am not surprised that the CRCs doesn't match.
IRC conversation about this for documentation: <marta_> ickle, I have done some testing of your spibatch cleanup: https://patchwork.freedesktop.org/patch/171768/ for https://bugs.freedesktop.org/show_bug.cgi?id=102259 <marta_> you patch doesn't appear to help for BDW, also I still believe that the GPU_HANG is during test extecition and not afterwards. <marta_> spibatch -> spinbatch ... extecition -> execution <ickle> ok, just an issue elsewhere, the test isn't as nonblocking as it is supposed to be <ickle> oh, it doesn't use spinbatch anyway <ickle> so it can still be residual batches on the gpu <ickle> the other aspect is that a hang in the middle there isn't supposed to alter the crc <marta_> I am wondering why this only appear to bother HSW and BDW, so I was thinking that there was some issue with the bacth itself. <marta_> which could be easily tested... <ickle> the batch is nothing <ickle> it's just a means for making the plane update wait behind the scenes <marta_> I tested to remove everything between make_fb_busy and finish_fb_busy -> no GPU_HANG <marta_> good <marta_> next step only remove igt_pipe_crc_collect_crc(pipe_crc, &crcs[2]); --> no GPU_HANG * YuGiOhJCJ has quit (Quit: YuGiOhJCJ) <marta_> also, for completion I have previously concluded that by removing the whole busy thing there is no GPU_HANG * Notify: jsaa is online (Ubuntu Servers (freenode)) * jsaa (jsaa@nat/intel/x-iuwomztxdxwslqob) has joined <marta_> and what is even more puzzling to me is that for BDW this issue started with vsyrjalas drm-tip@6e644626 <ickle> "Forcing a modeset" * jarit (~jtahvana@134.134.139.82) has joined * simonle1 (~Thunderbi@192.55.54.42) has joined <vsyrjala> modeset causes gpu hangs? <ickle> no, it's the test <ickle> the test doesn't like being blocked <ickle> it relies on the nonblocking interface in order to cancel the spinner and let the plane update proceed * jewins (~Thunderbi@192.55.55.37) has joined <vsyrjala> which test is that? maybe it shouldn't enable/disable crc capture during the critical section <ickle> that would be the easy fix <ickle> kms_cursor_legacy <ickle> flip_vs_cursor_crc() * jewins has quit (Remote host closed the connection) * _jewins (Thunderbir@nat/intel/x-wiswhomstokkdhzi) has joined <marta_> it is the flip-vs-cursor-busy-crc-(legacy|atomic) * _jewins has quit (Remote host closed the connection) * jewins (Thunderbir@nat/intel/x-oivcjudratazilmy) has joined <vsyrjala> i guess the main issue with not toggling crc capture all the time would be making sure it can deal with potentially getting multiple crcs at each step (in case something blocked it for more than a frame) <vsyrjala> and also making sure we look at the correct crc after a nonblocking commit
Well that it starts at that commit makes the puzzle fall into place.. igt_pipe_crc_collect_crc now forces a modeset, which is bad when we hang the GPU..
Created attachment 134329 [details] [review] Fix GPU hang?
(In reply to Maarten Lankhorst from comment #14) > Created attachment 134329 [details] [review] [review] > Fix GPU hang? This fix the test on BDW.
commit 1043c09ccbcba8e5c2ec5f2a358a442346348bd8 (HEAD -> master, origin/master, origin/HEAD) Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Date: Tue Sep 19 13:31:13 2017 +0200 tests/kms_cursor_legacy: Do not start collecting CRC after making FB busy
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.