Bug 103178

Summary: [BAT] igt@kms_cursor_legacy@basic-flip-after-cursor-legacy - Incomplete
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED DUPLICATE QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: highest CC: bugs, chris, intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: GLK i915 features: display/Other

Description Marta Löfstedt 2017-10-10 06:24:38 UTC
CI_DRM_3196 fi-glk-1 incomplete

Something weird is going on for this run:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3196/fi-glk-1/dmesg0.log
dmes stop at:
<7>[  388.921308] [IGT] kms_cursor_legacy: executing
but there are 2 oops logs where the test appear to keep running:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3196/fi-glk-1/dmesg-1507288911_Oops_1.log
<14>[  388.993250] [IGT] kms_cursor_legacy: starting subtest basic-flip-after-cursor-legacy
<2>[  390.244184] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:880!


https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3196/fi-glk-1/dmesg-1507288911_Oops_1.log
<14>[  569.068084] [IGT] drv_module_reload: starting subtest basic-reload
...
<1>[  569.561140] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028

the:  https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3196/fi-glk-1/dmesg-1507563092_Panic_2.log
refers to the: 
 <2>[  390.244184] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:880!

So, what is the the oops in the second oops file?

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3196/fi-glk-1/igt@kms_cursor_legacy@basic-flip-after-cursor-legacy.html
Comment 1 Imre Deak 2017-10-10 12:30:06 UTC
For the
<7>[  569.520217] [drm:intel_atomic_commit_tail [i915]] [CRTC:42:pipe A]
<1>[  569.561140] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
<1>[  569.561218] IP: intel_gpu_reset+0xef/0x1b0 [i915]


dev_priv->engines[RCS] is 0, looks like memory corruption. One possibility is freeing dev_priv too early. Adding Chris for more insight.
Comment 2 Marta Löfstedt 2017-10-10 12:51:27 UTC

<4>[  569.561246] Modules linked in: vgem snd_hda_codec_realtek snd_hda_codec_generic i915(-) x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm r8169 mii prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel [last unloaded: snd_hda_intel]
<4>[  569.561324] CPU: 3 PID: 4239 Comm: drv_module_relo Tainted: G     U          4.14.0-rc3-CI-Patchwork_5921+ #1

so the https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3196/fi-glk-1/dmesg-1507288911_Oops_1.log is from an "old" patchwork run.
Comment 3 Maarten Lankhorst 2017-10-10 17:13:55 UTC
How often does it happen? Does KASAN find anything?
Comment 4 Marta Löfstedt 2017-10-11 07:47:14 UTC
Note, data from:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3196/fi-glk-1/dmesg-1507288911_Oops_1.log
<14>[  569.068084] [IGT] drv_module_reload: starting subtest basic-reload
...
<1>[  569.561140] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028

should be ignorded.

So, this is a "plain" kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:880!

hence it should be duplicate of BUG 103190, which is now duplicate of BUG 102035.
Also, quote from IRC 2017-10-11:


<marta_> dolphin, ickle : this on APL-shards there has now been 3 occurences of kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:880! in 7 CI runs. see bug: https://bugs.freedesktop.org/show_bug.cgi?id=103190
<dolphin> marta_: does it happen to have IOMMU on, by any chance?
<dolphin> aka. Intel VT-d turned on from BIOS
* itoral (~itoral@fanzine.igalia.com) has joined
* danvet (~Daniel@2a02:168:5635:0:39d2:f87e:2033:9f6) has joined
<dolphin> If you can try running "intel_iommu=off" instead of igfx_off, that'd give a datapoint
* cristi (~majeru@131.228.216.128) has joined
<dolphin> marta_: it's the same issue as with https://bugs.freedesktop.org/show_bug.cgi?id=102035
<marta_> thanks dolphin I will dup
<marta_> tsa is dolphin suggestion ^^ something to test?
* xerpi (~xerpi@59.red-88-23-235.staticip.rima-tde.net) has joined
<tsa> need to pick up one APL shard and check bios option / test kernel param
<tsa> looking at the 3206 missing htmls now\
* lemonzest (~lemonzest@unaffiliated/lemonzest) has joined
* [Enrico] (~chiccoroc@gentoo/contributor/Enrico) has joined
<dolphin> marta_: yep, double-checked it, there just have been additions to the file
<dolphin> it may be more convenient for you guys to look at the actual line of code in the file, when filing bug
<dolphin> this this case: GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
<dolphin> so the bug could be best identified as intel_lrc_irq_handler()/GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED)
<dolphin> If we had not been having the same error before we added any pre-emption code, one would think there's an error in the pre-emption code, but it failed before
Comment 5 Marta Löfstedt 2017-10-11 07:52:37 UTC

*** This bug has been marked as a duplicate of bug 102035 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.