Bug 103989

Summary: [BAT][ELK only] igt@* - dmesg-warn - *ERROR* CPU pipe A FIFO underrun leading to incomplete.
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: REOPENED --- QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs, matthew.d.roper, philippe
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: GM45, HSW i915 features: display/Other
Bug Depends on:    
Bug Blocks: 105980    

Description Marta Löfstedt 2017-11-30 06:47:41 UTC
This is the first occurrence:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3409/fi-elk-e7500/igt@kms_flip@basic-plain-flip.html

then these are on latest results from cibuglog:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@gem_exec_suspend@basic-s4-devices.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_cursor_legacy@basic-flip-after-cursor-legacy.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@debugfs_test@read_all_entries.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_busy@basic-flip-b.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_cursor_legacy@basic-flip-after-cursor-varying-size.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_cursor_legacy@basic-flip-before-cursor-varying-size.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_flip@basic-plain-flip.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_force_connector_basic@force-connector-state.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_cursor_legacy@basic-flip-before-cursor-legacy.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_force_connector_basic@force-load-detect.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_flip@basic-flip-vs-wf_vblank.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_flip@basic-flip-vs-dpms.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_flip@basic-flip-vs-modeset.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@gem_exec_suspend@basic-s3.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3412/fi-elk-e7500/igt@kms_busy@basic-flip-a.html

[  213.977896] [drm:intel_set_cpu_fifo_underrun_reporting [i915]] *ERROR* pipe A underrun
[  219.448527] usb usb3: root hub lost power or was reset
[  219.448583] usb usb4: root hub lost power or was reset
[  219.448636] usb usb5: root hub lost power or was reset
[  219.448701] usb usb1: root hub lost power or was reset
[  219.449222] usb usb6: root hub lost power or was reset
[  219.449267] usb usb7: root hub lost power or was reset
[  219.449311] usb usb8: root hub lost power or was reset
[  219.449358] usb usb2: root hub lost power or was reset
[  219.464051] sd 0:0:0:0: [sda] Starting disk
[  219.477715] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
[  219.477920] [drm:pipe_config_err [i915]] *ERROR* mismatch in has_audio (expected yes, found no)
[  219.477923] ------------[ cut here ]------------
[  219.477925] pipe state doesn't match!
[  219.477987] WARNING: CPU: 0 PID: 2937 at drivers/gpu/drm/i915/intel_display.c:11596 intel_atomic_commit_tail+0xc12/0xd10 [i915]
[  219.477989] Modules linked in: i915 snd_hda_codec_realtek snd_hda_codec_generic coretemp snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core e1000e snd_pcm mei_me lpc_ich ptp mei pps_core prime_numbers
[  219.478020] CPU: 0 PID: 2937 Comm: kworker/u8:24 Tainted: G     U  W        4.15.0-rc1-CI-CI_DRM_3412+ #1
[  219.478022] Hardware name: Hewlett-Packard HP Compaq 8000 Elite CMT PC/3647h, BIOS 786G7 v01.02 10/22/2009
[  219.478027] Workqueue: events_unbound async_run_entry_fn
[  219.478030] task: ffff880106f82880 task.stack: ffffc900005c0000
[  219.478063] RIP: 0010:intel_atomic_commit_tail+0xc12/0xd10 [i915]
[  219.478065] RSP: 0018:ffffc900005c3b78 EFLAGS: 00010296
[  219.478068] RAX: 0000000000000019 RBX: ffff880107ec07c0 RCX: 0000000000000006
[  219.478070] RDX: 0000000000001531 RSI: ffffffff81d05d01 RDI: ffffffff81cb7126
[  219.478072] RBP: ffff88010783d3d8 R08: 0000000000000000 R09: 0000000000000001
[  219.478074] R10: ffff880114a84d68 R11: 0000000000000000 R12: ffff880114a84a88
[  219.478076] R13: ffff880114a85d28 R14: 0000000000000000 R15: ffff880107ec0000
[  219.478079] FS:  0000000000000000(0000) GS:ffff88011bc00000(0000) knlGS:0000000000000000
[  219.478081] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  219.478083] CR2: 00007f0216922080 CR3: 0000000001e10000 CR4: 00000000000406f0
[  219.478085] Call Trace:
[  219.478127]  intel_atomic_commit+0x22a/0x2e0 [i915]
[  219.478134]  ? pci_pm_suspend_noirq+0x190/0x190
[  219.478140]  drm_atomic_helper_commit_duplicated_state+0xd4/0x100
[  219.478174]  __intel_display_resume+0x76/0xc0 [i915]
[  219.478208]  intel_display_resume+0xbc/0xf0 [i915]
[  219.478242]  i915_pm_restore+0xc4/0x140 [i915]
[  219.478248]  dpm_run_callback+0x5f/0x310
[  219.478253]  device_resume+0xa3/0x1b0
[  219.478259]  ? dpm_watchdog_set+0x60/0x60
[  219.478266]  async_resume+0x14/0x40
[  219.478270]  async_run_entry_fn+0x2e/0x160
[  219.478275]  process_one_work+0x227/0x650
[  219.478284]  worker_thread+0x48/0x3a0
[  219.478292]  kthread+0x173/0x1b0
[  219.478295]  ? process_one_work+0x650/0x650
[  219.478297]  ? _kthread_create_on_node+0x30/0x30
[  219.478303]  ret_from_fork+0x24/0x30
[  219.478316] Code: ff 0f b6 d0 0f b6 f1 48 c7 c7 68 64 2b a0 e8 46 af e7 e0 0f ff 41 0f b6 4d 09 e9 de f9 ff ff 48 c7 c7 9c 48 29 a0 e8 2e af e7 e0 <0f> ff e9 90 fa ff ff 80 3d 71 d1 13 00 00 0f b6 ca 0f 85 b3 00 
[  219.478419] ---[ end trace 25a5dc15b6f8ad45 ]---
[  220.167384] Setting dangerous option reset - tainting kernel
Comment 1 Marta Löfstedt 2017-11-30 06:49:32 UTC
above is followed by this softdog:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3409/fi-elk-e7500/igt@kms_frontbuffer_tracking@basic.html
Comment 2 Marta Löfstedt 2017-11-30 06:51:30 UTC
From IRC:
"<mupuf> CI_DRM_3411 breaks fi-elk-e7500
<mupuf> *ERROR* mismatch in has_audio (expected yes, found no)
<mupuf> vsyrjala: is it from you?
<mupuf> does not look like it
<vsyrjala> caused by hdmi->dp display change on elk
* siro__1 (~siro__@ip-95-222-73-162.hsi15.unitymediagroup.de) has joined
<vsyrjala> fixes already on the list though ;)
<mupuf> marta_: Fun for you tomorrow ^
<mupuf> vsyrjala: thx :)
<vsyrjala> https://patchwork.freedesktop.org/series/34638/"
Comment 3 Marta Löfstedt 2017-11-30 11:54:06 UTC
Note this issue started when DP -> HDMI cable was replaced with DP -> DP.
Comment 4 Ville Syrjala 2017-12-01 15:14:00 UTC
commit 20ff39fa4312dfaee8d1314a208e6a5a3ee51cbc
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Wed Nov 29 18:43:01 2017 +0200

    drm/i915: Disable DP audio for g4x

fixes the has_audio issue. If the underruns persist we need a new bug for them.
Comment 5 Marta Löfstedt 2017-12-04 08:59:08 UTC
(In reply to Ville Syrjala from comment #4)
> commit 20ff39fa4312dfaee8d1314a208e6a5a3ee51cbc
> Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Date:   Wed Nov 29 18:43:01 2017 +0200
> 
>     drm/i915: Disable DP audio for g4x
> 
> fixes the has_audio issue. If the underruns persist we need a new bug for
> them.

Audio issue is wrong analyze from Martin, the issue has always started with FIFO underruns.

The pattern with FIFO underruns and then softdog on:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3446/fi-elk-e7500/igt@kms_frontbuffer_tracking@basic.html
still persist. Please read the information on what the bug is actually about before stating that the issue is resolved.
Comment 6 Marta Löfstedt 2017-12-04 12:08:00 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3411/fi-elk-e7500/igt@debugfs_test@read_all_entries.html

is actually the first occurrence.
Comment 8 Marta Löfstedt 2018-02-01 14:08:55 UTC
NOTE, The fi-elk-e7500 Hardware was changed: DP monitor back to DP-HDMI dongle and monitor. Now the issue can't be reproduced. 

From IRC:
<marta_malmoe> danvet, tsa, I don't know I think there is something wrong with the igt@debugfs_test@read_all_entries where the elk thing starts. Also the SNB issue I linked above flip-flops incomplete on that test.
<danvet> would indicate a nasty bug somewhere in one of the debugfs files we're using
<danvet> might need to improve the tracing of the testcase to better understand where it fails
<ickle> it's the always the crc debugfs iirc
<danvet> the old or new ones?
<danvet> and I thought krisman had some patches for those
<danvet> or mlankhorst
<ickle> were not the old ones removed? I thought mlankhost already excised them
Comment 9 Francesco Balestrieri 2018-06-12 12:39:51 UTC
Last occurred 2 days, 20 hours ago according to cibuglogger
Comment 10 Daniel Vetter 2018-07-04 19:30:00 UTC
Note if bug #105225 is really holding up, then we should remove the flip_done filter from cibuglog for this one here.
Comment 11 CI Bug Log 2019-01-31 14:44:13 UTC
A CI Bug Log filter associated to this bug has been updated:

{- fi-elk-e7500: igt@debugfs_test@read_all_entries - INFO: Timed out: reading sysfs entry -}
{+  fi-hsw-peppy fi-elk-e7500: igt@debugfs_test@read_all_entries - INFO: Timed out: reading sysfs entry +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_202/fi-hsw-peppy/igt@debugfs_test@read_all_entries_display_on.html
Comment 12 CI Bug Log 2019-05-23 15:06:59 UTC
A CI Bug Log filter associated to this bug has been updated:

{- fi-elk-e7500: All tests - *ERROR* CPU pipe A FIFO underrun -}
{+ fi-elk-e7500: All tests - *ERROR* CPU pipe A FIFO underrun +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_288/fi-elk-e7500/igt@kms_cursor_crc@pipe-b-cursor-64x64-sliding.html
Comment 13 Matt Roper 2019-08-19 22:04:08 UTC
All of CI links in this report are dead now, but my understanding is that when this bug was first reported the pattern was that watermark underruns were seen and then the machine would hang shortly thereafter, leaving tests incomplete.  Digging through the more recent CI results for this defect, it looks like the cibuglog filters may be a bit overzealous now.  There have been some kernel panics resulting in incomplete runs recently, but as far as I can see, those panics are caused by the snd_hda driver (a NULL pointer dereference) which would be more appropriately attached to https://bugzilla.kernel.org/show_bug.cgi?id=204565 (CI issue #1670).

There are also still very occasional watermark underruns at the very end of cursor tests when the fb console is restored (last instance seen was ~1 week ago), but those are non-fatal now and the system continues to run properly after the underrun is detected.  Watermark underruns by themselves can lead to brief flickering/corruption, but given how rarely they're showing up now (and that they're only showing up when the fbcon is restored) it's likely that they'd be completely unnoticed in general usage.  Dropping the priority/severity of this ticket down to medium.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.