Bug 103334

Summary: [CI] igt@kms_atomic_transition@plane-all-modeset-[transition|transition-fencing] - Softdog incomplete
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Imre Deak <imre.deak>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: bugs, imre.deak, intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: GLK i915 features: display/Other

Description Marta Löfstedt 2017-10-18 09:17:54 UTC
CI_DRM_3253 GLK-shards igt@kms_atomic_transition@plane-all-modeset-transition softdog incomplete

<7>[  700.691095] [drm:intel_hpd_irq_handler [i915]] Received HPD interrupt on PIN 4 - cnt: 0
<7>[  700.691310] [drm:intel_dp_hpd_pulse [i915]] ignoring long hpd on eDP port A
<2>[  700.824913] softdog: Initiating panic
<0>[  700.824932] Kernel panic - not syncing: Software Watchdog Timer expired
<4>[  700.824943] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G     U          4.14.0-rc5-CI-CI_DRM_3253+ #1
<4>[  700.824950] Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017
<4>[  700.824955] Call Trace:
<4>[  700.824964]  <IRQ>
<4>[  700.824981]  dump_stack+0x68/0x9f
<4>[  700.824991]  panic+0xd4/0x21d
<4>[  700.825001]  ? printk+0x43/0x4b
<4>[  700.825015]  softdog_fire+0x6b/0x70
<4>[  700.825023]  __hrtimer_run_queues+0xed/0x4d0
<4>[  700.825032]  ? softdog_ping+0x50/0x50
<4>[  700.825041]  hrtimer_interrupt+0xb5/0x200
<4>[  700.825053]  smp_apic_timer_interrupt+0x7d/0x2e0
<4>[  700.825061]  apic_timer_interrupt+0x9a/0xa0
<4>[  700.825066]  </IRQ>
<4>[  700.825075] RIP: 0010:cpuidle_enter_state+0x136/0x370
<4>[  700.825080] RSP: 0018:ffffc900000c3e80 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10
<4>[  700.825089] RAX: ffff88017aa30040 RBX: 0000000003380fa5 RCX: 0000000000000001
<4>[  700.825095] RDX: 0000000000000000 RSI: ffffffff81d0d8ac RDI: ffffffff81cc1886
<4>[  700.825100] RBP: ffffc900000c3eb8 R08: 0000000000007470 R09: 0000000000000007
<4>[  700.825106] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
<4>[  700.825111] R13: 0000000000000004 R14: ffff88017fca4b30 R15: 000000a3293285d4
<4>[  700.825129]  cpuidle_enter+0x17/0x20
<4>[  700.825138]  call_cpuidle+0x23/0x40
<4>[  700.825145]  do_idle+0x192/0x1e0
<4>[  700.825154]  cpu_startup_entry+0x1d/0x20
<4>[  700.825165]  start_secondary+0x11c/0x140
<4>[  700.825174]  secondary_startup_64+0xa5/0xa5
<0>[  700.825260] Kernel Offset: disabled

data:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3253/shard-glkb4/igt@kms_atomic_transition@plane-all-modeset-transition.html
Comment 2 Imre Deak 2017-10-18 09:48:00 UTC
Looks like the test is running properly, doing modesets while the softwachdog (owatch?) triggers. Could be because it's a long running test without any output. For owatch's sake then we could emit more progress info from these tests.
Comment 3 Jani Saarinen 2017-10-18 11:33:54 UTC
https://patchwork.freedesktop.org/series/32206/
Comment 4 Marta Löfstedt 2017-10-19 06:49:50 UTC
There is also this:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3260/shard-glkb5/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-pri-shrfb-draw-mmap-wc.html

However, it looks like something is wrong because there is no pstore only an OOPS indicating that the softdog was triggered.
Comment 5 Marta Löfstedt 2017-10-19 14:00:14 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3265/shard-glkb2/igt@kms_frontbuffer_tracking@psr-2p-primscrn-cur-indfb-draw-mmap-cpu.html

Note this also has the pattern with only Oops log usually there are 2 logs.
Comment 6 Imre Deak 2017-10-20 08:55:43 UTC
(In reply to Jani Saarinen from comment #3)
> https://patchwork.freedesktop.org/series/32206/

The above patchset won't help, owatch is only pinged by piglit's output and piglit doesn't forward the test's output itself.

The next step would be to check why one modeset takes so long (~1s) and if that can't be fixed consider
- increasing owatch's timeout
- break the subtests to smaller parts
- reduce the number of iterations in the subtests
Comment 7 Jari Tahvanainen 2017-10-25 13:01:48 UTC
I did some quick comparison with my glk and skl. Something is "wrong" GLK since the execution time for test is wayyyy more than on SKL:
GLK - Subtest plane-all-modeset-transition: SUCCESS (3346,877s)
SKL - Subtest plane-all-modeset-transition: SUCCESS (12,187s)
Test seems to be pass on this case, but on CI timer kicks in.
Comment 8 Marta Löfstedt 2017-10-25 13:37:13 UTC
(In reply to Jari Tahvanainen from comment #7)
> I did some quick comparison with my glk and skl. Something is "wrong" GLK
> since the execution time for test is wayyyy more than on SKL:
> GLK - Subtest plane-all-modeset-transition: SUCCESS (3346,877s)
> SKL - Subtest plane-all-modeset-transition: SUCCESS (12,187s)
> Test seems to be pass on this case, but on CI timer kicks in.

Jari, that is the defining criteria for the softdog to trigger.

If Imre want's longer softdog timeout he could just ask Tomi.
Comment 9 Imre Deak 2017-10-25 17:53:01 UTC
(In reply to Marta Löfstedt from comment #8)
> (In reply to Jari Tahvanainen from comment #7)
> > I did some quick comparison with my glk and skl. Something is "wrong" GLK
> > since the execution time for test is wayyyy more than on SKL:
> > GLK - Subtest plane-all-modeset-transition: SUCCESS (3346,877s)
> > SKL - Subtest plane-all-modeset-transition: SUCCESS (12,187s)
> > Test seems to be pass on this case, but on CI timer kicks in.
> 
> Jari, that is the defining criteria for the softdog to trigger.
> 
> If Imre want's longer softdog timeout he could just ask Tomi.

No, I think there is actually something wrong here, ~2 seconds for two modesets+CRC grabbing is too long. Adding also Maarten, but I can take a look at this on my GLK tomorrow.
Comment 10 Marta Löfstedt 2017-10-26 06:30:56 UTC
CI_DRM_3284 GLK-shards igt@gem_exec_whisper@normal incomplete:

NOTE FROM CI_DRM_3284 the CI system owatch has been updated. This is what the new Softdog looks like:


<12>[ 3757.298960] owatch: TIMEOUT!
<12>[ 3757.299104] owatch: timeout for /dev/watchdog0 set to 10 (requested 10)
<6>[ 3757.299319] sysrq: SysRq : Trigger a crash
<1>[ 3757.299428] BUG: unable to handle kernel NULL pointer dereference at           (null)
<1>[ 3757.299468] IP: sysrq_handle_crash+0x45/0x80
<6>[ 3757.299484] PGD 16ebfb067 P4D 16ebfb067 PUD 164c6a067 PMD 0 
<4>[ 3757.299514] Oops: 0002 [#1] PREEMPT SMP
<4>[ 3757.299530] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul i915 ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm r8169 mii mei_me mei prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel
<4>[ 3757.299675] CPU: 0 PID: 10412 Comm: owatch Tainted: G     U  W       4.14.0-rc6-CI-CI_DRM_3284+ #1
<4>[ 3757.299702] Hardware name: Intel Corp. Geminilake/GLK RVP1 DDR4 (05), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017
<4>[ 3757.299733] task: ffff88017a9e4ec0 task.stack: ffffc900007a8000
<4>[ 3757.299755] RIP: 0010:sysrq_handle_crash+0x45/0x80
<4>[ 3757.299771] RSP: 0018:ffffc900007abde0 EFLAGS: 00010282
<4>[ 3757.299791] RAX: ffff88017a9e4ec0 RBX: 0000000000000063 RCX: b946b02c2f5ed020
<4>[ 3757.299814] RDX: 0000000000000001 RSI: 00000000cd19e7cc RDI: ffff88017a9e4ec0
<4>[ 3757.299836] RBP: ffffc900007abde0 R08: ffff88017a9e5780 R09: 0000000000000000
<4>[ 3757.299858] R10: 000000009d08356a R11: 000000006244e854 R12: 0000000000000000
<4>[ 3757.299880] R13: 0000000000000004 R14: ffffffff81ebeb60 R15: ffffc900007abf20
<4>[ 3757.299903] FS:  00007fea1f64c700(0000) GS:ffff88017fc00000(0000) knlGS:0000000000000000
<4>[ 3757.299928] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 3757.299947] CR2: 0000000000000000 CR3: 000000017abbb000 CR4: 00000000003406f0
<4>[ 3757.299968] Call Trace:
<4>[ 3757.299983]  __handle_sysrq+0x132/0x210
<4>[ 3757.300001]  write_sysrq_trigger+0x51/0x60
<4>[ 3757.300019]  proc_reg_write+0x42/0x70
<4>[ 3757.300036]  __vfs_write+0x28/0x130
<4>[ 3757.300052]  ? rcu_read_lock_sched_held+0x7a/0x90
<4>[ 3757.300070]  ? rcu_sync_lockdep_assert+0x2f/0x60
<4>[ 3757.300088]  ? __sb_start_write+0x10c/0x200
<4>[ 3757.300106]  vfs_write+0xc8/0x1c0
<4>[ 3757.300121]  SyS_write+0x49/0xb0
<4>[ 3757.300140]  entry_SYSCALL_64_fastpath+0x1c/0xb1
<4>[ 3757.300157] RIP: 0033:0x7fea1f1788f0
<4>[ 3757.300170] RSP: 002b:00007ffd0f0abfa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4>[ 3757.300197] RAX: ffffffffffffffda RBX: ffffffff81491ef3 RCX: 00007fea1f1788f0
<4>[ 3757.300218] RDX: 0000000000000001 RSI: 0000000000f0f250 RDI: 0000000000000003
<4>[ 3757.300240] RBP: ffffc900007abf88 R08: 00007fea1f64c700 R09: 0000000000000410
<4>[ 3757.300262] R10: 00007fea1f441b58 R11: 0000000000000246 R12: 0000000000f0f020
<4>[ 3757.300284] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000001
<4>[ 3757.300310]  ? __this_cpu_preempt_check+0x13/0x20
<4>[ 3757.300329] Code: 34 e8 90 45 b6 ff 48 c7 c2 2b 76 59 81 be 01 00 00 00 48 c7 c7 80 d5 e4 81 e8 b8 60 b4 ff c7 05 be 47 b7 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 e8 1c 74 b6 ff 84 c0 75 c3 48 c7 
<1>[ 3757.300518] RIP: sysrq_handle_crash+0x45/0x80 RSP: ffffc900007abde0
<4>[ 3757.300537] CR2: 0000000000000000
<4>[ 3757.300553] ---[ end trace 05a1442557e12508 ]---
<0>[ 3757.404711] Kernel panic - not syncing: Fatal exception
<0>[ 3757.404784] Kernel Offset: disabled

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3284/shard-glkb4/igt@gem_exec_whisper@normal.html
Comment 11 Imre Deak 2017-10-30 11:31:58 UTC
The reason for modesets taking long:

[   34.061132] [drm:intel_pps_dump_state [i915]] cur t1_t3 0 t8 0 t9 0 t10 500 t11_t12 6000
[   34.061181] [drm:intel_pps_dump_state [i915]] vbt t1_t3 2000 t8 10 t9 2000 t10 500 t11_t12 6000
[   34.061220] [drm:intel_dp_init_panel_power_sequencer [i915]] panel power up delay 200, power down delay 50, power cycle delay 600
[   34.061259] [drm:intel_dp_init_panel_power_sequencer [i915]] backlight on delay 1, off delay 200

So each modeset incurs all the above delays ~1050ms. I think the test should set the number of iterations taking into account the above.
Comment 12 Jani Saarinen 2017-10-31 15:05:26 UTC
Reference: https://patchwork.freedesktop.org/series/32905/
Comment 13 Marta Löfstedt 2017-11-01 10:38:49 UTC
(In reply to Jani Saarinen from comment #12)
> Reference: https://patchwork.freedesktop.org/series/32905/

Note, the patch is only for the kms_atomic_transition issue. So, if/when Imres fix is merged and fixes the issue. I will remove the kms_atomic_transition and rename this bug. This is my bad I should have renamed the bug as soon as I started filing softdogs on other tests on this bug.
Comment 14 Imre Deak 2017-11-01 10:47:51 UTC
(In reply to Marta Löfstedt from comment #13)
> (In reply to Jani Saarinen from comment #12)
> > Reference: https://patchwork.freedesktop.org/series/32905/
> 
> Note, the patch is only for the kms_atomic_transition issue. So, if/when
> Imres fix is merged and fixes the issue. I will remove the
> kms_atomic_transition and rename this bug. This is my bad I should have
> renamed the bug as soon as I started filing softdogs on other tests on this
> bug.

I think for tracking purposes it's better to open a new bug instead of changing an existing one (think of coming back to this bug later from the commit's Bugzilla: tag). The fix covers the issues reported originally here, that is the kms_atomic_transition@plane-all-modeset-* on GLK.
Comment 15 Marta Löfstedt 2017-11-07 13:48:14 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3318/shard-glkb4/igt@kms_flip@flip-vs-fences.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3318/shard-glkb4/dmesg-1510055141_Oops_1.log

<7>[  586.496217] [drm:wait_panel_power_cycle [i915]] Wait for panel power cycle
<12>[  586.684918] owatch: TIMEOUT!
Comment 16 Imre Deak 2017-11-09 12:40:43 UTC
Should be fixed by the following IGT patch

Author: Imre Deak <imre.deak@intel.com>
Date:   Wed Nov 8 21:33:22 2017 +0200

    kms_atomic_transition: Split out modeset tests on internal panels
Comment 17 Marta Löfstedt 2017-11-09 13:53:03 UTC
(In reply to Imre Deak from comment #16)
> Should be fixed by the following IGT patch
> 
> Author: Imre Deak <imre.deak@intel.com>
> Date:   Wed Nov 8 21:33:22 2017 +0200
> 
>     kms_atomic_transition: Split out modeset tests on internal panels

Thanks Imre, I will start monitoring for this once we get Ci runs with the patch integrated.
Comment 18 Marta Löfstedt 2017-11-10 10:33:59 UTC
Imres patch was integrated in: CI_DRM_3326

Now, Imres patch only fixes kms_atomic_transition@plane-all-modeset-[transition|transition-fencing] and I believe that I have filed all GLK Softdogs on this bug.

However, since the patch came in there has been no Softdogs on GLK, so I will take a chance and close and archive the whole bug instead of spending time on time consuming bug splitting.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.