Bug 102973 - [CI][HSW] igt@drv_selftest@live_hangcheck - SW HANG - Incomplete
Summary: [CI][HSW] igt@drv_selftest@live_hangcheck - SW HANG - Incomplete
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 102970 102974 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-09-25 12:24 UTC by Marta Löfstedt
Modified: 2018-03-02 15:35 UTC (History)
1 user (show)

See Also:
i915 platform: HSW
i915 features:


Attachments

Description Marta Löfstedt 2017-09-25 12:24:43 UTC
On CI_DRM_3126 new IGT test:
igt@drv_selftest@live_hangcheck triggers softdog:

<7>[  313.752576] [drm:intelfb_create [i915]] no BIOS fb, allocating a new one
<3>[  314.725474] Failed to start request b
<0>[  348.422049] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [swapper/4:0]
<4>[  348.422074] Modules linked in: i915(+) snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul snd_hda_codec_realtek crc32_pclmul snd_hda_codec_generic ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm mei_me mii lpc_ich mei prime_numbers [last unloaded: i915]
<4>[  348.422146] irq event stamp: 15454047
<4>[  348.422152] hardirqs last  enabled at (15454046): [<ffffffff819107bd>] restore_regs_and_iret+0x0/0x1d
<4>[  348.422156] hardirqs last disabled at (15454047): [<ffffffff819117e5>] apic_timer_interrupt+0x95/0xa0
<4>[  348.422161] softirqs last  enabled at (12679772): [<ffffffff81085251>] _local_bh_enable+0x21/0x40
<4>[  348.422165] softirqs last disabled at (12679773): [<ffffffff81085645>] irq_exit+0xb5/0xd0
<4>[  348.422169] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G     U          4.14.0-rc1-CI-CI_DRM_3126+ #1
<4>[  348.422173] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4>[  348.422176] task: ffff88040d5a8040 task.stack: ffffc900000ac000
<4>[  348.422180] RIP: 0010:__do_softirq+0xa3/0x4e2
<4>[  348.422183] RSP: 0018:ffff88041fb03f58 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10
<4>[  348.422190] RAX: 00000000ffffffff RBX: ffff88040d5a8040 RCX: 0000000000000000
<4>[  348.422194] RDX: 0000000000000000 RSI: ffffffff81d0ddbc RDI: ffffffff81cc1bee
<4>[  348.422197] RBP: ffff88041fb03fb8 R08: 0000000000000000 R09: 0000000000000000
<4>[  348.422200] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[  348.422203] R13: ffff88040d5a8040 R14: 0000000000000000 R15: 0000000000000000
<4>[  348.422207] FS:  0000000000000000(0000) GS:ffff88041fb00000(0000) knlGS:0000000000000000
<4>[  348.422211] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  348.422214] CR2: 00007fc893427000 CR3: 0000000402329002 CR4: 00000000001606e0
<4>[  348.422217] Call Trace:
<4>[  348.422221]  <IRQ>
<4>[  348.422228]  irq_exit+0xb5/0xd0
<4>[  348.422232]  smp_apic_timer_interrupt+0x9e/0x2e0
<4>[  348.422236]  apic_timer_interrupt+0x9a/0xa0
<4>[  348.422239]  </IRQ>
<4>[  348.422244] RIP: 0010:tick_nohz_idle_exit+0x114/0x180
<4>[  348.422248] RSP: 0018:ffffc900000afed0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10
<4>[  348.422254] RAX: ffff88040d5a8040 RBX: ffff88040d5a8040 RCX: 0000000000000001
<4>[  348.422258] RDX: 0000000000000000 RSI: ffffffff81d0ddbc RDI: ffffffff81cc1bee
<4>[  348.422261] RBP: ffffc900000afed8 R08: 0000000000000000 R09: 0000000000000001
<4>[  348.422264] R10: 0000000000000000 R11: 0000000000000000 R12: 0000004b38d4ed68
<4>[  348.422268] R13: ffff88040d5a8040 R14: 0000000000000000 R15: 0000000000000000
<4>[  348.422277]  do_idle+0x13d/0x1e0
<4>[  348.422282]  cpu_startup_entry+0x1d/0x20
<4>[  348.422286]  start_secondary+0x11c/0x140
<4>[  348.422291]  secondary_startup_64+0xa5/0xa5
<4>[  348.422299] Code: 00 00 e8 11 ac 7c ff c7 45 c8 0a 00 00 00 48 89 5d a8 48 c7 c0 40 86 01 00 65 c7 00 00 00 00 00 e8 23 76 7c ff fb b8 ff ff ff ff <48> c7 45 c0 00 51 e0 81 0f bc 45 d4 83 c0 01 89 45 d0 75 6a e9 
<0>[  348.422498] Kernel panic - not syncing: softlockup: hung tasks
<4>[  348.422517] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G     U       L  4.14.0-rc1-CI-CI_DRM_3126+ #1
<4>[  348.422542] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4>[  348.422564] Call Trace:
<4>[  348.422574]  <IRQ>
<4>[  348.422585]  dump_stack+0x68/0x9f
<4>[  348.422599]  panic+0xd4/0x21d
<4>[  348.422614]  watchdog_timer_fn+0x289/0x290
<4>[  348.422631]  __hrtimer_run_queues+0xed/0x4d0
<4>[  348.422646]  ? __touch_watchdog+0x30/0x30
<4>[  348.422662]  hrtimer_interrupt+0xc1/0x220
<4>[  348.422679]  smp_apic_timer_interrupt+0x7d/0x2e0
<4>[  348.422695]  apic_timer_interrupt+0x9a/0xa0
<4>[  348.422710] RIP: 0010:__do_softirq+0xa3/0x4e2
<4>[  348.422724] RSP: 0018:ffff88041fb03f58 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10
<4>[  348.422750] RAX: 00000000ffffffff RBX: ffff88040d5a8040 RCX: 0000000000000000
<4>[  348.422771] RDX: 0000000000000000 RSI: ffffffff81d0ddbc RDI: ffffffff81cc1bee
<4>[  348.422793] RBP: ffff88041fb03fb8 R08: 0000000000000000 R09: 0000000000000000
<4>[  348.422814] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[  348.422835] R13: ffff88040d5a8040 R14: 0000000000000000 R15: 0000000000000000
<4>[  348.422861]  ? __do_softirq+0x9d/0x4e2
<4>[  348.422878]  irq_exit+0xb5/0xd0
<4>[  348.422890]  smp_apic_timer_interrupt+0x9e/0x2e0
<4>[  348.422906]  apic_timer_interrupt+0x9a/0xa0
<4>[  348.422920]  </IRQ>
<4>[  348.422931] RIP: 0010:tick_nohz_idle_exit+0x114/0x180
<4>[  348.422947] RSP: 0018:ffffc900000afed0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10
<4>[  348.422973] RAX: ffff88040d5a8040 RBX: ffff88040d5a8040 RCX: 0000000000000001
<4>[  348.422994] RDX: 0000000000000000 RSI: ffffffff81d0ddbc RDI: ffffffff81cc1bee
<4>[  348.423015] RBP: ffffc900000afed8 R08: 0000000000000000 R09: 0000000000000001
<4>[  348.423036] R10: 0000000000000000 R11: 0000000000000000 R12: 0000004b38d4ed68
<4>[  348.423058] R13: ffff88040d5a8040 R14: 0000000000000000 R15: 0000000000000000
<4>[  348.423084]  do_idle+0x13d/0x1e0
<4>[  348.423099]  cpu_startup_entry+0x1d/0x20
<4>[  348.423113]  start_secondary+0x11c/0x140
<4>[  348.423128]  secondary_startup_64+0xa5/0xa5
<0>[  348.423382] Kernel Offset: disabled


https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3126/shard-hsw6/igt@drv_selftest@live_hangcheck.html
Comment 1 Marta Löfstedt 2017-09-25 12:39:52 UTC
NOTE: the referred pstore file is identical to the one in BUG 102974.
Comment 2 Chris Wilson 2017-09-25 13:17:05 UTC
*** Bug 102974 has been marked as a duplicate of this bug. ***
Comment 3 Marta Löfstedt 2017-09-25 13:18:06 UTC
<marta_> Adrinael, you mentioned something about being wrong testlist for shards on CI_DRM_3026, could you elaborate. I have already filed bugs for this run...
<Adrinael> CI_DRM_3126
<Adrinael> It was running everything ever on accident
<Adrinael> ivyl, ^ right?
<marta_> but it was only 3 new drv_selftests and 3 new gem tests, for sure we have more than that blacklisted
<ivyl> yep, due to elaborated nature of deployment method, and streamilining it to use just "make install" an inevitable error occured on the human-Jenkins boundary.
<ivyl> marta_: it run with ALL ALL, but it got cancelled pretty quickly
<ivyl> and then rerun properly
<ivyl> what you see is the merge of both
<Adrinael> tools_test@* got "broken" by make install -deployment btw
* Weine (~dweineha@134.134.139.76) has joined
<ivyl> as jenkins haven't cleaned staging area for results
<Adrinael> marta_, if you file a bug on igt@tools_test@tools_test, make it an IGT bug
<ivyl> so sorry about confusion, it wasn't intended and I hoped the rerun will fix it
<ivyl> but as you can see we have the few leftovers
<marta_> OK, I will archive if needed when I results from the next run.
* Ahuj (Thunderbir@nat/intel/x-ngcmqbhsvahecvri) has joined
<ivyl> results from -27 already came in and they look clean, we also should have results for -28 in half an hour or so
Comment 4 Chris Wilson 2017-09-25 13:27:10 UTC
Fix here: https://patchwork.freedesktop.org/series/30419/
Comment 5 Chris Wilson 2017-09-25 13:29:31 UTC
*** Bug 102970 has been marked as a duplicate of this bug. ***
Comment 6 Chris Wilson 2017-09-26 13:24:48 UTC
commit 87dc03ad268f285065cdd2e2ac75701a1f04d0b8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 15 14:09:29 2017 +0100

    drm/i915/selftests: Try to recover from a wedged GPU during reset tests
    
    If we see the seqno stop progressing, we abandon the test for fear that
    the GPU died following the reset. However, during test teardown we still
    wait for the GPU to idle before continuing, but we have already
    confirmed that the GPU is dead. Furthermore, since we are inside a reset
    test, we have disabled the hangchecker, and so there is no safety net and
    we wait indefinitely. Detect the stuck GPU and declare it wedged as a
    state of emergency so we can escape.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20170915130929.18892-1-chris@chris-wilson.co.uk
    Tested-by: Jari Tahvanainen <jari.tahvanainen@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Comment 7 Elizabeth 2018-03-02 15:35:29 UTC
Closing. According to CI results, this tests hasn't been failed on HSW for a while now.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.