Bug 88654 - [BDW ppgtt Bisected]igt/pm_rps/reset sporadically causes system hang
Summary: [BDW ppgtt Bisected]igt/pm_rps/reset sporadically causes system hang
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: highest critical
Assignee: Nick Hoath
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-21 07:26 UTC by lu hua
Modified: 2017-10-06 14:32 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (13.94 KB, text/plain)
2015-01-21 07:26 UTC, lu hua
no flags Details

Description lu hua 2015-01-21 07:26:22 UTC
Created attachment 112588 [details]
dmesg

==System Environment==
--------------------------
Regression: not sure, fail rate:2/4

on-working platforms: BDW

==kernel==
--------------------------
drm-intel-nightly/d6bc7a6a0a7573350e8be8ec54002c20d1dbe1e0
commit d6bc7a6a0a7573350e8be8ec54002c20d1dbe1e0
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Jan 20 15:10:59 2015 +0100

    drm-intel-nightly: 2015y-01m-20d-14h-10m-40s UTC integration manifest

==Bug detailed description==
-----------------------------
It sporadically causes system hang on BDW. This case also has bug 88096.

output:
IGT-Version: 1.9-ga29f28e (x86_64) (Linux: 3.19.0-rc4_drm-intel-nightly_d6bc7a_20150121+ x86_64)
(pm_rps:4777) intel-batchbuffer-CRITICAL: Test assertion failure function intel_batchbuffer_flush_on_ring, file intel_batchbuffer.c:180:
(pm_rps:4777) intel-batchbuffer-CRITICAL: Failed assertion: (drm_intel_gem_bo_context_exec(batch->bo, ctx, used, ring)) == 0
(pm_rps:4777) intel-batchbuffer-CRITICAL: Last errno: 16, Device or resource busy
Subtest reset: FAIL (85.746s)
(pm_rps:4776) CRITICAL: Test assertion failure function loaded_check, file pm_rps.c:509:
(pm_rps:4776) CRITICAL: Failed assertion: freqs[CUR] == freqs[MAX]
(pm_rps:4776) CRITICAL: error: 450 != 900
Subtest reset: FAIL (88.748s)
(pm_rps:4776) CRITICAL: Test assertion failure function load_helper_stop, file pm_rps.c:290:
(pm_rps:4776) CRITICAL: Failed assertion: igt_wait_helper(&lh.igt_proc) == 0
(pm_rps:4776) CRITICAL: Last errno: 10, No child processes
pm_rps: igt_core.c:880: igt_fail: Assertion `!test_with_subtests || in_fixture' failed.

dmesg:
[  135.692079] WARNING: CPU: 1 PID: 4777 at drivers/gpu/drm/i915/intel_lrc.c:506 intel_logical_ring_advance_and_submit+0x73/0x254 [i915]()
[  135.694123] execlist context submission without request
[  135.694167] Modules linked in: netconsole configfs ipv6 dm_mod ac acpi_cpufreq i915 button video drm_kms_helper drm cfbfillrect cfbimgblt cfbcopyarea [last unloaded: netconsole]
[  135.698342] CPU: 1 PID: 4777 Comm: pm_rps Not tainted 3.19.0-rc4_drm-intel-nightly_d6bc7a_20150121+ #718
[  135.700497]  0000000000000000 0000000000000009 ffffffff81799600 ffff880148eeba78
[  135.702677]  ffffffff8103bdcc 0000000000000000 ffffffffa00ba85f 0000000000020000
[  135.704854]  ffff880002ed18b8 0000000000000000 ffff880002ed0000 ffff8801493b5600
[  135.707025] Call Trace:
[  135.709171]  [<ffffffff81799600>] ? dump_stack+0x40/0x50
[  135.711307]  [<ffffffff8103bdcc>] ? warn_slowpath_common+0x98/0xb0
[  135.713437]  [<ffffffffa00ba85f>] ? intel_logical_ring_advance_and_submit+0x73/0x254 [i915]
[  135.715562]  [<ffffffff8103be7c>] ? warn_slowpath_fmt+0x45/0x4a
[  135.717675]  [<ffffffffa00ba85f>] ? intel_logical_ring_advance_and_submit+0x73/0x254 [i915]
[  135.719785]  [<ffffffffa010bb08>] ? logical_ring_wait_for_space+0xc2/0x150 [i915]
[  135.721887]  [<ffffffffa00bad26>] ? intel_logical_ring_begin+0xea/0x1ea [i915]
[  135.723992]  [<ffffffff8110d473>] ? kmem_cache_free+0xf6/0x134
[  135.726096]  [<ffffffffa00baf02>] ? gen8_emit_flush_render+0x3d/0xe1 [i915]
[  135.728198]  [<ffffffffa00bb37e>] ? intel_execlists_submission+0x230/0x34b [i915]
[  135.730306]  [<ffffffffa009e24b>] ? i915_gem_do_execbuffer.isra.12+0xc21/0xd08 [i915]
[  135.732425]  [<ffffffff813f40e5>] ? __pm_runtime_resume+0x5b/0x6a
[  135.734533]  [<ffffffff8110d850>] ? __kmalloc+0x66/0x151
[  135.736622]  [<ffffffffa009f285>] ? i915_gem_execbuffer2+0x172/0x209 [i915]
[  135.738709]  [<ffffffffa009f113>] ? i915_gem_execbuffer+0x350/0x350 [i915]
[  135.740782]  [<ffffffffa001070a>] ? drm_ioctl+0x279/0x3bc [drm]
[  135.742861]  [<ffffffff8113363d>] ? __inode_wait_for_writeback+0x5c/0xa2
[  135.744917]  [<ffffffff81081217>] ? ktime_get+0x44/0x80
[  135.746948]  [<ffffffffa009f113>] ? i915_gem_execbuffer+0x350/0x350 [i915]
[  135.748957]  [<ffffffff8107d542>] ? hrtimer_try_to_cancel+0xa0/0xab
[  135.750939]  [<ffffffff8107d559>] ? hrtimer_cancel+0xc/0x16
[  135.752891]  [<ffffffff811222f5>] ? do_vfs_ioctl+0x412/0x459
[  135.754847]  [<ffffffff8107dead>] ? hrtimer_nanosleep+0x89/0x10b
[  135.756787]  [<ffffffff8107d1bb>] ? update_rmtp+0x60/0x60
[  135.758711]  [<ffffffff81122385>] ? SyS_ioctl+0x49/0x78
[  135.760624]  [<ffffffff8179efd2>] ? system_call_fastpath+0x12/0x17
[  135.762552] ---[ end trace 9955234f7d4f624c ]---
[  135.764486] ------------[ cut here ]------------
[  135.766402] WARNING: CPU: 3 PID: 4777 at include/linux/kref.h:47 intel_logical_ring_advance_and_submit+0xed/0x254 [i915]()

==Reproduce steps==
---------------------------- 
1.  ./pm_rps --run-subtest reset
Comment 1 Ding Heng 2015-02-02 06:07:13 UTC
b8d24a06568368076ebd5a858a011699a97bfa42 is the first bad commit.
commit b8d24a06568368076ebd5a858a011699a97bfa42
Author:     Mika Kuoppala <mika.kuoppala@linux.intel.com>
AuthorDate: Wed Jan 28 17:03:14 2015 +0200
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Thu Jan 29 18:03:07 2015 +0100

    drm/i915: Remove nested work in gpu error handling
    
    Now when we declare gpu errors only through our own dedicated
    hangcheck workqueue there is no need to have a separate workqueue
    for handling the resetting and waking up the clients as the deadlock
    concerns are no more.
    
    The only exception is i915_debugfs::i915_set_wedged, which triggers
    error handling through process context. However as this is only used through
    test harness it is responsibility for test harness not to introduce hangs
    through both debug interface and through hangcheck mechanism at the same time.
    
    Remove gpu_error.work and let the hangcheck work do the tasks it used to.
    
    v2: Add a big warning sign into i915_debugfs::i915_set_wedged (Chris)
    
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Comment 2 lu hua 2015-02-03 02:25:29 UTC
(In reply to Ding Heng from comment #1)
> b8d24a06568368076ebd5a858a011699a97bfa42 is the first bad commit.
> commit b8d24a06568368076ebd5a858a011699a97bfa42
> Author:     Mika Kuoppala <mika.kuoppala@linux.intel.com>
> AuthorDate: Wed Jan 28 17:03:14 2015 +0200
> Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
> CommitDate: Thu Jan 29 18:03:07 2015 +0100
> 
>     drm/i915: Remove nested work in gpu error handling

It's a separate bug, report bug 88928
Comment 3 lu hua 2015-02-03 05:53:15 UTC
I test on drm-intel-nightly kernel(98592c_20150122) with i915.enable_execlists=0, it works well.
Test on the latest drm-intel-nightly(8b4216_20150203) kernel, it has bug 88928.
Comment 4 lu hua 2015-02-09 06:41:13 UTC
add i915.enable_ppgtt=0,it works well.
Comment 5 lu hua 2015-02-10 09:15:19 UTC
Bisect it.
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
2d12955a3e539f0938b4b90d1eade852105ba290
72f95afa5faaf899f7344879b6ccd5f0cb271b28
We cannot bisect more!
Comment 6 Thomas Daniel 2015-02-16 16:42:30 UTC
Test failure should be fixed by the same fix as bug 88096

Warning trace should probably be investigated separately
Comment 7 Paulo Zanoni 2015-02-24 18:26:18 UTC
This patch mentioned is comment #6 is merged now. Can you please retest against -nightly?
Comment 8 Ding Heng 2015-02-27 06:12:51 UTC
Pass with nightly branch latest b18ca534ab790c19aefe8ecbec46d1bc7a31ce1e(2015-02-26).Change state to verified.
Comment 9 Elizabeth 2017-10-06 14:32:05 UTC
Closing old verified.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.