Bug 84207

Summary: [SNB+ Regression]igt/gem_render_copy_redux sporadically cause system hang
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: intel-gfx-bugs, jinxianx.guo
Version: XOrg git   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
dmesg none

Description lu hua 2014-09-23 02:14:20 UTC
Created attachment 106699 [details]
dmesg

System Environment:
--------------------------
Platform: IVB
Kernel:   (drm-intel-nightly)c5660b4ad395f1e34eacc22cf81c687edfc9c83c

Bug detailed description:
---------------------------
It sporadically cause system hang, fail rate: 2/3.
It happens on -queued, -fixes and -nightly kernel.

Run 10 cycles on -queued branch commit 6e47e3f097cc6c4cb470a805a3fa07a8e8376dab, it works well.

good commit: 6e47e3f097cc6c4cb470a805a3fa07a8e8376dab
bad commit:  b680c37a4d145cf4d8f2b24e46b1163e5ceb1d35

output:
IGT-Version: 1.8-g4b81e9c (x86_64) (Linux: 3.17.0-rc5_drm-intel-nightly_c5660b_20140922_debug+ x86_64)
Subtest normal: SUCCESS (0.218s)
Subtest interruptible: SUCCESS (0.202s)
Subtest flink: SUCCESS (0.817s)
Subtest flink-interruptible: SUCCESS (0.646s)

Reproduce steps:
-------------------------
1. xinit
2. run ./gem_render_copy_redux 3 cycles.
Comment 1 lu hua 2014-09-23 03:15:37 UTC
It impacts SNB+ platforms.
Comment 2 Rodrigo Vivi 2014-09-23 23:16:52 UTC
The good commit sha does't exist on current -nightly 
So please paste the commit subject along with sha.

Also, the bad commit is a Docbook integration, so I assume there was still a gap beween bad and good commit right?
Comment 3 lu hua 2014-09-24 02:16:54 UTC
good commit:9c787942907face82da505c2c5493998b56cfc5a
Comment 4 Rodrigo Vivi 2014-09-24 21:13:21 UTC
Yeah, I big gap between those 2. Could you please go a bit deep on this bisect, trying to find the offending commit?
Comment 5 lu hua 2014-09-25 05:22:29 UTC
The bisect is difficult. I bisect it and run each step 10 rounds, shows  9430dfa67d7 is the first bad commit. Then test on commit b478e336b3e755057, it also has this issue, fail rate: 1/25.
Comment 6 Chris Wilson 2014-09-25 09:02:30 UTC
Could you run with CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_OBJECTS, CONFIG_SLUB_DEBUG_ON, CONFIG_PROVE_LOCKING and CONFIG_DEBUG_LIST enabled?
Comment 7 lu hua 2014-09-26 05:50:59 UTC
Created attachment 106898 [details]
dmesg

(In reply to comment #6)
> Could you run with CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_OBJECTS,
> CONFIG_SLUB_DEBUG_ON, CONFIG_PROVE_LOCKING and CONFIG_DEBUG_LIST enabled?

Attached the dmesg with these settings.
Comment 8 Chris Wilson 2014-09-26 06:44:42 UTC
Hmm, that shows no signs of poisoning, so I think we can rule out a double-free of the i915_mm_struct. Could you gdb vmlinux (it will be in the root directory of your build tree) and "list * mmu_notifier_unregister+0x18"
Comment 9 Chris Wilson 2014-09-26 08:12:39 UTC
You run nothing else after boot? (i.e. you boot the machine and the first thing you run is gem_render_copy_redux in a loop?)

I can't see how that would trigger userptr and creation of the i915_mm_struct.
Comment 10 Chris Wilson 2014-09-26 08:46:33 UTC
Oh, the userptr is from libdrm. And the invalid dereference is an error pointer.
Comment 11 Chris Wilson 2014-09-26 08:49:20 UTC
*** Bug 84358 has been marked as a duplicate of this bug. ***
Comment 12 Chris Wilson 2014-09-26 09:29:47 UTC
commit f2775039b1d2f3c24876622e4528604496de8abc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 26 10:22:33 2014 +0100

    igt/gem_userptr_blits: Test interruptible create-destroy
    
    In order to exercise https://bugs.freedesktop.org/show_bug.cgi?id=84207
    we need to interrupt the mmu_notifier_register with a signal. This is
    likely to be quite difficult, but let's just try running the
    create-destroy test in an interruptible loop for 5s.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 13 Chris Wilson 2014-09-26 09:35:12 UTC
http://patchwork.freedesktop.org/patch/34166/
Comment 14 lu hua 2014-09-28 03:17:56 UTC
(In reply to comment #13)
> http://patchwork.freedesktop.org/patch/34166/

Apply this patch and test on latest igt, run 30 cycles, it works well.
Comment 15 shuo.wang 2014-09-28 03:18:26 UTC
OOO on Sep 28 – Sep 30. Sorry for no mail access.

Best Regards,
Shuo
Comment 16 Guo Jinxian 2014-09-29 07:08:16 UTC
Test below cause system hang on HSW IVB

[root@x-hsw27 tests]# time ./gem_userptr_blits --run-subtest create-destroy-sync
IGT-Version: 1.8-g32a0308 (x86_64) (Linux: 3.17.0-rc6_drm-intel-fixes_c84db7_20140929+ x86_64)
Aperture size is 2048 MiB
Total RAM is 7669 MiB
Testing unsynchronized mappings...
Testing synchronized mappings...




^C

[root@x-ivb9 tests]# time ./gem_userptr_blits --run-subtest create-destroy-sync
IGT-Version: 1.8-g32a0308 (x86_64) (Linux: 3.17.0-rc6_drm-intel-nightly_7101d8_20140929+ x86_64)
Aperture size is 2048 MiB
Total RAM is 3836 MiB
Testing unsynchronized mappings...
Testing synchronized mappings...






^C^C
Comment 17 Jani Nikula 2014-09-29 13:24:23 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > http://patchwork.freedesktop.org/patch/34166/
> 
> Apply this patch and test on latest igt, run 30 cycles, it works well.

Pushed to drm-intel-fixes as

commit 72e59c89131606106452f1773a316b90d9f54423
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 26 10:31:02 2014 +0100

    drm/i915: Do not store the error pointer for a failed userptr registration

(In reply to comment #16)
> Test below cause system hang on HSW IVB

Please file new reports for them, AFAICT these are separate issues.
Comment 18 Chris Wilson 2014-09-29 13:30:55 UTC
(In reply to comment #17) 
> (In reply to comment #16)
> > Test below cause system hang on HSW IVB
> 
> Please file new reports for them, AFAICT these are separate issues.

Nope. They were test failures added to explicitly reproduce this bug, see comment 12.
Comment 19 Jani Nikula 2014-09-29 13:44:51 UTC
(In reply to comment #17)
> (In reply to comment #14)
> > (In reply to comment #13)
> > > http://patchwork.freedesktop.org/patch/34166/
> > 
> > Apply this patch and test on latest igt, run 30 cycles, it works well.
> 
> Pushed to drm-intel-fixes as
> 
> commit 72e59c89131606106452f1773a316b90d9f54423
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri Sep 26 10:31:02 2014 +0100
> 
>     drm/i915: Do not store the error pointer for a failed userptr
> registration
> 

scratch that, it's drm-intel-next-fixes as

commit e9681366ea9e76ab8f75e84351f2f3ca63ee542c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 26 10:31:02 2014 +0100

    drm/i915: Do not store the error pointer for a failed userptr registration



> (In reply to comment #16)
> > Test below cause system hang on HSW IVB
> 
> Please file new reports for them, AFAICT these are separate issues.
Comment 20 Jani Nikula 2014-09-29 13:45:33 UTC
(In reply to comment #18)
> Nope. They were test failures added to explicitly reproduce this bug, see
> comment 12.

Reopen?
Comment 21 Chris Wilson 2014-09-29 15:07:02 UTC
commit e9681366ea9e76ab8f75e84351f2f3ca63ee542c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 26 10:31:02 2014 +0100

    drm/i915: Do not store the error pointer for a failed userptr registration

    Fixes regression from commit ad46cb533d586fdb256855437af876617c6cf609
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Thu Aug 7 14:20:40 2014 +0100
    
        drm/i915: Prevent recursive deadlock on releasing a busy userptr
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=84207
    Testcase: igt/gem_render_copy_redux
    Testcase: igt/gem_userptr_blits/create-destroy-sync
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Jacek Danecki <jacek.danecki@intel.com>
    Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com>
    Cc: Jacek Danecki <jacek.danecki@intel.com>
    Cc: "Ursulin, Tvrtko" <tvrtko.ursulin@intel.com>
    Cc: stable@vger.kernel.org
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

Note that the commit references both this bugzilla and the new test case from c12 that qa reported failure for in c17.
Comment 22 lu hua 2014-09-30 03:40:55 UTC
Verified.Fixed.
Comment 23 Jari Tahvanainen 2017-07-03 12:23:55 UTC
Closing old verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.