Bug 103964 - BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
Summary: BUG: sleeping function called from invalid context at ./include/linux/percpu-...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-29 00:44 UTC by Rafael Antognolli
Modified: 2018-01-05 17:09 UTC (History)
1 user (show)

See Also:
i915 platform: CNL
i915 features:


Attachments
Dmesg output with trace. (179.12 KB, text/plain)
2017-11-29 00:44 UTC, Rafael Antognolli
no flags Details
Dmesg output. (116.63 KB, application/octet-stream)
2017-11-29 18:30 UTC, Rafael Antognolli
no flags Details

Description Rafael Antognolli 2017-11-29 00:44:08 UTC
Created attachment 135791 [details]
Dmesg output with trace.

After a couple GPU hangs, I get the attached trace output.

Sometimes the machine gets totally frozen, although I couldn't reproduce that with the serial cable attached.

drm-tip from today has the issue:

5144438448829ec2a3d94fd16a9e69a52cfa7b3b
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>                                                                                                                                                                                                                                                                                                                                             
Date:   Tue Nov 28 17:05:46 2017 +0000                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                              
    drm-tip: 2017y-11m-28d-17h-04m-56s UTC integration manifest



And this wouldn't happen on the one from last week:

42670ce69ef7a03b53bea25fad60a9b3931402cf
Author: Lionel Landwerlin <lionel.g.landwerlin@intel.com>                                                                                                                                                                                                                                                                                                                                     
Date:   Mon Nov 20 16:11:13 2017 +0000                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                              
    drm-tip: 2017y-11m-20d-16h-09m-47s UTC integration manifest
Comment 1 Chris Wilson 2017-11-29 00:49:53 UTC
The dmesg is incomplete; lost by systemd not keeping up.

Off the top of my head, we've made no changes around ppgtt in the last week. I'm guessing a bisect will head off into -rc1. Please do bisect this if possible, and try to capture the dmesg, maybe disable ftrace-dump-on-oops.
Comment 2 Chris Wilson 2017-11-29 00:51:31 UTC
I would also recommend enabling kasan.
Comment 3 Rafael Antognolli 2017-11-29 18:30:32 UTC
Created attachment 135809 [details]
Dmesg output.

For some reason I couldn't reproduce it with kasan enabled.

This last log was with drm-tip from today:

commit 8f873adc152c899d0e1b9ccaf9caa468955c1fdd (HEAD -> wip/cnl, drm-tip/drm-tip, tip/2017-11-29)
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Nov 29 17:29:54 2017 +0100

    drm-tip: 2017y-11m-29d-16h-28m-40s UTC integration manifest
Comment 4 Chris Wilson 2017-11-29 18:35:43 UTC
Wowser; that's pretty random explosions. If not kasan, perhaps some of the other memdebug options will catch something.

I hope that system is expendable.

Bisection is becoming increasingly invaluable; something is severely amiss, and I'm 95% certain it's not us!
Comment 5 Rafael Antognolli 2017-11-29 18:42:51 UTC
I'm sorry, I can't afford to bisect this :(

If anyone is interested, it should be pretty trivial to reproduce with upstream mesa and something like:

$ PIGLIT_PLATFORM=gbm ./piglit run -t spec@ext_framebuffer_multisample@accuracy gpu piglit_results

No Xorg required.

I'm rolling back to 4.14 for now.
Comment 6 Rodrigo Vivi 2017-11-30 22:13:41 UTC
my bisect with "./gem_concurrent_all --run-subtest 4KiB-tiny-partial-gtt-early-read-forked-hang-all"

pointed to this:

commit 4e90a6e222720dd0ec529f87eca990c736ba8ede
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Nov 26 22:09:01 2017 +0000

    drm/i915: Record default HW state in the GPU error state

Reverting it on drm-tip I could get test passing.

Chris, any quick idea? Maybe concurrence issue here?

Rafael will still check if same patch is really what he saw...
Comment 7 Chris Wilson 2017-11-30 22:16:42 UTC
Try
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index e07b5247cd96..c69bd0a8c48c 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1413,6 +1413,7 @@ capture_object(struct drm_i915_private *dev_priv,
        if (obj && i915_gem_object_has_pages(obj)) {
                struct i915_vma fake = {
                        .node = { .start = U64_MAX, .size = obj->base.size },
+                       .size = obj->base.size;
                        .pages = obj->mm.pages,
                        .obj = obj,
                };
Comment 8 Chris Wilson 2017-11-30 22:18:10 UTC
(And that 5%!)
Comment 9 Rodrigo Vivi 2017-12-01 00:04:25 UTC
with .size defined (s/;/,) on top of drm-tip
./gem_concurrent_all runs fine for me...
Comment 10 Rafael Antognolli 2017-12-01 00:18:10 UTC
piglit also seems happy with that patch.

Thanks for the quick fix.
Comment 11 Chris Wilson 2017-12-01 09:23:07 UTC
commit b5e0a9418e09a7b6df1728a26832c7c34aa1adf8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Dec 1 00:15:36 2017 +0000

    drm/i915: Set fake_vma.size as well as fake_vma.node.size for capture
    
    When capturing the bo, we allocate an error object with an array of
    min(vma->size, vma->node.size) pages, plus a bit for compression overhead.
    However, when creating the fake vma to describe the bo, only one of the
    sizes was filled in, resulting in a too small array. Through my and CI
    testing, this was sufficient for the mostly empty NULL context as
    it compressed well (or the out-of-bounds access simply didn't cause an
    issue). However, in real workloads on Cannonlake, we were overflowing
    that array and causing havoc with the random memory corruption.
    
    Reported-by: Rafael Antognolli <rafael.antognolli@intel.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103964
    Fixes: 4e90a6e22272 ("drm/i915: Record default HW state in the GPU error state")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Tested-by: Rodrigo Vivi <rodrigo.vivi@gmail.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171201001536.13941-1-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.