Bug 42180

Summary: System hang while running gem_linear_blits of Intel-gpu-tools
Product: DRI Reporter: Guang Yang <guang.a.yang>
Component: DRM/IntelAssignee: Ben Widawsky <ben>
Status: CLOSED FIXED QA Contact:
Severity: major    
Priority: high CC: ben, chris, daniel, jbarnes, keithp, yi.sun
Version: unspecifiedKeywords: patch
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 40928, 42991, 44622    
Attachments:
Description Flags
BUG capture on my snb with netconsole none

Description Guang Yang 2011-10-24 19:28:31 UTC
System Environment:
--------------------------
Platform:        All
Kernel: (drm-intel-next)82d165557ef094d4b4dfc05871aee618ec7102b0

Bug detailed description:
-------------------------
With all the platforms ,run gem_linear_blits of the Intel-gpu-tools will make the system crash .

I find the latest good kernel is:
 
Kernel: (drm-intel-next)64a742fac3a22f57303d8f1b7e347350a1c48254
Comment 1 Daniel Vetter 2011-10-25 00:40:24 UTC
That's ... unexpected. Can you please bisect this one? Also check whether the issue isn't due to an update of intel-gpu-tool.
Comment 2 Chris Wilson 2011-10-26 01:52:37 UTC
And perhaps the details of the kernel crash?
Comment 3 Guang Yang 2011-10-26 02:25:19 UTC
(In reply to comment #1)
> That's ... unexpected. Can you please bisect this one? Also check whether the
> issue isn't due to an update of intel-gpu-tool.

The kernel commits above are closed,the second one is behind the first,I have try some old Intel-gpu-tools commits,they are good,so I think maybe the issue isn't due to the update of intel-gpu-tool.
Comment 4 Daniel Vetter 2011-10-26 02:32:23 UTC
> --- Comment #3 from yangguang <guang.a.yang@intel.com> 2011-10-26 02:25:19 PDT ---
> (In reply to comment #1)
> > That's ... unexpected. Can you please bisect this one? Also check whether the
> > issue isn't due to an update of intel-gpu-tool.
> 
> The kernel commits above are closed,the second one is behind the first,I have
> try some old Intel-gpu-tools commits,they are good,so I think maybe the issue
> isn't due to the update of intel-gpu-tool.

Just to clarify: The kernel still crashes with an older i-g-t? Also,
please attach the dmesg after the kernel crashed.

Thanks, Daniel
Comment 5 Guang Yang 2011-10-26 18:33:44 UTC
(In reply to comment #4)
> > --- Comment #3 from yangguang <guang.a.yang@intel.com> 2011-10-26 02:25:19 PDT ---
> > (In reply to comment #1)
> > > That's ... unexpected. Can you please bisect this one? Also check whether the
> > > issue isn't due to an update of intel-gpu-tool.
> > 
> > The kernel commits above are closed,the second one is behind the first,I have
> > try some old Intel-gpu-tools commits,they are good,so I think maybe the issue
> > isn't due to the update of intel-gpu-tool.
> Just to clarify: The kernel still crashes with an older i-g-t? Also,
> please attach the dmesg after the kernel crashed.
> Thanks, Daniel

Oh,sorry Daniel,I want to mean that the kernel still crashes with an older i-g-t,I can't get the dmesg because I can't ssh when kernel crashed.
Comment 6 Daniel Vetter 2011-10-27 07:02:40 UTC
Ok, I've bisected this to

commit 5c0422878fcdc279ae9a8e8b66972a15b5efb67f
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Mon Oct 17 15:51:55 2011 -0700

    drm/i915: ILK + VT-d workaround

And a small rant towards our qa-team:
- When filing a bug against the kernel, please always attach the full dmesg. If the machine crashes, try to capture as much with netconsole or something similar.
- When the bug is a regression, _always_ bisect it. Really.

Without these 2 things done, I consider the bug report rather incomplete.
Comment 7 Daniel Vetter 2011-10-27 07:19:35 UTC
Created attachment 52826 [details]
BUG capture on my snb with netconsole

"Thread overran stack, or stack corrupted" is the important bit ... everything else kinda stops making sense with that ;-)
Comment 8 Chris Wilson 2011-10-28 05:08:35 UTC
See id:20111028114241.GA13603@elgon.mountain

Ok, so we are doing the idle-flushes. Why is that destablising the system?
Comment 9 Chris Wilson 2011-10-28 05:16:52 UTC
Ah, recursion.

remove-pte -> wait -> retire -> move-to-inactive -> unref -> unbind -> remove-pte

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index a546a71..6ce1396 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2106,7 +2106,7 @@ i915_wait_request(struct intel_ring_buffer *ring,
         * buffer to have made it to the inactive list, and we would need
         * a separate wait queue to handle that.
         */
-       if (ret == 0)
+       if (ret == 0 && dev_priv->mm.interruptible)
                i915_gem_retire_requests_ring(ring);
 
        return ret;
Comment 10 Ben Widawsky 2011-10-28 10:18:17 UTC
(In reply to comment #9)
> Ah, recursion.
> 
> remove-pte -> wait -> retire -> move-to-inactive -> unref -> unbind ->
> remove-pte
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index a546a71..6ce1396 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2106,7 +2106,7 @@ i915_wait_request(struct intel_ring_buffer *ring,
>          * buffer to have made it to the inactive list, and we would need
>          * a separate wait queue to handle that.
>          */
> -       if (ret == 0)
> +       if (ret == 0 && dev_priv->mm.interruptible)
>                 i915_gem_retire_requests_ring(ring);
> 
>         return ret;

Looks good to me. Feel free to r-b me when you submit this patch.
Comment 11 Guang Yang 2011-10-31 18:15:04 UTC
(In reply to comment #7)
> Created attachment 52826 [details]
> BUG capture on my snb with netconsole
> "Thread overran stack, or stack corrupted" is the important bit ... everything
> else kinda stops making sense with that ;-)

BTW,I found that this bad commit :
Kernel: (drm-intel-next)82d165557ef094d4b4dfc05871aee618ec7102b0
has contained in the 3.1 release kernel.When we run the i-g-t with 3.1 release,it will cause crash.
Comment 12 Ben Widawsky 2011-10-31 21:56:36 UTC
Can you try this patch:
http://lists.freedesktop.org/archives/intel-gfx/2011-October/012984.html
Comment 13 Daniel Vetter 2011-11-01 02:17:27 UTC
> --- Comment #11 from yangguang <guang.a.yang@intel.com> 2011-10-31 18:15:04 UTC ---
> (In reply to comment #7)
> > Created attachment 52826 [details]
> > BUG capture on my snb with netconsole
> > "Thread overran stack, or stack corrupted" is the important bit ... everything
> > else kinda stops making sense with that ;-)
>
> BTW,I found that this bad commit :
> Kernel: (drm-intel-next)82d165557ef094d4b4dfc05871aee618ec7102b0
> has contained in the 3.1 release kernel.When we run the i-g-t with 3.1
> release,it will cause crash.

This is not how it works. The commit you've mentioned changes a few things
in the PCH modeset code. It's extremely unlikely that this will break
gem_linear_blits. So it's probably a new bug somewhere else.

So _please_ gather all the required details (machine details, what
kind of crash, dmesg, crash output over netconsole if there's nothing
in the logs, which test exactly fails, ...) and open a new bug report.

Yours, Daniel
Comment 14 Yi Sun 2011-11-01 20:31:14 UTC
(In reply to comment #13)
> > --- Comment #11 from yangguang <guang.a.yang@intel.com> 2011-10-31 18:15:04 UTC ---
> > (In reply to comment #7)
> > > Created attachment 52826 [details]
> > > BUG capture on my snb with netconsole
> > > "Thread overran stack, or stack corrupted" is the important bit ... everything
> > > else kinda stops making sense with that ;-)
> >
> > BTW,I found that this bad commit :
> > Kernel: (drm-intel-next)82d165557ef094d4b4dfc05871aee618ec7102b0
> > has contained in the 3.1 release kernel.When we run the i-g-t with 3.1
> > release,it will cause crash.
> 
> This is not how it works. The commit you've mentioned changes a few things
> in the PCH modeset code. It's extremely unlikely that this will break
> gem_linear_blits. So it's probably a new bug somewhere else.
> 
> So _please_ gather all the required details (machine details, what
> kind of crash, dmesg, crash output over netconsole if there's nothing
> in the logs, which test exactly fails, ...) and open a new bug report.
> 
Hi Daniel,

I think Guang emphasized the issue had appeared on the master branch.
Now the Ben's patch is able to  fix the issue.
Comment 15 Yi Sun 2011-11-01 20:32:08 UTC
(In reply to comment #12)
> Can you try this patch:
> http://lists.freedesktop.org/archives/intel-gfx/2011-October/012984.html

Okay, it works well
Comment 16 Chris Wilson 2011-11-09 11:56:53 UTC
Ben has already submitted a patch to fix this, so please close when it lands in Keith's tree.
Comment 17 Gordon Jin 2011-12-04 22:33:24 UTC
Has the patch committed?
Comment 18 Ben Widawsky 2011-12-21 11:32:10 UTC
(In reply to comment #17)
> Has the patch committed?

Keith took Daniel's patch which doesn't work for unknown reasons. I believe nobody (except me) has ever tested my patch.

Please refer to this email/thread, and ping Keith if you'd like him to try merging my patch to -next. Otherwise we have nothing.

http://lists.freedesktop.org/archives/dri-devel/2011-December/017520.html
Comment 19 Guang Yang 2011-12-21 17:54:55 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > Has the patch committed?
> Keith took Daniel's patch which doesn't work for unknown reasons. I believe
> nobody (except me) has ever tested my patch.
> Please refer to this email/thread, and ping Keith if you'd like him to try
> merging my patch to -next. Otherwise we have nothing.
> http://lists.freedesktop.org/archives/dri-devel/2011-December/017520.html
Hi Ben,yi have said at comment 15 ,we have already test your patch, it can work well.
  The patch test-by Guang Yang <guang.a.yang.intel.com>.
Comment 20 Ben Widawsky 2011-12-21 18:03:00 UTC
I(In reply to comment #19)
> (In reply to comment #18)
> > (In reply to comment #17)
> > > Has the patch committed?
> > Keith took Daniel's patch which doesn't work for unknown reasons. I believe
> > nobody (except me) has ever tested my patch.
> > Please refer to this email/thread, and ping Keith if you'd like him to try
> > merging my patch to -next. Otherwise we have nothing.
> > http://lists.freedesktop.org/archives/dri-devel/2011-December/017520.html
> Hi Ben,yi have said at comment 15 ,we have already test your patch, it can work
> well.
>   The patch test-by Guang Yang <guang.a.yang.intel.com>.

I pinged Keith on IRC. It's up to him whether or not he takes it.
Comment 21 Florian Mickler 2012-01-12 14:16:27 UTC
A patch referencing this bug report has been merged in Linux v3.2-rc5:

commit eb1711bb94991e93669c5a1b5f84f11be2d51ea1
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Dec 6 12:12:33 2011 +0100

    drm/i915: fix infinite recursion on unbind due to ilk vt-d w/a
Comment 22 Florian Mickler 2012-01-21 13:17:59 UTC
A patch referencing a commit referencing this bug report has been merged in Linux v3.2-rc6:

commit ed4a51842a9d9e618d4f4c31349b15b974dba5df
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Dec 16 12:58:39 2011 -0800

    Revert "drm/i915: fix infinite recursion on unbind due to ilk vt-d w/a"
Comment 23 Daniel Vetter 2012-02-09 01:56:09 UTC
Ben's patch is now merged to -next.
Comment 24 Gordon Jin 2012-02-09 21:21:19 UTC
(In reply to comment #23)
> Ben's patch is now merged to -next.

Is it targeted only to 3.4 kernel? If so I'd suggest removing blocking relatationship with 3.2 and 3.3 tracker.
Comment 25 Florian Mickler 2012-04-05 06:57:53 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc1:

commit 8436473a4b10243fd4c3009b97b6646c2ba642f7
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Tue Jan 24 20:36:15 2012 -0800

    drm/i915: drm/i915: Fix recursive calls to unmap
Comment 26 Elizabeth 2017-10-06 14:51:38 UTC
Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.