Bug 52429 - [IVB Regression] GPU hung after S3 and glxgears with hw contexts
Summary: [IVB Regression] GPU hung after S3 and glxgears with hw contexts
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other All
: high normal
Assignee: Ben Widawsky
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-24 07:05 UTC by Guang Yang
Modified: 2017-09-04 10:11 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
after doing S3 with X and glxgears dmesg info (7.96 KB, text/plain)
2012-07-24 07:05 UTC, Guang Yang
no flags Details
Unconditionally disable contexts (961 bytes, patch)
2012-07-30 17:53 UTC, Ben Widawsky
no flags Details | Splinter Review
dmesg info after S3 (58.09 KB, text/plain)
2012-07-31 02:23 UTC, Guang Yang
no flags Details
Make sure we see idle message (913 bytes, patch)
2012-08-08 05:00 UTC, Ben Widawsky
no flags Details | Splinter Review
dmesg info with Ben's patch (122.67 KB, text/plain)
2012-08-09 06:42 UTC, Guang Yang
no flags Details
dmesg info with mesa 8.0.4 (122.70 KB, text/plain)
2012-08-09 06:43 UTC, Guang Yang
no flags Details
with ben's new branch of bug_52429 's debug info (122.68 KB, text/plain)
2012-08-13 02:27 UTC, Guang Yang
no flags Details
error_state with ben's branch (2.06 MB, text/plain)
2012-08-13 02:41 UTC, Guang Yang
no flags Details
dmesg info with newest ben's branch (122.59 KB, text/plain)
2012-08-14 05:28 UTC, Guang Yang
no flags Details

Description Guang Yang 2012-07-24 07:05:47 UTC
Created attachment 64586 [details]
after doing S3 with X and glxgears dmesg info

System Environment:
--------------------------
Platform:        IvyBridge
Kernel:(drm-intel-testing)b5430f2760caadd38009e2290d070c700f

Bug detailed description:
-------------------------
After resuming from S3 with X and glxgears,the dmesg shows GPU hung,so I  attach the dmesg. I also try S3 without X ,S4 with X and glxgears and S4 without X, they all can work well.
Comment 1 Daniel Vetter 2012-07-24 08:31:44 UTC
Is this a regression?
Comment 2 Daniel Vetter 2012-07-24 08:38:05 UTC
In the reset code it dies on

BUG_ON(obj->base.write_domain & ~I915_GEM_GPU_DOMAINS);

in move_to_inactive.
Comment 3 Guang Yang 2012-07-24 08:39:42 UTC
(In reply to comment #1)
> Is this a regression?
It's a regression.
Comment 4 Daniel Vetter 2012-07-24 09:24:37 UTC
What are the version of the other driver components (especially mesa is important here)?
Comment 5 Guang Yang 2012-07-25 01:06:06 UTC
(In reply to comment #4)
> What are the version of the other driver components (especially mesa is
> important here)?
Here is the environment:
Libdrm:        
 (master)libdrm-2.4.37-11-gfaf26b689d4a2a6d1e851a1ea2fd657406eebfff
Mesa:           (master)cfdf60f236a525a0309146ce2da156bd3856c8b7
Xserver:                (master)xorg-server-1.12.99.902
Xf86_video_intel:               (master)2.20.1
Cairo:          (master)21e3f2e9034b64131075d82a4e34868dc72f2249
Libva:          (staging)f12f80371fb534e6bbf248586b3c17c298a31f4e
Libva_intel_driver:             (staging)82fa52510a37ab645daaa3bb7091ff5096a20d0b
Comment 6 Daniel Vetter 2012-07-25 08:36:22 UTC
Can you please check whether this issue is caused by the same patch as bug #52424, i.e. whether reverting 74792b53cfc2f235bc0e2eef39029817dc2cb726 fixes it?

If not, I guess we need the bisect for this one here, too - I've tried to reproduce it (even tried to manually hang the gpu), but couldn't.
Comment 7 Guang Yang 2012-07-30 07:10:47 UTC
(In reply to comment #6)
> Can you please check whether this issue is caused by the same patch as bug
> #52424, i.e. whether reverting 74792b53cfc2f235bc0e2eef39029817dc2cb726 fixes
> it?
> 
> If not, I guess we need the bisect for this one here, too - I've tried to
> reproduce it (even tried to manually hang the gpu), but couldn't.
It's the different with bug #52424, I try to bisect and find that :

e158c5aa1776372cd751e2c395300a3a6ff0bc9c is the first bad commit
commit e158c5aa1776372cd751e2c395300a3a6ff0bc9c
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Sun Jun 17 09:37:24 2012 -0700

    drm/i915: disable contexts on old HW
    This got dropped as a result of the last round of comments. I didn't test it on unsupported HW (which this is likely the case).
    Note that this prevents hw context from blowing up on any pre-gen6 hw.
    Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=51142
    [danvet: Added note and buglink.]
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

when I revert this commit,the issue is gone.
Comment 8 Chris Wilson 2012-07-30 13:20:50 UTC
Ben, can you add a module option to disable hw contexts for ease of debugging?
Comment 9 Ben Widawsky 2012-07-30 17:52:05 UTC
(In reply to comment #8)
> Ben, can you add a module option to disable hw contexts for ease of debugging?

I'd prefer to not add a module option since that creates a slippery slope. I'll attach a patch to unconditionally disable contexts. Yanguang, can you apply this patch on top of whatever repo you are using and report the results.
Comment 10 Ben Widawsky 2012-07-30 17:53:09 UTC
Created attachment 64962 [details] [review]
Unconditionally disable contexts
Comment 11 Guang Yang 2012-07-31 02:23:57 UTC
Created attachment 64970 [details]
dmesg info after S3

(In reply to comment #10)
> Created attachment 64962 [details] [review] [review]
> Unconditionally disable contexts
I try your patch with Kernel: 
(drm-intel-next-queued)ab3951eb74e7c33a2f5b7b64d72e82f1eea61571,
the issue is gone, and I attach the dmesg resume from S3.
Comment 12 Ben Widawsky 2012-08-08 03:12:19 UTC
Yangguang, can you reproduce this every time? Do you run glxgears with vsync? Do you see it on multiple platforms? I am unable to hit this on my IVB

The bisection point doesn't make much sense as it should have no effect on IVB.
Comment 13 Guang Yang 2012-08-08 03:29:49 UTC
(In reply to comment #12)
> Yangguang, can you reproduce this every time? Do you run glxgears with vsync?
> Do you see it on multiple platforms? I am unable to hit this on my IVB
> 
> The bisection point doesn't make much sense as it should have no effect on IVB.
 Yes,I can reproduce this issue every time,we run glxgears with vsync as default, I only catch this with IVB, I'm confused with that bisect result,too.But when I revert this commit,the issue is gone.
Comment 14 Ben Widawsky 2012-08-08 03:39:39 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > Yangguang, can you reproduce this every time? Do you run glxgears with vsync?
> > Do you see it on multiple platforms? I am unable to hit this on my IVB
> > 
> > The bisection point doesn't make much sense as it should have no effect on IVB.
>  Yes,I can reproduce this issue every time,we run glxgears with vsync as
> default, I only catch this with IVB, I'm confused with that bisect
> result,too.But when I revert this commit,the issue is gone.


Is this a composited desktop? 
Can you get the error state?
Can you try to reproduce this with a mesa that doesn't use contexts (8.0.4 or something should be fine)?
Comment 15 Ben Widawsky 2012-08-08 04:14:57 UTC
Hmm, also; do you always see these messages before the hang? And always 3 of them? 
[ 1049.967346] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle
Comment 16 Ben Widawsky 2012-08-08 05:00:21 UTC
Created attachment 65260 [details] [review]
Make sure we see idle message

Yanguang, can you please apply this patch to make sure we don't miss idle errors. Send another dmesg after the error with this patch.
Comment 17 Guang Yang 2012-08-09 06:42:21 UTC
Created attachment 65326 [details]
dmesg info with Ben's patch

(In reply to comment #16)
> Created attachment 65260 [details] [review] [review]
> Make sure we see idle message
> 
> Yanguang, can you please apply this patch to make sure we don't miss idle
> errors. Send another dmesg after the error with this patch.
Ben, I try your patch with the latest upstream kernel:
Kernel: (drm-intel-next-queued)65bccb5c708bd9f00d24f041f4f7c45130359448
I catch call trace after S3 and glxgears and I attach the dmesg info.

(In reply to comment #15)
> Hmm, also; do you always see these messages before the hang? And always 3 of
> them? 
> [ 1049.967346] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer
> elapsed... render ring idle
 With the new dmesg of your patch, I can find 3 of the messages:
[drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... render ring idle

(In reply to comment #14)
> (In reply to comment #13)
> > (In reply to comment #12)
> > > Yangguang, can you reproduce this every time? Do you run glxgears with vsync?
> > > Do you see it on multiple platforms? I am unable to hit this on my IVB
> > > 
> > > The bisection point doesn't make much sense as it should have no effect on IVB.
> >  Yes,I can reproduce this issue every time,we run glxgears with vsync as
> > default, I only catch this with IVB, I'm confused with that bisect
> > result,too.But when I revert this commit,the issue is gone.
> 
> 
> Is this a composited desktop? 
> Can you get the error state?
> Can you try to reproduce this with a mesa that doesn't use contexts (8.0.4 or
> something should be fine)?
I only run X, without any composite manager.
I can't get the error state because the GPU hang. After rebooting, the error state is empty.
I try with mesa 8.0.4, the issue is gone. and I attach the dmesg.
Comment 18 Guang Yang 2012-08-09 06:43:29 UTC
Created attachment 65327 [details]
dmesg info with mesa 8.0.4

This is the dmesg with mesa 8.0.4
Comment 19 Kenneth Graunke 2012-08-10 17:56:53 UTC
Mesa 8.0.x doesn't use contexts.
Comment 20 Ben Widawsky 2012-08-10 23:20:23 UTC
(In reply to comment #19)
> Mesa 8.0.x doesn't use contexts.

Yes. I asked for this to verify it doesn't occur with just the default context.
Comment 21 Ben Widawsky 2012-08-11 23:56:30 UTC
There appears to be list corruption occurring with this test case. I've been unable thus far to track down how the list is getting corrupted, and have no theories about it either. It only occurs at resume.

However, I've created some patches which address other potential issues. I'll update the patches with better commit messages later, but for now we can just test them.

Yangguang, please try this:
git://people.freedesktop.org/~bwidawsk/drm-intel bug_52429
Comment 22 Ben Widawsky 2012-08-12 00:58:06 UTC
There appears to be list corruption occurring with this test case. I've been unable thus far to track down how the list is getting corrupted, and have no theories about it either. It only occurs at resume.

However, I've created some patches which address other potential issues. I'll update the patches with better commit messages later, but for now we can just test them.

Yangguang, please try this:
git://people.freedesktop.org/~bwidawsk/drm-intel bug_52429
Comment 23 Ben Widawsky 2012-08-12 05:27:05 UTC
Just pushed a fix for the list corruption.

I have no more issues on my IVB with S3 now.
Comment 24 Guang Yang 2012-08-13 02:27:44 UTC
Created attachment 65486 [details]
with ben's new branch of bug_52429 's debug info

(In reply to comment #23)
> Just pushed a fix for the list corruption.
> 
> I have no more issues on my IVB with S3 now.
Ben,where do you push your fix patch?
I try ith the repo:
git://people.freedesktop.org/~bwidawsk/drm-intel bug_52429
your latest commit:
Kernel: (context_support_rev2)f1b8d863ac4b4ac7edc1107b19a7ce90b116ff96.
Still can catch Call Trace. I attach the dmesg info.
Comment 25 Ben Widawsky 2012-08-13 02:32:14 UTC
How about the error state?
Comment 26 Ben Widawsky 2012-08-13 02:41:26 UTC
To elaborate a bit, the order of events seem to be:

1. resume
2. gpu hang
3. ring init fail
4. pin fail

The last one may very well be my fault, but also seems to be the lowest priority.
Comment 27 Guang Yang 2012-08-13 02:41:30 UTC
Created attachment 65487 [details]
error_state with ben's branch

(In reply to comment #25)
> How about the error state?
Sorry for this delayed error state.
Comment 28 Ben Widawsky 2012-08-13 02:52:24 UTC
I've just pushed a test patch. In the error state, instdone1 is 0, which seems quite odd. I want to try to ignore it when detecting hangs to see what happens.

This is mostly just a guess, just want to try it while our timezones overlap :-)

The relevant sha is 4ea7e2c74f43f4798f5e1494b69b9720e5aa0846

It is still here:
git://people.freedesktop.org/~bwidawsk/drm-intel bug_52429

Thank you.
Comment 29 Guang Yang 2012-08-13 06:45:42 UTC
(In reply to comment #28)
> I've just pushed a test patch. In the error state, instdone1 is 0, which seems
> quite odd. I want to try to ignore it when detecting hangs to see what happens.
> 
> This is mostly just a guess, just want to try it while our timezones overlap
> :-)
> 
> The relevant sha is 4ea7e2c74f43f4798f5e1494b69b9720e5aa0846
> 
> It is still here:
> git://people.freedesktop.org/~bwidawsk/drm-intel bug_52429
> 
> Thank you.

Ben, this commit 4ea7e2c74f43f4798f5e1494b69b9720e5aa0846 can work well,
Comment 30 Ben Widawsky 2012-08-13 19:03:06 UTC
yanguang, I've just forced push the patch series which I would like to submit to intel-gfx. Would you please try it out and tell me how it goes? It's a bit different than what was there previously.

Thanks.
Comment 31 Guang Yang 2012-08-14 01:37:27 UTC
(In reply to comment #30)
> yanguang, I've just forced push the patch series which I would like to submit
> to intel-gfx. Would you please try it out and tell me how it goes? It's a bit
> different than what was there previously.
> 
> Thanks.
Ben, I see branch of bug_52429 has been updated, you mean I need to try the newest commit 9b524fe712f7d6c7c7cc83947920aefcf9fb8867?
Comment 32 Ben Widawsky 2012-08-14 02:53:41 UTC
(In reply to comment #31)
> (In reply to comment #30)
> > yanguang, I've just forced push the patch series which I would like to submit
> > to intel-gfx. Would you please try it out and tell me how it goes? It's a bit
> > different than what was there previously.
> > 
> > Thanks.
> Ben, I see branch of bug_52429 has been updated, you mean I need to try the
> newest commit 9b524fe712f7d6c7c7cc83947920aefcf9fb8867?

Just try the whole branch like you did before. I just wanted to point out that I did a force push, so you should do something like `git reset --hard bwidawsk/bug_52429`. There are 4 patches in there in all which I wanted tested. If it works, I'll add your tested-by and submit it to intel-gfx mailing list.

THanks.
Comment 33 Guang Yang 2012-08-14 05:28:54 UTC
Created attachment 65527 [details]
dmesg info with newest ben's branch

(In reply to comment #32)
> (In reply to comment #31)
> > (In reply to comment #30)
> > > yanguang, I've just forced push the patch series which I would like to submit
> > > to intel-gfx. Would you please try it out and tell me how it goes? It's a bit
> > > different than what was there previously.
> > > 
> > > Thanks.
> > Ben, I see branch of bug_52429 has been updated, you mean I need to try the
> > newest commit 9b524fe712f7d6c7c7cc83947920aefcf9fb8867?
> 
> Just try the whole branch like you did before. I just wanted to point out that
> I did a force push, so you should do something like `git reset --hard
> bwidawsk/bug_52429`. There are 4 patches in there in all which I wanted tested.
> If it works, I'll add your tested-by and submit it to intel-gfx mailing list.
> 
> THanks.
I try with the latest commit 9b524fe712f7d6c7c7cc83947920aefcf9fb8867, it can work well and the issue is gone,I attach the dmesg.
Comment 34 Ben Widawsky 2012-08-20 03:09:04 UTC
Hi Yanguang. Daniel has taken one of the patches already for drm-intel-next-queued. Can you tell whether or not that patch alone makes the issue go away?

If it doesn't I'll work on getting the other patches upstream as well.
Comment 35 Guang Yang 2012-08-20 05:54:06 UTC
(In reply to comment #34)
> Hi Yanguang. Daniel has taken one of the patches already for
> drm-intel-next-queued. Can you tell whether or not that patch alone makes the
> issue go away?
> 
> If it doesn't I'll work on getting the other patches upstream as well.
Ben, I have try the newest drm-intel-next-queued,the issue still occurs.
Comment 36 Daniel Vetter 2012-08-20 17:47:14 UTC
Erhm, I've merged b6c7488df68ae3660d81b into -fixes (and nothing yet into -queued), can you please test whether -fixes works better?
Comment 37 Guang Yang 2012-08-21 08:23:52 UTC
(In reply to comment #36)
> Erhm, I've merged b6c7488df68ae3660d81b into -fixes (and nothing yet into
> -queued), can you please test whether -fixes works better?
I try the newest -fixes kernel,it can work well, the issue is gone.
Comment 38 Daniel Vetter 2012-08-21 08:39:30 UTC
Ok, thanks for testing, I'll close this as fixed.
Comment 39 Guang Yang 2012-08-21 08:43:12 UTC
Confirmed, -fixes kernel can fix this issue.
Comment 40 Ben Widawsky 2012-08-21 15:45:20 UTC
I'm happy that it's fixed with that one patch - but I'm also a bit leery that we shouldn't throw out the other patches just yet.
Comment 41 Florian Mickler 2012-09-05 20:41:54 UTC
A patch referencing this bug report has been merged in Linux v3.6-rc3:

commit b6c7488df68ae3660d81b149b61b55b97929da83
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Tue Aug 14 14:35:14 2012 -0700

    drm/i915/contexts: fix list corruption
Comment 42 Jari Tahvanainen 2017-09-04 10:11:35 UTC
Closing old verified+fixed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.