Bug 93079 - Tonga faults and oopses on agd5f drm-fixes-4.4 since fix incorrect mutex usage v3
Summary: Tonga faults and oopses on agd5f drm-fixes-4.4 since fix incorrect mutex usag...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-23 15:54 UTC by Andy Furniss
Modified: 2015-12-13 10:33 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
some backtraces (13.48 KB, text/plain)
2015-11-23 15:54 UTC, Andy Furniss
no flags Details
dmesg (60.35 KB, text/plain)
2015-11-23 15:55 UTC, Andy Furniss
no flags Details
Another trace (2.18 KB, text/plain)
2015-11-25 10:05 UTC, Andy Furniss
no flags Details
0001-drm-amdgpu-fix-race-condition-in-amd_sched_entity_pu.patch (2.11 KB, patch)
2015-12-02 13:18 UTC, Nicolai Hähnle
no flags Details | Splinter Review

Description Andy Furniss 2015-11-23 15:54:09 UTC
Created attachment 120050 [details]
some backtraces

Haven't tried agd5f drm-fixes-4.4 before today.

It seems that it is unstable on R9 285.

Twice I've been close to submitting a bisect, but false goods have spoilt things.
 
Varying backtraces attached.

Will try to bisect again but it's not going to be quick.

I am stable on powerplay and was OK on 4.4-next-wip.
Comment 1 Andy Furniss 2015-11-23 15:55:52 UTC
Created attachment 120051 [details]
dmesg
Comment 2 Andy Furniss 2015-11-23 20:58:05 UTC
Unless I eventually manage to lock the one before - and I am failing to so far, it looks like this starts with -

commit e284022163716ecf11c37fd1057c35d689ef2c11
Author: Christian König <christian.koenig@amd.com>
Date:   Thu Nov 5 19:49:48 2015 +0100

    drm/amdgpu: fix incorrect mutex usage v3
Comment 3 Andy Furniss 2015-11-25 10:04:17 UTC
Seems this is the correct bad.

Ran all day yesterday without issue on the one before.

Booted into kernel set on the "bad" this morning and was locked within 10 minutes browsing.

Slightly different - no Oops,  similar trace, rcu_preempt detected stalls on CPUs/tasks.
Comment 4 Andy Furniss 2015-11-25 10:05:00 UTC
Created attachment 120106 [details]
Another trace
Comment 5 Christian König 2015-11-26 09:39:05 UTC
Well the good news first it looks like I can reproduce the issue.

Now the bad news is that I don't have the slightest idea what's causing it.

The patch you mentioned results in different timing around those functions, but I can't see how it should cause something like this.
Comment 6 Nicolai Hähnle 2015-12-02 13:18:22 UTC
Created attachment 120274 [details] [review]
0001-drm-amdgpu-fix-race-condition-in-amd_sched_entity_pu.patch

Please try the attached patch. Thanks to Christian for pointing me at this bug (I got lucky by running into a similar but cleaner stack trace...)
Comment 7 Andy Furniss 2015-12-02 21:46:45 UTC
(In reply to Nicolai Hähnle from comment #6)
> Created attachment 120274 [details] [review] [review]
> 0001-drm-amdgpu-fix-race-condition-in-amd_sched_entity_pu.patch
> 
> Please try the attached patch. Thanks to Christian for pointing me at this
> bug (I got lucky by running into a similar but cleaner stack trace...)

Seems good so far, I'll stay on this kernel for a few days to be sure.

"Sure" may be a bit strong though, I said above powerplay was OK, but despite being OK on it for several days in the past I managed to hit this bug on it yesterday.
Comment 8 Andy Furniss 2015-12-13 10:33:27 UTC
Still seems good


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.