Bug 93079

Summary: Tonga faults and oopses on agd5f drm-fixes-4.4 since fix incorrect mutex usage v3
Product: DRI Reporter: Andy Furniss <adf.lists>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: ckoenig.leichtzumerken, nhaehnle
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
some backtraces
none
dmesg
none
Another trace
none
0001-drm-amdgpu-fix-race-condition-in-amd_sched_entity_pu.patch none

Description Andy Furniss 2015-11-23 15:54:09 UTC
Created attachment 120050 [details]
some backtraces

Haven't tried agd5f drm-fixes-4.4 before today.

It seems that it is unstable on R9 285.

Twice I've been close to submitting a bisect, but false goods have spoilt things.
 
Varying backtraces attached.

Will try to bisect again but it's not going to be quick.

I am stable on powerplay and was OK on 4.4-next-wip.
Comment 1 Andy Furniss 2015-11-23 15:55:52 UTC
Created attachment 120051 [details]
dmesg
Comment 2 Andy Furniss 2015-11-23 20:58:05 UTC
Unless I eventually manage to lock the one before - and I am failing to so far, it looks like this starts with -

commit e284022163716ecf11c37fd1057c35d689ef2c11
Author: Christian König <christian.koenig@amd.com>
Date:   Thu Nov 5 19:49:48 2015 +0100

    drm/amdgpu: fix incorrect mutex usage v3
Comment 3 Andy Furniss 2015-11-25 10:04:17 UTC
Seems this is the correct bad.

Ran all day yesterday without issue on the one before.

Booted into kernel set on the "bad" this morning and was locked within 10 minutes browsing.

Slightly different - no Oops,  similar trace, rcu_preempt detected stalls on CPUs/tasks.
Comment 4 Andy Furniss 2015-11-25 10:05:00 UTC
Created attachment 120106 [details]
Another trace
Comment 5 Christian König 2015-11-26 09:39:05 UTC
Well the good news first it looks like I can reproduce the issue.

Now the bad news is that I don't have the slightest idea what's causing it.

The patch you mentioned results in different timing around those functions, but I can't see how it should cause something like this.
Comment 6 Nicolai Hähnle 2015-12-02 13:18:22 UTC
Created attachment 120274 [details] [review]
0001-drm-amdgpu-fix-race-condition-in-amd_sched_entity_pu.patch

Please try the attached patch. Thanks to Christian for pointing me at this bug (I got lucky by running into a similar but cleaner stack trace...)
Comment 7 Andy Furniss 2015-12-02 21:46:45 UTC
(In reply to Nicolai Hähnle from comment #6)
> Created attachment 120274 [details] [review] [review]
> 0001-drm-amdgpu-fix-race-condition-in-amd_sched_entity_pu.patch
> 
> Please try the attached patch. Thanks to Christian for pointing me at this
> bug (I got lucky by running into a similar but cleaner stack trace...)

Seems good so far, I'll stay on this kernel for a few days to be sure.

"Sure" may be a bit strong though, I said above powerplay was OK, but despite being OK on it for several days in the past I managed to hit this bug on it yesterday.
Comment 8 Andy Furniss 2015-12-13 10:33:27 UTC
Still seems good

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.