Bug 91790

Summary: TONGA hang in amdgpu_ring_lock
Product: DRI Reporter: Mathias Tillman <master.homer>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: master.homer
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg of hang
none
mplayer X hung task
none
dmesg with added debug output
none
possible fix
none
Output of amdgpu_regs and amdgpu_fence_info none

Description Mathias Tillman 2015-08-28 08:49:06 UTC
Created attachment 117962 [details]
dmesg of hang

I've been getting random hangs in amdgpu_ring_lock, this causes X to hang, meaning I can't use the computer at all. I can sometimes switch to a tty, but this doesn't always work either.

I'm running Ubuntu 15.04 with mesa and libdrm from the oibaf ppa, with a self-compiled xf86-video-amdgpu and a self-compiled kernel from agd5f, drm-next-4.3-wip (9066b0c318589f47b754a3def4fe8ec4688dc21a).

I haven't been able to predict when the hang will happen, sometimes I can use it for several hours before it hangs, other times it happens just a few minutes after booting.
Comment 1 Andy Furniss 2015-08-28 10:30:38 UTC
Created attachment 117963 [details]
mplayer X hung task

I got a similar trace yesterday on current agd5f drm-next-4.3 while trying to kill uvd with mplayer by repeatedly starting.

I am slightly hopeful this is a different issue from uvd as it starts with X and I got way more starts than I recently have - 360 to get this trace after a couple of OK 250 runs.

I haven't locked in normal use, but then my desktop setup is simple = fluxbox.
Comment 2 Mathias Tillman 2015-08-28 12:31:34 UTC
Created attachment 117967 [details]
dmesg with added debug output

I've done some more testing, turns out that it never reaches amdgpu_ring_unlock_commit on certain cases, and that's what causes it to hang, since the mutex never unlocks.
I added some debug output to the code, gfx/sdma0 is ring->name, 0/9 is ring->idx and the address is the address of the ring struct.
As you can see in the log, it calls amdgpu_ring_lock on ring 9 with name sdma0, and then afterwards it calls it again on ring 0 with name gfx, without calling amdgpu_ring_unlock_commit.
I will add some more debug output in hopes of finding why exactly it's never unlocked, and if it is fixable. I should mention that these random lockups do not happen while using the proprietary catalyst driver, so it must be something in the amdgpu driver.
Comment 3 Christian König 2015-08-28 12:46:43 UTC
That could just be a symptom of a hardware hang which isn't detected for some reason.

Please take a look at amdgpu_fence_info as well to see if there are any outstanding submissions.
Comment 4 Andy Furniss 2015-08-28 12:54:43 UTC
(In reply to Christian König from comment #3)
> That could just be a symptom of a hardware hang which isn't detected for
> some reason.

There's this - drm/amdgpu: disable GPU reset by default

http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-4.3&id=a895c222e7ab5f50ec10e209cd4548ecd5dd9443
Comment 5 Mathias Tillman 2015-08-28 16:03:28 UTC
(In reply to Christian König from comment #3)
> That could just be a symptom of a hardware hang which isn't detected for
> some reason.
> 
> Please take a look at amdgpu_fence_info as well to see if there are any
> outstanding submissions.

If it's a hardware hang, wouldn't it also happen when using catalyst? It doesn't happen there, so it should at least be possible to work around (if it is a hardware problem).
I will continue investigating why this happens, but it does seem to me like this, #91278, and #91676 all are caused by the same thing, but with different log output depending on if you use drm-next-4.3 or drm-next-4.2.
Comment 6 Christian König 2015-08-28 16:07:52 UTC
No, current released catalyst doesn't uses anything from the amdgpu module yet.

It's clearly not a hardware problem, but invalid render commands can cause the hardware to lock up.
Comment 7 Mathias Tillman 2015-09-01 18:20:27 UTC
Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've been running it all day without a single lock up, before it used to lock up several times a day. Just wanted someone to confirm if it is in fact working, or if it's just me.
Comment 8 Andy Furniss 2015-09-01 19:58:17 UTC
(In reply to Mathias Tillman from comment #7)
> Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've
> been running it all day without a single lock up, before it used to lock up
> several times a day. Just wanted someone to confirm if it is in fact
> working, or if it's just me.

I can imaging that it's far better for desktop locks - I moved onto it when it got updated.

Initially testing with Unigine Valley I thought it was going to be good - I got further than ever before (about 4x through all the scenes having not got through once previously), but it did lock.
Comment 9 Mathias Tillman 2015-09-02 09:15:37 UTC
(In reply to Andy Furniss from comment #8)
> (In reply to Mathias Tillman from comment #7)
> > Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've
> > been running it all day without a single lock up, before it used to lock up
> > several times a day. Just wanted someone to confirm if it is in fact
> > working, or if it's just me.
> 
> I can imaging that it's far better for desktop locks - I moved onto it when
> it got updated.
> 
> Initially testing with Unigine Valley I thought it was going to be good - I
> got further than ever before (about 4x through all the scenes having not got
> through once previously), but it did lock.

That's a shame. I'll try and see if I can find out what has caused the lockups to stop for me, maybe that could help in finding out what's still causing them for you.
Comment 10 Alex Deucher 2015-09-02 20:09:42 UTC
Created attachment 118056 [details] [review]
possible fix

I think this patch should fix it.
Comment 11 Mathias Tillman 2015-09-02 20:40:04 UTC
(In reply to Alex Deucher from comment #10)
> Created attachment 118056 [details] [review] [review]
> possible fix
> 
> I think this patch should fix it.

No luck here I'm afraid - I'm having a hard time reproducing it during normal desktop usage (with or without the patch), but it did lockup while running Unigine Valley.
Comment 12 Christian König 2015-09-03 08:56:42 UTC
(In reply to Mathias Tillman from comment #11)
> No luck here I'm afraid - I'm having a hard time reproducing it during
> normal desktop usage (with or without the patch), but it did lockup while
> running Unigine Valley.

Assuming you can still access the box over the network after the lockup then please provide the output of the following as root:

cat /sys/kernel/debug/dri/0/amdgpu_fence_info
hexdump -s 0x14fc -n 4 /sys/kernel/debug/dri/0/amdgpu_regs
Comment 13 Andy Furniss 2015-09-03 10:24:11 UTC
(In reply to Mathias Tillman from comment #11)
> (In reply to Alex Deucher from comment #10)
> > Created attachment 118056 [details] [review] [review] [review]
> > possible fix
> > 
> > I think this patch should fix it.
> 
> No luck here I'm afraid - I'm having a hard time reproducing it during
> normal desktop usage (with or without the patch), but it did lockup while
> running Unigine Valley.

I see drm-next-4.3 is now ahead again, haven't tested that yet.

With patch + drm-next-4.3-wip, I haven't yet managed to lock valley - but I've only had time to do a couple of runs (45 min then 90 min) from a clean boot. Maybe later when I've been up a while doing other things I'll try harder.

Patch doesn't apply with git apply - did it by hand.
Comment 14 Mathias Tillman 2015-09-03 11:34:56 UTC
Created attachment 118060 [details]
Output of amdgpu_regs and amdgpu_fence_info

I have attached the output of amdgpu_regs and amdgpu_fence_info. Hang is right after the hang happened, Normal is right after a reboot after the hang (for comparison).
Comment 15 Andy Furniss 2015-09-03 11:46:00 UTC
(In reply to Andy Furniss from comment #13)
> (In reply to Mathias Tillman from comment #11)
> > (In reply to Alex Deucher from comment #10)
> > > Created attachment 118056 [details] [review] [review] [review] [review]
> > > possible fix
> > > 
> > > I think this patch should fix it.
> > 
> > No luck here I'm afraid - I'm having a hard time reproducing it during
> > normal desktop usage (with or without the patch), but it did lockup while
> > running Unigine Valley.
> 
> I see drm-next-4.3 is now ahead again, haven't tested that yet.
> 
> With patch + drm-next-4.3-wip, I haven't yet managed to lock valley - but
> I've only had time to do a couple of runs (45 min then 90 min) from a clean
> boot. Maybe later when I've been up a while doing other things I'll try
> harder.
> 
> Patch doesn't apply with git apply - did it by hand.

I managed to lock it, seems that doing "something" between runs changes things, or first runs are lucky.

FWIW I tried running Unreal 4.5 ElementalDemo after my long runs and I got a signal 7.

After I later locked/hung valley I rebooted and tried again elemental from a clean boot and it ran OK, but after quitting. it now gives signal 7 again if I try to start it.
Comment 16 Martin Peres 2019-11-19 08:06:36 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/57.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.