Bug 97980

Summary: [amdgpu] New kernel warning during shutdown
Product: DRI Reporter: Mike Lothian <mike>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: blocker    
Priority: lowest CC: ernstp, mike, pankaj.baware1
Version: XOrg gitKeywords: have-backtrace
Hardware: IA64 (Itanium)   
OS: NetBSD   
URL: https://bugs.freedesktop.org
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=98638
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Screenshot
none
Updated screenshot
none
Dmesg
none
Updated screenshot
none
New Screenshot
none
Updated dmesg
none
possible fix
none
alternative patch none

Description Mike Lothian 2016-09-29 18:31:16 UTC
Created attachment 126886 [details]
Screenshot

I might have spoke too soon with the memory manager patches, I'm seeing a stack trace just as the machine is just about to switch off.

Also it takes about 30 seconds to switch off my laptop now, I think it's amdgpu related, it seems to wait then fire up the card then switch off - it could also be hard disk or even systemd related though.

I'm attaching the screen shot but it looks like an issue with ttm_bo_force_list_clean

Sorry about the bad quality but I had to record a video in slowmo to capture it, then screenshot that
Comment 1 Alex Deucher 2016-09-29 18:41:35 UTC
Does cherry-picking this patch over help?
https://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=a951ed85abd4615e98e36b536e3b3b07b22a88ac
Comment 2 Mike Lothian 2016-09-29 19:03:33 UTC
Yes that fixes it

I've been having a more and more difficult time testing stuff of late, there's been quite a few regressions and I've been carrying more and more patches amongst various branches - lets hope the next cycle will be better

What's your handle on IRC?
Comment 3 Alex Deucher 2016-09-29 19:17:59 UTC
(In reply to Mike Lothian from comment #2)
> Yes that fixes it
> 
> I've been having a more and more difficult time testing stuff of late,
> there's been quite a few regressions and I've been carrying more and more
> patches amongst various branches - lets hope the next cycle will be better
> 

Well, bug fixes go to -fixes and new features go to -next.  If you want everything, you'd need to merge -fixes into -next.

> What's your handle on IRC?

agd5f
Comment 4 Mike Lothian 2016-09-29 21:24:12 UTC
Sorry I spoke too soon, the issue is still there, it's just more difficult to see as the reboot is so quick now
Comment 5 Andy Furniss 2016-09-30 15:24:12 UTC
Maybe a different issue but I've just started getting shutdown issues with agd5f drm-next-4.9-wip

It seems the monitor blanks early so I don't get to see anything - just with halt it doesn't power off.

On current kernel reverting 

0ea8cba5ef7b783f11cb1a0b900b7c18d2ce0b6
drm/amdgpu: always apply pci shutdown callbacks (v2)

Apparently fixes it, but it's not that simple. I first saw the issue on the 25th, but with the next update the branch got it went away, so I thought it was fixed. It re-appeared with more recent updates.

Unfortunately it seems the my working recent kernel (26th) has the above commit - so maybe some interaction/timing issue with something else.
Comment 6 Mike Lothian 2016-10-16 14:04:17 UTC
I'm still seeing this issue on the 4.9-wip branch and that has this patch included:

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1708,11 +1708,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 
        DRM_INFO("amdgpu: finishing device.\n");
        adev->shutdown = true;
+       drm_crtc_force_disable_all(adev->ddev);
        /* evict vram memory */
        amdgpu_bo_evict_vram(adev);
        amdgpu_ib_pool_fini(adev);
        amdgpu_fence_driver_fini(adev);
-       drm_crtc_force_disable_all(adev->ddev);
        amdgpu_fbdev_fini(adev);
        r = amdgpu_fini(adev);
        kfree(adev->ip_block_status);
Comment 7 Mike Lothian 2016-10-16 14:06:17 UTC
Created attachment 127331 [details]
Updated screenshot
Comment 8 Mike Lothian 2016-10-17 01:16:36 UTC
OK I followed the advice you gave in the other bug about compiling amdgpu as a module and got the following dmesg using 

modprobe -r amdgpu && dmesg > dmesg && sync
Comment 9 Mike Lothian 2016-10-17 01:17:31 UTC
Created attachment 127340 [details]
Dmesg
Comment 10 Mike Lothian 2016-10-17 01:18:35 UTC
After I issue the modprobe -r amdgpu command the system entirely freezes up

I took a screenshot of the final messages - could this be TTM related?
Comment 11 Mike Lothian 2016-10-17 01:21:33 UTC
Created attachment 127341 [details]
Updated screenshot

This captures the BUG that freezes up the system
Comment 12 Mike Lothian 2016-10-27 18:35:39 UTC
Created attachment 127565 [details]
New Screenshot

The first stack trace in the dmesg is the same, the one captured after the system freezes up is slightly different
Comment 13 Mike Lothian 2016-10-27 18:41:23 UTC
Created attachment 127566 [details]
Updated dmesg
Comment 14 Mike Lothian 2016-11-15 13:21:48 UTC
I've tested this again with the latest drm-next-4.10-wip branch and I still get the same errors
Comment 15 Alex Deucher 2016-12-06 15:45:43 UTC
Created attachment 128355 [details] [review]
possible fix

Does this patch help?
Comment 16 Mike Lothian 2016-12-06 18:53:42 UTC
It helps the original issue where a saw a panic / stack trace on shutdown and shutdown took a while - so that's great news

I've retested compiling amdgpu as a module and modprobe -r(ing) it - this still kills my machine, would you be interested in me taking more diagnostics? Or can that now be considered a separate bug?
Comment 17 Alex Deucher 2016-12-06 18:57:02 UTC
(In reply to Mike Lothian from comment #16)
> It helps the original issue where a saw a panic / stack trace on shutdown
> and shutdown took a while - so that's great news
> 
> I've retested compiling amdgpu as a module and modprobe -r(ing) it - this
> still kills my machine, would you be interested in me taking more
> diagnostics? Or can that now be considered a separate bug?

Separate bug.  With this patch, the two code paths (module unload and shutdown are now separate).
Comment 18 Alex Deucher 2016-12-07 20:28:03 UTC
*** Bug 98638 has been marked as a duplicate of this bug. ***
Comment 19 Alex Deucher 2016-12-07 20:28:57 UTC
Created attachment 128372 [details] [review]
alternative patch

Does this patch also work?
Comment 20 Mike Lothian 2016-12-08 07:43:49 UTC
So I removed your previous patch and applied the new one, I get a panic in shutdown again

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.