Bug 97980 - [amdgpu] New kernel warning during shutdown
Summary: [amdgpu] New kernel warning during shutdown
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: IA64 (Itanium) NetBSD
: lowest blocker
Assignee: Default DRI bug account
QA Contact:
URL: https://bugs.freedesktop.org
Whiteboard:
Keywords: have-backtrace
: 98638 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-09-29 18:31 UTC by Mike Lothian
Modified: 2016-12-20 03:14 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Screenshot (1.85 MB, image/png)
2016-09-29 18:31 UTC, Mike Lothian
no flags Details
Updated screenshot (2.34 MB, image/png)
2016-10-16 14:06 UTC, Mike Lothian
no flags Details
Dmesg (70.44 KB, text/plain)
2016-10-17 01:17 UTC, Mike Lothian
no flags Details
Updated screenshot (1.18 MB, image/jpeg)
2016-10-17 01:21 UTC, Mike Lothian
no flags Details
New Screenshot (2.77 MB, image/jpeg)
2016-10-27 18:35 UTC, Mike Lothian
no flags Details
Updated dmesg (70.75 KB, text/plain)
2016-10-27 18:41 UTC, Mike Lothian
no flags Details
possible fix (2.53 KB, patch)
2016-12-06 15:45 UTC, Alex Deucher
no flags Details | Splinter Review
alternative patch (1.35 KB, patch)
2016-12-07 20:28 UTC, Alex Deucher
no flags Details | Splinter Review

Description Mike Lothian 2016-09-29 18:31:16 UTC
Created attachment 126886 [details]
Screenshot

I might have spoke too soon with the memory manager patches, I'm seeing a stack trace just as the machine is just about to switch off.

Also it takes about 30 seconds to switch off my laptop now, I think it's amdgpu related, it seems to wait then fire up the card then switch off - it could also be hard disk or even systemd related though.

I'm attaching the screen shot but it looks like an issue with ttm_bo_force_list_clean

Sorry about the bad quality but I had to record a video in slowmo to capture it, then screenshot that
Comment 1 Alex Deucher 2016-09-29 18:41:35 UTC
Does cherry-picking this patch over help?
https://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=a951ed85abd4615e98e36b536e3b3b07b22a88ac
Comment 2 Mike Lothian 2016-09-29 19:03:33 UTC
Yes that fixes it

I've been having a more and more difficult time testing stuff of late, there's been quite a few regressions and I've been carrying more and more patches amongst various branches - lets hope the next cycle will be better

What's your handle on IRC?
Comment 3 Alex Deucher 2016-09-29 19:17:59 UTC
(In reply to Mike Lothian from comment #2)
> Yes that fixes it
> 
> I've been having a more and more difficult time testing stuff of late,
> there's been quite a few regressions and I've been carrying more and more
> patches amongst various branches - lets hope the next cycle will be better
> 

Well, bug fixes go to -fixes and new features go to -next.  If you want everything, you'd need to merge -fixes into -next.

> What's your handle on IRC?

agd5f
Comment 4 Mike Lothian 2016-09-29 21:24:12 UTC
Sorry I spoke too soon, the issue is still there, it's just more difficult to see as the reboot is so quick now
Comment 5 Andy Furniss 2016-09-30 15:24:12 UTC
Maybe a different issue but I've just started getting shutdown issues with agd5f drm-next-4.9-wip

It seems the monitor blanks early so I don't get to see anything - just with halt it doesn't power off.

On current kernel reverting 

0ea8cba5ef7b783f11cb1a0b900b7c18d2ce0b6
drm/amdgpu: always apply pci shutdown callbacks (v2)

Apparently fixes it, but it's not that simple. I first saw the issue on the 25th, but with the next update the branch got it went away, so I thought it was fixed. It re-appeared with more recent updates.

Unfortunately it seems the my working recent kernel (26th) has the above commit - so maybe some interaction/timing issue with something else.
Comment 6 Mike Lothian 2016-10-16 14:04:17 UTC
I'm still seeing this issue on the 4.9-wip branch and that has this patch included:

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1708,11 +1708,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 
        DRM_INFO("amdgpu: finishing device.\n");
        adev->shutdown = true;
+       drm_crtc_force_disable_all(adev->ddev);
        /* evict vram memory */
        amdgpu_bo_evict_vram(adev);
        amdgpu_ib_pool_fini(adev);
        amdgpu_fence_driver_fini(adev);
-       drm_crtc_force_disable_all(adev->ddev);
        amdgpu_fbdev_fini(adev);
        r = amdgpu_fini(adev);
        kfree(adev->ip_block_status);
Comment 7 Mike Lothian 2016-10-16 14:06:17 UTC
Created attachment 127331 [details]
Updated screenshot
Comment 8 Mike Lothian 2016-10-17 01:16:36 UTC
OK I followed the advice you gave in the other bug about compiling amdgpu as a module and got the following dmesg using 

modprobe -r amdgpu && dmesg > dmesg && sync
Comment 9 Mike Lothian 2016-10-17 01:17:31 UTC
Created attachment 127340 [details]
Dmesg
Comment 10 Mike Lothian 2016-10-17 01:18:35 UTC
After I issue the modprobe -r amdgpu command the system entirely freezes up

I took a screenshot of the final messages - could this be TTM related?
Comment 11 Mike Lothian 2016-10-17 01:21:33 UTC
Created attachment 127341 [details]
Updated screenshot

This captures the BUG that freezes up the system
Comment 12 Mike Lothian 2016-10-27 18:35:39 UTC
Created attachment 127565 [details]
New Screenshot

The first stack trace in the dmesg is the same, the one captured after the system freezes up is slightly different
Comment 13 Mike Lothian 2016-10-27 18:41:23 UTC
Created attachment 127566 [details]
Updated dmesg
Comment 14 Mike Lothian 2016-11-15 13:21:48 UTC
I've tested this again with the latest drm-next-4.10-wip branch and I still get the same errors
Comment 15 Alex Deucher 2016-12-06 15:45:43 UTC
Created attachment 128355 [details] [review]
possible fix

Does this patch help?
Comment 16 Mike Lothian 2016-12-06 18:53:42 UTC
It helps the original issue where a saw a panic / stack trace on shutdown and shutdown took a while - so that's great news

I've retested compiling amdgpu as a module and modprobe -r(ing) it - this still kills my machine, would you be interested in me taking more diagnostics? Or can that now be considered a separate bug?
Comment 17 Alex Deucher 2016-12-06 18:57:02 UTC
(In reply to Mike Lothian from comment #16)
> It helps the original issue where a saw a panic / stack trace on shutdown
> and shutdown took a while - so that's great news
> 
> I've retested compiling amdgpu as a module and modprobe -r(ing) it - this
> still kills my machine, would you be interested in me taking more
> diagnostics? Or can that now be considered a separate bug?

Separate bug.  With this patch, the two code paths (module unload and shutdown are now separate).
Comment 18 Alex Deucher 2016-12-07 20:28:03 UTC
*** Bug 98638 has been marked as a duplicate of this bug. ***
Comment 19 Alex Deucher 2016-12-07 20:28:57 UTC
Created attachment 128372 [details] [review]
alternative patch

Does this patch also work?
Comment 20 Mike Lothian 2016-12-08 07:43:49 UTC
So I removed your previous patch and applied the new one, I get a panic in shutdown again


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.