104142 – Stack trace in runpm when Tonga card powers down

Bug 104142 - Stack trace in runpm when Tonga card powers down

Summary: Stack trace in runpm when Tonga card powers down

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-12-06 10:32 UTC by Mike Lothian
Modified:	2018-02-27 01:18 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Dmesg (84.07 KB, text/plain) 2017-12-06 10:32 UTC, Mike Lothian	no flags	Details
View All

Description Mike Lothian 2017-12-06 10:32:34 UTC

Created attachment 135997 [details]
Dmesg

[ 9087.801615] WARNING: CPU: 6 PID: 1002 at dm_suspend+0x49/0x50
[ 9087.801617] Modules linked in:
[ 9087.801620] CPU: 6 PID: 1002 Comm: kworker/6:0 Tainted: G        W        4.15.0-rc2-agd5f+ #300
[ 9087.801621] Hardware name: Alienware Alienware 15 R2/0H6J09, BIOS 1.3.12 07/28/2017
[ 9087.801622] Workqueue: pm pm_runtime_work
[ 9087.801623] task: 00000000fc3d3872 task.stack: 00000000a1a771ba
[ 9087.801624] RIP: 0010:dm_suspend+0x49/0x50
[ 9087.801625] RSP: 0018:ffffc900000fbca0 EFLAGS: 00010282
[ 9087.801626] RAX: 0000000000000000 RBX: ffff88089c9e0000 RCX: 0000000000000000
[ 9087.801627] RDX: 0000000000000001 RSI: 0000000000000282 RDI: ffff88089c9ea8f0
[ 9087.801627] RBP: 0000000000000003 R08: 00000000c0000000 R09: ffffffff824e7648
[ 9087.801628] R10: ffffea001e7c8020 R11: ffff8808c1d19480 R12: ffff88089c9e0000
[ 9087.801628] R13: ffffffff82241e98 R14: 0000000000000004 R15: ffffffff816b7670
[ 9087.801629] FS:  0000000000000000(0000) GS:ffff8808c1d80000(0000) knlGS:0000000000000000
[ 9087.801630] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9087.801631] CR2: 00007f43c544dbb8 CR3: 000000000240a004 CR4: 00000000001606e0
[ 9087.801631] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9087.801632] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9087.801632] Call Trace:
[ 9087.801636]  amdgpu_suspend+0x61/0x170
[ 9087.801637]  amdgpu_device_suspend+0x195/0x390
[ 9087.801639]  ? vga_switcheroo_runtime_resume+0x50/0x50
[ 9087.801640]  amdgpu_pmops_runtime_suspend+0x4d/0xc0
[ 9087.801642]  pci_pm_runtime_suspend+0x4d/0x120
[ 9087.801644]  vga_switcheroo_runtime_suspend+0x19/0x90
[ 9087.801645]  __rpm_callback+0xb5/0x1e0
[ 9087.801647]  ? vga_switcheroo_runtime_resume+0x50/0x50
[ 9087.801648]  rpm_callback+0x1a/0x70
[ 9087.801649]  ? vga_switcheroo_runtime_resume+0x50/0x50
[ 9087.801650]  rpm_suspend+0x124/0x650
[ 9087.801652]  pm_runtime_work+0x58/0x80
[ 9087.801653]  process_one_work+0x1d5/0x3d0
[ 9087.801655]  worker_thread+0x42/0x3e0
[ 9087.801656]  kthread+0xf0/0x130
[ 9087.801657]  ? cancel_delayed_work+0x10/0x10
[ 9087.801658]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 9087.801660]  ret_from_fork+0x1f/0x30
[ 9087.801661] Code: a9 00 00 00 75 25 48 8b 7b 08 e8 93 4f f0 ff 48 8b bb 00 92 00 00 be 08 00 00 00 48 89 83 68 a9 00 00 e8 bb 4f 04 00 31 c0 5b c3 <0f> ff eb d7 0f 1f 00 53 48 89 fb 48 8b bf 20 92 00 00 e8 20 8d
[ 9087.801675] ---[ end trace 8e3cd942fb9ca189 ]---
[ 9087.996944] amdgpu 0000:01:00.0: GPU pci config reset


I'm seeing these a lot, also in Linus's tree

Will try and bisect later though I have a funny feeling it might be DC related

Comment 1 Harry Wentland 2017-12-07 14:15:33 UTC

We do a WARN_ON(adev->dm.cached_state). We shouldn't have a cached state when doing suspend. Not sure right now why this is happening. Is this with an Intel iGPU + AMD dGPU laptop?

Comment 2 Mike Lothian 2017-12-07 15:36:54 UTC

Yes it's Intel Skylake and AMD Tonga

Just tested it with Alex's 4.16-wip branch

[ 8476.275162] [drm] PCIE GART of 1024M enabled (table at 0x000000F400040000).
[ 8476.472088] [drm] UVD initialized successfully.
[ 8476.684149] [drm] VCE initialized successfully.
[ 8481.834519] WARNING: CPU: 2 PID: 62 at dm_suspend+0x49/0x50
[ 8481.834521] Modules linked in:
[ 8481.834523] CPU: 2 PID: 62 Comm: kworker/2:1 Not tainted 4.15.0-rc2-agd5f+ #303
[ 8481.834524] Hardware name: Alienware Alienware 15 R2/0H6J09, BIOS 1.3.12 07/28/2017
[ 8481.834525] Workqueue: pm pm_runtime_work
[ 8481.834527] task: 000000002f4790b3 task.stack: 000000003f1bbe2b
[ 8481.834528] RIP: 0010:dm_suspend+0x49/0x50
[ 8481.834529] RSP: 0018:ffffc90000273ca0 EFLAGS: 00010286
[ 8481.834530] RAX: 0000000000000000 RBX: ffff88089c9f0000 RCX: 0000000000000000
[ 8481.834530] RDX: 0000000000000001 RSI: 0000000000000282 RDI: ffff88089c9fa8f0
[ 8481.834531] RBP: 0000000000000003 R08: 00000000c0000000 R09: ffffffff824e80c8
[ 8481.834532] R10: ffffea0021940020 R11: ffff8808c1c99480 R12: ffff88089c9f0000
[ 8481.834532] R13: ffffffff82243fc0 R14: 0000000000000004 R15: ffffffff816b7630
[ 8481.834533] FS:  0000000000000000(0000) GS:ffff8808c1c80000(0000) knlGS:0000000000000000
[ 8481.834534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8481.834535] CR2: 00007fd9e466f000 CR3: 000000000240a005 CR4: 00000000001606e0
[ 8481.834535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8481.834536] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8481.834536] Call Trace:
[ 8481.834540]  amdgpu_suspend+0x61/0x170
[ 8481.834541]  amdgpu_device_suspend+0x195/0x390
[ 8481.834543]  ? vga_switcheroo_runtime_resume+0x50/0x50
[ 8481.834544]  amdgpu_pmops_runtime_suspend+0x4d/0xc0
[ 8481.834547]  pci_pm_runtime_suspend+0x4d/0x120
[ 8481.834548]  vga_switcheroo_runtime_suspend+0x19/0x90
[ 8481.834550]  __rpm_callback+0xb5/0x1e0
[ 8481.834551]  ? vga_switcheroo_runtime_resume+0x50/0x50
[ 8481.834552]  rpm_callback+0x1a/0x70
[ 8481.834554]  ? vga_switcheroo_runtime_resume+0x50/0x50
[ 8481.834555]  rpm_suspend+0x124/0x650
[ 8481.834556]  pm_runtime_work+0x58/0x80
[ 8481.834558]  process_one_work+0x1d5/0x3d0
[ 8481.834559]  worker_thread+0x42/0x3e0
[ 8481.834560]  kthread+0xf0/0x130
[ 8481.834562]  ? cancel_delayed_work+0x10/0x10
[ 8481.834562]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 8481.834564]  ret_from_fork+0x1f/0x30
[ 8481.834565] Code: a9 00 00 00 75 25 48 8b 7b 08 e8 d3 4f f0 ff 48 8b bb 00 92 00 00 be 08 00 00 00 48 89 83 68 a9 00 00 e8 bb 4f 04 00 31 c0 5b c3 <0f> ff eb d7 0f 1f 00 53 48 89 fb 48 8b bf 20 92 00 00 e8 60 8d
[ 8481.834579] ---[ end trace 86b596a21b2ff6ee ]---
[ 8482.018625] amdgpu 0000:01:00.0: GPU pci config reset

Is there anything else you need from me? Or debugging I could turn on?

Comment 3 Mike Lothian 2017-12-14 14:57:13 UTC

I bisected this back to:

d21becbe0225de0e2582d17d4fbc73fbd103b1f7 is the first bad commit
commit d21becbe0225de0e2582d17d4fbc73fbd103b1f7
Author: Tony Cheng <tony.cheng@amd.com>
Date:   Wed Jul 12 11:54:10 2017 -0400

    drm/amd/display: avoid disabling opp clk before hubp is blanked.

    Signed-off-by: Tony Cheng <tony.cheng@amd.com>
    Reviewed-by: Eric Yang <eric.yang2@amd.com>
    Acked-by: Harry Wentland <Harry.Wentland@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 61debba3cf73670d29975bc136d01862c2a54576 3d2315a1843d6276655b1550cb9f18fab47c5ce4 M      drivers

Comment 4 Mike Lothian 2017-12-15 00:57:01 UTC

Tried again and it looks a little more promising:

0a214e2fb6b0a56519b6d5efab4b21475c233ee0 is the first bad commit
commit 0a214e2fb6b0a56519b6d5efab4b21475c233ee0
Author: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Date:   Thu Jul 13 10:56:48 2017 -0400

    drm/amd/display: Release cached atomic state in S3.
    
    Fixes memory leak.
    
    Signed-off-by: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
    Reviewed-by: Tony Cheng <Tony.Cheng@amd.com>
    Acked-by: Harry Wentland <Harry.Wentland@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 494f25ce4ad407678f88d6c85128905762c9fbfb fb36845ef2ccca7bf823c9fec4d13d0a6e71ea2b M      drivers

Which makes sense as that's the commit that adds the WARN_ON, I guess that takes us back to why is there a cached state

Comment 5 frederik 2018-02-09 00:40:07 UTC

I ran into this bug when upgrading to 4.15.1. Anything I can do to help?

Comment 6 Harry Wentland 2018-02-27 00:46:10 UTC

We have a few new patches in our staging trees relating to suspend and driver unload.

Would you be able to try amd-staging-drm-next or drm-next-4.17-wip from https://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-4.17-wip and see if the issue is fixed there?

Comment 7 Mike Lothian 2018-02-27 01:08:02 UTC

I'm running agd5f's drm-next-4.17-wip branch with https://patchwork.freedesktop.org/series/38985/ applied on top along with https://github.com/FireBurn/KernelStuff/blob/master/05-remove-warn.patch removing the WARN_ON(adev->dm.cached_state);

I've reverted the removal of the WARN_ON and it seems to be fixed thanks

Comment 8 Harry Wentland 2018-02-27 01:18:44 UTC

Thanks for the update.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.