Summary: | Stack trace in runpm when Tonga card powers down | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Mike Lothian <mike> | ||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||
Status: | RESOLVED FIXED | QA Contact: | |||||
Severity: | normal | ||||||
Priority: | medium | CC: | frederik.schwan, harry.wentland, jordan.lazare, mike | ||||
Version: | XOrg git | ||||||
Hardware: | Other | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
i915 platform: | i915 features: | ||||||
Attachments: |
|
We do a WARN_ON(adev->dm.cached_state). We shouldn't have a cached state when doing suspend. Not sure right now why this is happening. Is this with an Intel iGPU + AMD dGPU laptop? Yes it's Intel Skylake and AMD Tonga Just tested it with Alex's 4.16-wip branch [ 8476.275162] [drm] PCIE GART of 1024M enabled (table at 0x000000F400040000). [ 8476.472088] [drm] UVD initialized successfully. [ 8476.684149] [drm] VCE initialized successfully. [ 8481.834519] WARNING: CPU: 2 PID: 62 at dm_suspend+0x49/0x50 [ 8481.834521] Modules linked in: [ 8481.834523] CPU: 2 PID: 62 Comm: kworker/2:1 Not tainted 4.15.0-rc2-agd5f+ #303 [ 8481.834524] Hardware name: Alienware Alienware 15 R2/0H6J09, BIOS 1.3.12 07/28/2017 [ 8481.834525] Workqueue: pm pm_runtime_work [ 8481.834527] task: 000000002f4790b3 task.stack: 000000003f1bbe2b [ 8481.834528] RIP: 0010:dm_suspend+0x49/0x50 [ 8481.834529] RSP: 0018:ffffc90000273ca0 EFLAGS: 00010286 [ 8481.834530] RAX: 0000000000000000 RBX: ffff88089c9f0000 RCX: 0000000000000000 [ 8481.834530] RDX: 0000000000000001 RSI: 0000000000000282 RDI: ffff88089c9fa8f0 [ 8481.834531] RBP: 0000000000000003 R08: 00000000c0000000 R09: ffffffff824e80c8 [ 8481.834532] R10: ffffea0021940020 R11: ffff8808c1c99480 R12: ffff88089c9f0000 [ 8481.834532] R13: ffffffff82243fc0 R14: 0000000000000004 R15: ffffffff816b7630 [ 8481.834533] FS: 0000000000000000(0000) GS:ffff8808c1c80000(0000) knlGS:0000000000000000 [ 8481.834534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8481.834535] CR2: 00007fd9e466f000 CR3: 000000000240a005 CR4: 00000000001606e0 [ 8481.834535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8481.834536] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8481.834536] Call Trace: [ 8481.834540] amdgpu_suspend+0x61/0x170 [ 8481.834541] amdgpu_device_suspend+0x195/0x390 [ 8481.834543] ? vga_switcheroo_runtime_resume+0x50/0x50 [ 8481.834544] amdgpu_pmops_runtime_suspend+0x4d/0xc0 [ 8481.834547] pci_pm_runtime_suspend+0x4d/0x120 [ 8481.834548] vga_switcheroo_runtime_suspend+0x19/0x90 [ 8481.834550] __rpm_callback+0xb5/0x1e0 [ 8481.834551] ? vga_switcheroo_runtime_resume+0x50/0x50 [ 8481.834552] rpm_callback+0x1a/0x70 [ 8481.834554] ? vga_switcheroo_runtime_resume+0x50/0x50 [ 8481.834555] rpm_suspend+0x124/0x650 [ 8481.834556] pm_runtime_work+0x58/0x80 [ 8481.834558] process_one_work+0x1d5/0x3d0 [ 8481.834559] worker_thread+0x42/0x3e0 [ 8481.834560] kthread+0xf0/0x130 [ 8481.834562] ? cancel_delayed_work+0x10/0x10 [ 8481.834562] ? kthread_create_worker_on_cpu+0x70/0x70 [ 8481.834564] ret_from_fork+0x1f/0x30 [ 8481.834565] Code: a9 00 00 00 75 25 48 8b 7b 08 e8 d3 4f f0 ff 48 8b bb 00 92 00 00 be 08 00 00 00 48 89 83 68 a9 00 00 e8 bb 4f 04 00 31 c0 5b c3 <0f> ff eb d7 0f 1f 00 53 48 89 fb 48 8b bf 20 92 00 00 e8 60 8d [ 8481.834579] ---[ end trace 86b596a21b2ff6ee ]--- [ 8482.018625] amdgpu 0000:01:00.0: GPU pci config reset Is there anything else you need from me? Or debugging I could turn on? I bisected this back to: d21becbe0225de0e2582d17d4fbc73fbd103b1f7 is the first bad commit commit d21becbe0225de0e2582d17d4fbc73fbd103b1f7 Author: Tony Cheng <tony.cheng@amd.com> Date: Wed Jul 12 11:54:10 2017 -0400 drm/amd/display: avoid disabling opp clk before hubp is blanked. Signed-off-by: Tony Cheng <tony.cheng@amd.com> Reviewed-by: Eric Yang <eric.yang2@amd.com> Acked-by: Harry Wentland <Harry.Wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> :040000 040000 61debba3cf73670d29975bc136d01862c2a54576 3d2315a1843d6276655b1550cb9f18fab47c5ce4 M drivers Tried again and it looks a little more promising: 0a214e2fb6b0a56519b6d5efab4b21475c233ee0 is the first bad commit commit 0a214e2fb6b0a56519b6d5efab4b21475c233ee0 Author: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Date: Thu Jul 13 10:56:48 2017 -0400 drm/amd/display: Release cached atomic state in S3. Fixes memory leak. Signed-off-by: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Reviewed-by: Tony Cheng <Tony.Cheng@amd.com> Acked-by: Harry Wentland <Harry.Wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> :040000 040000 494f25ce4ad407678f88d6c85128905762c9fbfb fb36845ef2ccca7bf823c9fec4d13d0a6e71ea2b M drivers Which makes sense as that's the commit that adds the WARN_ON, I guess that takes us back to why is there a cached state I ran into this bug when upgrading to 4.15.1. Anything I can do to help? We have a few new patches in our staging trees relating to suspend and driver unload. Would you be able to try amd-staging-drm-next or drm-next-4.17-wip from https://cgit.freedesktop.org/~agd5f/linux/?h=drm-next-4.17-wip and see if the issue is fixed there? I'm running agd5f's drm-next-4.17-wip branch with https://patchwork.freedesktop.org/series/38985/ applied on top along with https://github.com/FireBurn/KernelStuff/blob/master/05-remove-warn.patch removing the WARN_ON(adev->dm.cached_state); I've reverted the removal of the WARN_ON and it seems to be fixed thanks Thanks for the update. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 135997 [details] Dmesg [ 9087.801615] WARNING: CPU: 6 PID: 1002 at dm_suspend+0x49/0x50 [ 9087.801617] Modules linked in: [ 9087.801620] CPU: 6 PID: 1002 Comm: kworker/6:0 Tainted: G W 4.15.0-rc2-agd5f+ #300 [ 9087.801621] Hardware name: Alienware Alienware 15 R2/0H6J09, BIOS 1.3.12 07/28/2017 [ 9087.801622] Workqueue: pm pm_runtime_work [ 9087.801623] task: 00000000fc3d3872 task.stack: 00000000a1a771ba [ 9087.801624] RIP: 0010:dm_suspend+0x49/0x50 [ 9087.801625] RSP: 0018:ffffc900000fbca0 EFLAGS: 00010282 [ 9087.801626] RAX: 0000000000000000 RBX: ffff88089c9e0000 RCX: 0000000000000000 [ 9087.801627] RDX: 0000000000000001 RSI: 0000000000000282 RDI: ffff88089c9ea8f0 [ 9087.801627] RBP: 0000000000000003 R08: 00000000c0000000 R09: ffffffff824e7648 [ 9087.801628] R10: ffffea001e7c8020 R11: ffff8808c1d19480 R12: ffff88089c9e0000 [ 9087.801628] R13: ffffffff82241e98 R14: 0000000000000004 R15: ffffffff816b7670 [ 9087.801629] FS: 0000000000000000(0000) GS:ffff8808c1d80000(0000) knlGS:0000000000000000 [ 9087.801630] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9087.801631] CR2: 00007f43c544dbb8 CR3: 000000000240a004 CR4: 00000000001606e0 [ 9087.801631] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9087.801632] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9087.801632] Call Trace: [ 9087.801636] amdgpu_suspend+0x61/0x170 [ 9087.801637] amdgpu_device_suspend+0x195/0x390 [ 9087.801639] ? vga_switcheroo_runtime_resume+0x50/0x50 [ 9087.801640] amdgpu_pmops_runtime_suspend+0x4d/0xc0 [ 9087.801642] pci_pm_runtime_suspend+0x4d/0x120 [ 9087.801644] vga_switcheroo_runtime_suspend+0x19/0x90 [ 9087.801645] __rpm_callback+0xb5/0x1e0 [ 9087.801647] ? vga_switcheroo_runtime_resume+0x50/0x50 [ 9087.801648] rpm_callback+0x1a/0x70 [ 9087.801649] ? vga_switcheroo_runtime_resume+0x50/0x50 [ 9087.801650] rpm_suspend+0x124/0x650 [ 9087.801652] pm_runtime_work+0x58/0x80 [ 9087.801653] process_one_work+0x1d5/0x3d0 [ 9087.801655] worker_thread+0x42/0x3e0 [ 9087.801656] kthread+0xf0/0x130 [ 9087.801657] ? cancel_delayed_work+0x10/0x10 [ 9087.801658] ? kthread_create_worker_on_cpu+0x70/0x70 [ 9087.801660] ret_from_fork+0x1f/0x30 [ 9087.801661] Code: a9 00 00 00 75 25 48 8b 7b 08 e8 93 4f f0 ff 48 8b bb 00 92 00 00 be 08 00 00 00 48 89 83 68 a9 00 00 e8 bb 4f 04 00 31 c0 5b c3 <0f> ff eb d7 0f 1f 00 53 48 89 fb 48 8b bf 20 92 00 00 e8 20 8d [ 9087.801675] ---[ end trace 8e3cd942fb9ca189 ]--- [ 9087.996944] amdgpu 0000:01:00.0: GPU pci config reset I'm seeing these a lot, also in Linus's tree Will try and bisect later though I have a funny feeling it might be DC related