Summary: | "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921 | ||
---|---|---|---|
Product: | DRI | Reporter: | dwagner <jb5sgc1n.nya> |
Component: | DRM/AMDgpu | Assignee: | Andrey Grodzovsky <andrey.grodzovsky> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | major | ||
Priority: | medium | CC: | andrey.grodzovsky, a_ruhier, ckoenig.leichtzumerken, samuel, wes |
Version: | DRI git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
dwagner
2018-06-28 19:33:52 UTC
Created attachment 140383 [details]
dmesg of the system boot and before and at the crash at S3 resume
(Just for reference: This bug report is for a different kind of S3-resume-crash than reported in https://bugs.freedesktop.org/show_bug.cgi?id=103277 ) Can you use addr2line or gdb with 'list' command to give the line number matching amdgpu_vm_cpu_set_ptes+0x76/0xf0 ? (In reply to Andrey Grodzovsky from comment #3) > Can you use addr2line or gdb with 'list' command to give the line number > matching > amdgpu_vm_cpu_set_ptes+0x76/0xf0 ? That would have been easy had I used my self-compiled kernel - but it seems there is no debuginfo file available for the Arch Linux supplied kernels, which I ran in this case. So I can only provide a disassembled listing of that function, with offset 0x76 aka +118 inside: Dump of assembler code for function amdgpu_vm_cpu_set_ptes: 0x0000000000027c80 <+0>: callq 0x27c85 <amdgpu_vm_cpu_set_ptes+5> 0x0000000000027c85 <+5>: push %r15 0x0000000000027c87 <+7>: mov %rcx,%r15 0x0000000000027c8a <+10>: push %r14 0x0000000000027c8c <+12>: mov %rdi,%r14 0x0000000000027c8f <+15>: mov %rsi,%rdi 0x0000000000027c92 <+18>: push %r13 0x0000000000027c94 <+20>: mov %r8d,%r13d 0x0000000000027c97 <+23>: push %r12 0x0000000000027c99 <+25>: mov %rdx,%r12 0x0000000000027c9c <+28>: push %rbp 0x0000000000027c9d <+29>: mov %r9d,%ebp 0x0000000000027ca0 <+32>: push %rbx 0x0000000000027ca1 <+33>: callq 0x27ca6 <amdgpu_vm_cpu_set_ptes+38> 0x0000000000027ca6 <+38>: add %rax,%r12 0x0000000000027ca9 <+41>: nopl 0x0(%rax,%rax,1) 0x0000000000027cae <+46>: xor %ebx,%ebx 0x0000000000027cb0 <+48>: test %r13d,%r13d 0x0000000000027cb3 <+51>: je 0x27cfb <amdgpu_vm_cpu_set_ptes+123> 0x0000000000027cb5 <+53>: mov 0x28(%r14),%rax 0x0000000000027cb9 <+57>: mov %r15,%rcx 0x0000000000027cbc <+60>: test %rax,%rax 0x0000000000027cbf <+63>: je 0x27cd3 <amdgpu_vm_cpu_set_ptes+83> 0x0000000000027cc1 <+65>: mov %r15,%rdx 0x0000000000027cc4 <+68>: mov $0xfffffffffffff000,%rcx 0x0000000000027ccb <+75>: shr $0xc,%rdx 0x0000000000027ccf <+79>: and (%rax,%rdx,8),%rcx 0x0000000000027cd3 <+83>: mov (%r14),%rdi 0x0000000000027cd6 <+86>: mov %ebx,%edx 0x0000000000027cd8 <+88>: add $0x1,%ebx 0x0000000000027cdb <+91>: mov 0x38(%rsp),%r8 0x0000000000027ce0 <+96>: mov %r12,%rsi 0x0000000000027ce3 <+99>: add %rbp,%r15 0x0000000000027ce6 <+102>: mov 0x968(%rdi),%rax 0x0000000000027ced <+109>: mov 0x18(%rax),%rax 0x0000000000027cf1 <+113>: callq 0x27cf6 <amdgpu_vm_cpu_set_ptes+118> 0x0000000000027cf6 <+118>: cmp %ebx,%r13d 0x0000000000027cf9 <+121>: jne 0x27cb5 <amdgpu_vm_cpu_set_ptes+53> 0x0000000000027cfb <+123>: pop %rbx 0x0000000000027cfc <+124>: pop %rbp 0x0000000000027cfd <+125>: pop %r12 0x0000000000027cff <+127>: pop %r13 0x0000000000027d01 <+129>: pop %r14 0x0000000000027d03 <+131>: pop %r15 0x0000000000027d05 <+133>: retq 0x0000000000027d06 <+134>: mov %gs:0x0(%rip),%eax # 0x27d0d <amdgpu_vm_cpu_set_ptes+141> 0x0000000000027d0d <+141>: mov %eax,%eax 0x0000000000027d0f <+143>: bt %rax,0x0(%rip) # 0x27d17 <amdgpu_vm_cpu_set_ptes+151> 0x0000000000027d17 <+151>: jae 0x27cae <amdgpu_vm_cpu_set_ptes+46> 0x0000000000027d19 <+153>: incl %gs:0x0(%rip) # 0x27d20 <amdgpu_vm_cpu_set_ptes+160> 0x0000000000027d20 <+160>: mov 0x0(%rip),%rbx # 0x27d27 <amdgpu_vm_cpu_set_ptes+167> 0x0000000000027d27 <+167>: test %rbx,%rbx 0x0000000000027d2a <+170>: je 0x27d55 <amdgpu_vm_cpu_set_ptes+213> 0x0000000000027d2c <+172>: mov (%rbx),%rax 0x0000000000027d2f <+175>: mov 0x8(%rbx),%rdi 0x0000000000027d33 <+179>: add $0x18,%rbx 0x0000000000027d37 <+183>: mov 0x38(%rsp),%r9 0x0000000000027d3c <+188>: mov %ebp,%r8d 0x0000000000027d3f <+191>: mov %r13d,%ecx 0x0000000000027d42 <+194>: mov %r15,%rdx 0x0000000000027d45 <+197>: mov %r12,%rsi 0x0000000000027d48 <+200>: callq 0x27d4d <amdgpu_vm_cpu_set_ptes+205> 0x0000000000027d4d <+205>: mov (%rbx),%rax 0x0000000000027d50 <+208>: test %rax,%rax 0x0000000000027d53 <+211>: jne 0x27d2f <amdgpu_vm_cpu_set_ptes+175> 0x0000000000027d55 <+213>: decl %gs:0x0(%rip) # 0x27d5c <amdgpu_vm_cpu_set_ptes+220> 0x0000000000027d5c <+220>: jne 0x27cae <amdgpu_vm_cpu_set_ptes+46> 0x0000000000027d62 <+226>: callq 0x27d67 <amdgpu_vm_cpu_set_ptes+231> 0x0000000000027d67 <+231>: jmpq 0x27cae <amdgpu_vm_cpu_set_ptes+46> Interesting: With amd-staging-drm-next, I see the same crash at https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c?h=amd-staging-drm-next#n921 with the same backtrace with vm_update_mode=3 immediately upon starting X11 - not only after S3 resume. Here with symbols translated to source lines: Jun 29 01:49:05 ryzen kernel: amdgpu_vm_cpu_set_ptes (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:921 (discriminator 2)) amdgpu Jun 29 01:49:05 ryzen kernel: amdgpu_vm_update_directories (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:989 /drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1096) amdgpu Jun 29 01:49:05 ryzen kernel: ? amdgpu_vm_do_copy_ptes (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:913) amdgpu Jun 29 01:49:05 ryzen kernel: amdgpu_gem_va_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:542 /drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:674) amdgpu Jun 29 01:49:05 ryzen kernel: ? __alloc_pages_nodemask (/mm/page_alloc.c:4355) Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu Jun 29 01:49:05 ryzen kernel: drm_ioctl_kernel+0xa7/0xf0 drm Jun 29 01:49:05 ryzen kernel: drm_ioctl+0x2f1/0x3c0 drm Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu Jun 29 01:49:05 ryzen kernel: amdgpu_drm_ioctl (/./include/linux/pm_runtime.h:108 /drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:842) amdgpu Jun 29 01:49:05 ryzen kernel: do_vfs_ioctl (/fs/ioctl.c:46 /fs/ioctl.c:500 /fs/ioctl.c:684) Jun 29 01:49:05 ryzen kernel: ? handle_mm_fault (/mm/memory.c:4133) Jun 29 01:49:05 ryzen kernel: ksys_ioctl (/./include/linux/file.h:39 /fs/ioctl.c:702) Jun 29 01:49:05 ryzen kernel: __x64_sys_ioctl (/fs/ioctl.c:708 /fs/ioctl.c:706 /fs/ioctl.c:706) Jun 29 01:49:05 ryzen kernel: do_syscall_64 (/arch/x86/entry/common.c:290) Jun 29 01:49:05 ryzen kernel: entry_SYSCALL_64_after_hwframe (/./include/trace/events/initcall.h:10 /./include/trace/events/initcall.h:10) (In reply to dwagner from comment #5) > Interesting: With amd-staging-drm-next, I see the same crash at > https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/ > amdgpu_vm.c?h=amd-staging-drm-next#n921 with the same backtrace with > vm_update_mode=3 immediately upon starting X11 - not only after S3 resume. > Here with symbols translated to source lines: > > Jun 29 01:49:05 ryzen kernel: amdgpu_vm_cpu_set_ptes > (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:921 (discriminator 2)) amdgpu > Jun 29 01:49:05 ryzen kernel: amdgpu_vm_update_directories > (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:989 > /drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1096) amdgpu > Jun 29 01:49:05 ryzen kernel: ? amdgpu_vm_do_copy_ptes > (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:913) amdgpu > Jun 29 01:49:05 ryzen kernel: amdgpu_gem_va_ioctl > (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:542 > /drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:674) amdgpu > Jun 29 01:49:05 ryzen kernel: ? __alloc_pages_nodemask > (/mm/page_alloc.c:4355) > Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl > (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu > Jun 29 01:49:05 ryzen kernel: drm_ioctl_kernel+0xa7/0xf0 drm > Jun 29 01:49:05 ryzen kernel: drm_ioctl+0x2f1/0x3c0 drm > Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl > (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu > Jun 29 01:49:05 ryzen kernel: amdgpu_drm_ioctl > (/./include/linux/pm_runtime.h:108 > /drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:842) amdgpu > Jun 29 01:49:05 ryzen kernel: do_vfs_ioctl (/fs/ioctl.c:46 /fs/ioctl.c:500 > /fs/ioctl.c:684) > Jun 29 01:49:05 ryzen kernel: ? handle_mm_fault (/mm/memory.c:4133) > Jun 29 01:49:05 ryzen kernel: ksys_ioctl (/./include/linux/file.h:39 > /fs/ioctl.c:702) > Jun 29 01:49:05 ryzen kernel: __x64_sys_ioctl (/fs/ioctl.c:708 > /fs/ioctl.c:706 /fs/ioctl.c:706) > Jun 29 01:49:05 ryzen kernel: do_syscall_64 (/arch/x86/entry/common.c:290) > Jun 29 01:49:05 ryzen kernel: entry_SYSCALL_64_after_hwframe > (/./include/trace/events/initcall.h:10 /./include/trace/events/initcall.h:10) So with Arch Linux kernel it happens only during S3 but with amd-staging-drm-next it happens once you start X ? (In reply to Andrey Grodzovsky from comment #6) > So with Arch Linux kernel it happens only during S3 but with > amd-staging-drm-next it happens once you start X ? Yes. I know it sounds strange, but it's currently 100% reproducible to me: Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0: X11 starts fine, but system crashes after minutes of firefox browsing Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3: X11 starts fine, system does not crash (for at least hours of use) but crashes as above if resumed from S3 sleep Booting linux compiled from amd-staging-drm-next, as of commit 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with amdgpu.vm_update_mode=0: X11 starts fine, but system crashes after minutes of firefox browsing Booting linux compiled from amd-staging-drm-next, as of commit 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with amdgpu.vm_update_mode=3: X11 does not start, crashes immediately with the same above pasted kernel BUG message and backtrace So something with CPU-based vm_update_mode is broken, but in a different way than the SDMA-based method. I will change the subject of this report to reflect that this crash is not necessarily S3-resume-related. (In reply to dwagner from comment #7) > (In reply to Andrey Grodzovsky from comment #6) > > So with Arch Linux kernel it happens only during S3 but with > > amd-staging-drm-next it happens once you start X ? > > Yes. I know it sounds strange, but it's currently 100% reproducible to me: > > Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0: > X11 starts fine, but system crashes after minutes of firefox browsing > > Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3: > X11 starts fine, system does not crash (for at least hours of use) > but crashes as above if resumed from S3 sleep > > Booting linux compiled from amd-staging-drm-next, as of commit > 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with > amdgpu.vm_update_mode=0: > X11 starts fine, but system crashes after minutes of firefox browsing > > Booting linux compiled from amd-staging-drm-next, as of commit > 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with > amdgpu.vm_update_mode=3: > X11 does not start, crashes immediately with the same above pasted kernel > BUG message and backtrace > > > So something with CPU-based vm_update_mode is broken, but in a different way > than the SDMA-based method. > > I will change the subject of this report to reflect that this crash is not > necessarily S3-resume-related. I am going to try and reproduce the crash with CPU update mode here, please describe exactly what ASIC are you using ? (In reply to Andrey Grodzovsky from comment #8) > (In reply to dwagner from comment #7) > > (In reply to Andrey Grodzovsky from comment #6) > > > So with Arch Linux kernel it happens only during S3 but with > > > amd-staging-drm-next it happens once you start X ? > > > > Yes. I know it sounds strange, but it's currently 100% reproducible to me: > > > > Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0: > > X11 starts fine, but system crashes after minutes of firefox browsing > > > > Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3: > > X11 starts fine, system does not crash (for at least hours of use) > > but crashes as above if resumed from S3 sleep > > > > Booting linux compiled from amd-staging-drm-next, as of commit > > 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with > > amdgpu.vm_update_mode=0: > > X11 starts fine, but system crashes after minutes of firefox browsing > > > > Booting linux compiled from amd-staging-drm-next, as of commit > > 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with > > amdgpu.vm_update_mode=3: > > X11 does not start, crashes immediately with the same above pasted kernel > > BUG message and backtrace > > > > > > So something with CPU-based vm_update_mode is broken, but in a different way > > than the SDMA-based method. > > > > I will change the subject of this report to reflect that this crash is not > > necessarily S3-resume-related. > > I am going to try and reproduce the crash with CPU update mode here, please > describe exactly what ASIC are you using ? Got it already. Created attachment 140418 [details] [review] drm/amdgpu: Verify root PD is mapped into kernel address space. dwagner, please try this patch. Fixes the issue for me and I observed no suspend/resume issues. Christian, please take a look at the patch, problem was that in amdgpu_vm_update_directories the parent BO didn't have a kernel mapping and so later inside amdgpu_vm_cpu_set_ptes pe += (unsigned long)amdgpu_bo_kptr(bo); would equal to 0000000000002000 since parent amdgpu_bo_kptr woudld return NULL. The parent was the root PD. This was still working in 67b8d5c Linus Torvalds 7 weeks ago Linux 4.17-rc5 (tag: v4.17-rc5) but I wasn't able to exactly pinpoint which change broke it. I am not sure my fix is the right one so please advise. (In reply to Andrey Grodzovsky from comment #10) > Created attachment 140418 [details] [review] [review] > drm/amdgpu: Verify root PD is mapped into kernel address space. > > dwagner, please try this patch. Fixes the issue for me and I observed no > suspend/resume issues. > > Christian, please take a look at the patch, problem was that in > amdgpu_vm_update_directories the parent BO didn't have a kernel mapping and > so later inside amdgpu_vm_cpu_set_ptes > pe += (unsigned long)amdgpu_bo_kptr(bo); would equal to 0000000000002000 > since > parent amdgpu_bo_kptr woudld return NULL. The parent was the root PD. > > This was still working in 67b8d5c Linus Torvalds 7 weeks ago Linux > 4.17-rc5 (tag: v4.17-rc5) but I wasn't able to exactly pinpoint which > change broke it. I am not sure my fix is the right one so please advise. No idea when that broke either, CPU based updates is not something we usually test. Anyway it's a good catch, but I would rather add that to amdgpu_vm_bo_base_init() (with the appropriate checks). That would also allow us to remove the duplicated code from amdgpu_vm_alloc_levels(). (In reply to Andrey Grodzovsky from comment #10) > Created attachment 140418 [details] [review] [review] > drm/amdgpu: Verify root PD is mapped into kernel address space. > > dwagner, please try this patch. Fixes the issue for me and I observed no > suspend/resume issues. While I can start X11 with this patch applied to current amd-staging-drm-next, attempts to resume from S3 fail consistently. The following related output is emitted right before the suspend: Jul 02 21:31:32 ryzen kernel: Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. Jul 02 21:31:32 ryzen kernel: Suspending console(s) (use no_console_suspend to debug) Jul 02 21:31:32 ryzen kernel: sd 9:0:0:0: [sda] Synchronizing SCSI cache Jul 02 21:31:32 ryzen kernel: [TTM] Buffer eviction failed Jul 02 21:31:32 ryzen kernel: ACPI: Preparing to enter system sleep state S3 Jul 02 21:31:32 ryzen kernel: PM: Saving platform NVS memory Jul 02 21:31:32 ryzen kernel: Disabling non-boot CPUs ... (I wonder if that "[TTM] Buffer eviction failed" is a bad sign - as I have seen it some other times in conjunction with heavy uses of the amdgpu driver.) Then, upon resume, the following messages are emitted: Jul 02 21:31:33 ryzen kernel: ACPI: Low-level resume complete Jul 02 21:31:33 ryzen kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000). Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 148 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 145 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 189 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 306 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 5e ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 18a ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 145 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 148 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 145 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xC> Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -22 Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-22). Jul 02 21:31:33 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -22 Jul 02 21:31:33 ryzen kernel: PM: Device 0000:0a:00.0 failed to resume async: error -22 Jul 02 21:31:33 ryzen kernel: OOM killer enabled. Jul 02 21:31:33 ryzen kernel: Restarting tasks ... done. Jul 02 21:31:33 ryzen kernel: PM: suspend exit Jul 02 21:31:33 ryzen kernel: BUG: unable to handle kernel paging request at 0000000000001000 Jul 02 21:31:33 ryzen kernel: PGD 0 P4D 0 Jul 02 21:31:33 ryzen kernel: Oops: 0002 [#1] SMP Jul 02 21:31:33 ryzen kernel: CPU: 14 PID: 791 Comm: amdgpu_cs:0 Tainted: G W O 4.18.0-rc1-amd+ #45 Jul 02 21:31:33 ryzen kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018 Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu] Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0> Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202 Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 RCX: 000000000fe004f1 Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000 RDI: ffff8807e2f70000 Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1 R09: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000 R12: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18 R15: 000000000fe01000 Jul 02 21:31:33 ryzen kernel: FS: 00007f8b57266700(0000) GS:ffff88081ef80000(0000) knlGS:0000000000000000 Jul 02 21:31:33 ryzen kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000 CR4: 00000000003406e0 Jul 02 21:31:33 ryzen kernel: Call Trace: Jul 02 21:31:33 ryzen kernel: amdgpu_vm_cpu_set_ptes+0x76/0xe0 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_update_ptes+0x1d3/0x2e0 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_frag_ptes+0xae/0x130 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_bo_update_mapping+0xed/0x410 [amdgpu] Jul 02 21:31:33 ryzen kernel: ? amdgpu_vm_do_copy_ptes+0xa0/0xa0 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_bo_update+0x310/0x680 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_cs_ioctl+0x1092/0x1a50 [amdgpu] Jul 02 21:31:33 ryzen kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] Jul 02 21:31:33 ryzen kernel: drm_ioctl_kernel+0xa7/0xf0 [drm] Jul 02 21:31:33 ryzen kernel: drm_ioctl+0x2f1/0x3c0 [drm] Jul 02 21:31:33 ryzen kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] Jul 02 21:31:33 ryzen kernel: do_vfs_ioctl+0xa4/0x620 Jul 02 21:31:33 ryzen kernel: ? __se_sys_futex+0x138/0x180 Jul 02 21:31:33 ryzen kernel: ksys_ioctl+0x60/0x90 Jul 02 21:31:33 ryzen kernel: __x64_sys_ioctl+0x16/0x20 Jul 02 21:31:33 ryzen kernel: do_syscall_64+0x48/0xf0 Jul 02 21:31:33 ryzen kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jul 02 21:31:33 ryzen kernel: RIP: 0033:0x7f8b66c92667 Jul 02 21:31:33 ryzen kernel: Code: 00 00 90 48 8b 05 e9 67 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 8> Jul 02 21:31:33 ryzen kernel: RSP: 002b:00007f8b57265a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Jul 02 21:31:33 ryzen kernel: RAX: ffffffffffffffda RBX: 00007f8b57265b88 RCX: 00007f8b66c92667 Jul 02 21:31:33 ryzen kernel: RDX: 00007f8b57265b00 RSI: 00000000c0186444 RDI: 000000000000000b Jul 02 21:31:33 ryzen kernel: RBP: 00007f8b57265b00 R08: 00007f8b57265bb0 R09: 0000000000000010 Jul 02 21:31:33 ryzen kernel: R10: 00007f8b57265bb0 R11: 0000000000000246 R12: 00000000c0186444 Jul 02 21:31:33 ryzen kernel: R13: 000000000000000b R14: 0000000000000002 R15: 0000000000000000 Jul 02 21:31:33 ryzen kernel: Modules linked in: it87(O) joydev mousedev hid_generic hidp hid ipt_REJECT nf_reject_ipv4 nf_l> Jul 02 21:31:33 ryzen kernel: serio_raw crc32_pclmul atkbd ghash_clmulni_intel libps2 pcbc ahci libahci xhci_pci libata aes> Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 Jul 02 21:31:33 ryzen kernel: ---[ end trace 517a8a72887251f0 ]--- Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu] Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0> Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202 Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 RCX: 000000000fe004f1 Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000 RDI: ffff8807e2f70000 Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1 R09: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000 R12: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18 R15: 000000000fe01000 Jul 02 21:31:33 ryzen kernel: FS: 00007f8b57266700(0000) GS:ffff88081ef80000(0000) knlGS:0000000000000000 Jul 02 21:31:33 ryzen kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000 CR4: 00000000003406e0 (At this point, the machine is just dead, and reacts upon nothing.) So something is still wrong at amdgpu_vm_cpu_set_ptes+0x76 (In reply to dwagner from comment #12) > (In reply to Andrey Grodzovsky from comment #10) > > Created attachment 140418 [details] [review] [review] [review] > > drm/amdgpu: Verify root PD is mapped into kernel address space. > > > > dwagner, please try this patch. Fixes the issue for me and I observed no > > suspend/resume issues. > > While I can start X11 with this patch applied to current > amd-staging-drm-next, attempts to resume from S3 fail consistently. > > The following related output is emitted right before the suspend: > > Jul 02 21:31:32 ryzen kernel: Freezing remaining freezable tasks ... > (elapsed 0.000 seconds) done. > Jul 02 21:31:32 ryzen kernel: Suspending console(s) (use no_console_suspend > to debug) > Jul 02 21:31:32 ryzen kernel: sd 9:0:0:0: [sda] Synchronizing SCSI cache > Jul 02 21:31:32 ryzen kernel: [TTM] Buffer eviction failed > Jul 02 21:31:32 ryzen kernel: ACPI: Preparing to enter system sleep state S3 > Jul 02 21:31:32 ryzen kernel: PM: Saving platform NVS memory > Jul 02 21:31:32 ryzen kernel: Disabling non-boot CPUs ... > > (I wonder if that "[TTM] Buffer eviction failed" is a bad sign - as I have > seen it some other times in conjunction with heavy uses of the amdgpu > driver.) > > > Then, upon resume, the following messages are emitted: > > Jul 02 21:31:33 ryzen kernel: ACPI: Low-level resume complete > Jul 02 21:31:33 ryzen kernel: [drm] PCIE GART of 256M enabled (table at > 0x000000F400300000). > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 146 ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 148 ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 145 ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 146 ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 189 ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 306 ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 5e ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 18a ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 145 ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 146 ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 148 ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 145 ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > last message was failed ret is 0 > Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] > failed to send message 146 ret is 0 > Jul 02 21:31:33 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* > amdgpu: ring 0 test failed (scratch(0xC040)=0xC> > Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] > *ERROR* resume of IP block <gfx_v8_0> failed -22 > Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* > amdgpu_device_ip_resume failed (-22). > Jul 02 21:31:33 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 > returns -22 > Jul 02 21:31:33 ryzen kernel: PM: Device 0000:0a:00.0 failed to resume > async: error -22 > Jul 02 21:31:33 ryzen kernel: OOM killer enabled. > Jul 02 21:31:33 ryzen kernel: Restarting tasks ... done. > Jul 02 21:31:33 ryzen kernel: PM: suspend exit > Jul 02 21:31:33 ryzen kernel: BUG: unable to handle kernel paging request at > 0000000000001000 > Jul 02 21:31:33 ryzen kernel: PGD 0 P4D 0 > Jul 02 21:31:33 ryzen kernel: Oops: 0002 [#1] SMP > Jul 02 21:31:33 ryzen kernel: CPU: 14 PID: 791 Comm: amdgpu_cs:0 Tainted: G > W O 4.18.0-rc1-amd+ #45 > Jul 02 21:31:33 ryzen kernel: Hardware name: System manufacturer System > Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018 > Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 > [amdgpu] > Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44 > 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0> > Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202 > Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 > RCX: 000000000fe004f1 > Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000 > RDI: ffff8807e2f70000 > Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1 > R09: 0000000000001000 > Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000 > R12: 0000000000001000 > Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18 > R15: 000000000fe01000 > Jul 02 21:31:33 ryzen kernel: FS: 00007f8b57266700(0000) > GS:ffff88081ef80000(0000) knlGS:0000000000000000 > Jul 02 21:31:33 ryzen kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 > Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000 > CR4: 00000000003406e0 > Jul 02 21:31:33 ryzen kernel: Call Trace: > Jul 02 21:31:33 ryzen kernel: amdgpu_vm_cpu_set_ptes+0x76/0xe0 [amdgpu] > Jul 02 21:31:33 ryzen kernel: amdgpu_vm_update_ptes+0x1d3/0x2e0 [amdgpu] > Jul 02 21:31:33 ryzen kernel: amdgpu_vm_frag_ptes+0xae/0x130 [amdgpu] > Jul 02 21:31:33 ryzen kernel: amdgpu_vm_bo_update_mapping+0xed/0x410 > [amdgpu] > Jul 02 21:31:33 ryzen kernel: ? amdgpu_vm_do_copy_ptes+0xa0/0xa0 [amdgpu] > Jul 02 21:31:33 ryzen kernel: amdgpu_vm_bo_update+0x310/0x680 [amdgpu] > Jul 02 21:31:33 ryzen kernel: amdgpu_cs_ioctl+0x1092/0x1a50 [amdgpu] > Jul 02 21:31:33 ryzen kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] > Jul 02 21:31:33 ryzen kernel: drm_ioctl_kernel+0xa7/0xf0 [drm] > Jul 02 21:31:33 ryzen kernel: drm_ioctl+0x2f1/0x3c0 [drm] > Jul 02 21:31:33 ryzen kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] > Jul 02 21:31:33 ryzen kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] > Jul 02 21:31:33 ryzen kernel: do_vfs_ioctl+0xa4/0x620 > Jul 02 21:31:33 ryzen kernel: ? __se_sys_futex+0x138/0x180 > Jul 02 21:31:33 ryzen kernel: ksys_ioctl+0x60/0x90 > Jul 02 21:31:33 ryzen kernel: __x64_sys_ioctl+0x16/0x20 > Jul 02 21:31:33 ryzen kernel: do_syscall_64+0x48/0xf0 > Jul 02 21:31:33 ryzen kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 > Jul 02 21:31:33 ryzen kernel: RIP: 0033:0x7f8b66c92667 > Jul 02 21:31:33 ryzen kernel: Code: 00 00 90 48 8b 05 e9 67 2c 00 64 c7 00 > 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 8> > Jul 02 21:31:33 ryzen kernel: RSP: 002b:00007f8b57265a98 EFLAGS: 00000246 > ORIG_RAX: 0000000000000010 > Jul 02 21:31:33 ryzen kernel: RAX: ffffffffffffffda RBX: 00007f8b57265b88 > RCX: 00007f8b66c92667 > Jul 02 21:31:33 ryzen kernel: RDX: 00007f8b57265b00 RSI: 00000000c0186444 > RDI: 000000000000000b > Jul 02 21:31:33 ryzen kernel: RBP: 00007f8b57265b00 R08: 00007f8b57265bb0 > R09: 0000000000000010 > Jul 02 21:31:33 ryzen kernel: R10: 00007f8b57265bb0 R11: 0000000000000246 > R12: 00000000c0186444 > Jul 02 21:31:33 ryzen kernel: R13: 000000000000000b R14: 0000000000000002 > R15: 0000000000000000 > Jul 02 21:31:33 ryzen kernel: Modules linked in: it87(O) joydev mousedev > hid_generic hidp hid ipt_REJECT nf_reject_ipv4 nf_l> > Jul 02 21:31:33 ryzen kernel: serio_raw crc32_pclmul atkbd > ghash_clmulni_intel libps2 pcbc ahci libahci xhci_pci libata aes> > Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 > Jul 02 21:31:33 ryzen kernel: ---[ end trace 517a8a72887251f0 ]--- > Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 > [amdgpu] > Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44 > 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0> > Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202 > Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 > RCX: 000000000fe004f1 > Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000 > RDI: ffff8807e2f70000 > Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1 > R09: 0000000000001000 > Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000 > R12: 0000000000001000 > Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18 > R15: 000000000fe01000 > Jul 02 21:31:33 ryzen kernel: FS: 00007f8b57266700(0000) > GS:ffff88081ef80000(0000) knlGS:0000000000000000 > Jul 02 21:31:33 ryzen kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 > Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000 > CR4: 00000000003406e0 > > (At this point, the machine is just dead, and reacts upon nothing.) > > So something is still wrong at amdgpu_vm_cpu_set_ptes+0x76 My guess is that on resume from S3 root PD needs to be again mapped to CPU address space. Maybe changing the patch according to Christian's advise will be enough. I will take a look tomorrow. Or it has to do with the resume failure you are experiencing. What ASIC are you using ? I also tested with gfx8 ASIC and haven't observed any issues with resume. Did you update the firmware for this ASIC to latest # (In reply to Andrey Grodzovsky from comment #13) > What ASIC are you using ? I also tested with > gfx8 ASIC and haven't observed any issues with resume. Did you update the > firmware for this ASIC to latest # The GPU is an RX460 "POLARIS11 0x1002:0x67EF 0x1682:0x9460 0xCF", with the latest firmware from the kernel git, you can see the details from https://bugs.freedesktop.org/attachment.cgi?id=140383 uploaded earlier. (In reply to dwagner from comment #14) > (In reply to Andrey Grodzovsky from comment #13) > > What ASIC are you using ? I also tested with > > gfx8 ASIC and haven't observed any issues with resume. Did you update the > > firmware for this ASIC to latest # > > The GPU is an RX460 "POLARIS11 0x1002:0x67EF 0x1682:0x9460 0xCF", > with the latest firmware from the kernel git, you can see the > details from https://bugs.freedesktop.org/attachment.cgi?id=140383 > uploaded earlier. We have only minor differences but I can't reproduce it. Maybe the resume failure is indeed due the eviction failure during suspend. Is S3 failure is happening only when you switch to CPU update mode ? (In reply to Andrey Grodzovsky from comment #15) > We have only minor differences but I can't reproduce it. Maybe the resume > failure is indeed due the eviction failure during suspend. Is S3 failure is > happening only when you switch to CPU update mode ? No, when I boot amd-staging-drm-next with amdgpu.vm_update_mode=0 and suspend to S3 then resuming does also crash, but with different messages - _not_ with "BUG: unable to handle kernel paging request at 0000000000002000" like in the vm_update_mode=3 case. In the journal, I can see see after a vm_update_mode=0 S3 resume attempt: Jul 05 00:41:59 ryzen kernel: [TTM] Buffer eviction failed Jul 05 00:41:59 ryzen kernel: ACPI: Preparing to enter system sleep state S3 ... Jul 05 00:42:00 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xC> Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -22 Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-22). Jul 05 00:42:00 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -22 Jul 05 00:42:00 ryzen kernel: PM: Device 0000:0a:00.0 failed to resume async: error -22 ... Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0> Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22) Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0> Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22) Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0> Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22) Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0> ... many more of this... but no kernel BUG or Oops. Interesting observation: If I first switch from the X11 display to the console display (with Alt-F2), and then enter "echo mem >/sys/power/state" on the console, above described crashes upon S3 resume do not occur, and I do not see the "[TTM] Buffer eviction failed" in the kernel log, neither with vm_update_mode=0, nor with vm_update_mode=3. Switching back to the X11 display after a successful S3 resume to the console also works fine. What could be the relevant difference here? (In reply to dwagner from comment #17) > Interesting observation: If I first switch from the X11 display to the > console display (with Alt-F2), and then enter "echo mem >/sys/power/state" > on the console, above described crashes upon S3 resume do not occur, and I > do not see the "[TTM] Buffer eviction failed" in the kernel log, neither > with vm_update_mode=0, nor with vm_update_mode=3. > > Switching back to the X11 display after a successful S3 resume to the > console also works fine. > > What could be the relevant difference here? Well, there is no acceleration involved when in console mode. So maybe this has something to do with it. Anyway, i am sidetracked a bit by an internal requirement but once i finish I will get back to this issue especially because I got another report with the same failure as you describe. (In reply to Andrey Grodzovsky from comment #18) > (In reply to dwagner from comment #17) > > Interesting observation: If I first switch from the X11 display to the > > console display (with Alt-F2), and then enter "echo mem >/sys/power/state" > > on the console, above described crashes upon S3 resume do not occur, and I > > do not see the "[TTM] Buffer eviction failed" in the kernel log, neither > > with vm_update_mode=0, nor with vm_update_mode=3. > > > > Switching back to the X11 display after a successful S3 resume to the > > console also works fine. > > > > What could be the relevant difference here? > > Well, there is no acceleration involved when in console mode. So maybe this > has something to do with it. > > Anyway, i am sidetracked a bit by an internal requirement but once i finish > I will get back to this issue especially because I got another report with > the same failure as you describe. I was able to reproduce this instantly without even using page tables CPU update mode. Looks like a regression since S3 was working fine for long time. Were you able to find a regression point for this ? (In reply to Andrey Grodzovsky from comment #19) > I was able to reproduce this instantly without even using page tables CPU > update mode. Looks like a regression since S3 was working fine for long > time. Were you able to find a regression point for this ? Not for the exact symptom described in this report, but for an older S3 resume issue that was partially resolved - https://bugs.freedesktop.org/show_bug.cgi?id=103277 - I did once find the regression caused by the "drm/amd/display: Match actual state during S3 resume" commit. Unluckily, the many changes that followed thereafter do no longer allow to bisect the symptom there to one specific commit, but given that it still occurs if I use the option "drm.edid_firmware=edid/LG_EG9609_edid.bin", I think there is still some bug in the order of things done during re-initialization upon S3 resumes, and setting some fixed EDID seems to expose it as crash. (In reply to dwagner from comment #20) > (In reply to Andrey Grodzovsky from comment #19) > > I was able to reproduce this instantly without even using page tables CPU > > update mode. Looks like a regression since S3 was working fine for long > > time. Were you able to find a regression point for this ? > > Not for the exact symptom described in this report, but for an older S3 > resume issue that was partially resolved - > https://bugs.freedesktop.org/show_bug.cgi?id=103277 - I did once find the > regression caused by the "drm/amd/display: Match actual state during S3 > resume" commit. > > Unluckily, the many changes that followed thereafter do no longer allow to > bisect the symptom there to one specific commit, but given that it still > occurs if I use the option "drm.edid_firmware=edid/LG_EG9609_edid.bin", I > think there is still some bug in the order of things done during > re-initialization upon S3 resumes, and setting some fixed EDID seems to > expose it as crash. I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on atomic drivers Not sure yet what's going on there and not sure it will fix you issue with amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here. Still worth a try on your side to revert it and see what happens. (In reply to Andrey Grodzovsky from comment #21) > I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on > atomic drivers > Not sure yet what's going on there and not sure it will fix you issue with > amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here. > Still worth a try on your side to revert it and see what happens. Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic drivers" for me only changes that after S3 resume, the very picture that was visible before S3 sleep is displayed again - but the kernel crash at "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as frozen as the system is dead. (In reply to dwagner from comment #22) > (In reply to Andrey Grodzovsky from comment #21) > > I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on > > atomic drivers > > Not sure yet what's going on there and not sure it will fix you issue with > > amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here. > > Still worth a try on your side to revert it and see what happens. > Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic > drivers" for me only changes that after S3 resume, the very picture that was > visible before S3 sleep is displayed again - but the kernel crash at > "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as > frozen as the system is dead. Can you attach dmesg from the system with reverted patch ? > > Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic
> > drivers" for me only changes that after S3 resume, the very picture that was
> > visible before S3 sleep is displayed again - but the kernel crash at
> > "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as
> > frozen as the system is dead.
>
> Can you attach dmesg from the system with reverted patch ?
Sure, will do
Created attachment 140634 [details]
dmesg before and after S3 sleep with commit "updating plane ..." reverted
(In reply to dwagner from comment #25) > Created attachment 140634 [details] > dmesg before and after S3 sleep with commit "updating plane ..." reverted Reverting the patch makes the TTM eviction failure + following driver resume failure go away. So that one issue. Another issue Is that you still experience page table updates realated fault during S3. I can't reproduce that issue. I am currently looking into how this patch broke S3, this is more burning issue as other people experience it to. Later i will try to give you some debug printk patch to sort out your page fault issue. Created attachment 140715 [details] [review] 0001-drm-amdgpu-Fix-S3-resume-failre.patch Please try the attached patch for the S3 issue, it's might still not be the final fix but still. It's not a fix for your CPU page table updates fault. (In reply to Andrey Grodzovsky from comment #27) > Created attachment 140715 [details] [review] [review] > 0001-drm-amdgpu-Fix-S3-resume-failre.patch > > Please try the attached patch for the S3 issue, it's might still not be the > final fix but still. It's not a fix for your CPU page table updates fault. Alas, this patch does not change the symptom relative to the revert mentioned above: Screen comes back on, but amdgpu crashes at amdgpu_vm_cpu_set_ptes+0x76 immediately thereafter. Will attach kernel messages below. Created attachment 140721 [details]
dmesg before and after S3 with above patch applied
(In reply to dwagner from comment #28) > (In reply to Andrey Grodzovsky from comment #27) > > Created attachment 140715 [details] [review] [review] [review] > > 0001-drm-amdgpu-Fix-S3-resume-failre.patch > > > > Please try the attached patch for the S3 issue, it's might still not be the > > final fix but still. It's not a fix for your CPU page table updates fault. > > Alas, this patch does not change the symptom relative to the revert > mentioned above: Screen comes back on, but amdgpu crashes at > amdgpu_vm_cpu_set_ptes+0x76 immediately thereafter. Will attach kernel > messages below. Yes, as expected. The page fault issue is something different. Small update: I wrote on 2018-07-04 above: "when I boot amd-staging-drm-next with amdgpu.vm_update_mode=0 and suspend to S3 then resuming does also crash, but with different messages". This has changed with some recent commits to amd-staging-drm-next as of the last 7 days. Now, with amdgpu.vm_update_mode=0, I can put the machine to S3 sleep (even from X11, not only from the console) and have it resume fine. As positive as this is, it does not solve my general stability problem, amd-staging-drm-next with vm_update=0 is still crashing after < 1h of mundane use, and it does not change the fact that with vm_update_mode=3 still every S3 resume ends in a crash (with the symptoms described above). In other news, the amdgpu.dc_log options has vanished, and messages are now fewer and less verbose. (Not an improvement, IMHO.) -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/430. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.