Bug 107065 - "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921
Summary: "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_v...
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Andrey Grodzovsky
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-28 19:33 UTC by dwagner
Modified: 2019-11-19 08:42 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg of the system boot and before and at the crash at S3 resume (84.19 KB, text/plain)
2018-06-28 19:42 UTC, dwagner
no flags Details
drm/amdgpu: Verify root PD is mapped into kernel address space. (1.08 KB, patch)
2018-07-02 03:11 UTC, Andrey Grodzovsky
no flags Details | Splinter Review
dmesg before and after S3 sleep with commit "updating plane ..." reverted (97.61 KB, text/plain)
2018-07-14 13:16 UTC, dwagner
no flags Details
0001-drm-amdgpu-Fix-S3-resume-failre.patch (1.68 KB, patch)
2018-07-19 16:42 UTC, Andrey Grodzovsky
no flags Details | Splinter Review
dmesg before and after S3 with above patch applied (97.54 KB, text/plain)
2018-07-20 00:19 UTC, dwagner
no flags Details

Description dwagner 2018-06-28 19:33:52 UTC
When I resume from S3 using the 4.17.2-1-ARCH kernel, with amdgpu.vm_update_mode=3 (for reasons explained in https://bugs.freedesktop.org/show_bug.cgi?id=102322 ) first the amdgpu driver and shortly thereafter the system crashes with the following kernel messages:

Jun 28 21:14:25 ryzen kernel: ACPI: Low-level resume complete
Jun 28 21:14:25 ryzen kernel: PM: Restoring platform NVS memory
Jun 28 21:14:25 ryzen kernel: Enabling non-boot CPUs ...
...
Jun 28 21:14:25 ryzen kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400040000).
Jun 28 21:14:25 ryzen kernel: [drm] UVD and UVD ENC initialized successfully.
Jun 28 21:14:25 ryzen kernel: [drm] VCE initialized successfully.
Jun 28 21:14:25 ryzen kernel: OOM killer enabled.
Jun 28 21:14:25 ryzen kernel: Restarting tasks ... done.
Jun 28 21:14:25 ryzen kernel: PM: suspend exit
Jun 28 21:14:25 ryzen kernel: BUG: unable to handle kernel paging request at 0000000000002000
Jun 28 21:14:25 ryzen kernel: PGD 0 P4D 0 
Jun 28 21:14:25 ryzen kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Jun 28 21:14:25 ryzen kernel: Modules linked in: arc4 md4 sha512_ssse3 sha512_generic nls_utf8 cifs ccm dns_resolver fscache>
Jun 28 21:14:25 ryzen kernel:  bluetooth snd_hwdep snd_pcm eeepc_wmi snd_timer asus_wmi snd sparse_keymap mxm_wmi wmi_bmof i>
Jun 28 21:14:25 ryzen kernel:  dm_crypt dm_mod i2c_dev
Jun 28 21:14:25 ryzen kernel: CPU: 3 PID: 882 Comm: amdgpu_cs:0 Tainted: G        W  O      4.17.2-1-ARCH #1
Jun 28 21:14:25 ryzen kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
Jun 28 21:14:25 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu]
Jun 28 21:14:25 ryzen kernel: RSP: 0018:ffffb8b8c3fa7a70 EFLAGS: 00010202
Jun 28 21:14:25 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 RCX: 000000f400956001
Jun 28 21:14:25 ryzen kernel: RDX: 0000000000002000 RSI: 0000000000002000 RDI: ffff9edab48a0000
Jun 28 21:14:25 ryzen kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
Jun 28 21:14:25 ryzen kernel: R10: ffffffffc03e4c50 R11: ffff9edab30d0800 R12: 0000000000002000
Jun 28 21:14:25 ryzen kernel: R13: 0000000000000001 R14: ffffb8b8c3fa7ae8 R15: 000000f400956000
Jun 28 21:14:25 ryzen kernel: FS:  00007f622bb59700(0000) GS:ffff9edadecc0000(0000) knlGS:0000000000000000
Jun 28 21:14:25 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 28 21:14:25 ryzen kernel: CR2: 0000000000002000 CR3: 00000007e03f8000 CR4: 00000000003406e0
Jun 28 21:14:25 ryzen kernel: Call Trace:
Jun 28 21:14:25 ryzen kernel:  amdgpu_vm_cpu_set_ptes+0x76/0xf0 [amdgpu]
Jun 28 21:14:25 ryzen kernel:  amdgpu_vm_update_directories+0x1ca/0x3c0 [amdgpu]
Jun 28 21:14:25 ryzen kernel:  ? amdgpu_vm_do_copy_ptes+0xc0/0xc0 [amdgpu]
Jun 28 21:14:25 ryzen kernel:  amdgpu_cs_ioctl+0x1169/0x1a70 [amdgpu]
Jun 28 21:14:25 ryzen kernel:  ? dequeue_entity+0x156/0x950
Jun 28 21:14:25 ryzen kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
Jun 28 21:14:25 ryzen kernel:  drm_ioctl_kernel+0x5b/0xb0 [drm]
Jun 28 21:14:25 ryzen kernel:  drm_ioctl+0x1b7/0x370 [drm]
Jun 28 21:14:25 ryzen kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
Jun 28 21:14:25 ryzen kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Jun 28 21:14:25 ryzen kernel:  do_vfs_ioctl+0xa4/0x610
Jun 28 21:14:25 ryzen kernel:  ksys_ioctl+0x60/0x90
Jun 28 21:14:25 ryzen kernel:  __x64_sys_ioctl+0x16/0x20
Jun 28 21:14:25 ryzen kernel:  do_syscall_64+0x5b/0x170
Jun 28 21:14:25 ryzen kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 28 21:14:25 ryzen kernel: RIP: 0033:0x7f623b586667
Jun 28 21:14:25 ryzen kernel: RSP: 002b:00007f622bb58a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 28 21:14:25 ryzen kernel: RAX: ffffffffffffffda RBX: 00007f622bb58b88 RCX: 00007f623b586667
Jun 28 21:14:25 ryzen kernel: RDX: 00007f622bb58b00 RSI: 00000000c0186444 RDI: 000000000000000b
Jun 28 21:14:25 ryzen kernel: RBP: 00007f622bb58b00 R08: 00007f622bb58bb0 R09: 0000000000000010
Jun 28 21:14:25 ryzen kernel: R10: 00007f622bb58bb0 R11: 0000000000000246 R12: 00000000c0186444
Jun 28 21:14:25 ryzen kernel: R13: 000000000000000b R14: 000000000000000a R15: 0000000000000000
Jun 28 21:14:25 ryzen kernel: Code: 8b 80 d8 00 00 00 e9 85 ed 5c c2 0f 1f 44 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 0>
Jun 28 21:14:25 ryzen kernel: RIP: gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu] RSP: ffffb8b8c3fa7a70
Jun 28 21:14:25 ryzen kernel: CR2: 0000000000002000
Jun 28 21:14:25 ryzen kernel: ---[ end trace 6fce4be2faa5be7e ]---
Comment 1 dwagner 2018-06-28 19:42:59 UTC
Created attachment 140383 [details]
dmesg of the system boot and before and at the crash at S3 resume
Comment 2 dwagner 2018-06-28 19:52:17 UTC
(Just for reference: This bug report is for a different kind of S3-resume-crash than reported in https://bugs.freedesktop.org/show_bug.cgi?id=103277 )
Comment 3 Andrey Grodzovsky 2018-06-28 20:49:19 UTC
Can you use addr2line or gdb with 'list' command to give the line number matching
amdgpu_vm_cpu_set_ptes+0x76/0xf0 ?
Comment 4 dwagner 2018-06-28 22:50:19 UTC
(In reply to Andrey Grodzovsky from comment #3)
> Can you use addr2line or gdb with 'list' command to give the line number
> matching
> amdgpu_vm_cpu_set_ptes+0x76/0xf0 ?

That would have been easy had I used my self-compiled kernel - but it seems there is no debuginfo file available for the Arch Linux supplied kernels, which I ran in this case.

So I can only provide a disassembled listing of that function, with offset 0x76 aka +118 inside:

Dump of assembler code for function amdgpu_vm_cpu_set_ptes:
   0x0000000000027c80 <+0>:     callq  0x27c85 <amdgpu_vm_cpu_set_ptes+5>
   0x0000000000027c85 <+5>:     push   %r15
   0x0000000000027c87 <+7>:     mov    %rcx,%r15
   0x0000000000027c8a <+10>:    push   %r14
   0x0000000000027c8c <+12>:    mov    %rdi,%r14
   0x0000000000027c8f <+15>:    mov    %rsi,%rdi
   0x0000000000027c92 <+18>:    push   %r13
   0x0000000000027c94 <+20>:    mov    %r8d,%r13d
   0x0000000000027c97 <+23>:    push   %r12
   0x0000000000027c99 <+25>:    mov    %rdx,%r12
   0x0000000000027c9c <+28>:    push   %rbp
   0x0000000000027c9d <+29>:    mov    %r9d,%ebp
   0x0000000000027ca0 <+32>:    push   %rbx
   0x0000000000027ca1 <+33>:    callq  0x27ca6 <amdgpu_vm_cpu_set_ptes+38>
   0x0000000000027ca6 <+38>:    add    %rax,%r12
   0x0000000000027ca9 <+41>:    nopl   0x0(%rax,%rax,1)
   0x0000000000027cae <+46>:    xor    %ebx,%ebx
   0x0000000000027cb0 <+48>:    test   %r13d,%r13d
   0x0000000000027cb3 <+51>:    je     0x27cfb <amdgpu_vm_cpu_set_ptes+123>
   0x0000000000027cb5 <+53>:    mov    0x28(%r14),%rax
   0x0000000000027cb9 <+57>:    mov    %r15,%rcx
   0x0000000000027cbc <+60>:    test   %rax,%rax
   0x0000000000027cbf <+63>:    je     0x27cd3 <amdgpu_vm_cpu_set_ptes+83>
   0x0000000000027cc1 <+65>:    mov    %r15,%rdx
   0x0000000000027cc4 <+68>:    mov    $0xfffffffffffff000,%rcx
   0x0000000000027ccb <+75>:    shr    $0xc,%rdx
   0x0000000000027ccf <+79>:    and    (%rax,%rdx,8),%rcx
   0x0000000000027cd3 <+83>:    mov    (%r14),%rdi
   0x0000000000027cd6 <+86>:    mov    %ebx,%edx
   0x0000000000027cd8 <+88>:    add    $0x1,%ebx
   0x0000000000027cdb <+91>:    mov    0x38(%rsp),%r8
   0x0000000000027ce0 <+96>:    mov    %r12,%rsi
   0x0000000000027ce3 <+99>:    add    %rbp,%r15
   0x0000000000027ce6 <+102>:   mov    0x968(%rdi),%rax
   0x0000000000027ced <+109>:   mov    0x18(%rax),%rax
   0x0000000000027cf1 <+113>:   callq  0x27cf6 <amdgpu_vm_cpu_set_ptes+118>
   0x0000000000027cf6 <+118>:   cmp    %ebx,%r13d
   0x0000000000027cf9 <+121>:   jne    0x27cb5 <amdgpu_vm_cpu_set_ptes+53>
   0x0000000000027cfb <+123>:   pop    %rbx
   0x0000000000027cfc <+124>:   pop    %rbp
   0x0000000000027cfd <+125>:   pop    %r12
   0x0000000000027cff <+127>:   pop    %r13
   0x0000000000027d01 <+129>:   pop    %r14
   0x0000000000027d03 <+131>:   pop    %r15
   0x0000000000027d05 <+133>:   retq   
   0x0000000000027d06 <+134>:   mov    %gs:0x0(%rip),%eax        # 0x27d0d <amdgpu_vm_cpu_set_ptes+141>
   0x0000000000027d0d <+141>:   mov    %eax,%eax
   0x0000000000027d0f <+143>:   bt     %rax,0x0(%rip)        # 0x27d17 <amdgpu_vm_cpu_set_ptes+151>
   0x0000000000027d17 <+151>:   jae    0x27cae <amdgpu_vm_cpu_set_ptes+46>
   0x0000000000027d19 <+153>:   incl   %gs:0x0(%rip)        # 0x27d20 <amdgpu_vm_cpu_set_ptes+160>
   0x0000000000027d20 <+160>:   mov    0x0(%rip),%rbx        # 0x27d27 <amdgpu_vm_cpu_set_ptes+167>
   0x0000000000027d27 <+167>:   test   %rbx,%rbx
   0x0000000000027d2a <+170>:   je     0x27d55 <amdgpu_vm_cpu_set_ptes+213>
   0x0000000000027d2c <+172>:   mov    (%rbx),%rax
   0x0000000000027d2f <+175>:   mov    0x8(%rbx),%rdi
   0x0000000000027d33 <+179>:   add    $0x18,%rbx
   0x0000000000027d37 <+183>:   mov    0x38(%rsp),%r9
   0x0000000000027d3c <+188>:   mov    %ebp,%r8d
   0x0000000000027d3f <+191>:   mov    %r13d,%ecx
   0x0000000000027d42 <+194>:   mov    %r15,%rdx
   0x0000000000027d45 <+197>:   mov    %r12,%rsi
   0x0000000000027d48 <+200>:   callq  0x27d4d <amdgpu_vm_cpu_set_ptes+205>
   0x0000000000027d4d <+205>:   mov    (%rbx),%rax
   0x0000000000027d50 <+208>:   test   %rax,%rax
   0x0000000000027d53 <+211>:   jne    0x27d2f <amdgpu_vm_cpu_set_ptes+175>
   0x0000000000027d55 <+213>:   decl   %gs:0x0(%rip)        # 0x27d5c <amdgpu_vm_cpu_set_ptes+220>
   0x0000000000027d5c <+220>:   jne    0x27cae <amdgpu_vm_cpu_set_ptes+46>
   0x0000000000027d62 <+226>:   callq  0x27d67 <amdgpu_vm_cpu_set_ptes+231>
   0x0000000000027d67 <+231>:   jmpq   0x27cae <amdgpu_vm_cpu_set_ptes+46>
Comment 5 dwagner 2018-06-29 00:37:45 UTC
Interesting: With amd-staging-drm-next, I see the same crash at https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c?h=amd-staging-drm-next#n921 with the same backtrace with vm_update_mode=3 immediately upon starting X11 - not only after S3 resume. Here with symbols translated to source lines:

Jun 29 01:49:05 ryzen kernel: amdgpu_vm_cpu_set_ptes (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:921 (discriminator 2)) amdgpu
Jun 29 01:49:05 ryzen kernel: amdgpu_vm_update_directories (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:989 /drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1096) amdgpu
Jun 29 01:49:05 ryzen kernel: ? amdgpu_vm_do_copy_ptes (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:913) amdgpu
Jun 29 01:49:05 ryzen kernel: amdgpu_gem_va_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:542 /drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:674) amdgpu
Jun 29 01:49:05 ryzen kernel: ? __alloc_pages_nodemask (/mm/page_alloc.c:4355) 
Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu
Jun 29 01:49:05 ryzen kernel: drm_ioctl_kernel+0xa7/0xf0 drm
Jun 29 01:49:05 ryzen kernel: drm_ioctl+0x2f1/0x3c0 drm
Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu
Jun 29 01:49:05 ryzen kernel: amdgpu_drm_ioctl (/./include/linux/pm_runtime.h:108 /drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:842) amdgpu
Jun 29 01:49:05 ryzen kernel: do_vfs_ioctl (/fs/ioctl.c:46 /fs/ioctl.c:500 /fs/ioctl.c:684) 
Jun 29 01:49:05 ryzen kernel: ? handle_mm_fault (/mm/memory.c:4133) 
Jun 29 01:49:05 ryzen kernel: ksys_ioctl (/./include/linux/file.h:39 /fs/ioctl.c:702) 
Jun 29 01:49:05 ryzen kernel: __x64_sys_ioctl (/fs/ioctl.c:708 /fs/ioctl.c:706 /fs/ioctl.c:706) 
Jun 29 01:49:05 ryzen kernel: do_syscall_64 (/arch/x86/entry/common.c:290) 
Jun 29 01:49:05 ryzen kernel: entry_SYSCALL_64_after_hwframe (/./include/trace/events/initcall.h:10 /./include/trace/events/initcall.h:10)
Comment 6 Andrey Grodzovsky 2018-06-29 16:16:37 UTC
(In reply to dwagner from comment #5)
> Interesting: With amd-staging-drm-next, I see the same crash at
> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/
> amdgpu_vm.c?h=amd-staging-drm-next#n921 with the same backtrace with
> vm_update_mode=3 immediately upon starting X11 - not only after S3 resume.
> Here with symbols translated to source lines:
> 
> Jun 29 01:49:05 ryzen kernel: amdgpu_vm_cpu_set_ptes
> (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:921 (discriminator 2)) amdgpu
> Jun 29 01:49:05 ryzen kernel: amdgpu_vm_update_directories
> (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:989
> /drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1096) amdgpu
> Jun 29 01:49:05 ryzen kernel: ? amdgpu_vm_do_copy_ptes
> (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:913) amdgpu
> Jun 29 01:49:05 ryzen kernel: amdgpu_gem_va_ioctl
> (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:542
> /drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:674) amdgpu
> Jun 29 01:49:05 ryzen kernel: ? __alloc_pages_nodemask
> (/mm/page_alloc.c:4355) 
> Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl
> (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu
> Jun 29 01:49:05 ryzen kernel: drm_ioctl_kernel+0xa7/0xf0 drm
> Jun 29 01:49:05 ryzen kernel: drm_ioctl+0x2f1/0x3c0 drm
> Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl
> (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu
> Jun 29 01:49:05 ryzen kernel: amdgpu_drm_ioctl
> (/./include/linux/pm_runtime.h:108
> /drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:842) amdgpu
> Jun 29 01:49:05 ryzen kernel: do_vfs_ioctl (/fs/ioctl.c:46 /fs/ioctl.c:500
> /fs/ioctl.c:684) 
> Jun 29 01:49:05 ryzen kernel: ? handle_mm_fault (/mm/memory.c:4133) 
> Jun 29 01:49:05 ryzen kernel: ksys_ioctl (/./include/linux/file.h:39
> /fs/ioctl.c:702) 
> Jun 29 01:49:05 ryzen kernel: __x64_sys_ioctl (/fs/ioctl.c:708
> /fs/ioctl.c:706 /fs/ioctl.c:706) 
> Jun 29 01:49:05 ryzen kernel: do_syscall_64 (/arch/x86/entry/common.c:290) 
> Jun 29 01:49:05 ryzen kernel: entry_SYSCALL_64_after_hwframe
> (/./include/trace/events/initcall.h:10 /./include/trace/events/initcall.h:10)

So with Arch Linux kernel it happens only during S3 but with amd-staging-drm-next it happens once you start X ?
Comment 7 dwagner 2018-06-29 19:10:02 UTC
(In reply to Andrey Grodzovsky from comment #6)
> So with Arch Linux kernel it happens only during S3 but with
> amd-staging-drm-next it happens once you start X ?

Yes. I know it sounds strange, but it's currently 100% reproducible to me:

Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0:
 X11 starts fine, but system crashes after minutes of firefox browsing

Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3:
 X11 starts fine, system does not crash (for at least hours of use)
 but crashes as above if resumed from S3 sleep

Booting linux compiled from amd-staging-drm-next, as of commit 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with amdgpu.vm_update_mode=0:
 X11 starts fine, but system crashes after minutes of firefox browsing

Booting linux compiled from amd-staging-drm-next, as of commit 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with amdgpu.vm_update_mode=3:
 X11 does not start, crashes immediately with the same above pasted kernel BUG message and backtrace


So something with CPU-based vm_update_mode is broken, but in a different way than the SDMA-based method.

I will change the subject of this report to reflect that this crash is not necessarily S3-resume-related.
Comment 8 Andrey Grodzovsky 2018-06-29 19:17:50 UTC
(In reply to dwagner from comment #7)
> (In reply to Andrey Grodzovsky from comment #6)
> > So with Arch Linux kernel it happens only during S3 but with
> > amd-staging-drm-next it happens once you start X ?
> 
> Yes. I know it sounds strange, but it's currently 100% reproducible to me:
> 
> Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0:
>  X11 starts fine, but system crashes after minutes of firefox browsing
> 
> Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3:
>  X11 starts fine, system does not crash (for at least hours of use)
>  but crashes as above if resumed from S3 sleep
> 
> Booting linux compiled from amd-staging-drm-next, as of commit
> 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with
> amdgpu.vm_update_mode=0:
>  X11 starts fine, but system crashes after minutes of firefox browsing
> 
> Booting linux compiled from amd-staging-drm-next, as of commit
> 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with
> amdgpu.vm_update_mode=3:
>  X11 does not start, crashes immediately with the same above pasted kernel
> BUG message and backtrace
> 
> 
> So something with CPU-based vm_update_mode is broken, but in a different way
> than the SDMA-based method.
> 
> I will change the subject of this report to reflect that this crash is not
> necessarily S3-resume-related.

I am going to try and reproduce the crash with CPU update mode here, please describe exactly what ASIC are you using ?
Comment 9 Andrey Grodzovsky 2018-06-29 19:21:34 UTC
(In reply to Andrey Grodzovsky from comment #8)
> (In reply to dwagner from comment #7)
> > (In reply to Andrey Grodzovsky from comment #6)
> > > So with Arch Linux kernel it happens only during S3 but with
> > > amd-staging-drm-next it happens once you start X ?
> > 
> > Yes. I know it sounds strange, but it's currently 100% reproducible to me:
> > 
> > Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0:
> >  X11 starts fine, but system crashes after minutes of firefox browsing
> > 
> > Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3:
> >  X11 starts fine, system does not crash (for at least hours of use)
> >  but crashes as above if resumed from S3 sleep
> > 
> > Booting linux compiled from amd-staging-drm-next, as of commit
> > 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with
> > amdgpu.vm_update_mode=0:
> >  X11 starts fine, but system crashes after minutes of firefox browsing
> > 
> > Booting linux compiled from amd-staging-drm-next, as of commit
> > 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with
> > amdgpu.vm_update_mode=3:
> >  X11 does not start, crashes immediately with the same above pasted kernel
> > BUG message and backtrace
> > 
> > 
> > So something with CPU-based vm_update_mode is broken, but in a different way
> > than the SDMA-based method.
> > 
> > I will change the subject of this report to reflect that this crash is not
> > necessarily S3-resume-related.
> 
> I am going to try and reproduce the crash with CPU update mode here, please
> describe exactly what ASIC are you using ?

Got it already.
Comment 10 Andrey Grodzovsky 2018-07-02 03:11:27 UTC
Created attachment 140418 [details] [review]
drm/amdgpu: Verify root PD is mapped into kernel address space.

dwagner, please try this patch. Fixes the issue for me and I observed no suspend/resume issues.

Christian, please take a look at the patch, problem was that in amdgpu_vm_update_directories the parent BO didn't have a kernel mapping and so later inside amdgpu_vm_cpu_set_ptes 
pe += (unsigned long)amdgpu_bo_kptr(bo); would equal to  0000000000002000 since 
parent amdgpu_bo_kptr woudld return NULL. The parent was the root PD. 

This was still working in 67b8d5c Linus Torvalds      7 weeks ago    Linux 4.17-rc5   (tag: v4.17-rc5) but I wasn't able to exactly pinpoint which change broke it. I am not sure my fix is the right one so please advise.
Comment 11 Christian König 2018-07-02 11:03:45 UTC
(In reply to Andrey Grodzovsky from comment #10)
> Created attachment 140418 [details] [review] [review]
> drm/amdgpu: Verify root PD is mapped into kernel address space.
> 
> dwagner, please try this patch. Fixes the issue for me and I observed no
> suspend/resume issues.
> 
> Christian, please take a look at the patch, problem was that in
> amdgpu_vm_update_directories the parent BO didn't have a kernel mapping and
> so later inside amdgpu_vm_cpu_set_ptes 
> pe += (unsigned long)amdgpu_bo_kptr(bo); would equal to  0000000000002000
> since 
> parent amdgpu_bo_kptr woudld return NULL. The parent was the root PD. 
> 
> This was still working in 67b8d5c Linus Torvalds      7 weeks ago    Linux
> 4.17-rc5   (tag: v4.17-rc5) but I wasn't able to exactly pinpoint which
> change broke it. I am not sure my fix is the right one so please advise.

No idea when that broke either, CPU based updates is not something we usually test.

Anyway it's a good catch, but I would rather add that to amdgpu_vm_bo_base_init() (with the appropriate checks).

That would also allow us to remove the duplicated code from amdgpu_vm_alloc_levels().
Comment 12 dwagner 2018-07-02 19:48:48 UTC
(In reply to Andrey Grodzovsky from comment #10)
> Created attachment 140418 [details] [review] [review]
> drm/amdgpu: Verify root PD is mapped into kernel address space.
> 
> dwagner, please try this patch. Fixes the issue for me and I observed no
> suspend/resume issues.

While I can start X11 with this patch applied to current amd-staging-drm-next, attempts to resume from S3 fail consistently.

The following related output is emitted right before the suspend:

Jul 02 21:31:32 ryzen kernel: Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
Jul 02 21:31:32 ryzen kernel: Suspending console(s) (use no_console_suspend to debug)
Jul 02 21:31:32 ryzen kernel: sd 9:0:0:0: [sda] Synchronizing SCSI cache
Jul 02 21:31:32 ryzen kernel: [TTM] Buffer eviction failed
Jul 02 21:31:32 ryzen kernel: ACPI: Preparing to enter system sleep state S3
Jul 02 21:31:32 ryzen kernel: PM: Saving platform NVS memory
Jul 02 21:31:32 ryzen kernel: Disabling non-boot CPUs ...

(I wonder if that "[TTM] Buffer eviction failed" is a bad sign - as I have seen it some other times in conjunction with heavy uses of the amdgpu driver.)


Then, upon resume, the following messages are emitted:

Jul 02 21:31:33 ryzen kernel: ACPI: Low-level resume complete
Jul 02 21:31:33 ryzen kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 146 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 148 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 145 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 146 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 189 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 306 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 5e ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 18a ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 145 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 146 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 148 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 145 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
                               failed to send message 146 ret is 0 
Jul 02 21:31:33 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xC>
Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -22
Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-22).
Jul 02 21:31:33 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -22
Jul 02 21:31:33 ryzen kernel: PM: Device 0000:0a:00.0 failed to resume async: error -22
Jul 02 21:31:33 ryzen kernel: OOM killer enabled.
Jul 02 21:31:33 ryzen kernel: Restarting tasks ... done.
Jul 02 21:31:33 ryzen kernel: PM: suspend exit
Jul 02 21:31:33 ryzen kernel: BUG: unable to handle kernel paging request at 0000000000001000
Jul 02 21:31:33 ryzen kernel: PGD 0 P4D 0 
Jul 02 21:31:33 ryzen kernel: Oops: 0002 [#1] SMP
Jul 02 21:31:33 ryzen kernel: CPU: 14 PID: 791 Comm: amdgpu_cs:0 Tainted: G        W  O      4.18.0-rc1-amd+ #45
Jul 02 21:31:33 ryzen kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu]
Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0>
Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202
Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 RCX: 000000000fe004f1
Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000 RDI: ffff8807e2f70000
Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1 R09: 0000000000001000
Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000 R12: 0000000000001000
Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18 R15: 000000000fe01000
Jul 02 21:31:33 ryzen kernel: FS:  00007f8b57266700(0000) GS:ffff88081ef80000(0000) knlGS:0000000000000000
Jul 02 21:31:33 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000 CR4: 00000000003406e0
Jul 02 21:31:33 ryzen kernel: Call Trace:
Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_cpu_set_ptes+0x76/0xe0 [amdgpu]
Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_update_ptes+0x1d3/0x2e0 [amdgpu]
Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_frag_ptes+0xae/0x130 [amdgpu]
Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_bo_update_mapping+0xed/0x410 [amdgpu]
Jul 02 21:31:33 ryzen kernel:  ? amdgpu_vm_do_copy_ptes+0xa0/0xa0 [amdgpu]
Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_bo_update+0x310/0x680 [amdgpu]
Jul 02 21:31:33 ryzen kernel:  amdgpu_cs_ioctl+0x1092/0x1a50 [amdgpu]
Jul 02 21:31:33 ryzen kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
Jul 02 21:31:33 ryzen kernel:  drm_ioctl_kernel+0xa7/0xf0 [drm]
Jul 02 21:31:33 ryzen kernel:  drm_ioctl+0x2f1/0x3c0 [drm]
Jul 02 21:31:33 ryzen kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
Jul 02 21:31:33 ryzen kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Jul 02 21:31:33 ryzen kernel:  do_vfs_ioctl+0xa4/0x620
Jul 02 21:31:33 ryzen kernel:  ? __se_sys_futex+0x138/0x180
Jul 02 21:31:33 ryzen kernel:  ksys_ioctl+0x60/0x90
Jul 02 21:31:33 ryzen kernel:  __x64_sys_ioctl+0x16/0x20
Jul 02 21:31:33 ryzen kernel:  do_syscall_64+0x48/0xf0
Jul 02 21:31:33 ryzen kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 02 21:31:33 ryzen kernel: RIP: 0033:0x7f8b66c92667
Jul 02 21:31:33 ryzen kernel: Code: 00 00 90 48 8b 05 e9 67 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 8>
Jul 02 21:31:33 ryzen kernel: RSP: 002b:00007f8b57265a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jul 02 21:31:33 ryzen kernel: RAX: ffffffffffffffda RBX: 00007f8b57265b88 RCX: 00007f8b66c92667
Jul 02 21:31:33 ryzen kernel: RDX: 00007f8b57265b00 RSI: 00000000c0186444 RDI: 000000000000000b
Jul 02 21:31:33 ryzen kernel: RBP: 00007f8b57265b00 R08: 00007f8b57265bb0 R09: 0000000000000010
Jul 02 21:31:33 ryzen kernel: R10: 00007f8b57265bb0 R11: 0000000000000246 R12: 00000000c0186444
Jul 02 21:31:33 ryzen kernel: R13: 000000000000000b R14: 0000000000000002 R15: 0000000000000000
Jul 02 21:31:33 ryzen kernel: Modules linked in: it87(O) joydev mousedev hid_generic hidp hid ipt_REJECT nf_reject_ipv4 nf_l>
Jul 02 21:31:33 ryzen kernel:  serio_raw crc32_pclmul atkbd ghash_clmulni_intel libps2 pcbc ahci libahci xhci_pci libata aes>
Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000
Jul 02 21:31:33 ryzen kernel: ---[ end trace 517a8a72887251f0 ]---
Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu]
Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0>
Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202
Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 RCX: 000000000fe004f1
Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000 RDI: ffff8807e2f70000
Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1 R09: 0000000000001000
Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000 R12: 0000000000001000
Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18 R15: 000000000fe01000
Jul 02 21:31:33 ryzen kernel: FS:  00007f8b57266700(0000) GS:ffff88081ef80000(0000) knlGS:0000000000000000
Jul 02 21:31:33 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000 CR4: 00000000003406e0

(At this point, the machine is just dead, and reacts upon nothing.)

So something is still wrong at amdgpu_vm_cpu_set_ptes+0x76
Comment 13 Andrey Grodzovsky 2018-07-02 22:55:24 UTC
(In reply to dwagner from comment #12)
> (In reply to Andrey Grodzovsky from comment #10)
> > Created attachment 140418 [details] [review] [review] [review]
> > drm/amdgpu: Verify root PD is mapped into kernel address space.
> > 
> > dwagner, please try this patch. Fixes the issue for me and I observed no
> > suspend/resume issues.
> 
> While I can start X11 with this patch applied to current
> amd-staging-drm-next, attempts to resume from S3 fail consistently.
> 
> The following related output is emitted right before the suspend:
> 
> Jul 02 21:31:32 ryzen kernel: Freezing remaining freezable tasks ...
> (elapsed 0.000 seconds) done.
> Jul 02 21:31:32 ryzen kernel: Suspending console(s) (use no_console_suspend
> to debug)
> Jul 02 21:31:32 ryzen kernel: sd 9:0:0:0: [sda] Synchronizing SCSI cache
> Jul 02 21:31:32 ryzen kernel: [TTM] Buffer eviction failed
> Jul 02 21:31:32 ryzen kernel: ACPI: Preparing to enter system sleep state S3
> Jul 02 21:31:32 ryzen kernel: PM: Saving platform NVS memory
> Jul 02 21:31:32 ryzen kernel: Disabling non-boot CPUs ...
> 
> (I wonder if that "[TTM] Buffer eviction failed" is a bad sign - as I have
> seen it some other times in conjunction with heavy uses of the amdgpu
> driver.)
> 
> 
> Then, upon resume, the following messages are emitted:
> 
> Jul 02 21:31:33 ryzen kernel: ACPI: Low-level resume complete
> Jul 02 21:31:33 ryzen kernel: [drm] PCIE GART of 256M enabled (table at
> 0x000000F400300000).
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 148 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 145 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 189 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 306 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 5e ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 18a ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 145 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 148 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 145 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR*
> amdgpu: ring 0 test failed (scratch(0xC040)=0xC>
> Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]]
> *ERROR* resume of IP block <gfx_v8_0> failed -22
> Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR*
> amdgpu_device_ip_resume failed (-22).
> Jul 02 21:31:33 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0
> returns -22
> Jul 02 21:31:33 ryzen kernel: PM: Device 0000:0a:00.0 failed to resume
> async: error -22
> Jul 02 21:31:33 ryzen kernel: OOM killer enabled.
> Jul 02 21:31:33 ryzen kernel: Restarting tasks ... done.
> Jul 02 21:31:33 ryzen kernel: PM: suspend exit
> Jul 02 21:31:33 ryzen kernel: BUG: unable to handle kernel paging request at
> 0000000000001000
> Jul 02 21:31:33 ryzen kernel: PGD 0 P4D 0 
> Jul 02 21:31:33 ryzen kernel: Oops: 0002 [#1] SMP
> Jul 02 21:31:33 ryzen kernel: CPU: 14 PID: 791 Comm: amdgpu_cs:0 Tainted: G 
> W  O      4.18.0-rc1-amd+ #45
> Jul 02 21:31:33 ryzen kernel: Hardware name: System manufacturer System
> Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30
> [amdgpu]
> Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44
> 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0>
> Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202
> Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001
> RCX: 000000000fe004f1
> Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000
> RDI: ffff8807e2f70000
> Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1
> R09: 0000000000001000
> Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000
> R12: 0000000000001000
> Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18
> R15: 000000000fe01000
> Jul 02 21:31:33 ryzen kernel: FS:  00007f8b57266700(0000)
> GS:ffff88081ef80000(0000) knlGS:0000000000000000
> Jul 02 21:31:33 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000
> CR4: 00000000003406e0
> Jul 02 21:31:33 ryzen kernel: Call Trace:
> Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_cpu_set_ptes+0x76/0xe0 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_update_ptes+0x1d3/0x2e0 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_frag_ptes+0xae/0x130 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_bo_update_mapping+0xed/0x410
> [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  ? amdgpu_vm_do_copy_ptes+0xa0/0xa0 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_bo_update+0x310/0x680 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_cs_ioctl+0x1092/0x1a50 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  drm_ioctl_kernel+0xa7/0xf0 [drm]
> Jul 02 21:31:33 ryzen kernel:  drm_ioctl+0x2f1/0x3c0 [drm]
> Jul 02 21:31:33 ryzen kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  do_vfs_ioctl+0xa4/0x620
> Jul 02 21:31:33 ryzen kernel:  ? __se_sys_futex+0x138/0x180
> Jul 02 21:31:33 ryzen kernel:  ksys_ioctl+0x60/0x90
> Jul 02 21:31:33 ryzen kernel:  __x64_sys_ioctl+0x16/0x20
> Jul 02 21:31:33 ryzen kernel:  do_syscall_64+0x48/0xf0
> Jul 02 21:31:33 ryzen kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Jul 02 21:31:33 ryzen kernel: RIP: 0033:0x7f8b66c92667
> Jul 02 21:31:33 ryzen kernel: Code: 00 00 90 48 8b 05 e9 67 2c 00 64 c7 00
> 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 8>
> Jul 02 21:31:33 ryzen kernel: RSP: 002b:00007f8b57265a98 EFLAGS: 00000246
> ORIG_RAX: 0000000000000010
> Jul 02 21:31:33 ryzen kernel: RAX: ffffffffffffffda RBX: 00007f8b57265b88
> RCX: 00007f8b66c92667
> Jul 02 21:31:33 ryzen kernel: RDX: 00007f8b57265b00 RSI: 00000000c0186444
> RDI: 000000000000000b
> Jul 02 21:31:33 ryzen kernel: RBP: 00007f8b57265b00 R08: 00007f8b57265bb0
> R09: 0000000000000010
> Jul 02 21:31:33 ryzen kernel: R10: 00007f8b57265bb0 R11: 0000000000000246
> R12: 00000000c0186444
> Jul 02 21:31:33 ryzen kernel: R13: 000000000000000b R14: 0000000000000002
> R15: 0000000000000000
> Jul 02 21:31:33 ryzen kernel: Modules linked in: it87(O) joydev mousedev
> hid_generic hidp hid ipt_REJECT nf_reject_ipv4 nf_l>
> Jul 02 21:31:33 ryzen kernel:  serio_raw crc32_pclmul atkbd
> ghash_clmulni_intel libps2 pcbc ahci libahci xhci_pci libata aes>
> Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000
> Jul 02 21:31:33 ryzen kernel: ---[ end trace 517a8a72887251f0 ]---
> Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30
> [amdgpu]
> Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44
> 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0>
> Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202
> Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001
> RCX: 000000000fe004f1
> Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000
> RDI: ffff8807e2f70000
> Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1
> R09: 0000000000001000
> Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000
> R12: 0000000000001000
> Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18
> R15: 000000000fe01000
> Jul 02 21:31:33 ryzen kernel: FS:  00007f8b57266700(0000)
> GS:ffff88081ef80000(0000) knlGS:0000000000000000
> Jul 02 21:31:33 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000
> CR4: 00000000003406e0
> 
> (At this point, the machine is just dead, and reacts upon nothing.)
> 
> So something is still wrong at amdgpu_vm_cpu_set_ptes+0x76


My guess is that on resume from S3 root PD needs to be again mapped to CPU address space. Maybe changing the patch according  to Christian's advise will be enough. I will take a look tomorrow. Or it has to do with the resume failure you are experiencing. What ASIC are you using ? I also tested with gfx8 ASIC and haven't observed any issues with resume. Did you update the firmware for this ASIC to latest #
Comment 14 dwagner 2018-07-03 20:42:22 UTC
(In reply to Andrey Grodzovsky from comment #13)
> What ASIC are you using ? I also tested with
> gfx8 ASIC and haven't observed any issues with resume. Did you update the
> firmware for this ASIC to latest #

The GPU is an RX460 "POLARIS11 0x1002:0x67EF 0x1682:0x9460 0xCF",
with the latest firmware from the kernel git, you can see the
details from https://bugs.freedesktop.org/attachment.cgi?id=140383
uploaded earlier.
Comment 15 Andrey Grodzovsky 2018-07-03 22:58:20 UTC
(In reply to dwagner from comment #14)
> (In reply to Andrey Grodzovsky from comment #13)
> > What ASIC are you using ? I also tested with
> > gfx8 ASIC and haven't observed any issues with resume. Did you update the
> > firmware for this ASIC to latest #
> 
> The GPU is an RX460 "POLARIS11 0x1002:0x67EF 0x1682:0x9460 0xCF",
> with the latest firmware from the kernel git, you can see the
> details from https://bugs.freedesktop.org/attachment.cgi?id=140383
> uploaded earlier.

We have only minor differences but I can't reproduce it. Maybe the resume failure is indeed due the eviction failure during suspend. Is S3 failure is happening only when you switch to CPU update mode ?
Comment 16 dwagner 2018-07-04 22:55:09 UTC
(In reply to Andrey Grodzovsky from comment #15)
> We have only minor differences but I can't reproduce it. Maybe the resume
> failure is indeed due the eviction failure during suspend. Is S3 failure is
> happening only when you switch to CPU update mode ?

No, when I boot amd-staging-drm-next with amdgpu.vm_update_mode=0
and suspend to S3 then resuming does also crash, but with different
messages - _not_ with
 "BUG: unable to handle kernel paging request at 0000000000002000"
like in the vm_update_mode=3 case.

In the journal, I can see see after a vm_update_mode=0 S3 resume attempt: 

Jul 05 00:41:59 ryzen kernel: [TTM] Buffer eviction failed
Jul 05 00:41:59 ryzen kernel: ACPI: Preparing to enter system sleep state S3
...
Jul 05 00:42:00 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xC>
Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -22
Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-22).
Jul 05 00:42:00 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -22
Jul 05 00:42:00 ryzen kernel: PM: Device 0000:0a:00.0 failed to resume async: error -22
...
Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0>
Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0>
Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0>
Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22)
Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0>
... many more of this... but no kernel BUG or Oops.
Comment 17 dwagner 2018-07-06 23:03:07 UTC
Interesting observation: If I first switch from the X11 display to the console display (with Alt-F2), and then enter "echo mem >/sys/power/state" on the console, above described crashes upon S3 resume do not occur, and I do not see the "[TTM] Buffer eviction failed" in the kernel log, neither with vm_update_mode=0, nor with vm_update_mode=3.

Switching back to the X11 display after a successful S3 resume to the console also works fine.

What could be the relevant difference here?
Comment 18 Andrey Grodzovsky 2018-07-09 18:16:46 UTC
(In reply to dwagner from comment #17)
> Interesting observation: If I first switch from the X11 display to the
> console display (with Alt-F2), and then enter "echo mem >/sys/power/state"
> on the console, above described crashes upon S3 resume do not occur, and I
> do not see the "[TTM] Buffer eviction failed" in the kernel log, neither
> with vm_update_mode=0, nor with vm_update_mode=3.
> 
> Switching back to the X11 display after a successful S3 resume to the
> console also works fine.
> 
> What could be the relevant difference here?

Well, there is no acceleration involved when in console mode. So maybe this has something to do with it.

Anyway, i am sidetracked a bit by an internal requirement but once i finish I will get back to this issue especially because I got another report with the same failure as you describe.
Comment 19 Andrey Grodzovsky 2018-07-11 22:04:03 UTC
(In reply to Andrey Grodzovsky from comment #18)
> (In reply to dwagner from comment #17)
> > Interesting observation: If I first switch from the X11 display to the
> > console display (with Alt-F2), and then enter "echo mem >/sys/power/state"
> > on the console, above described crashes upon S3 resume do not occur, and I
> > do not see the "[TTM] Buffer eviction failed" in the kernel log, neither
> > with vm_update_mode=0, nor with vm_update_mode=3.
> > 
> > Switching back to the X11 display after a successful S3 resume to the
> > console also works fine.
> > 
> > What could be the relevant difference here?
> 
> Well, there is no acceleration involved when in console mode. So maybe this
> has something to do with it.
> 
> Anyway, i am sidetracked a bit by an internal requirement but once i finish
> I will get back to this issue especially because I got another report with
> the same failure as you describe.

I was able to reproduce this instantly without even using page tables CPU update mode. Looks like a regression since S3 was working fine for long time. Were you able to find a regression point for this ?
Comment 20 dwagner 2018-07-11 22:23:05 UTC
(In reply to Andrey Grodzovsky from comment #19)
> I was able to reproduce this instantly without even using page tables CPU
> update mode. Looks like a regression since S3 was working fine for long
> time. Were you able to find a regression point for this ?

Not for the exact symptom described in this report, but for an older S3 resume issue that was partially resolved - https://bugs.freedesktop.org/show_bug.cgi?id=103277 - I did once find the regression caused by the "drm/amd/display: Match actual state during S3 resume" commit.

Unluckily, the many changes that followed thereafter do no longer allow to bisect the symptom there to one specific commit, but given that it still occurs if I use the option "drm.edid_firmware=edid/LG_EG9609_edid.bin", I think there is still some bug in the order of things done during re-initialization upon S3 resumes, and setting some fixed EDID seems to expose it as crash.
Comment 21 Andrey Grodzovsky 2018-07-13 21:01:58 UTC
(In reply to dwagner from comment #20)
> (In reply to Andrey Grodzovsky from comment #19)
> > I was able to reproduce this instantly without even using page tables CPU
> > update mode. Looks like a regression since S3 was working fine for long
> > time. Were you able to find a regression point for this ?
> 
> Not for the exact symptom described in this report, but for an older S3
> resume issue that was partially resolved -
> https://bugs.freedesktop.org/show_bug.cgi?id=103277 - I did once find the
> regression caused by the "drm/amd/display: Match actual state during S3
> resume" commit.
> 
> Unluckily, the many changes that followed thereafter do no longer allow to
> bisect the symptom there to one specific commit, but given that it still
> occurs if I use the option "drm.edid_firmware=edid/LG_EG9609_edid.bin", I
> think there is still some bug in the order of things done during
> re-initialization upon S3 resumes, and setting some fixed EDID seems to
> expose it as crash.

I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on atomic drivers
Not sure yet what's going on there and not sure it will fix you issue with amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here.
Still worth a try on your side to revert it and see what happens.
Comment 22 dwagner 2018-07-13 23:45:19 UTC
(In reply to Andrey Grodzovsky from comment #21)
> I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on
> atomic drivers
> Not sure yet what's going on there and not sure it will fix you issue with
> amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here.
> Still worth a try on your side to revert it and see what happens.
Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic drivers" for me only changes that after S3 resume, the very picture that was visible before S3 sleep is displayed again - but the kernel crash at "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as frozen as the system is dead.
Comment 23 Andrey Grodzovsky 2018-07-14 04:28:30 UTC
(In reply to dwagner from comment #22)
> (In reply to Andrey Grodzovsky from comment #21)
> > I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on
> > atomic drivers
> > Not sure yet what's going on there and not sure it will fix you issue with
> > amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here.
> > Still worth a try on your side to revert it and see what happens.
> Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic
> drivers" for me only changes that after S3 resume, the very picture that was
> visible before S3 sleep is displayed again - but the kernel crash at
> "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as
> frozen as the system is dead.

Can you attach dmesg from the system with reverted patch ?
Comment 24 dwagner 2018-07-14 13:15:39 UTC
> > Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic
> > drivers" for me only changes that after S3 resume, the very picture that was
> > visible before S3 sleep is displayed again - but the kernel crash at
> > "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as
> > frozen as the system is dead.
> 
> Can you attach dmesg from the system with reverted patch ?

Sure, will do
Comment 25 dwagner 2018-07-14 13:16:46 UTC
Created attachment 140634 [details]
dmesg before and after S3 sleep with commit "updating plane ..." reverted
Comment 26 Andrey Grodzovsky 2018-07-16 13:52:00 UTC
(In reply to dwagner from comment #25)
> Created attachment 140634 [details]
> dmesg before and after S3 sleep with commit "updating plane ..." reverted

Reverting the patch makes the TTM eviction failure + following driver resume failure go away. So that one issue. Another issue Is that you still experience page table updates realated fault during S3. I can't reproduce that issue. 

I am currently looking into how this patch broke S3, this is more burning issue as other people experience it to. Later i will try to give you some debug printk patch to sort out your page fault issue.
Comment 27 Andrey Grodzovsky 2018-07-19 16:42:26 UTC
Created attachment 140715 [details] [review]
0001-drm-amdgpu-Fix-S3-resume-failre.patch

Please try the attached patch for the S3 issue, it's might still not be the final fix but still. It's not a fix for your CPU page table updates fault.
Comment 28 dwagner 2018-07-20 00:18:05 UTC
(In reply to Andrey Grodzovsky from comment #27)
> Created attachment 140715 [details] [review] [review]
> 0001-drm-amdgpu-Fix-S3-resume-failre.patch
> 
> Please try the attached patch for the S3 issue, it's might still not be the
> final fix but still. It's not a fix for your CPU page table updates fault.

Alas, this patch does not change the symptom relative to the revert mentioned above: Screen comes back on, but amdgpu crashes at amdgpu_vm_cpu_set_ptes+0x76 immediately thereafter. Will attach kernel messages below.
Comment 29 dwagner 2018-07-20 00:19:10 UTC
Created attachment 140721 [details]
dmesg before and after S3 with above patch applied
Comment 30 Andrey Grodzovsky 2018-07-20 13:45:06 UTC
(In reply to dwagner from comment #28)
> (In reply to Andrey Grodzovsky from comment #27)
> > Created attachment 140715 [details] [review] [review] [review]
> > 0001-drm-amdgpu-Fix-S3-resume-failre.patch
> > 
> > Please try the attached patch for the S3 issue, it's might still not be the
> > final fix but still. It's not a fix for your CPU page table updates fault.
> 
> Alas, this patch does not change the symptom relative to the revert
> mentioned above: Screen comes back on, but amdgpu crashes at
> amdgpu_vm_cpu_set_ptes+0x76 immediately thereafter. Will attach kernel
> messages below.

Yes, as expected. The page fault issue is something different.
Comment 31 dwagner 2018-07-25 23:05:27 UTC
Small update: I wrote on 2018-07-04 above:
"when I boot amd-staging-drm-next with amdgpu.vm_update_mode=0 and suspend to S3 then resuming does also crash, but with different messages".

This has changed with some recent commits to amd-staging-drm-next as of the last 7 days. Now, with amdgpu.vm_update_mode=0, I can put the machine to S3 sleep (even from X11, not only from the console) and have it resume fine.

As positive as this is, it does not solve my general stability problem, amd-staging-drm-next with vm_update=0 is still crashing after < 1h of mundane use, and it does not change the fact that with vm_update_mode=3 still every S3 resume ends in a crash (with the symptoms described above).

In other news, the amdgpu.dc_log options has vanished, and messages are now fewer and less verbose. (Not an improvement, IMHO.)
Comment 32 Martin Peres 2019-11-19 08:42:24 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/430.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.