Bug 109234 - amdgpu random hangs with 5.0-rc2/4.21+
Summary: amdgpu random hangs with 5.0-rc2/4.21+
Status: RESOLVED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-06 20:50 UTC by bmilreu
Modified: 2019-01-18 04:31 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Bisect result (4.21 KB, text/plain)
2019-01-09 15:21 UTC, Sibren Vasse
no flags Details
dmesg kfd and amdgpu hangs (12.69 KB, text/plain)
2019-01-11 06:38 UTC, bmilreu
no flags Details

Description bmilreu 2019-01-06 20:50:19 UTC
This bug happens for me like once a day, at seemingly random times with latest kernel from torvalds tree. Last merge was yesterday https://github.com/torvalds/linux/commit/0fe4e2d5cd931ad2ff99d61cfdd5c6dc0c3ec60b but in the previous drm merge the bug was also present.

System:    Host: mjb Kernel: 4.20.0-1-tkg-cfs x86_64 bits: 64 Desktop: KDE Plasma 5.14.4 Distro: Manjaro Linux 
Machine:   Type: Desktop Mobo: ASUSTeK model: TUF B450M-PLUS GAMING v: Rev X.0x serial: <root required> 
           UEFI: American Megatrends v: 0601 date: 10/29/2018 
CPU:       6-Core: AMD Ryzen 5 2600 type: MT MCP speed: 3885 MHz min/max: 1550/3900 MHz 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] 
           driver: amdgpu v: kernel 
           Display: x11 server: X.Org 1.20.3 driver: amdgpu resolution: 1920x1080~60Hz 
           OpenGL: renderer: Radeon RX 580 Series (POLARIS10 DRM 3.27.0 4.21.0-torvaldsgit LLVM 8.0.0) 
           v: 4.5 Mesa 19.0.0-devel (git-8847370424) 
Network:   Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet driver: r8169 
Drives:    Local Storage: total: 1.59 TiB used: 283.09 GiB (17.3%) 
Info:      Processes: 262 Uptime: 6m Memory: 15.66 GiB used: 1.22 GiB (7.8%) Shell: zsh inxi: 3.0.28 

dmesg from previous boot always shows a trace like this:

jan 06 18:37:32 mjb kernel: general protection fault: 0000 [#1] PREEMPT SMP NOPTI
jan 06 18:37:32 mjb kernel: CPU: 4 PID: 676 Comm: Xorg:cs0 Tainted: G           O      4.21.0-torvaldsgit #1
jan 06 18:37:32 mjb kernel: Hardware name: System manufacturer System Product Name/TUF B450M-PLUS GAMING, BIOS 0601 10/29/2018
jan 06 18:37:32 mjb kernel: RIP: 0010:__memcpy+0x12/0x20
jan 06 18:37:32 mjb kernel: Code: 48 89 c8 e9 f9 fc ff ff 48 89 f0 e9 f1 fc ff ff 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66>
jan 06 18:37:32 mjb kernel: RSP: 0018:ffffc9000327bc30 EFLAGS: 00010246
jan 06 18:37:32 mjb kernel: RAX: 0000a0050f003b80 RBX: ffff888105fdb0b0 RCX: 0000000000000200
jan 06 18:37:32 mjb kernel: RDX: 0000000000000000 RSI: ffff8880d3369000 RDI: 0000a0050f003b80
jan 06 18:37:32 mjb kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000001cc
jan 06 18:37:32 mjb kernel: R10: 0000000000000000 R11: ffff8883fa6a4828 R12: 0000000000001000
jan 06 18:37:32 mjb kernel: R13: 0000000000000000 R14: 00000000d3369000 R15: ffff88840bf6fb28
jan 06 18:37:32 mjb kernel: FS:  00007fc0419a4700(0000) GS:ffff88840eb00000(0000) knlGS:0000000000000000
jan 06 18:37:32 mjb kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jan 06 18:37:32 mjb kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jan 06 18:37:32 mjb kernel: CR2: 00007fa2d142d210 CR3: 0000000401080000 CR4: 00000000003406e0
jan 06 18:37:32 mjb kernel: Call Trace:
jan 06 18:37:32 mjb kernel:  dma_direct_unmap_page+0x92/0xa0
jan 06 18:37:32 mjb kernel:  ttm_unmap_and_unpopulate_pages+0x148/0x170 [ttm]
jan 06 18:37:32 mjb kernel:  ttm_tt_destroy+0x81/0xd0 [ttm]
jan 06 18:37:32 mjb kernel:  ttm_bo_put+0x25e/0x2f0 [ttm]
jan 06 18:37:32 mjb kernel:  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
jan 06 18:37:32 mjb kernel:  amdgpu_gem_object_free+0x23/0x30 [amdgpu]
jan 06 18:37:32 mjb kernel:  drm_gem_handle_delete+0x9b/0x130 [drm]
jan 06 18:37:32 mjb kernel:  ? drm_gem_handle_create+0x40/0x40 [drm]
jan 06 18:37:32 mjb kernel:  drm_ioctl_kernel+0x8b/0xd0 [drm]
jan 06 18:37:32 mjb kernel:  drm_ioctl+0x1e5/0x390 [drm]
jan 06 18:37:32 mjb kernel:  ? drm_gem_handle_create+0x40/0x40 [drm]
jan 06 18:37:32 mjb kernel:  ? tlb_finish_mmu+0x1f/0x30
jan 06 18:37:32 mjb kernel:  ? unmap_region+0xc9/0xf0
jan 06 18:37:32 mjb kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
jan 06 18:37:32 mjb kernel:  do_vfs_ioctl+0x97/0x720
jan 06 18:37:32 mjb kernel:  ? __do_munmap.constprop.9+0x263/0x3a0
jan 06 18:37:32 mjb kernel:  __x64_sys_ioctl+0x62/0x90
jan 06 18:37:32 mjb kernel:  do_syscall_64+0x55/0x100
jan 06 18:37:32 mjb kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
jan 06 18:37:32 mjb kernel: RIP: 0033:0x7fc04ba0580b
jan 06 18:37:32 mjb kernel: Code: 0f 1e fa 48 8b 05 55 b6 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3>
jan 06 18:37:32 mjb kernel: RSP: 002b:00007fc0419a3968 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
jan 06 18:37:32 mjb kernel: RAX: ffffffffffffffda RBX: 0000562b657e9710 RCX: 00007fc04ba0580b
jan 06 18:37:32 mjb kernel: RDX: 00007fc0419a39a0 RSI: 0000000040086409 RDI: 000000000000000e
jan 06 18:37:32 mjb kernel: RBP: 00007fc0419a39a0 R08: 0000562b6333bc48 R09: 0000000000000007
jan 06 18:37:32 mjb kernel: R10: 0000000000000026 R11: 0000000000000246 R12: 0000000040086409
jan 06 18:37:32 mjb kernel: R13: 000000000000000e R14: 0000562b63390960 R15: 0000562b657e3fa0
jan 06 18:37:32 mjb kernel: Modules linked in: devlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter nf_tables nfnetlink edac_mce_amd kvm_amd kvm irqbypass snd_hda_code>
jan 06 18:37:32 mjb kernel: ---[ end trace 28938eb196cb96ca ]---
jan 06 18:37:32 mjb kernel: RIP: 0010:__memcpy+0x12/0x20
jan 06 18:37:32 mjb kernel: Code: 48 89 c8 e9 f9 fc ff ff 48 89 f0 e9 f1 fc ff ff 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66>
jan 06 18:37:32 mjb kernel: RSP: 0018:ffffc9000327bc30 EFLAGS: 00010246
jan 06 18:37:32 mjb kernel: RAX: 0000a0050f003b80 RBX: ffff888105fdb0b0 RCX: 0000000000000200
jan 06 18:37:32 mjb kernel: RDX: 0000000000000000 RSI: ffff8880d3369000 RDI: 0000a0050f003b80
jan 06 18:37:32 mjb kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000001cc
jan 06 18:37:32 mjb kernel: R10: 0000000000000000 R11: ffff8883fa6a4828 R12: 0000000000001000
jan 06 18:37:32 mjb kernel: R13: 0000000000000000 R14: 00000000d3369000 R15: ffff88840bf6fb28
jan 06 18:37:32 mjb kernel: FS:  00007fc0419a4700(0000) GS:ffff88840eb00000(0000) knlGS:0000000000000000
jan 06 18:37:32 mjb kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jan 06 18:37:32 mjb kernel: CR2: 00007fa2d142d210 CR3: 0000000401080000 CR4: 00000000003406e0
lines 2853-2894/2894 (END)
Comment 1 bmilreu 2019-01-07 03:33:44 UTC
Still happens in 5.0-rc1
Comment 2 Michel Dänzer 2019-01-07 10:28:49 UTC
Can you bisect?
Comment 3 bmilreu 2019-01-07 17:56:15 UTC
(In reply to Michel Dänzer from comment #2)
> Can you bisect?

I could but I don't have a reliable reproduction yet. I don't know what triggers the bug.
Comment 4 bmilreu 2019-01-07 22:01:17 UTC
I havent triggered it again yet in 5.0-rc1 after a bios update, lets see what happens in next few days.
Comment 5 bmilreu 2019-01-07 23:32:59 UTC
got it playing a steam game in wine now, but still can't reproduce reliably:

jan 07 21:27:20 mjb kernel: BUG: unable to handle kernel paging request at ffff8e08888b4c00
jan 07 21:27:20 mjb kernel: #PF error: [WRITE]
jan 07 21:27:20 mjb kernel: PGD 0 P4D 0 
jan 07 21:27:20 mjb kernel: Oops: 0002 [#1] SMP NOPTI
jan 07 21:27:20 mjb kernel: CPU: 1 PID: 18040 Comm: Steam.exe Tainted: G           O      5.0.0-1-tkg-cfs #1
jan 07 21:27:20 mjb kernel: Hardware name: System manufacturer System Product Name/TUF B450M-PLUS GAMING, BIOS 0604 12/07/2018
jan 07 21:27:20 mjb kernel: RIP: 0010:__memcpy+0x12/0x20
jan 07 21:27:20 mjb kernel: Code: 48 89 c8 e9 f9 fc ff ff 48 89 f0 e9 f1 fc ff ff 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66>
jan 07 21:27:20 mjb kernel: RSP: 0018:ffffc90001b73cc0 EFLAGS: 00210246
jan 07 21:27:20 mjb kernel: RAX: ffff8e08888b4c00 RBX: ffff888105fd80b0 RCX: 0000000000000200
jan 07 21:27:20 mjb kernel: RDX: 0000000000000000 RSI: ffff8880d50f0000 RDI: ffff8e08888b4c00
jan 07 21:27:20 mjb kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
jan 07 21:27:20 mjb kernel: R10: ffffea000d8bf580 R11: ffff888143d89710 R12: 0000000000001000
jan 07 21:27:20 mjb kernel: R13: 0000000000000000 R14: 00000000d50f0000 R15: ffff8883fcaefd28
jan 07 21:27:20 mjb kernel: FS:  000000007ffd8000(0063) GS:ffff88840ea40000(006b) knlGS:00000000f7b810c0
jan 07 21:27:20 mjb kernel: CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
jan 07 21:27:20 mjb kernel: CR2: ffff8e08888b4c00 CR3: 000000033c700000 CR4: 00000000003406e0
jan 07 21:27:20 mjb kernel: Call Trace:
jan 07 21:27:20 mjb kernel:  dma_direct_unmap_page+0x92/0xa0
jan 07 21:27:20 mjb kernel:  ttm_unmap_and_unpopulate_pages+0x148/0x170 [ttm]
jan 07 21:27:20 mjb kernel:  ttm_tt_destroy+0x81/0xd0 [ttm]
jan 07 21:27:20 mjb kernel:  ttm_bo_put+0x262/0x2f0 [ttm]
jan 07 21:27:20 mjb kernel:  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
jan 07 21:27:20 mjb kernel:  amdgpu_gem_object_free+0x23/0x30 [amdgpu]
jan 07 21:27:20 mjb kernel:  drm_gem_handle_delete+0x9e/0x130 [drm]
jan 07 21:27:20 mjb kernel:  ? drm_gem_handle_create+0x40/0x40 [drm]
jan 07 21:27:20 mjb kernel:  drm_ioctl_kernel+0x8b/0xd0 [drm]
jan 07 21:27:20 mjb kernel:  drm_ioctl+0x1e5/0x390 [drm]
jan 07 21:27:20 mjb kernel:  ? drm_gem_handle_create+0x40/0x40 [drm]
jan 07 21:27:20 mjb kernel:  ? kmem_cache_free+0x18e/0x1b0
jan 07 21:27:20 mjb kernel:  ? remove_vma_list+0xe6/0x140
jan 07 21:27:20 mjb kernel:  ? __do_munmap.constprop.9+0x263/0x3a0
jan 07 21:27:20 mjb kernel:  __se_compat_sys_ioctl+0x2e3/0xe10
jan 07 21:27:20 mjb kernel:  ? __ia32_sys_munmap+0x75/0x90
jan 07 21:27:20 mjb kernel:  do_fast_syscall_32+0x98/0x210
jan 07 21:27:20 mjb kernel:  entry_SYSCALL_compat_after_hwframe+0x45/0x4d
jan 07 21:27:20 mjb kernel: Modules linked in: edac_mce_amd kvm_amd kvm snd_hda_codec_realtek amdgpu irqbypass snd_hda_codec_generic ledtrig_audio chash snd_hda_codec_hdmi amd_iommu_v2 gpu>
jan 07 21:27:20 mjb kernel: CR2: ffff8e08888b4c00
jan 07 21:27:20 mjb kernel: ---[ end trace b2ffa643a20c80fe ]---
jan 07 21:27:20 mjb kernel: RIP: 0010:__memcpy+0x12/0x20
jan 07 21:27:20 mjb kernel: Code: 48 89 c8 e9 f9 fc ff ff 48 89 f0 e9 f1 fc ff ff 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66>
jan 07 21:27:20 mjb kernel: RSP: 0018:ffffc90001b73cc0 EFLAGS: 00210246
jan 07 21:27:20 mjb kernel: RAX: ffff8e08888b4c00 RBX: ffff888105fd80b0 RCX: 0000000000000200
jan 07 21:27:20 mjb kernel: RDX: 0000000000000000 RSI: ffff8880d50f0000 RDI: ffff8e08888b4c00
jan 07 21:27:20 mjb kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
jan 07 21:27:20 mjb kernel: R10: ffffea000d8bf580 R11: ffff888143d89710 R12: 0000000000001000
jan 07 21:27:20 mjb kernel: R13: 0000000000000000 R14: 00000000d50f0000 R15: ffff8883fcaefd28
jan 07 21:27:20 mjb kernel: FS:  000000007ffd8000(0063) GS:ffff88840ea40000(006b) knlGS:00000000f7b810c0
jan 07 21:27:20 mjb kernel: CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
jan 07 21:27:20 mjb kernel: CR2: ffff8e08888b4c00 CR3: 000000033c700000 CR4: 00000000003406e0
jan 07 21:27:22 mjb kernel: general protection fault: 0000 [#2] SMP NOPTI
jan 07 21:27:22 mjb kernel: CPU: 0 PID: 649 Comm: Xorg Tainted: G      D    O      5.0.0-1-tkg-cfs #1
jan 07 21:27:22 mjb kernel: Hardware name: System manufacturer System Product Name/TUF B450M-PLUS GAMING, BIOS 0604 12/07/2018
jan 07 21:27:22 mjb kernel: RIP: 0010:__memcpy+0x12/0x20
jan 07 21:27:22 mjb kernel: Code: 48 89 c8 e9 f9 fc ff ff 48 89 f0 e9 f1 fc ff ff 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66>
jan 07 21:27:22 mjb kernel: RSP: 0018:ffffc90002203c30 EFLAGS: 00010246
jan 07 21:27:22 mjb kernel: RAX: c930ce4031168b49 RBX: ffff888105fd80b0 RCX: 0000000000000200
jan 07 21:27:22 mjb kernel: RDX: 0000000000000000 RSI: ffff8880d5297000 RDI: c930ce4031168b49
jan 07 21:27:22 mjb kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000041
jan 07 21:27:22 mjb kernel: R10: ffffea00060c1d40 R11: ffff8883fa0950f8 R12: 0000000000001000
jan 07 21:27:22 mjb kernel: R13: 0000000000000000 R14: 00000000d5297000 R15: ffff8883fc85ef28
jan 07 21:27:22 mjb kernel: FS:  00007fed1f70bdc0(0000) GS:ffff88840ea00000(0000) knlGS:0000000000000000
jan 07 21:27:22 mjb kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jan 07 21:27:22 mjb kernel: CR2: 0000561332707448 CR3: 0000000402090000 CR4: 00000000003406f0
jan 07 21:27:22 mjb kernel: Call Trace:
jan 07 21:27:22 mjb kernel:  dma_direct_unmap_page+0x92/0xa0
jan 07 21:27:22 mjb kernel:  ttm_unmap_and_unpopulate_pages+0x148/0x170 [ttm]
jan 07 21:27:22 mjb kernel:  ttm_tt_destroy+0x81/0xd0 [ttm]
jan 07 21:27:22 mjb kernel:  ttm_bo_put+0x262/0x2f0 [ttm]
jan 07 21:27:22 mjb kernel:  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
jan 07 21:27:22 mjb kernel:  amdgpu_gem_object_free+0x23/0x30 [amdgpu]
jan 07 21:27:22 mjb kernel:  drm_gem_handle_delete+0x9e/0x130 [drm]
jan 07 21:27:22 mjb kernel:  ? drm_gem_handle_create+0x40/0x40 [drm]
jan 07 21:27:22 mjb kernel:  drm_ioctl_kernel+0x8b/0xd0 [drm]
jan 07 21:27:22 mjb kernel:  drm_ioctl+0x1e5/0x390 [drm]
jan 07 21:27:22 mjb kernel:  ? drm_gem_handle_create+0x40/0x40 [drm]
jan 07 21:27:22 mjb kernel:  ? tlb_finish_mmu+0x1f/0x30
jan 07 21:27:22 mjb kernel:  ? unmap_region+0xc9/0xf0
jan 07 21:27:22 mjb kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
jan 07 21:27:22 mjb kernel:  do_vfs_ioctl+0x97/0x720
jan 07 21:27:22 mjb kernel:  ? __do_munmap.constprop.9+0x263/0x3a0
jan 07 21:27:22 mjb kernel:  __x64_sys_ioctl+0x62/0x90
jan 07 21:27:22 mjb kernel:  do_syscall_64+0x55/0x100
jan 07 21:27:22 mjb kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
jan 07 21:27:22 mjb kernel: RIP: 0033:0x7fed21f6480b
jan 07 21:27:22 mjb kernel: Code: 0f 1e fa 48 8b 05 55 b6 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3>
jan 07 21:27:22 mjb kernel: RSP: 002b:00007ffe6b968648 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
jan 07 21:27:22 mjb kernel: RAX: ffffffffffffffda RBX: 00005630ead1d5e0 RCX: 00007fed21f6480b
jan 07 21:27:22 mjb kernel: RDX: 00007ffe6b968680 RSI: 0000000040086409 RDI: 000000000000000e
jan 07 21:27:22 mjb kernel: RBP: 00007ffe6b968680 R08: 00005630e9527c48 R09: 0000000000000000
jan 07 21:27:22 mjb kernel: R10: 000000000000001c R11: 0000000000000246 R12: 0000000040086409
jan 07 21:27:22 mjb kernel: R13: 000000000000000e R14: 00005630eaee5c80 R15: 00005630e957c960
jan 07 21:27:22 mjb kernel: Modules linked in: edac_mce_amd kvm_amd kvm snd_hda_codec_realtek amdgpu irqbypass snd_hda_codec_generic ledtrig_audio chash snd_hda_codec_hdmi amd_iommu_v2 gpu>
jan 07 21:27:22 mjb kernel: ---[ end trace b2ffa643a20c80ff ]---
jan 07 21:27:22 mjb kernel: RIP: 0010:__memcpy+0x12/0x20
jan 07 21:27:22 mjb kernel: Code: 48 89 c8 e9 f9 fc ff ff 48 89 f0 e9 f1 fc ff ff 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66>
jan 07 21:27:22 mjb kernel: RSP: 0018:ffffc90001b73cc0 EFLAGS: 00210246
jan 07 21:27:22 mjb kernel: RAX: ffff8e08888b4c00 RBX: ffff888105fd80b0 RCX: 0000000000000200
jan 07 21:27:22 mjb kernel: RDX: 0000000000000000 RSI: ffff8880d50f0000 RDI: ffff8e08888b4c00
jan 07 21:27:22 mjb kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
jan 07 21:27:22 mjb kernel: R10: ffffea000d8bf580 R11: ffff888143d89710 R12: 0000000000001000
jan 07 21:27:22 mjb kernel: R13: 0000000000000000 R14: 00000000d50f0000 R15: ffff8883fcaefd28
jan 07 21:27:22 mjb kernel: FS:  00007fed1f70bdc0(0000) GS:ffff88840ea00000(0000) knlGS:0000000000000000
jan 07 21:27:22 mjb kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jan 07 21:27:22 mjb kernel: CR2: 0000561332707448 CR3: 0000000402090000 CR4: 00000000003406f0
Comment 6 fin4478 2019-01-08 08:12:31 UTC
Arch Linux distributions is for nvidia gpus. Use Debian testing/sid Xfce and Oibaf ppa Mesa cosmic version. The AMD wip kernel is the best kernel for AMD GPUs
https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.21-wip

System:
  Host: ryzenpc Kernel: 5.0.0-rc1 x86_64 bits: 64 Desktop: Xfce 4.12.4 
  Distro: Debian GNU/Linux buster/sid 
Machine:
  Type: Desktop Mobo: ASUSTeK model: PRIME B350M-K v: Rev X.0x 
  serial: <root required> UEFI [Legacy]: American Megatrends v: 4207 
  date: 12/07/2018 
CPU:
  6-Core: AMD Ryzen 5 1600 type: MT MCP speed: 2959 MHz 
Graphics:
  Device-1: AMD Ellesmere [Radeon RX 470/480] driver: amdgpu v: kernel 
  Display: x11 server: X.Org 1.20.3 driver: amdgpu 
  resolution: 3840x2160~60Hz 
  OpenGL: 
  renderer: Radeon RX 570 Series (POLARIS10 DRM 3.27.0 5.0.0-rc1 LLVM 7.0.1) 
  v: 4.5 Mesa 19.0.0-devel (git-70be9af 2019-01-02 cosmic-oibaf-ppa)
Comment 7 bmilreu 2019-01-08 15:15:11 UTC
(In reply to fin4478 from comment #6)
> blablabla

This makes zero sense and is totally uncalled for, specially here. Go back to posting your usual bs in Phoronix debianxfce, this is not the place. You are polluting this and other bug reports without adding anything.
Comment 8 fin4478 2019-01-08 16:40:50 UTC
(In reply to bmilreu from comment #7)
> (In reply to fin4478 from comment #6)
> > blablabla
> 
> This makes zero sense and is totally uncalled for, specially here. Go back
> to posting your usual bs in Phoronix debianxfce, this is not the place. You
> are polluting this and other bug reports without adding anything.

Look at mirror, Arch Linux, Ubuntu and Fedora users are polluting this system, see: 
https://bugs.freedesktop.org/buglist.cgi?bug_status=__open__&component=DRM%2FAMDgpu&list_id=663649&product=DRI

You are using old kernels, old mesa, buggy llvm 8 etc. Do not know that kernel configuration and bios settings can cause unstability. Steam games supports only Ubuntu and SteamOS etc.
Comment 9 bmilreu 2019-01-08 17:01:34 UTC
(In reply to fin4478 from comment #8)
> (In reply to bmilreu from comment #7)
> > (In reply to fin4478 from comment #6)
> > > blablabla
> > 
> > This makes zero sense and is totally uncalled for, specially here. Go back
> > to posting your usual bs in Phoronix debianxfce, this is not the place. You
> > are polluting this and other bug reports without adding anything.
> 
> Look at mirror, Arch Linux, Ubuntu and Fedora users are polluting this
> system, see: 
> https://bugs.freedesktop.org/buglist.
> cgi?bug_status=__open__&component=DRM%2FAMDgpu&list_id=663649&product=DRI
> 
> You are using old kernels, old mesa, buggy llvm 8 etc. Do not know that
> kernel configuration and bios settings can cause unstability. Steam games
> supports only Ubuntu and SteamOS etc.

Old kernels? This report is specifically about 4.21 wip/5.0-rc1 so that doesn't make any sense as well.
Old mesa? My mesa builds daily and is usually newer than oibaf's by a couple of days.
Buggy llvm8? Maybe, but unless you can point out specific bugs it works just fine and is very close to a stable release.

Lastly, the reported bug is very likely to be in kernel code anyway so mesa and llvm are mostly irrelevant here.

I'm not answering you anymore, I'll leave up to moderation to take care of this.
Comment 10 Sibren Vasse 2019-01-09 15:21:57 UTC
Created attachment 143038 [details]
Bisect result
Comment 11 Sibren Vasse 2019-01-09 15:22:16 UTC
I've been running into this issue multiple times a day. I noticed I hit the OOPS a lot more frequent when my system was under load (e.g. compiling a kernel) and then opening a new tab in Firefox. 

Don't ask me how, but eventually I figured out I could reproduce the problem reliably on my system by starting many instances of my terminal emulator until I hit the OOM killer.

v4.20 (good): OOM Killer kills processes and/or my user session and I can login again.
v5.0-rc1 (bad): System hangs with OOPS in dmesg.

So I started bisecting, result attached.

I have not been able to reproduce after reverting parent merge commit [af7ddd8a627c62a835524b3f5b471edbbbcce025]
and these related commits:
06f55fd2d22742ed7e725124dfea68936d12ce40
2e05ea5cdc1ac55d9ef678ed5ea6c38acf7fd2a3
d7076f07840851bbe57cb21ba052d6a4a9b1efa9
4788ba5792cc1368ba4867e1488dc168b4fe97b7
ed6ccf10f24bdfc1955bc8b976ddedc370fc3869

See the full tree here: https://github.com/SibrenVasse/linux/tree/revert

Hope this helps!
Comment 12 Michel Dänzer 2019-01-09 15:36:28 UTC
Looks like this should be reported to Christoph Hellwig and other kernel DMA mapping helper developers then. Please Cc the dri-devel mailing list when doing so.
Comment 13 bmilreu 2019-01-11 06:36:46 UTC
There are a few new dma fixes on torvalds tree, but I'm still triggering the bug. I got something similar now but slightly different while watching a real-time 60fps interpolated video that uses opencl acceleration via rocm. Attached the log, the first error is from kfd driver and the second looks like the one reported in OP.
Comment 14 bmilreu 2019-01-11 06:38:39 UTC
Created attachment 143066 [details]
dmesg kfd and amdgpu hangs

attachment for last comment
Comment 15 bmilreu 2019-01-14 03:38:39 UTC
@Sibren Vasse
Have you forwarded this to dma devs yet?
Comment 16 Sibren Vasse 2019-01-14 10:53:21 UTC
@bmilreu: No, Michel beat me to it.

See thread here: https://lists.linuxfoundation.org/pipermail/iommu/2019-January/032528.html
Comment 17 Sibren Vasse 2019-01-14 19:33:45 UTC
@bmilreu: Could you try this patch? It works for me.
https://lists.linuxfoundation.org/pipermail/iommu/2019-January/032651.html
Comment 18 bmilreu 2019-01-14 21:30:30 UTC
(In reply to Sibren Vasse from comment #17)
> @bmilreu: Could you try this patch? It works for me.
> https://lists.linuxfoundation.org/pipermail/iommu/2019-January/032651.html

Sure, will report if it fixes it for me.
Comment 19 Michel Dänzer 2019-01-15 08:15:51 UTC
Thanks for the report, turned out to be a bug in the DMA subsystem.
Comment 20 mikhail.v.gavrilov 2019-01-18 03:45:20 UTC
Michel, thanks.
I tested this patch https://patchwork.codeaurora.org/patch/699617/ for several days and confirm that it fix the problem.
Comment 21 mikhail.v.gavrilov 2019-01-18 03:47:21 UTC
Forgot to ask: when it will be merged in Linus tree?
Comment 22 bmilreu 2019-01-18 04:31:08 UTC
(In reply to mikhail.v.gavrilov from comment #21)
> Forgot to ask: when it will be merged in Linus tree?

https://github.com/torvalds/linux/commit/6d060fa39035d5ff6bb3e720a8119aeb50453e3b

Can confirm my system been stable for 3 days with the patch


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.