Summary: | [regression][vega10] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout on exiting certain Steam games | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Vedran Miletić <vedran> | ||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||
Severity: | normal | ||||||||
Priority: | medium | ||||||||
Version: | DRI git | ||||||||
Hardware: | Other | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
i915 platform: | i915 features: | ||||||||
Attachments: |
|
Description
Vedran Miletić
2017-12-16 13:53:22 UTC
Also happens with American Truck Simulator. Can you bisect? Yes, I can, but it will take some time because there is an unrelated bug in between which makes many revisions unbootable. You can restrict that to changes to drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c. The problem is that we use more dw than expected for clearing the page tables. No idea what exactly goes wrong, but bisecting the commit which introduced it would certainly help. (In reply to Christian König from comment #4) > You can restrict that to changes to drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c. > > The problem is that we use more dw than expected for clearing the page > tables. No idea what exactly goes wrong, but bisecting the commit which > introduced it would certainly help. I'm sorry, but I will not be able to bisect this. Checkouts of relevant commits don't boot and simple reverts do apply cleanly, but don't compile. Created attachment 136340 [details] [review] Possible fix Complete shot into the dark, but while double checking the code I've found that at least this calculation isn't correct. (In reply to Vedran Miletić from comment #5) > I'm sorry, but I will not be able to bisect this. Checkouts of relevant > commits don't boot and simple reverts do apply cleanly, but don't compile. FWIW, you may still be able to at least narrow things down with git bisect. If you can't test a selected commit, run "git bisect skip". That will select another commit to test. You can also manually check out another commit to test. In the worst case, the bisection process will end with identifying the minimal set of candidates instead of a single commit. I think I've figured out what is going on here. Give me a moment to provide a new patch. Created attachment 136343 [details] [review] Possible fix v2 Please try that one instead. i could reliably reproduce this with starting fallout 4 in wine, getting same or similiar crashes in dmesg, however with the last attachment Christian König posted it now runs. https://bugs.freedesktop.org/attachment.cgi?id=136343 dmesg: dec 31 15:01:22 tom-pc kernel: WARNING: CPU: 6 PID: 25993 at drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1641 amdgpu_vm_bo_update_mapping+0x3dd/0x3f0 [amdgpu] dec 31 15:01:22 tom-pc kernel: Modules linked in: fuse mousedev msr nls_iso8859_1 nls_cp437 vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp dec 31 15:01:22 tom-pc kernel: gpu_sched drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm agpgart dec 31 15:01:22 tom-pc kernel: CPU: 6 PID: 25993 Comm: amdgpu_cs:0 Tainted: G W 4.15.0-rc2-mainline #1 dec 31 15:01:22 tom-pc kernel: Hardware name: Gigabyte Technology Co., Ltd. Z170-HD3P/Z170-HD3P-CF, BIOS F20 11/04/2016 dec 31 15:01:22 tom-pc kernel: task: 00000000569a51e8 task.stack: 00000000bc284a6f dec 31 15:01:22 tom-pc kernel: RIP: 0010:amdgpu_vm_bo_update_mapping+0x3dd/0x3f0 [amdgpu] dec 31 15:01:22 tom-pc kernel: RSP: 0018:fffface501b7b9e0 EFLAGS: 00010216 dec 31 15:01:22 tom-pc kernel: RAX: ffff92a0f7ac6e58 RBX: ffff92a0c072d800 RCX: ffff92a1682b6550 dec 31 15:01:22 tom-pc kernel: RDX: fffface50336c700 RSI: ffff92a0f7ac6e58 RDI: ffff92a1682b6560 dec 31 15:01:22 tom-pc kernel: RBP: ffff92a1682b0000 R08: 0000000000000002 R09: 0000000000000000 dec 31 15:01:22 tom-pc kernel: R10: 00000000000007fb R11: 00000000000007f9 R12: 000000000000078e dec 31 15:01:22 tom-pc kernel: R13: ffff92a1682b6560 R14: 0000000000109200 R15: 0000000000000000 dec 31 15:01:22 tom-pc kernel: FS: 00007fc349c21700(0000) GS:ffff92a17ed80000(0000) knlGS:00007fffffea8000 dec 31 15:01:22 tom-pc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 dec 31 15:01:22 tom-pc kernel: CR2: 00007fc296881fa8 CR3: 00000003e8fbd003 CR4: 00000000003606e0 dec 31 15:01:22 tom-pc kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 dec 31 15:01:22 tom-pc kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 dec 31 15:01:22 tom-pc kernel: Call Trace: dec 31 15:01:22 tom-pc kernel: ? amdgpu_vm_free_mapping.isra.24+0x20/0x20 [amdgpu] dec 31 15:01:22 tom-pc kernel: amdgpu_vm_bo_update+0x327/0x5e0 [amdgpu] dec 31 15:01:22 tom-pc kernel: amdgpu_vm_handle_moved+0x73/0xa0 [amdgpu] dec 31 15:01:22 tom-pc kernel: amdgpu_cs_ioctl+0x1a4a/0x1ae0 [amdgpu] dec 31 15:01:22 tom-pc kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] dec 31 15:01:22 tom-pc kernel: drm_ioctl_kernel+0x59/0xb0 [drm] dec 31 15:01:22 tom-pc kernel: drm_ioctl+0x2d5/0x370 [drm] dec 31 15:01:22 tom-pc kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] dec 31 15:01:22 tom-pc kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] dec 31 15:01:22 tom-pc kernel: do_vfs_ioctl+0xa1/0x610 dec 31 15:01:22 tom-pc kernel: ? SyS_futex+0x12d/0x180 dec 31 15:01:22 tom-pc kernel: SyS_ioctl+0x74/0x80 dec 31 15:01:22 tom-pc kernel: entry_SYSCALL_64_fastpath+0x1a/0x7d dec 31 15:01:22 tom-pc kernel: RIP: 0033:0x7fc41e3b1a07 dec 31 15:01:22 tom-pc kernel: RSP: 002b:00007fc349c20c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 dec 31 15:01:22 tom-pc kernel: RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007fc41e3b1a07 dec 31 15:01:22 tom-pc kernel: RDX: 00007fc349c20ce0 RSI: 00000000c0186444 RDI: 000000000000001e dec 31 15:01:22 tom-pc kernel: RBP: 00007fc349c20e00 R08: 00007fc349c20d80 R09: 00007fc349c20cc0 dec 31 15:01:22 tom-pc kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 000000007cdf0a98 dec 31 15:01:22 tom-pc kernel: R13: 0000000000000001 R14: 00007fc349c20cf0 R15: 0000000000000000 dec 31 15:01:22 tom-pc kernel: Code: ff 74 16 f0 ff 0f 0f 88 3c d4 12 00 75 0b 89 04 24 e8 c8 44 0a e3 8b 04 24 48 8b 54 24 38 48 8b 5c 24 08 48 89 13 e9 0b fd dec 31 15:01:22 tom-pc kernel: ---[ end trace 425bb209c57fc66b ]--- dec 31 15:01:32 tom-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=53896, last emitted seq=53898 dec 31 15:01:32 tom-pc kernel: [drm] No hardware hang detected. Did some blocks stall? dec 31 15:01:35 tom-pc systemd-logind[561]: Power key pressed. dec 31 15:01:35 tom-pc systemd-logind[561]: Powering Off... dec 31 15:01:35 tom-pc systemd-logind[561]: System is powering down. Code fix is now in amd-staging-drm-next (In reply to Michel Dänzer from comment #7) > (In reply to Vedran Miletić from comment #5) > > I'm sorry, but I will not be able to bisect this. Checkouts of relevant > > commits don't boot and simple reverts do apply cleanly, but don't compile. > > FWIW, you may still be able to at least narrow things down with git bisect. > If you can't test a selected commit, run "git bisect skip". That will select > another commit to test. You can also manually check out another commit to > test. In the worst case, the bisection process will end with identifying the > minimal set of candidates instead of a single commit. Thanks for the suggestion. Tried that and didn't get anywhere (all the relevant commits were broken in one way or another). (In reply to Christian König from comment #11) > Code fix is now in amd-staging-drm-next Verified as fixed. (Would have checked earlier, but was away from the computer with Vega.) Sorry to post into this already closed bug. Should this issue be fixed in 4.17.12? I am asking because I see sporadic system hangs that start with these messages: Aug 09 08:20:18 thinkpad kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=2260291, last emitted seq=2260293 Aug 09 08:20:18 thinkpad kernel: [drm] No hardware hang detected. Did some blocks stall? Aug 09 08:20:35 thinkpad kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [kwin_x11:915] Sounds similar to this bug. My Radeon Pro Duo (polaris) is experiencing ring sdma0 timeouts when trying to move to newer kernels. I’m running a custom build of 4.17.0-rc2-180424-fkxamd (from ROCm Kernel https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/fkxamd/drm-next-wip) without issues. When I build either of these kernels, the card gets ring timeouts on boot. Both amdgpu-pro 18.20 and 18.30 for userland, didnt matter. amd-staging-drm-next (built Oct 7 2018) [ 61.701281] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=888, emitted seq=890 [ 61.701285] [drm] GPU recovery disabled. [ 61.701397] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=902, emitted seq=904 [ 61.701399] [drm] GPU recovery disabled. drm-next-4.20-wip (built Oct 8 2018) [ 60.840847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=914, emitted seq=916 [ 60.840851] [drm] GPU recovery disabled. [ 60.840962] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=907, emitted seq=909 [ 60.840964] [drm] GPU recovery disabled. Both of these kernels work fine on my Vega 56 and Vega 64's, just the Pro Duo has the ring timeouts. (In reply to dallase from comment #14) > My Radeon Pro Duo (polaris) is experiencing ring sdma0 timeouts when trying > to move to newer kernels. Please file your own report. Per comment 12, the issue this report is about is fixed. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.