Bug 104289

Summary: [regression][vega10] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout on exiting certain Steam games
Product: DRI Reporter: Vedran Miletić <vedran>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Possible fix
none
Possible fix v2 none

Description Vedran Miletić 2017-12-16 13:53:22 UTC
Vega 10, amd-staging-drm-next 4021f6f628ee6cd621a22e768f7f3ae94f330790, when exiting Hitman 2016 (but not Xonotic) I get. It's a regression and it's reliably reproducible.

[ 5261.568646] WARNING: CPU: 0 PID: 25480 at drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1641 amdgpu_vm_bo_update_mapping+0x3dd/0x3f0 [amdgpu]
[ 5261.568648] Modules linked in: rfcomm fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack devlink libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat arc4 iwlmvm mac80211 snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel iwlwifi snd_hda_codec edac_mce_amd btusb snd_hda_core btrtl btbcm kvm btintel snd_hwdep bluetooth snd_seq irqbypass snd_seq_device
[ 5261.568665]  cfg80211 crct10dif_pclmul snd_pcm joydev mxm_wmi wmi_bmof crc32_pclmul snd_timer ecdh_generic ghash_clmulni_intel snd rfkill soundcore ccp sp5100_tco pcspkr shpchp i2c_piix4 k10temp wmi acpi_cpufreq binfmt_misc amdgpu chash i2c_algo_bit gpu_sched drm_kms_helper ttm drm crc32c_intel alx hid_holtek_mouse mdio
[ 5261.568675] CPU: 0 PID: 25480 Comm: gallium_drv:0 Not tainted 4.15.0-rc2+ #27
[ 5261.568676] Hardware name: Gigabyte Technology Co., Ltd. X399 AORUS Gaming 7/X399 AORUS Gaming 7, BIOS F2 08/31/2017
[ 5261.568676] task: 0000000045d1804c task.stack: 00000000e1257dad
[ 5261.568694] RIP: 0010:amdgpu_vm_bo_update_mapping+0x3dd/0x3f0 [amdgpu]
[ 5261.568694] RSP: 0018:ffffa3c751d83998 EFLAGS: 00010212
[ 5261.568695] RAX: ffff9875fd962a58 RBX: ffff987a9942e000 RCX: ffff987a73d86550
[ 5261.568695] RDX: ffffa3c744e4e000 RSI: ffff9875fd962a58 RDI: ffff987a73d86560
[ 5261.568696] RBP: ffff987a73d80000 R08: 0000000000000002 R09: 0000000000000000
[ 5261.568696] R10: 0000000000001ffb R11: 0000000000001ff9 R12: 0000000000001b0e
[ 5261.568697] R13: ffff987a73d86560 R14: 0000000000101a00 R15: 0000000000000000
[ 5261.568698] FS:  00007fcd9f7fe700(0000) GS:ffff987a9de00000(0000) knlGS:0000000000000000
[ 5261.568698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5261.568698] CR2: 00007f2581124010 CR3: 00000008011ad000 CR4: 00000000003406f0
[ 5261.568699] Call Trace:
[ 5261.568705]  ? _cond_resched+0x15/0x40
[ 5261.568706]  ? __ww_mutex_lock.isra.2+0x42/0x640
[ 5261.568721]  ? amdgpu_vm_free_mapping.isra.23+0x20/0x20 [amdgpu]
[ 5261.568736]  amdgpu_vm_clear_freed+0xbb/0x190 [amdgpu]
[ 5261.568751]  amdgpu_gem_object_close+0x19c/0x210 [amdgpu]
[ 5261.568760]  ? drm_gem_object_release_handle+0x2c/0x90 [drm]
[ 5261.568764]  drm_gem_object_release_handle+0x2c/0x90 [drm]
[ 5261.568769]  ? drm_gem_object_handle_put_unlocked+0xb0/0xb0 [drm]
[ 5261.568771]  idr_for_each+0x48/0xe0
[ 5261.568776]  drm_gem_release+0x1c/0x30 [drm]
[ 5261.568780]  drm_release+0x342/0x3b0 [drm]
[ 5261.568783]  __fput+0xcd/0x1d0
[ 5261.568785]  task_work_run+0x81/0xa0
[ 5261.568787]  do_exit+0x2de/0xba0
[ 5261.568792]  ? drm_ioctl_kernel+0x59/0xb0 [drm]
[ 5261.568793]  do_group_exit+0x3a/0xa0
[ 5261.568794]  get_signal+0x26c/0x570
[ 5261.568796]  do_signal+0x36/0x610
[ 5261.568798]  ? do_vfs_ioctl+0xa1/0x610
[ 5261.568800]  ? SyS_futex+0x12d/0x180
[ 5261.568802]  exit_to_usermode_loop+0x69/0xa0
[ 5261.568803]  syscall_return_slowpath+0xbf/0xd0
[ 5261.568804]  entry_SYSCALL_64_fastpath+0x7b/0x7d
[ 5261.568805] RIP: 0033:0x7fceb27696dc
[ 5261.568805] RSP: 002b:00007fcd9f7fd620 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 5261.568806] RAX: fffffffffffffe00 RBX: 0000000008bc2028 RCX: 00007fceb27696dc
[ 5261.568806] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000008bc2050
[ 5261.568807] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000008020100
[ 5261.568807] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000008bc2000
[ 5261.568807] R13: 0000000001687f40 R14: 0000000000000000 R15: 0000000008bc2050
[ 5261.568808] Code: ff 74 16 f0 ff 0f 0f 88 db e5 14 00 75 0b 89 04 24 e8 e8 1f f8 d1 8b 04 24 48 8b 54 24 38 48 8b 5c 24 08 48 89 13 e9 0b fd ff ff <0f> ff eb 88 e8 1a 15 a7 d1 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 
[ 5261.568822] ---[ end trace 700e4f5bfa6fddf4 ]---
[ 5261.569536] WARNING: CPU: 0 PID: 25480 at drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1641 amdgpu_vm_bo_update_mapping+0x3dd/0x3f0 [amdgpu]
[ 5261.569538] Modules linked in: rfcomm fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack devlink libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat arc4 iwlmvm mac80211 snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel iwlwifi snd_hda_codec edac_mce_amd btusb snd_hda_core btrtl btbcm kvm btintel snd_hwdep bluetooth snd_seq irqbypass snd_seq_device
[ 5261.569552]  cfg80211 crct10dif_pclmul snd_pcm joydev mxm_wmi wmi_bmof crc32_pclmul snd_timer ecdh_generic ghash_clmulni_intel snd rfkill soundcore ccp sp5100_tco pcspkr shpchp i2c_piix4 k10temp wmi acpi_cpufreq binfmt_misc amdgpu chash i2c_algo_bit gpu_sched drm_kms_helper ttm drm crc32c_intel alx hid_holtek_mouse mdio
[ 5261.569558] CPU: 0 PID: 25480 Comm: gallium_drv:0 Tainted: G        W        4.15.0-rc2+ #27
[ 5261.569559] Hardware name: Gigabyte Technology Co., Ltd. X399 AORUS Gaming 7/X399 AORUS Gaming 7, BIOS F2 08/31/2017
[ 5261.569559] task: 0000000045d1804c task.stack: 00000000e1257dad
[ 5261.569572] RIP: 0010:amdgpu_vm_bo_update_mapping+0x3dd/0x3f0 [amdgpu]
[ 5261.569574] RSP: 0018:ffffa3c751d83998 EFLAGS: 00010216
[ 5261.569575] RAX: ffff987a86601258 RBX: ffff987a9942e000 RCX: ffff987a73d86550
[ 5261.569575] RDX: 0000000000000004 RSI: ffff987a86601258 RDI: ffffa3c744d7f300
[ 5261.569575] RBP: ffff987a73d80000 R08: 0000000000001018 R09: 0000000000000004
[ 5261.569576] R10: 0000000000030000 R11: 000000000000100d R12: 0000000000000e0e
[ 5261.569576] R13: ffff987a73d86560 R14: 000000000013f000 R15: 0000000000000000
[ 5261.569577] FS:  00007fcd9f7fe700(0000) GS:ffff987a9de00000(0000) knlGS:0000000000000000
[ 5261.569577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5261.569579] CR2: 00007f2581124010 CR3: 00000001bbe09000 CR4: 00000000003406f0
[ 5261.569579] Call Trace:
[ 5261.569581]  ? _cond_resched+0x15/0x40
[ 5261.569582]  ? __ww_mutex_lock.isra.2+0x42/0x640
[ 5261.569595]  ? amdgpu_vm_free_mapping.isra.23+0x20/0x20 [amdgpu]
[ 5261.569609]  amdgpu_vm_clear_freed+0xbb/0x190 [amdgpu]
[ 5261.569623]  amdgpu_gem_object_close+0x19c/0x210 [amdgpu]
[ 5261.569629]  ? drm_gem_object_release_handle+0x2c/0x90 [drm]
[ 5261.569633]  drm_gem_object_release_handle+0x2c/0x90 [drm]
[ 5261.569637]  ? drm_gem_object_handle_put_unlocked+0xb0/0xb0 [drm]
[ 5261.569638]  idr_for_each+0x48/0xe0
[ 5261.569642]  drm_gem_release+0x1c/0x30 [drm]
[ 5261.569646]  drm_release+0x342/0x3b0 [drm]
[ 5261.569649]  __fput+0xcd/0x1d0
[ 5261.569650]  task_work_run+0x81/0xa0
[ 5261.569651]  do_exit+0x2de/0xba0
[ 5261.569655]  ? drm_ioctl_kernel+0x59/0xb0 [drm]
[ 5261.569656]  do_group_exit+0x3a/0xa0
[ 5261.569657]  get_signal+0x26c/0x570
[ 5261.569658]  do_signal+0x36/0x610
[ 5261.569659]  ? do_vfs_ioctl+0xa1/0x610
[ 5261.569660]  ? SyS_futex+0x12d/0x180
[ 5261.569662]  exit_to_usermode_loop+0x69/0xa0
[ 5261.569662]  syscall_return_slowpath+0xbf/0xd0
[ 5261.569663]  entry_SYSCALL_64_fastpath+0x7b/0x7d
[ 5261.569664] RIP: 0033:0x7fceb27696dc
[ 5261.569664] RSP: 002b:00007fcd9f7fd620 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 5261.569665] RAX: fffffffffffffe00 RBX: 0000000008bc2028 RCX: 00007fceb27696dc
[ 5261.569665] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000008bc2050
[ 5261.569665] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000008020100
[ 5261.569666] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000008bc2000
[ 5261.569666] R13: 0000000001687f40 R14: 0000000000000000 R15: 0000000008bc2050
[ 5261.569667] Code: ff 74 16 f0 ff 0f 0f 88 db e5 14 00 75 0b 89 04 24 e8 e8 1f f8 d1 8b 04 24 48 8b 54 24 38 48 8b 5c 24 08 48 89 13 e9 0b fd ff ff <0f> ff eb 88 e8 1a 15 a7 d1 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 
[ 5261.569681] ---[ end trace 700e4f5bfa6fddf5 ]---
[ 5261.569753] WARNING: CPU: 0 PID: 25480 at drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1641 amdgpu_vm_bo_update_mapping+0x3dd/0x3f0 [amdgpu]
[ 5261.569754] Modules linked in: rfcomm fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack devlink libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat arc4 iwlmvm mac80211 snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel iwlwifi snd_hda_codec edac_mce_amd btusb snd_hda_core btrtl btbcm kvm btintel snd_hwdep bluetooth snd_seq irqbypass snd_seq_device
[ 5261.569766]  cfg80211 crct10dif_pclmul snd_pcm joydev mxm_wmi wmi_bmof crc32_pclmul snd_timer ecdh_generic ghash_clmulni_intel snd rfkill soundcore ccp sp5100_tco pcspkr shpchp i2c_piix4 k10temp wmi acpi_cpufreq binfmt_misc amdgpu chash i2c_algo_bit gpu_sched drm_kms_helper ttm drm crc32c_intel alx hid_holtek_mouse mdio
[ 5261.569772] CPU: 0 PID: 25480 Comm: gallium_drv:0 Tainted: G        W        4.15.0-rc2+ #27
[ 5261.569772] Hardware name: Gigabyte Technology Co., Ltd. X399 AORUS Gaming 7/X399 AORUS Gaming 7, BIOS F2 08/31/2017
[ 5261.569773] task: 0000000045d1804c task.stack: 00000000e1257dad
[ 5261.569785] RIP: 0010:amdgpu_vm_bo_update_mapping+0x3dd/0x3f0 [amdgpu]
[ 5261.569787] RSP: 0018:ffffa3c751d83998 EFLAGS: 00010212
[ 5261.569787] RAX: ffff987a86604258 RBX: ffff987a9942e000 RCX: ffff987a73d86550
[ 5261.569788] RDX: 0000000000000004 RSI: ffff987a86604258 RDI: ffffa3c744d83700
[ 5261.569788] RBP: ffff987a73d80000 R08: 0000000000000c18 R09: 0000000000000004
[ 5261.569788] R10: 0000000000030000 R11: 0000000000000c0d R12: 0000000000000ace
[ 5261.569789] R13: ffff987a73d86560 R14: 0000000000151400 R15: 0000000000000000
[ 5261.569789] FS:  00007fcd9f7fe700(0000) GS:ffff987a9de00000(0000) knlGS:0000000000000000
[ 5261.569790] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5261.569792] CR2: 00007f2581124010 CR3: 00000001bbe09000 CR4: 00000000003406f0
[ 5261.569792] Call Trace:
[ 5261.569794]  ? _cond_resched+0x15/0x40
[ 5261.569795]  ? __ww_mutex_lock.isra.2+0x42/0x640
[ 5261.569807]  ? amdgpu_vm_free_mapping.isra.23+0x20/0x20 [amdgpu]
[ 5261.569819]  amdgpu_vm_clear_freed+0xbb/0x190 [amdgpu]
[ 5261.569833]  amdgpu_gem_object_close+0x19c/0x210 [amdgpu]
[ 5261.569839]  ? drm_gem_object_release_handle+0x2c/0x90 [drm]
[ 5261.569842]  drm_gem_object_release_handle+0x2c/0x90 [drm]
[ 5261.569846]  ? drm_gem_object_handle_put_unlocked+0xb0/0xb0 [drm]
[ 5261.569847]  idr_for_each+0x48/0xe0
[ 5261.569851]  drm_gem_release+0x1c/0x30 [drm]
[ 5261.569854]  drm_release+0x342/0x3b0 [drm]
[ 5261.569857]  __fput+0xcd/0x1d0
[ 5261.569858]  task_work_run+0x81/0xa0
[ 5261.569859]  do_exit+0x2de/0xba0
[ 5261.569864]  ? drm_ioctl_kernel+0x59/0xb0 [drm]
[ 5261.569865]  do_group_exit+0x3a/0xa0
[ 5261.569866]  get_signal+0x26c/0x570
[ 5261.569867]  do_signal+0x36/0x610
[ 5261.569868]  ? do_vfs_ioctl+0xa1/0x610
[ 5261.569869]  ? SyS_futex+0x12d/0x180
[ 5261.569870]  exit_to_usermode_loop+0x69/0xa0
[ 5261.569871]  syscall_return_slowpath+0xbf/0xd0
[ 5261.569872]  entry_SYSCALL_64_fastpath+0x7b/0x7d
[ 5261.569872] RIP: 0033:0x7fceb27696dc
[ 5261.569873] RSP: 002b:00007fcd9f7fd620 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 5261.569873] RAX: fffffffffffffe00 RBX: 0000000008bc2028 RCX: 00007fceb27696dc
[ 5261.569874] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000008bc2050
[ 5261.569874] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000008020100
[ 5261.569874] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000008bc2000
[ 5261.569875] R13: 0000000001687f40 R14: 0000000000000000 R15: 0000000008bc2050
[ 5261.569875] Code: ff 74 16 f0 ff 0f 0f 88 db e5 14 00 75 0b 89 04 24 e8 e8 1f f8 d1 8b 04 24 48 8b 54 24 38 48 8b 5c 24 08 48 89 13 e9 0b fd ff ff <0f> ff eb 88 e8 1a 15 a7 d1 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 
[ 5261.569890] ---[ end trace 700e4f5bfa6fddf6 ]---
[ 5271.969130] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=135113, last emitted seq=135115
[ 5271.969137] [drm] No hardware hang detected. Did some blocks stall?
Comment 1 Vedran Miletić 2017-12-16 22:13:11 UTC
Also happens with American Truck Simulator.
Comment 2 Michel Dänzer 2017-12-18 10:18:44 UTC
Can you bisect?
Comment 3 Vedran Miletić 2017-12-19 02:38:59 UTC
Yes, I can, but it will take some time because there is an unrelated bug in between which makes many revisions unbootable.
Comment 4 Christian König 2017-12-19 13:38:24 UTC
You can restrict that to changes to drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c.

The problem is that we use more dw than expected for clearing the page tables. No idea what exactly goes wrong, but bisecting the commit which introduced it would certainly help.
Comment 5 Vedran Miletić 2017-12-21 12:27:56 UTC
(In reply to Christian König from comment #4)
> You can restrict that to changes to drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c.
> 
> The problem is that we use more dw than expected for clearing the page
> tables. No idea what exactly goes wrong, but bisecting the commit which
> introduced it would certainly help.

I'm sorry, but I will not be able to bisect this. Checkouts of relevant commits don't boot and simple reverts do apply cleanly, but don't compile.
Comment 6 Christian König 2017-12-21 12:29:20 UTC
Created attachment 136340 [details] [review]
Possible fix

Complete shot into the dark, but while double checking the code I've found that at least this calculation isn't correct.
Comment 7 Michel Dänzer 2017-12-21 14:03:05 UTC
(In reply to Vedran Miletić from comment #5)
> I'm sorry, but I will not be able to bisect this. Checkouts of relevant
> commits don't boot and simple reverts do apply cleanly, but don't compile.

FWIW, you may still be able to at least narrow things down with git bisect. If you can't test a selected commit, run "git bisect skip". That will select another commit to test. You can also manually check out another commit to test. In the worst case, the bisection process will end with identifying the minimal set of candidates instead of a single commit.
Comment 8 Christian König 2017-12-21 14:15:10 UTC
I think I've figured out what is going on here. Give me a moment to provide a new patch.
Comment 9 Christian König 2017-12-21 14:50:42 UTC
Created attachment 136343 [details] [review]
Possible fix v2

Please try that one instead.
Comment 10 Tom Englund 2017-12-31 14:07:26 UTC
i could reliably reproduce this with starting fallout 4 in wine, getting same or similiar crashes in dmesg,

however with the last attachment Christian König posted it now runs.
https://bugs.freedesktop.org/attachment.cgi?id=136343

dmesg: 

dec 31 15:01:22 tom-pc kernel: WARNING: CPU: 6 PID: 25993 at drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1641 amdgpu_vm_bo_update_mapping+0x3dd/0x3f0 [amdgpu]
dec 31 15:01:22 tom-pc kernel: Modules linked in: fuse mousedev msr nls_iso8859_1 nls_cp437 vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp 
dec 31 15:01:22 tom-pc kernel:  gpu_sched drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm agpgart
dec 31 15:01:22 tom-pc kernel: CPU: 6 PID: 25993 Comm: amdgpu_cs:0 Tainted: G        W        4.15.0-rc2-mainline #1
dec 31 15:01:22 tom-pc kernel: Hardware name: Gigabyte Technology Co., Ltd. Z170-HD3P/Z170-HD3P-CF, BIOS F20 11/04/2016
dec 31 15:01:22 tom-pc kernel: task: 00000000569a51e8 task.stack: 00000000bc284a6f
dec 31 15:01:22 tom-pc kernel: RIP: 0010:amdgpu_vm_bo_update_mapping+0x3dd/0x3f0 [amdgpu]
dec 31 15:01:22 tom-pc kernel: RSP: 0018:fffface501b7b9e0 EFLAGS: 00010216
dec 31 15:01:22 tom-pc kernel: RAX: ffff92a0f7ac6e58 RBX: ffff92a0c072d800 RCX: ffff92a1682b6550
dec 31 15:01:22 tom-pc kernel: RDX: fffface50336c700 RSI: ffff92a0f7ac6e58 RDI: ffff92a1682b6560
dec 31 15:01:22 tom-pc kernel: RBP: ffff92a1682b0000 R08: 0000000000000002 R09: 0000000000000000
dec 31 15:01:22 tom-pc kernel: R10: 00000000000007fb R11: 00000000000007f9 R12: 000000000000078e
dec 31 15:01:22 tom-pc kernel: R13: ffff92a1682b6560 R14: 0000000000109200 R15: 0000000000000000
dec 31 15:01:22 tom-pc kernel: FS:  00007fc349c21700(0000) GS:ffff92a17ed80000(0000) knlGS:00007fffffea8000
dec 31 15:01:22 tom-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
dec 31 15:01:22 tom-pc kernel: CR2: 00007fc296881fa8 CR3: 00000003e8fbd003 CR4: 00000000003606e0
dec 31 15:01:22 tom-pc kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
dec 31 15:01:22 tom-pc kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
dec 31 15:01:22 tom-pc kernel: Call Trace:
dec 31 15:01:22 tom-pc kernel:  ? amdgpu_vm_free_mapping.isra.24+0x20/0x20 [amdgpu]
dec 31 15:01:22 tom-pc kernel:  amdgpu_vm_bo_update+0x327/0x5e0 [amdgpu]
dec 31 15:01:22 tom-pc kernel:  amdgpu_vm_handle_moved+0x73/0xa0 [amdgpu]
dec 31 15:01:22 tom-pc kernel:  amdgpu_cs_ioctl+0x1a4a/0x1ae0 [amdgpu]
dec 31 15:01:22 tom-pc kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
dec 31 15:01:22 tom-pc kernel:  drm_ioctl_kernel+0x59/0xb0 [drm]
dec 31 15:01:22 tom-pc kernel:  drm_ioctl+0x2d5/0x370 [drm]
dec 31 15:01:22 tom-pc kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
dec 31 15:01:22 tom-pc kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
dec 31 15:01:22 tom-pc kernel:  do_vfs_ioctl+0xa1/0x610
dec 31 15:01:22 tom-pc kernel:  ? SyS_futex+0x12d/0x180
dec 31 15:01:22 tom-pc kernel:  SyS_ioctl+0x74/0x80
dec 31 15:01:22 tom-pc kernel:  entry_SYSCALL_64_fastpath+0x1a/0x7d
dec 31 15:01:22 tom-pc kernel: RIP: 0033:0x7fc41e3b1a07
dec 31 15:01:22 tom-pc kernel: RSP: 002b:00007fc349c20c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
dec 31 15:01:22 tom-pc kernel: RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007fc41e3b1a07
dec 31 15:01:22 tom-pc kernel: RDX: 00007fc349c20ce0 RSI: 00000000c0186444 RDI: 000000000000001e
dec 31 15:01:22 tom-pc kernel: RBP: 00007fc349c20e00 R08: 00007fc349c20d80 R09: 00007fc349c20cc0
dec 31 15:01:22 tom-pc kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 000000007cdf0a98
dec 31 15:01:22 tom-pc kernel: R13: 0000000000000001 R14: 00007fc349c20cf0 R15: 0000000000000000
dec 31 15:01:22 tom-pc kernel: Code: ff 74 16 f0 ff 0f 0f 88 3c d4 12 00 75 0b 89 04 24 e8 c8 44 0a e3 8b 04 24 48 8b 54 24 38 48 8b 5c 24 08 48 89 13 e9 0b fd
dec 31 15:01:22 tom-pc kernel: ---[ end trace 425bb209c57fc66b ]---
dec 31 15:01:32 tom-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=53896, last emitted seq=53898
dec 31 15:01:32 tom-pc kernel: [drm] No hardware hang detected. Did some blocks stall?
dec 31 15:01:35 tom-pc systemd-logind[561]: Power key pressed.
dec 31 15:01:35 tom-pc systemd-logind[561]: Powering Off...
dec 31 15:01:35 tom-pc systemd-logind[561]: System is powering down.
Comment 11 Christian König 2018-01-03 18:36:22 UTC
Code fix is now in amd-staging-drm-next
Comment 12 Vedran Miletić 2018-01-07 17:57:21 UTC
(In reply to Michel Dänzer from comment #7)
> (In reply to Vedran Miletić from comment #5)
> > I'm sorry, but I will not be able to bisect this. Checkouts of relevant
> > commits don't boot and simple reverts do apply cleanly, but don't compile.
> 
> FWIW, you may still be able to at least narrow things down with git bisect.
> If you can't test a selected commit, run "git bisect skip". That will select
> another commit to test. You can also manually check out another commit to
> test. In the worst case, the bisection process will end with identifying the
> minimal set of candidates instead of a single commit.

Thanks for the suggestion. Tried that and didn't get anywhere (all the relevant commits were broken in one way or another).

(In reply to Christian König from comment #11)
> Code fix is now in amd-staging-drm-next

Verified as fixed. (Would have checked earlier, but was away from the computer with Vega.)
Comment 13 Peter Klotz 2018-08-09 06:46:09 UTC
Sorry to post into this already closed bug.

Should this issue be fixed in 4.17.12?

I am asking because I see sporadic system hangs that start with these messages:

Aug 09 08:20:18 thinkpad kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=2260291, last emitted seq=2260293
Aug 09 08:20:18 thinkpad kernel: [drm] No hardware hang detected. Did some blocks stall?
Aug 09 08:20:35 thinkpad kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [kwin_x11:915]


Sounds similar to this bug.
Comment 14 dallase 2018-10-08 13:14:41 UTC
My Radeon Pro Duo (polaris) is experiencing ring sdma0 timeouts when trying to move to newer kernels.  I’m running
a custom build of 4.17.0-rc2-180424-fkxamd (from ROCm Kernel https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/fkxamd/drm-next-wip) without issues.

When I build either of these kernels, the card gets ring timeouts on boot.  Both amdgpu-pro 18.20 and 18.30 for userland, didnt matter.


amd-staging-drm-next (built Oct 7 2018)

[   61.701281] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=888, emitted seq=890
[   61.701285] [drm] GPU recovery disabled.
[   61.701397] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=902, emitted seq=904
[   61.701399] [drm] GPU recovery disabled.

drm-next-4.20-wip (built Oct 8 2018)

[   60.840847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=914, emitted seq=916
[   60.840851] [drm] GPU recovery disabled.
[   60.840962] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=907, emitted seq=909
[   60.840964] [drm] GPU recovery disabled.



Both of these kernels work fine on my Vega 56 and Vega 64's, just the Pro Duo has the ring timeouts.
Comment 15 Michel Dänzer 2018-10-08 14:45:37 UTC
(In reply to dallase from comment #14)
> My Radeon Pro Duo (polaris) is experiencing ring sdma0 timeouts when trying
> to move to newer kernels.

Please file your own report. Per comment 12, the issue this report is about is fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.