Description
mikhail.v.gavrilov
2017-11-30 20:53:16 UTC
Does "it occurs on latest staging kernel" mean it doesn't happen with an earlier staging kernel or with another kernel version? If so, can you provide more details about what kernels it doesn't happen with? Earlier kernels don't support GPU Vega. So I can't recheck it with earlier kernel which works fine with IGPU on same machine. Created attachment 135984 [details]
dmesg
Created attachment 136012 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
Created attachment 136036 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
Created attachment 136200 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
With latest build in dmesg appear message when hang again occurs: [ 341.475043] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=110200, last emitted seq=110202 [ 341.475059] [drm] No hardware hang detected. Did some blocks stall? Created attachment 136346 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
Created attachment 136517 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
Yeah, I enabled more error messages on amd-staging-drm-next. But please don't change the bug subject to something less descriptive. (In reply to Christian König from comment #10) > Yeah, I enabled more error messages on amd-staging-drm-next. But it still not enough for understand root cause? Can you also rebase amd-staging-drm-next to RC5 with enabled KPTI patch? I do not want to sit on a vulnerable kernel. The default shipped kernel in Fedora already patched but not having AMD Vega support. Created attachment 136579 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next
(In reply to mikhail.v.gavrilov from comment #11) > (In reply to Christian König from comment #10) > > Yeah, I enabled more error messages on amd-staging-drm-next. > But it still not enough for understand root cause? > > > Can you also rebase amd-staging-drm-next to RC5 with enabled KPTI patch? I > do not want to sit on a vulnerable kernel. The default shipped kernel in > Fedora already patched but not having AMD Vega support. Christian and Alex, when you do this (rebase to RCx with enabled KPTI patch) please do it at least to RC7 (NOT penalize AMD chips) even that I'm on Intel Xeon currently...;-) Created attachment 136599 [details]
dmesg with 4.15.0-rc2 amd-staging-drm-next with SysRq : Show State
Created attachment 136809 [details]
dmesg with 4.15.0-rc4 amd-staging-drm-next with SysRq : Show State
Created attachment 136836 [details]
dmesg with 4.15.0-rc4 amd-staging-drm-next e6555e61902c with SysRq : Show State
Please stop attaching more and more dmesg with unrelated information to the bug report. The initial one is perfectly sufficient. I am sorry for misunderstanding. Every time when I see new commits in branch I hope that this issue may be fixed. And every time I rebuild kernel for testing. And after it I every time I reproduce this annoying bug. And I still hope that anybody works on it and improve logging for understanding root cause of this hung. So I every time attach new dmesg log. Anybody is investigated this bug? It is not necessary watch video for occurring computer hung. It hangs just after running the client's Steam or during the game if computer already worked some time. I'm already tired of pressing the reset button because "init 6" is not able to restart the computer after such a hang. For today I already press reset button more than 30 times. But no one care about it :( Created attachment 137680 [details]
dmesg with 4.16.0-rc1 amd-staging-drm-next
Sadly still present in 4.16 rc1 Found you another crash case: The @GraphicsFuzz demo found 1 issue (14/15 tests passed) on my desktop device, affecting my @AMD GPU driver Give it a try: www.graphicsfuzz.com/#demo #GraphicsFuzz Computer always hangs on shader15 Created attachment 137710 [details]
photo of test when computer is hang
(In reply to mikhail.v.gavrilov from comment #23) > Found you another crash case: That's unlikely to be the exact same cause as that of the Steam hang this report is about, so it needs to be tracked separately. (In reply to Michel Dänzer from comment #25) > That's unlikely to be the exact same cause as that of the Steam hang this > report is about, so it needs to be tracked separately. Ok, https://bugs.freedesktop.org/show_bug.cgi?id=105317 I run into this issue regularly with an AMD Ryzen 5 2400G (primary display, connected via DP to the monitor) and an AMD Radeon RX 560 (not connected to a monitor, secondary display according to mainboard firmware configuration). After using my computer for some time, the graphics suddenly freezes and I see lines like the following in dmesg (after logging in via SSH): [Fri Mar 2 21:05:33 2018] amdgpu: [powerplay] pp_dpm_get_temperature was not implemented. [Fri Mar 2 21:06:03 2018] INFO: task X:898 blocked for more than 120 seconds. [Fri Mar 2 21:06:03 2018] Tainted: G W 4.15.7-gentoo-r1 #2 [Fri Mar 2 21:06:03 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Mar 2 21:06:03 2018] X D 0 898 881 0x00000004 [Fri Mar 2 21:06:03 2018] Call Trace: [Fri Mar 2 21:06:03 2018] ? __schedule+0x2a7/0x8b0 [Fri Mar 2 21:06:03 2018] schedule+0x28/0x80 [Fri Mar 2 21:06:03 2018] schedule_preempt_disabled+0xa/0x10 [Fri Mar 2 21:06:03 2018] __ww_mutex_lock.isra.3+0x224/0x690 [Fri Mar 2 21:06:03 2018] ? drm_modeset_backoff+0x3e/0xb0 [drm] [Fri Mar 2 21:06:03 2018] drm_modeset_backoff+0x3e/0xb0 [drm] [Fri Mar 2 21:06:03 2018] drm_mode_gamma_set_ioctl+0xb4/0x200 [drm] [Fri Mar 2 21:06:03 2018] ? drm_mode_crtc_set_gamma_size+0xa0/0xa0 [drm] [Fri Mar 2 21:06:03 2018] drm_ioctl_kernel+0x5b/0xb0 [drm] [Fri Mar 2 21:06:03 2018] drm_ioctl+0x2d5/0x370 [drm] [Fri Mar 2 21:06:03 2018] ? drm_mode_crtc_set_gamma_size+0xa0/0xa0 [drm] [Fri Mar 2 21:06:03 2018] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [Fri Mar 2 21:06:03 2018] do_vfs_ioctl+0xa4/0x670 [Fri Mar 2 21:06:03 2018] ? __sys_recvmsg+0x64/0xa0 [Fri Mar 2 21:06:03 2018] ? __sys_recvmsg+0x95/0xa0 [Fri Mar 2 21:06:03 2018] SyS_ioctl+0x74/0x80 [Fri Mar 2 21:06:03 2018] do_syscall_64+0x6e/0x120 [Fri Mar 2 21:06:03 2018] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [Fri Mar 2 21:06:03 2018] RIP: 0033:0x7fd8924c0467 [Fri Mar 2 21:06:03 2018] RSP: 002b:00007ffcb17d7b08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [Fri Mar 2 21:06:03 2018] RAX: ffffffffffffffda RBX: 0000560ddb4480e0 RCX: 00007fd8924c0467 [Fri Mar 2 21:06:03 2018] RDX: 00007ffcb17d7b40 RSI: 00000000c02064a5 RDI: 0000000000000016 [Fri Mar 2 21:06:03 2018] RBP: 00007ffcb17d7b40 R08: 0000560ddb4487a0 R09: 0000560ddb4489a0 [Fri Mar 2 21:06:03 2018] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000c02064a5 [Fri Mar 2 21:06:03 2018] R13: 0000000000000016 R14: 0000560ddb448bb0 R15: 0000560ddb4485a0 [Fri Mar 2 21:06:03 2018] INFO: task kworker/u32:2:32344 blocked for more than 120 seconds. [Fri Mar 2 21:06:03 2018] Tainted: G W 4.15.7-gentoo-r1 #2 [Fri Mar 2 21:06:03 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Mar 2 21:06:03 2018] kworker/u32:2 D 0 32344 2 0x80000000 [Fri Mar 2 21:06:03 2018] Workqueue: events_unbound commit_work [drm_kms_helper] [Fri Mar 2 21:06:03 2018] Call Trace: [Fri Mar 2 21:06:03 2018] ? __schedule+0x2a7/0x8b0 [Fri Mar 2 21:06:03 2018] schedule+0x28/0x80 [Fri Mar 2 21:06:03 2018] schedule_timeout+0x1e7/0x370 [Fri Mar 2 21:06:03 2018] ? generic_reg_get+0x21/0x30 [amdgpu] [Fri Mar 2 21:06:03 2018] dma_fence_default_wait+0x1f0/0x280 [Fri Mar 2 21:06:03 2018] ? dma_fence_release+0x90/0x90 [Fri Mar 2 21:06:03 2018] dma_fence_wait_timeout+0x39/0xf0 [Fri Mar 2 21:06:03 2018] reservation_object_wait_timeout_rcu+0x17b/0x370 [Fri Mar 2 21:06:03 2018] amdgpu_dm_do_flip+0x11f/0x360 [amdgpu] [Fri Mar 2 21:06:03 2018] amdgpu_dm_atomic_commit_tail+0x8a1/0x9a0 [amdgpu] [Fri Mar 2 21:06:03 2018] ? _cond_resched+0x15/0x40 [Fri Mar 2 21:06:03 2018] ? wait_for_completion_timeout+0x35/0x180 [Fri Mar 2 21:06:03 2018] commit_tail+0x3d/0x70 [drm_kms_helper] [Fri Mar 2 21:06:03 2018] process_one_work+0x1da/0x3d0 [Fri Mar 2 21:06:03 2018] worker_thread+0x2b/0x3f0 [Fri Mar 2 21:06:03 2018] ? process_one_work+0x3d0/0x3d0 [Fri Mar 2 21:06:03 2018] kthread+0x113/0x130 [Fri Mar 2 21:06:03 2018] ? kthread_create_worker_on_cpu+0x70/0x70 [Fri Mar 2 21:06:03 2018] ? SyS_exit_group+0x10/0x10 [Fri Mar 2 21:06:03 2018] ret_from_fork+0x22/0x40 [Fri Mar 2 21:06:33 2018] i2c /dev entries driver Everything apart from the graphics appears to continue to run fine, except any application (e.g. started on the command line) that tries to talk to the X server: They will hang. Most applications that hang can be killed with SIGKILL, except the X server and a few others, which will be zombies forever. Linux kernel is at 4.15.7-gentoo-r1, LLVM at 5.0.1, Mesa at 18.0.0_rc4. [ 69.089101] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=897, last emitted seq=899 [ 69.089176] [drm] No hardware hang detected. Did some blocks stall? [ 85.813890] sysrq: SysRq : Show Blocked State [ 85.813982] task PC stack pid father [ 85.814019] kworker/u16:4 D14104 146 2 0x80000000 [ 85.814055] Workqueue: events_unbound commit_work [drm_kms_helper] [ 85.814058] Call Trace: [ 85.814064] ? __schedule+0x2ed/0xba0 [ 85.814070] ? dma_fence_default_wait+0x14f/0x370 [ 85.814073] schedule+0x2f/0x90 [ 85.814076] schedule_timeout+0x23d/0x540 [ 85.814079] ? find_held_lock+0x34/0xa0 [ 85.814084] ? mark_held_locks+0x56/0x80 [ 85.814087] ? _raw_spin_unlock_irqrestore+0x32/0x60 [ 85.814091] ? dma_fence_default_wait+0x14f/0x370 [ 85.814094] dma_fence_default_wait+0x23b/0x370 [ 85.814097] ? dma_fence_release+0x170/0x170 [ 85.814101] dma_fence_wait_timeout+0x4f/0x270 [ 85.814105] reservation_object_wait_timeout_rcu+0x193/0x4d0 [ 85.814148] amdgpu_dm_do_flip+0x112/0x350 [amdgpu] [ 85.814188] amdgpu_dm_atomic_commit_tail+0xb66/0xdc0 [amdgpu] [ 85.814194] ? wait_for_completion_timeout+0x76/0x1b0 [ 85.814206] commit_tail+0x3d/0x70 [drm_kms_helper] [ 85.814211] process_one_work+0x266/0x6b0 [ 85.814218] worker_thread+0x3a/0x390 [ 85.814222] ? process_one_work+0x6b0/0x6b0 [ 85.814225] kthread+0x121/0x140 [ 85.814228] ? kthread_create_worker_on_cpu+0x70/0x70 [ 85.814231] ret_from_fork+0x3a/0x50 [ 85.814391] tracker-store D12184 2786 2167 0x00000000 [ 85.814395] Call Trace: [ 85.814400] ? __schedule+0x2ed/0xba0 [ 85.814406] schedule+0x2f/0x90 [ 85.814409] io_schedule+0x12/0x40 [ 85.814413] generic_file_read_iter+0x39e/0xdb0 [ 85.814420] ? page_cache_tree_insert+0x130/0x130 [ 85.814474] xfs_file_buffered_aio_read+0x65/0x1a0 [xfs] [ 85.814498] xfs_file_read_iter+0x64/0xc0 [xfs] [ 85.814504] __vfs_read+0x102/0x170 [ 85.814511] vfs_read+0x9e/0x150 [ 85.814515] SyS_pread64+0x93/0xb0 [ 85.814518] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 85.814523] do_syscall_64+0x79/0x220 [ 85.814526] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 85.814528] RIP: 0033:0x7ff185a6a873 [ 85.814530] RSP: 002b:00007ffe0e646780 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 [ 85.814533] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007ff185a6a873 [ 85.814535] RDX: 0000000000001000 RSI: 00005613cc6fbe48 RDI: 0000000000000008 [ 85.814536] RBP: 0000000000001000 R08: 00005613cc6fbe48 R09: 000000000ff80fff [ 85.814538] R10: 000000001982a000 R11: 0000000000000293 R12: 0000000000000000 [ 85.814539] R13: 00005613cc6fbe48 R14: 000000001982a000 R15: 00005613cc446580 [ 85.814601] gldriverquery D12856 4120 4072 0xa0020002 [ 85.814606] Call Trace: [ 85.814611] ? __schedule+0x2ed/0xba0 [ 85.814617] schedule+0x2f/0x90 [ 85.814621] drm_sched_entity_fini+0xbe/0x2b0 [gpu_sched] [ 85.814626] ? finish_wait+0x80/0x80 [ 85.814649] amdgpu_ctx_fini+0xbf/0x100 [amdgpu] [ 85.814672] amdgpu_ctx_mgr_fini+0x7c/0xc0 [amdgpu] [ 85.814692] amdgpu_driver_postclose_kms+0x57/0x220 [amdgpu] [ 85.814708] drm_release+0x2a0/0x3c0 [drm] [ 85.814714] __fput+0xe9/0x200 [ 85.814719] task_work_run+0x87/0xb0 [ 85.814723] do_exit+0x345/0xd70 [ 85.814727] ? up_read+0x1c/0x40 [ 85.814730] ? __do_page_fault+0x2af/0x530 [ 85.814735] do_group_exit+0x47/0xc0 [ 85.814738] SyS_exit_group+0x10/0x10 [ 85.814740] do_fast_syscall_32+0xbf/0x376 [ 85.814744] entry_SYSENTER_compat+0x84/0x96 This might fix it: https://cgit.freedesktop.org/mesa/mesa/commit/?id=d15fb766aa3c98ffbe16d050b2af4804e4b12c57 (In reply to Marek Olšák from comment #30) > This might fix it: > https://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=d15fb766aa3c98ffbe16d050b2af4804e4b12c57 For which mesa version this patch? My si_pipe.c (mesa 18.0.1) looks differently https://imgur.com/a/dc3RoHi The patch is already backported in the 18.0 branch: https://cgit.freedesktop.org/mesa/mesa/log/?h=18.0 (In reply to Marek Olšák from comment #32) > The patch is already backported in the 18.0 branch: > https://cgit.freedesktop.org/mesa/mesa/log/?h=18.0 How I can sure what patch already applied in my mesa? Created attachment 139363 [details]
my si_pipe.c file
Looks like my si_pipe.c already patched. But my GPU still hangs when I try pass one and the same place in the game Rise of Tomb Rider. Created attachment 139364 [details]
Here GPU VEGA always hungs
Rise of Tomb Raider is a Vulkan game. You can file a RADV bug for it. I'm closing this since you are not reporting any issues with Youtube. (In reply to Marek Olšák from comment #37) > Rise of Tomb Raider is a Vulkan game. You can file a RADV bug for it. I'm > closing this since you are not reporting any issues with Youtube. My bug report about the game Rise of Tomb Raider unfortunately closed without explanations which patches needed for fix hung: https://bugs.freedesktop.org/show_bug.cgi?id=106196 And symptomes of problem same as in this bug report: - The system stop to respod. - All the LEDs on the video card showing power consumption start to glow. - The turbine on the video card starts to make a lot of noise. Long time ago the Intel driver hungs too: https://bugs.freedesktop.org/show_bug.cgi?id=54226 but intel developers add GPU reset in such situations why not add GPU reset also for AMD GPU? We are working on the GPU reset, we just don't have any ETA. The GPU reset is something you can't rely on to save you. In most cases, a successful GPU reset needs a complete restart of X or Wayland, so you'll lose the whole desktop and all running desktop applications. You are likely to get into an infinite GPU-hang+GPU-reset loop if the driver doesn't kill all apps. With the current Linux desktop architecture that isn't aware of GPU resets, a GPU reset is mostly unusable. The implementation of the GPU reset is secondary to making sure that GPU hangs don't occur. Thus, bugs about GPU hangs are only about fixing GPU hangs. Rise Of The Tomb Raider is a Vulkan game, so any hangs within the game are RADV bugs. Filing a bug against DRM/AMDgpu for a GPU hang within that game is less effective than filing a bug against RADV. Ok, now video playback results in a hangup without a steam client if vaapi is used: https://bugs.freedesktop.org/show_bug.cgi?id=106430 |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.