Bug 106196

Summary: GPU randomly hangs while playing game Rise of the Tomb Rider
Product: DRI Reporter: mikhail.v.gavrilov
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: sonichedgehog_hyperblast00
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg none

Description mikhail.v.gavrilov 2018-04-23 20:28:10 UTC
Created attachment 139028 [details]
dmesg

* Fedora 28 - https://download.fedoraproject.org/pub/fedora/linux/releases/test/28_Beta/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-28_Beta-1.3.iso
* Latest system updates:
 - kernel 4.16.3
 - mesa 18.0.1
 - llvm 6.0.0
* Steam client version 1523923735

For reproduction issue:
1) Play in game several hours until GPU hang occurs

Symptoms:
1. The system stop to respod.
2. All the LEDs on the video card showing the load start to glow.
3. The turbine on the video card starts to make a lot of noise.

kernel output after GPU hang:
[10918.342576] amdgpu 0000:07:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:7 pas_id:0)
[10918.342582] amdgpu 0000:07:00.0:   at page 0x00001891a90f0000 from 27
[10918.342585] amdgpu 0000:07:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0070113D
[10918.342591] amdgpu 0000:07:00.0: [gfxhub] VMC page fault (src_id:0 ring:158 vmid:7 pas_id:0)
[10918.342594] amdgpu 0000:07:00.0:   at page 0x00001891a90f0000 from 27
[10918.342597] amdgpu 0000:07:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[10928.687661] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1874360, last emitted seq=1874362
[10928.687666] [drm] No hardware hang detected. Did some blocks stall?
[11016.291301] sysrq: SysRq : Show Blocked State
[11016.291315]   task                        PC stack   pid father
[11016.291509] Xwayland        D10616  1956   1882 0x00000004
[11016.291522] Call Trace:
[11016.291541]  ? __schedule+0x2bd/0xb00
[11016.291555]  ? dma_fence_wait_any_timeout+0x264/0x2f0
[11016.291564]  schedule+0x2f/0x90
[11016.291573]  schedule_timeout+0x35c/0x520
[11016.291592]  ? dma_fence_wait_any_timeout+0x264/0x2f0
[11016.291602]  dma_fence_wait_any_timeout+0x230/0x2f0
[11016.291734]  amdgpu_sa_bo_new+0x444/0x510 [amdgpu]
[11016.291900]  amdgpu_ib_get+0x31/0x90 [amdgpu]
[11016.292048]  amdgpu_job_alloc_with_ib+0x46/0x80 [amdgpu]
[11016.292128]  amdgpu_map_buffer.isra.10+0xa3/0x1f0 [amdgpu]
[11016.292215]  amdgpu_ttm_copy_mem_to_mem+0x3c6/0x5d0 [amdgpu]
[11016.292305]  ? amdgpu_vm_bo_invalidate+0x3b/0x210 [amdgpu]
[11016.292385]  amdgpu_move_blit.constprop.13+0x82/0x110 [amdgpu]
[11016.292467]  amdgpu_bo_move+0x94/0x1c0 [amdgpu]
[11016.292486]  ttm_bo_handle_move_mem+0x10d/0x540 [ttm]
[11016.292509]  ? ttm_bo_evict+0x155/0x1e0 [ttm]
[11016.292530]  ? mutex_trylock+0xcd/0xe0
[11016.292552]  ? ttm_mem_evict_first+0x1cf/0x260 [ttm]
[11016.292574]  ? ttm_bo_mem_space+0x2da/0x4a0 [ttm]
[11016.292599]  ? ttm_bo_validate+0xe3/0x1a0 [ttm]
[11016.292612]  ? ttm_bo_init_reserved+0x40e/0x470 [ttm]
[11016.292628]  ? mutex_trylock+0xcd/0xe0
[11016.292645]  ? ttm_bo_init_reserved+0x42a/0x470 [ttm]
[11016.292723]  ? amdgpu_bo_do_create+0x1da/0x570 [amdgpu]
[11016.292799]  ? amdgpu_fill_buffer+0x320/0x320 [amdgpu]
[11016.292885]  ? amdgpu_bo_create+0x4f/0x2c0 [amdgpu]
[11016.292993]  ? amdgpu_gem_object_create+0x80/0x110 [amdgpu]
[11016.293075]  ? amdgpu_gem_object_close+0x1e0/0x1e0 [amdgpu]
[11016.293153]  ? amdgpu_gem_create_ioctl+0x1eb/0x2a0 [amdgpu]
[11016.293165]  ? __might_fault+0x3e/0x90
[11016.293244]  ? amdgpu_gem_object_close+0x1e0/0x1e0 [amdgpu]
[11016.293277]  ? drm_ioctl_kernel+0x5b/0xb0 [drm]
[11016.293308]  ? drm_ioctl+0x1c0/0x380 [drm]
[11016.293417]  ? amdgpu_gem_object_close+0x1e0/0x1e0 [amdgpu]
[11016.293529]  ? amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[11016.293544]  ? do_vfs_ioctl+0xa5/0x6d0
[11016.293559]  ? SyS_ioctl+0x74/0x80
[11016.293574]  ? do_syscall_64+0x79/0x210
[11016.293584]  ? entry_SYSCALL_64_after_hwframe+0x42/0xb7
[11016.294195] kworker/u16:2   D12472 10521      2 0x80000000
[11016.294228] Workqueue: events_unbound commit_work [drm_kms_helper]
[11016.294237] Call Trace:
[11016.294252]  ? __schedule+0x2bd/0xb00
[11016.294266]  ? dma_fence_default_wait+0x231/0x370
[11016.294278]  schedule+0x2f/0x90
[11016.294288]  schedule_timeout+0x35c/0x520
[11016.294301]  ? dma_fence_default_wait+0x72/0x370
[11016.294316]  ? dma_fence_default_wait+0x231/0x370
[11016.294325]  dma_fence_default_wait+0x25d/0x370
[11016.294334]  ? dma_fence_release+0x160/0x160
[11016.294347]  dma_fence_wait_timeout+0x4f/0x270
[11016.294358]  reservation_object_wait_timeout_rcu+0x236/0x4e0
[11016.294485]  amdgpu_dm_do_flip+0x112/0x360 [amdgpu]
[11016.294624]  amdgpu_dm_atomic_commit_tail+0xac7/0xda0 [amdgpu]
[11016.294640]  ? wait_for_completion_timeout+0x73/0x1a0
[11016.294673]  commit_tail+0x3d/0x70 [drm_kms_helper]
[11016.294685]  process_one_work+0x261/0x630
[11016.294703]  worker_thread+0x3a/0x390
[11016.294715]  ? process_one_work+0x630/0x630
[11016.294725]  kthread+0x120/0x140
[11016.294739]  ? kthread_create_worker_on_cpu+0x70/0x70
[11016.294750]  ret_from_fork+0x3a/0x50
Comment 1 fin4478 2018-04-29 23:35:21 UTC
Mainline kernels do have a partially implemented amdgpu driver, compare the diff column at kernel.org to this:
https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.18-wip

Fixed release Mesa does have old and buggy code, use a rolling release repository. Fedora has one and with Debian testing you can use the Oibaf ppa Mesa bionic version.

Use the drm-next-4.18-wip or amd-staging-drm-next kernel from ~agd5f repository.

I am fighting in the final chapter of the Rise of the Tomb Raider game and the game works fine with my system:

xfce@ryzenpc:~$ inxi -bM
System:    Host: ryzenpc Kernel: 4.16.0-rc7+ x86_64 bits: 64 Desktop: Xfce 4.12.4 
           Distro: Debian GNU/Linux buster/sid 
Machine:   Type: Desktop Mobo: ASUSTeK model: PRIME B350M-K v: Rev X.0x serial: N/A UEFI: American Megatrends 
           v: 4008 date: 04/13/2018 
CPU:       6-Core: AMD Ryzen 5 1600 type: MT MCP speed: 2872 MHz 
Graphics:  Card-1: Advanced Micro Devices [AMD/ATI] Baffin [Polaris11] driver: amdgpu v: kernel 
           Display: x11 server: X.Org 1.19.6 driver: amdgpu,ati unloaded: fbdev,modesetting,radeon,vesa 
           resolution: 1920x1080~60Hz 
           OpenGL: renderer: Radeon RX 560 Series (POLARIS11 DRM 3.26.0 4.16.0-rc7+ LLVM 6.0.0) 
           v: 4.5 Mesa 18.2.0-devel 
Network:   Card-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet driver: r8169 
Drives:    HDD Total Size: 238.47 GiB used: 92.00 GiB (38.6%) 
Info:      Processes: 236 Uptime: 46m Memory: 7.79 GiB used: 704.6 MiB (8.8%) Shell: bash inxi: 3.0.07

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.