Summary: | ring gfx timeout during particular shader generation on yuzu emulator | ||
---|---|---|---|
Product: | Mesa | Reporter: | e88z4 <felix.adrianto> |
Component: | Drivers/Gallium/radeonsi | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED MOVED | QA Contact: | Default DRI bug account <dri-devel> |
Severity: | major | ||
Priority: | medium | CC: | glencoesmith, saliksaeed |
Version: | git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
freeze-screenshot
dmesg apitrace log |
Description
e88z4
2018-12-12 03:31:03 UTC
A revision of my initial comment. I was able to reproduce the bug once while replaying the trace and generated the same error in the dmesg output but I am no longer able to reproduce the error while replaying the trace currently which is very odd. Out of 10 times replaying the trace, I was able to reproduce the bug twice. [ 540.189992] amdgpu 0000:23:00.0: GPU fault detected: 147 0x09088402 for process glretrace pid 2006 thread glretrace:cs0 pid 2007 [ 540.190000] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00001121 [ 540.190004] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E084002 [ 540.190010] amdgpu 0000:23:00.0: VM fault (0x02, vmid 7, pasid 32771) at page 4385, read from 'TC7' (0x54433700) (132) [ 550.275970] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=20845, emitted seq=20848 The error message from dmesg is identical when producing the bug from apitrace. I did a little more troubleshooting on my API trace. Call #36840880 is the culprit of the VMC Page Fault error. The GPU didn't crash right away but continuing 100-200 calls after 36840880 might randomly trigger the ring gfx timeout error. Created attachment 142865 [details]
freeze-screenshot
Created attachment 142866 [details]
dmesg
Same pb here on a r9 fury, mesa 18.3 Complete freeze on various games in yuzu (easy to reproduce: super mario odissey). I attached a screenshot + the dmesg log obtained via ssh. I can reproduce the error during the FMV as well. You can quickly bypass this FMV by skipping it before the error occurs (I couldn't remember how to trigger the skipping selection on your screen). I am wondering if you would get the same crash too when Mario was about to jump for the first time after being woken up by cappy? Hi, This bug still occurs with kernel 5.0-rc4 with Mesa 19.0-devel (commit 932ed9c00b99e6ec92146ec9e820f546cf3e6551). There are more error logs in dmesg like below. [ 1132.040752] amdgpu 0000:23:00.0: GPU fault detected: 147 0x0a808802 for process yuzu pid 19573 thread yuzu:cs0 pid 19577 [ 1132.040761] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0003F350 [ 1132.040766] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C088002 [ 1132.040773] amdgpu 0000:23:00.0: VM fault (0x02, vmid 6, pasid 32770) at page 258896, read from 'TC6' (0x54433600) (136) [ 1142.103758] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=50481, emitted seq=50484 [ 1142.103820] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process yuzu pid 19573 thread yuzu:cs0 pid 19577 [ 1142.103826] amdgpu 0000:23:00.0: GPU reset begin! [ 1149.487464] WARNING: CPU: 9 PID: 19117 at kernel/kthread.c:501 kthread_park+0x20/0x67 [ 1149.487467] Modules linked in: hid_switchcon hidp tda18271 s5h1411 rfcomm fuse cmac bnep binfmt_misc amdgpu mfd_core kvm irqbypass ghash_clmulni_intel btusb btrtl btbcm btintel bluetooth snd_hda_codec_realtek snd_hda_codec_generic snd_usb_audio snd_usbmidi_lib snd_hda_codec_hdmi sha256_generic snd_rawmidi snd_pcsp chash gpu_sched saa7164 xpad snd_hda_intel tveeprom drm_kms_helper joydev evdev aesni_intel syscopyarea dvb_core sysfillrect snd_hda_codec sysimgblt fb_sys_fops aes_x86_64 snd_hwdep ttm crypto_simd snd_hda_core cryptd ff_memless drbg glue_helper snd_pcm drm snd_timer i2c_algo_bit videodev snd ecdh_generic sp5100_tco soundcore k10temp media sg 8250 8250_base serial_core button acpi_cpufreq nct6775 hwmon_vid autofs4 ext4 crc16 mbcache jbd2 sd_mod hid_generic usbhid hid i2c_piix4 ahci xhci_pci crc32c_intel i2c_core libahci xhci_hcd r8169 libata realtek libphy usbcore scsi_mod usb_common [ 1149.487525] CPU: 9 PID: 19117 Comm: MemoryInfra Not tainted 5.0.0-rc4-htpc+ #3 [ 1149.487526] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P2.00 12/18/2018 [ 1149.487530] RIP: 0010:kthread_park+0x20/0x67 [ 1149.487533] Code: e8 09 ff ff ff 48 89 c7 eb 8f 55 53 e8 fd fe ff ff f6 47 24 04 74 09 0f 0b b8 da ff ff ff eb 4e 48 89 c3 48 8b 00 a8 04 74 09 <0f> 0b b8 f0 ff ff ff eb 3b 48 89 fd f0 80 0b 04 65 48 8b 04 25 40 [ 1149.487534] RSP: 0018:ffffc900030d7bc0 EFLAGS: 00010202 [ 1149.487537] RAX: 0000000000000004 RBX: ffff88840868f680 RCX: 0000000000000000 [ 1149.487538] RDX: ffff8883d0794330 RSI: ffff88827243f330 RDI: ffff88840732ee80 [ 1149.487540] RBP: ffff888407ad7398 R08: 0000000000000001 R09: ffff888288facd80 [ 1149.487541] R10: 0000000000000040 R11: ffffc900030d7bc0 R12: 000000000000000d [ 1149.487543] R13: 0000000000000006 R14: ffff88840b01d000 R15: ffff88840b01d0f0 [ 1149.487545] FS: 00007f96304a6700(0000) GS:ffff88841ee40000(0000) knlGS:0000000000000000 [ 1149.487547] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1149.487548] CR2: 00007f229032b000 CR3: 0000000407b90000 CR4: 00000000003406e0 [ 1149.487549] Call Trace: [ 1149.487559] drm_sched_entity_fini+0x4c/0xeb [gpu_sched] [ 1149.487612] amdgpu_ctx_mgr_entity_fini+0x80/0xa8 [amdgpu] [ 1149.487662] amdgpu_ctx_mgr_fini+0x22/0x8b [amdgpu] [ 1149.487706] amdgpu_driver_postclose_kms+0x188/0x242 [amdgpu] [ 1149.487717] drm_file_free+0x1e6/0x22d [drm] [ 1149.487728] drm_release+0x75/0xdf [drm] [ 1149.487733] __fput+0xe9/0x18f [ 1149.487737] task_work_run+0x68/0x7c [ 1149.487741] do_exit+0x450/0x917 [ 1149.487744] do_group_exit+0x95/0x95 [ 1149.487748] get_signal+0x439/0x461 [ 1149.487752] do_signal+0x31/0x4cc [ 1149.487757] prepare_exit_to_usermode+0x8b/0xf5 [ 1149.487762] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1149.487764] RIP: 0033:0x7f964651ffac [ 1149.487772] Code: Bad RIP value. [ 1149.487773] RSP: 002b:00007f96304a5640 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca [ 1149.487776] RAX: fffffffffffffe00 RBX: 00007f96304a57b0 RCX: 00007f964651ffac [ 1149.487777] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f96304a57d8 [ 1149.487778] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007fff49705080 [ 1149.487779] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 1149.487781] R13: 00007f96304a5788 R14: 0000000000000000 R15: 00007f96304a57d8 [ 1149.487784] ---[ end trace 540c7632a8289097 ]--- [ 1152.333861] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out If you need more clarification or detail, please let me know. I tried to enable NIR using R600_DEBUG=nir, I got a different behaviour on GFX timeout. [Feb 1 19:44] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [Feb 1 19:45] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240145] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240170] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.239953] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240061] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240340] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [Feb 1 19:46] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240003] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240173] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240074] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.239890] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [Feb 1 19:44] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [Feb 1 19:45] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240145] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240170] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.239953] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240061] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240340] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [Feb 1 19:46] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240003] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240173] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ +10.240074] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered I get a a looping error message like the above but I can kill the application from ssh. Display is once again usable, no hard reset is required. *** Bug 109870 has been marked as a duplicate of this bug. *** Can you reproduce the bug on 18.3.5? I usually build mesa master weekly since November 2018. This bug can be reproduced with latest Mesa from git master branch. Do you want me to try to build 18.3.5 and reproduce it? (In reply to e88z4 from comment #12) > I usually build mesa master weekly since November 2018. This bug can be > reproduced with latest Mesa from git master branch. > > Do you want me to try to build 18.3.5 and reproduce it? No, I was just curious if the bug showed up in 18.3.5. I tried to reproduce this bug again with latest mesa master (659aa3dd651). My kernel is 5.1.3 and yuzu-canary #2318. The behaviour of the softlock is slightly changed. Yuzu didn't crash anymore but rendering was super slow. I was able to kill the application without using SSH. The following error was produced from dmesg. [ 5179.131132] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [ 5183.739102] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ 5189.883024] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [ 5193.979037] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ 5204.218939] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ 5214.458837] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ 5224.698730] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ 5229.818653] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [ 5234.938622] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [ 5234.938680] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ 5241.850516] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [ 5245.178534] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered [ 5251.322426] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [ 5255.418427] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered Maybe related to 110472 (In reply to Timothy Arceri from comment #15) > Maybe related to 110472 Sorry to bug #110472 (In reply to Timothy Arceri from comment #16) > (In reply to Timothy Arceri from comment #15) > > Maybe related to 110472 > > Sorry to bug #110472 It definitely is, as it occurs using the same drivers, exact same program, and causes the same crash (In reply to e88z4 from comment #14) > I tried to reproduce this bug again with latest mesa master (659aa3dd651). > My kernel is 5.1.3 and yuzu-canary #2318. The behaviour of the softlock is > slightly changed. Yuzu didn't crash anymore but rendering was super slow. I > was able to kill the application without using SSH. The following error was > produced from dmesg. > > [ 5179.131132] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting > for fences timed out. > [ 5183.739102] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > but soft recovered > [ 5189.883024] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting > for fences timed out. > [ 5193.979037] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > but soft recovered > [ 5204.218939] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > but soft recovered > [ 5214.458837] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > but soft recovered > [ 5224.698730] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > but soft recovered > [ 5229.818653] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting > for fences timed out. > [ 5234.938622] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting > for fences timed out. > [ 5234.938680] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > but soft recovered > [ 5241.850516] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting > for fences timed out. > [ 5245.178534] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > but soft recovered > [ 5251.322426] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting > for fences timed out. > [ 5255.418427] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, > but soft recovered Yay, progress https://github.com/yuzu-emu/yuzu/issues/2523 WOOOOOOO! This is getting a Yuzu side fix! Hi Timothy, I have tried your suggestion on the other bug reporting https://bugs.freedesktop.org/show_bug.cgi?id=110472 You suggested to add an environment variable allow_glsl_extension_directive_midshader=true. This doesn't resolve the issue. I also noticed that the yuzu developers followed your suggestion from your bug reporting ticket https://github.com/yuzu-emu/yuzu/issues/2523 I decided to compile the latest source code with the merged code but the issue is not yet resolved. One thing that I noticed is the "ring fgx timeout" error message is no longer being produced". Instead error messages below are reproduced continuously until I kill the application. May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: amdgpu_cs_query_fence_status failed. May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: amdgpu_cs_query_fence_status failed. May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:21 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:25 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:25 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. May 27 18:53:29 htpc org.gnome.Shell.desktop[1578]: amdgpu: The CS has been cancelled because the context is lost. I will produce the apitrace shortly and will post it on a share location. Hi, I uploaded the apitrace in my google drive. https://drive.google.com/file/d/1hg1ep4rJ2Y_g7WmA_HbmNndrHd-x7Etc/view?usp=sharing When I replayed the trace, I was able to produce the error. My system is AMD Ryzen 2600 Radeon RX580 Mesa master 659aa3dd651 Kernel 5.1.3 Let me know if you need anything else. Created attachment 144472 [details]
apitrace log
Hi,
I attached the apitrace log being written to my standard output while replaying my apitrace file in my google drive.
Hi, Is there an update for this ticket? The issue is still easily be replicated in the latest mesa on RadeonSI driver. System information Linux 5.2.2 Mesa-master Radeon RX 580 Series (POLARIS10, DRM 3.32.0, 5.2.2-htpc, LLVM 10.0.0) I have tried different combination of llvm backend such as llvm-8, llvm-9, and llvm-10. They produced the same error. Could this be a bug in amd llvm backend? Previously, I have provided the api trace, I hope the trace can be used to debug this issue. I provided the latest error from today. They are the same error like before. ^[ [Jul31 16:10] amdgpu 0000:23:00.0: GPU fault detected: 147 0x0d604401 for process yuzu pid 15061 thread yuzu:cs0 pid 15068 [ +0.000004] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0C03F7AC [ +0.000001] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E044001 [ +0.000002] amdgpu 0000:23:00.0: VM fault (0x01, vmid 7, pasid 32770) at page 201586604, read from 'TC5' (0x54433500) (68) [ +5.312442] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted! [ +4.874028] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=24383, emitted seq=24385 [ +0.000066] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process yuzu pid 15061 thread yuzu:cs0 pid 15068 [ +0.000006] amdgpu 0000:23:00.0: GPU reset begin! [ +0.584424] cp is busy, skip halt cp [ +0.328786] rlc is busy, skip halt rlc [ +0.001031] amdgpu 0000:23:00.0: GPU pci config reset [ +0.463461] amdgpu 0000:23:00.0: GPU reset succeeded, trying to resume [ +0.002152] [drm] PCIE GART of 256M enabled (table at 0x000000F400900000). [ +0.000016] [drm] VRAM is lost due to GPU reset! [ +0.087583] [drm] UVD and UVD ENC initialized successfully. [ +0.099970] [drm] VCE initialized successfully. [ +0.018504] amdgpu 0000:23:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on uvd (-110). [ +0.003336] amdgpu 0000:23:00.0: ib ring test failed (-110). -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1348. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.