Summary: | Unrecoverable GPU crash with DiRT 4 | ||
---|---|---|---|
Product: | Mesa | Reporter: | Thomas Rohloff <v10lator> |
Component: | Drivers/Vulkan/radeon | Assignee: | mesa-dev |
Status: | RESOLVED MOVED | QA Contact: | mesa-dev |
Severity: | normal | ||
Priority: | medium | CC: | f.pinamartins |
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
force wd switch on eop
RADV_DEBUG=info vulkaninfo RADV_DEBUG=info vulkaninfo with workaround |
Description
Thomas Rohloff
2019-04-06 06:27:31 UTC
What preset and what llvm are you using? (In reply to Samuel Pitoiset from comment #1) > What preset and what llvm are you using? With preset you mean the in-game graphics preset, right? If so I'm using Ultra (everything maxed out). I started with LLVM 7 but my last test (there's no log as it didn't differ) was with LLVM 8: > OpenGL renderer string: Radeon RX 580 Series (POLARIS10, DRM 3.27.0, 5.0.5, LLVM 8.0.0) Should I try LLVM 9 (git) ? Also note that it seems every update (be it mesa or LLVM) triggers the bug less frequently. So my last test with LLVM 8 needed around 6 hours to trigger it. That's ofc. highly subjective and the freezing is randomly anyway. If you need a way to reproduce: Drive a track but instead of continuing after finish just let the reply run (over night). (In reply to Thomas Rohloff from comment #2) > instead of continuing after finish just let the reply run I meaned "replay", sorry. I got a hang one time when starting a new race, I'm not sure how to reproduce it again. Like you said, it's random and that will be a huge pain to figure out. (In reply to Samuel Pitoiset from comment #4) > I got a hang one time when starting a new race, I'm not sure how to > reproduce it again. Like you said, it's random and that will be a huge pain > to figure out. Sad to hear. In the meantime I tested LLVM git (6aaa856e179f33a05a47db3666821d2ae44c9afc) in combination with Mesa git (75a3dd97aa) and it froze with the same infos in dmesg. If you can find a good way for reproducing the issue that would be highly appreciated. You can also try different combinations of debug options, like: RADV_DEBUG=nodcc,nohiz,nofastclears,zerovram Thanks for all the helpful advice. Sadly I still didn't find a way to reproduce it better. Anyway, I tried your env var for around 8 hours without a freeze. While that might be just luck I'll do more testing (just started a replay with only two of the four options to let it run overnight). Hopefully it will pinpoint which of the four options does the trick, so expect a new reply in two to four days. :) Very nice, thanks for your time! Sorry to say but all of these combinations froze: RADV_DEBUG=nodcc,nohiz RADV_DEBUG=nofastclears RADV_DEBUG=zerovram So I'll re-run with RADV_DEBUG=nodcc,nohiz,nofastclears,zerovram to confirm it was just luck before. Not sure what to do after that to help pin-pointing this as the distributions I'm using has some problems with vktrace atm and even if I would be able to run it it would most likely produce a way to large trace to upload. (In reply to Thomas Rohloff from comment #9) > So I'll re-run with RADV_DEBUG=nodcc,nohiz,nofastclears,zerovram to confirm > it was just luck before. And it froze, too. Does the problem still happen if you try to downgrade your kernel? The 4.19 and 4.20 series won't boot for me cause of other bugs. I could try 4.18.12 but the FPS is very low with it (so it might take longer to trigger the bug) and I can't bisect (cause, as told above, 4.19/4.20 won't boot for me). Shortly after my last reply I started a test with 4.18.12. At the moment it's still not frozen but I want to give it more time (cause of the low FPS). Anyway, I saw someobody CCing this bug report. So @Riikka: Are you experiencing the same bug and if so would you mind sharing your system details (kernel, mesa and LLVM versions) ? Also do you have the time and skills to test other kernel versions and maybe even bisect (wait before my test on 4.18.12 is finished before doing your own testing to not waste time and energy) ? (In reply to Thomas Rohloff from comment #13) > Shortly after my last reply I started a test with 4.18.12. At the moment > it's still not frozen but I want to give it more time (cause of the low FPS). > > Anyway, I saw someobody CCing this bug report. So @Riikka: Are you > experiencing the same bug and if so would you mind sharing your system > details (kernel, mesa and LLVM versions) ? Also do you have the time and > skills to test other kernel versions and maybe even bisect (wait before my > test on 4.18.12 is finished before doing your own testing to not waste time > and energy) ? I knew I should've been sneakier... I don't have DiRT 4 (I think) so I can't say I have a crash when I play it. I am, however, using an old card because it doesn't randomly crash things and hang my system (sort of...??) with dmesg output like what's been posted all over this place. I don't know how much of my story I should post here since it's mostly more of the same and this isn't necessarily the issue I had (discounting the particular application's involvement). Also I apparently didn't keep records nearly as much as I thought. I can handle building a bunch of kernels (thanks to my lovely Ryzen CPU that only causes compiler segfaults sometimes (also I use Gentoo; please feel sorry for me now)) and should be able to handle some debugging work though I'm unfamiliar with graphics mechanisms at this level; it's more a matter of patience. My latest efforts resulted in a system so flaky I was thinking I'd need to replace the whole thing. I'm running on an old HD 6870 instead of my RX 580. It's not /physically/ hard to swap the cards, only emotionally :P Shorter story: similar to several reports on this site: unpredictable hanging, familiar dmesg output. SSH works, cursor works, audio works. Tried many things (newer kernels (4.2x, 5.0.5, 5.0.7) (similar bugs keep getting fixed, just never my one :( ), Mesa (late 18s, 19.x), all with LLVM 7.0.1 it seems), no luck, just more pain with our friend "ring gfx timeout." Latest results were veeeerryy strange crashiness in programs but seemingly less of the hanging. Now using a cruddy old card instead because terminal-based roguelikes don't need frame rates, will be trying things again soon. I added myself just to see if there's any progress on these issues or something particular I can try or do to help. Someone save us. amdgpu is hurting our souls. Also, I do want to bring up a couple of questions. What are the relevant debugging tools, and is there any way to determine whether the problem is just a bad video card? @Samuel Pitoiset It doesn't seem to freeze on 4.18.12. Are you still able to reproduce with latest mesa/llvm git? We fixed some issues that might or might not help. Since last time, I tried (a bunch of times) to reproduce the GPU hang with Dirt 4 without success, unfortunately. (In reply to Samuel Pitoiset from comment #16) > Are you still able to reproduce with latest mesa/llvm git? Yes. What graphics settings are you using? Can you eventually record a video of the Video panel? That way I could try again with the same settings as you. Cities: Skylines will crash in the same way (but on Ryzen PRO 7 2700U Vega 10) so might help with re-producing and testing? This is guaranteed to crash every time just by loading a map. Have tested with kernel 5.1 and previous, a continuous issue (In reply to Samuel Pitoiset from comment #18) > What graphics settings are you using? Can you eventually record a video of > the Video panel? That way I could try again with the same settings as you. I made some screenshots: https://imgur.com/a/C8JTQuI Can you attach the output of "RADV_DEBUG=info vulkaninfo" please? Created attachment 144219 [details] [review] force wd switch on eop Does the attached patch help? Created attachment 144232 [details] RADV_DEBUG=info vulkaninfo (In reply to Samuel Pitoiset from comment #22) > Created attachment 144219 [details] [review] [review] > force wd switch on eop > > Does the attached patch help? No. (In reply to Samuel Pitoiset from comment #21) > Can you attach the output of "RADV_DEBUG=info vulkaninfo" please? Sure. You have two Polaris10 in your machine right? No, just one (Sapphire NITRO+ Radeon RX 580):
> # lspci | grep VGA
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
(In reply to Cameron Banfield from comment #19) > Cities: Skylines will crash in the same way (but on Ryzen PRO 7 2700U Vega > 10) so might help with re-producing and testing? > > This is guaranteed to crash every time just by loading a map. > > Have tested with kernel 5.1 and previous, a continuous issue Just tried to reproduce a hang with that game on both Polaris10/Vega10 with different mesa/llvm combinations, no success. What mesa/llvm do you use and what graphics preset? (In reply to Thomas Rohloff from comment #25) > No, just one (Sapphire NITRO+ Radeon RX 580): > > > # lspci | grep VGA > > 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7) vulkaninfo reports 2 devices which means you likely have two drivers installed. Can you first clean up your install to make sure the number of reported devices is 1 (ie. Device Count = 1). Also, can you try to remove the shaders cache (ie. rm -rf ~/.cache/mesa_shader_cache) ? Btw, I played this game few hours on my Polaris10 with the same graphics settings and it didn't hang. :/ (In reply to Samuel Pitoiset from comment #27) > vulkaninfo reports 2 devices which means you likely have two drivers > installed. > Can you first clean up your install to make sure the number of reported > devices is 1 (ie. Device Count = 1). I would love to do that but I have no idea what's happening here. This looks as the same driver would claim two devices (same apiVersion and driverVersion) : > GPU0 > VkPhysicalDeviceProperties: > =========================== > apiVersion = 0x40105a (1.1.90) > driverVersion = 79695971 (0x4c01063) > vendorID = 0x1002 > deviceID = 0x67df > deviceType = DISCRETE_GPU > deviceName = AMD RADV POLARIS10 (LLVM 9.0.0) > GPU1 > VkPhysicalDeviceProperties: > =========================== > apiVersion = 0x40105a (1.1.90) > driverVersion = 79695971 (0x4c01063) > vendorID = 0x1002 > deviceID = 0x67df > deviceType = DISCRETE_GPU > deviceName = AMD RADV POLARIS10 (LLVM 9.0.0) Also there aren't multiple drivers available: > $ ls -l /usr/lib64/ | grep vulkan > lrwxrwxrwx 1 root root 14 7. Mär 16:57 libvulkan.so -> > libvulkan.so.1 > lrwxrwxrwx 1 root root 19 7. Mär 16:57 libvulkan.so.1 -> > libvulkan.so.1.1.92 > -rwxr-xr-x 1 root root 317992 7. Mär 16:57 libvulkan.so.1.1.92 > -rwxr-xr-x 1 root root 3645808 10. Mai 18:57 libvulkan_radeon.so > $ equery b /usr/lib64/libvulkan.so.1.1.92 /usr/lib64/libvulkan_radeon.so > * Searching for /usr/lib64/libvulkan.so.1.1.92,/usr/lib64/libvulkan_radeon.so ... > media-libs/mesa-9999 (/usr/lib64/libvulkan_radeon.so) > media-libs/vulkan-loader-1.1.92.1 (/usr/lib64/libvulkan.so.1.1.92) > $ ls /usr/share/vulkan/icd.d/ > radeon_icd.i686.json radeon_icd.x86_64.json > $ cat /usr/share/vulkan/icd.d/radeon_icd.x86_64.json > { > "ICD": { > "api_version": "1.1.90", > "library_path": "/usr/lib64/libvulkan_radeon.so" > }, > "file_format_version": "1.0.0" > } > $ strace vulkaninfo 2>&1 | grep libvulkan > openat(AT_FDCWD, "/usr/lib64/libvulkan.so.1", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/usr/lib64/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/usr/lib64/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/usr/lib64/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3 > write(2, "ERROR: [Loader Message] Code 0 :"..., 93ERROR: [Loader Message] Code 0 : /usr/lib32/libvulkan_radeon.so: wrong ELF class: ELFCLASS32 > openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3 > write(2, "ERROR: [Loader Message] Code 0 :"..., 93ERROR: [Loader Message] Code 0 : /usr/lib32/libvulkan_radeon.so: wrong ELF class: ELFCLASS32 > stat("/usr/lib64/libvulkan_radeon.so", {st_mode=S_IFREG|0755, st_size=3645808, ...}) = 0 > stat("/usr/lib64/libvulkan_radeon.so", {st_mode=S_IFREG|0755, st_size=3645808, ...}) = 0 Created attachment 144246 [details]
RADV_DEBUG=info vulkaninfo with workaround
Nevermind, I found the error. I also found a quick workaround (setting VK_ICD_FILENAMES to /usr/share/vulkan/icd.d/radeon_icd.x86_64.json) but I think this is either a Gentoo or ICD-Loader bug. Anyway, will retest with the workaround now.
And it crashed again:
> [ 1104.395697] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled seq=231618, emitted seq=231620
> [ 1104.395701] [drm:amdgpu_job_timedout] *ERROR* Process information: process Dirt4 pid 3969 thread WinMain pid 4016
> [ 1104.395704] amdgpu 0000:01:00.0: GPU reset begin!
> [ 1114.625442] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
> [ 1276.700561] WARNING: CPU: 0 PID: 1868 at kernel/kthread.c:529 kthread_park+0x67/0x78
> [ 1276.700564] Modules linked in: nfsd vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O)
> [ 1276.700569] CPU: 0 PID: 1868 Comm: TaskSchedulerSi Tainted: G O 5.0.5 #1
> [ 1276.700570] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016
> [ 1276.700572] RIP: 0010:kthread_park+0x67/0x78
> [ 1276.700574] Code: 18 e8 fd c9 a9 00 be 40 00 00 00 48 89 df e8 60 72 00 00 48 85 c0 74 1b 31 c0 5b 5d c3 0f 0b eb ae 0f 0b b8 da ff ff ff eb f0 <0f> 0b b8 f0 ff ff ff eb e7 0f 0b eb e3 0f 1f 40 00 f6 47 26 20 74
> [ 1276.700575] RSP: 0018:ffffaee303127b78 EFLAGS: 00010202
> [ 1276.700576] RAX: 0000000000000004 RBX: ffffa32ac4dd88c0 RCX: 0000000000000000
> [ 1276.700577] RDX: ffffa32b2b407428 RSI: ffffa32ac4dd88c0 RDI: ffffa32b2b59b300
> [ 1276.700578] RBP: ffffa32b2bbbbde0 R08: 0000000000000000 R09: ffffa32b2ea14a00
> [ 1276.700580] R10: 0000052fd586b10b R11: 00000144012984ae R12: ffffa32b2b402790
> [ 1276.700580] R13: ffffa32b1cbe4e00 R14: 0000000000000206 R15: dead000000000100
> [ 1276.700582] FS: 00007f5d91bd6700(0000) GS:ffffa32b2ea00000(0000) knlGS:0000000000000000
> [ 1276.700583] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1276.700584] CR2: 00001f0f0872fc00 CR3: 00000003f7c74000 CR4: 00000000000406f0
> [ 1276.700585] Call Trace:
> [ 1276.700590] drm_sched_entity_fini+0x32/0x180
> [ 1276.700592] amdgpu_vm_fini+0xa8/0x520
> [ 1276.700595] ? idr_destroy+0x78/0xc0
> [ 1276.700597] amdgpu_driver_postclose_kms+0x14c/0x268
> [ 1276.700600] drm_file_free.part.7+0x21a/0x2f8
> [ 1276.700601] drm_release+0xa5/0x120
> [ 1276.700604] __fput+0x9a/0x1c8
> [ 1276.700606] task_work_run+0x8a/0xb0
> [ 1276.700608] do_exit+0x2b5/0xb28
> [ 1276.700610] do_group_exit+0x35/0x98
> [ 1276.700612] get_signal+0xbd/0x690
> [ 1276.700614] ? do_signal+0x2b/0x6b8
> [ 1276.700616] ? __x64_sys_futex+0x137/0x178
> [ 1276.700618] ? exit_to_usermode_loop+0x46/0xa0
> [ 1276.700619] ? do_syscall_64+0x14c/0x178
> [ 1276.700621] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1276.700623] ---[ end trace 523e60dbe51727f8 ]---
> [ 1278.790751] [drm] Skip scheduling IBs!
I did remove the shader cache (stopped X.org, removed everything inside of ~/.cache/mesa_shader_cache and rebooted) before starting the test.
Is there a special race where it crashes more often? Can you explain again the steps to reproduce? I'm sorry but I can't do anything better without being able to repro myself. (In reply to Samuel Pitoiset from comment #31) > Is there a special race where it crashes more often? No, it's just random. > Can you explain again the steps to reproduce? I'm starting a new race and finish it but instead of exiting I click on "view replay" and just let it run untill it crashes. Ok, I will replay a race all over the night and hopefully I will hit that hang. No hangs so far. I do experience the same hang in Far Cry New Dawn. It occurs very randomly. Using latest LLVM9 git and mesa 19,2 git with kernel 5.0.13 Just as an update the issue in Far Cry New Dawn doesn't seem to occur when using mesa 19.0.4 and llvm8 from the official arch repo. (In reply to antonh from comment #36) > Just as an update the issue in Far Cry New Dawn doesn't seem to occur when > using mesa 19.0.4 and llvm8 from the official arch repo. I'm not using Arch but will test if this combination helps here, too. Might just need some time as I'm not home right now. Anyway, could you try to pinpoint if it's Mesa or LLVM and maybe even bisect? Running mesa-git 19.2 together with LLVM8 does not expose the issue for me as well. So I assume it's LLVM causing the issue. To my embarrassment, my issue seems to have been a memory problem that testing had failed to reveal. Whoops. Sorry. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/859. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.