Bug 110345

Summary:	Unrecoverable GPU crash with DiRT 4
Product:	Mesa	Reporter:	Thomas Rohloff <v10lator>
Component:	Drivers/Vulkan/radeon	Assignee:	mesa-dev
Status:	RESOLVED MOVED	QA Contact:	mesa-dev
Severity:	normal
Priority:	medium	CC:	f.pinamartins
Version:	unspecified
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	force wd switch on eop RADV_DEBUG=info vulkaninfo RADV_DEBUG=info vulkaninfo with workaround

Description Thomas Rohloff 2019-04-06 06:27:31 UTC

At first I thought this is a kernel bug bug Alex Deucher told it's most likely a mesa bug.

The game randomly freezes. When that happens the screen is frozen and input by mouse or keyboard doesn't work (LEDs on the keyboard are frozen, too). It's still possible to SSH to the PC to get some logs but that's about it: Even the reboot command freezes.

Here's a log from Mesa 18.3.4 and kernel 5.0.4:

> [52700.498697] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled seq=1423558, emitted seq=1423560`
> [52700.498702] [drm:amdgpu_job_timedout] *ERROR* Process information: process Dirt4 pid 10332 thread WebViewRenderer pid 10391
> [52700.498705] amdgpu 0000:01:00.0: GPU reset begin!
> [52710.728397] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
> [52873.699280] WARNING: CPU: 2 PID: 4034 at kernel/kthread.c:529 kthread_park+0x67/0x78
> [52873.699283] Modules linked in: nfsd
> [52873.699287] CPU: 2 PID: 4034 Comm: TaskSchedulerFo Not tainted 5.0.4 #1
> [52873.699288] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016
> [52873.699290] RIP: 0010:kthread_park+0x67/0x78
> [52873.699291] Code: 18 e8 9d 78 aa 00 be 40 00 00 00 48 89 df e8 60 72 00 00 48 85 c0 74 1b 31 c0 5b 5d c3 0f 0b eb ae 0f 0b b8 da ff ff ff eb f0 <0f> 0b b8 f0 ff ff ff eb e7 0f 0b eb e3 0f 1f 40 00 f6 47 26 20 74
> [52873.699293] RSP: 0018:ffffa0144460fb78 EFLAGS: 00210202
> [52873.699294] RAX: 0000000000000004 RBX: ffff9155631210c0 RCX: 0000000000000000
> [52873.699295] RDX: ffff9155ef427428 RSI: ffff9155631210c0 RDI: ffff9155ef9bbfc0
> [52873.699296] RBP: ffff9155f013b8a0 R08: ffff9155f2a97480 R09: ffff9155f2a94a00
> [52873.699297] R10: 0000d46d0abbfe3a R11: 000033d8b581bc78 R12: ffff9155ef422790
> [52873.699298] R13: ffff9155a2f83c00 R14: 0000000000000202 R15: dead000000000100
> [52873.699299] FS:  00007fc756cff700(0000) GS:ffff9155f2a80000(0000) knlGS:0000000000000000
> [52873.699301] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [52873.699302] CR2: 00007fc7650b8070 CR3: 0000000322b86000 CR4: 00000000000406e0
> [52873.699302] Call Trace:
> [52873.699307]  drm_sched_entity_fini+0x32/0x180
> [52873.699309]  amdgpu_vm_fini+0xa8/0x520
> [52873.699311]  ? idr_destroy+0x78/0xc0
> [52873.699313]  amdgpu_driver_postclose_kms+0x14c/0x268
> [52873.699316]  drm_file_free.part.7+0x21a/0x2f8
> [52873.699318]  drm_release+0xa5/0x120
> [52873.699320]  __fput+0x9a/0x1c8
> [52873.699323]  task_work_run+0x8a/0xb0
> [52873.699325]  do_exit+0x2b5/0xb30
> [52873.699326]  do_group_exit+0x35/0x98
> [52873.699328]  get_signal+0xbd/0x690
> [52873.699331]  ? _raw_spin_unlock+0xd/0x20
> [52873.699333]  ? do_signal+0x2b/0x6b8
> [52873.699335]  ? __x64_sys_futex+0x137/0x178
> [52873.699337]  ? exit_to_usermode_loop+0x46/0xa0
> [52873.699338]  ? do_syscall_64+0x14c/0x178
> [52873.699339]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [52873.699341] ---[ end trace 1e1efc0508ef22df ]---
> [52875.619562] [drm] Skip scheduling IBs!
> [52875.625247] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
> [52885.826983] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:47:crtc-0] flip_done timed out
> [52896.066581] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [CRTC:47:crtc-0] flip_done timed out
> [52906.306280] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR* [PLANE:45:plane-5] flip_done timed out

Mesa 19.0.1 / kernel 5.0.5:

> [178793.032358] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled seq=12332054, emitted seq=12332056
> [178793.032362] [drm:amdgpu_job_timedout] *ERROR* Process information: process Dirt4 pid 31348 thread WebViewRenderer pid 31422
> [178793.032365] amdgpu 0000:01:00.0: GPU reset begin!
> [178803.262008] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out

Mesa git (26e161b1e9) / kernel 5.0.5:

> [ 7819.095648] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled seq=2652771, emitted seq=2652773
> [ 7819.095652] [drm:amdgpu_job_timedout] *ERROR* Process information: process Dirt4 pid 3075 thread WebViewRenderer pid 3152
> [ 7819.095655] amdgpu 0000:01:00.0: GPU reset begin!
> [ 7829.315220] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out

This is on with a Radeon RX 580 (Sapphire NITRO+).

Link to the kernel bug report: https://bugzilla.kernel.org/show_bug.cgi?id=203111

Comment 1 Samuel Pitoiset 2019-04-09 10:32:01 UTC

What preset and what llvm are you using?

Comment 2 Thomas Rohloff 2019-04-09 10:54:01 UTC

(In reply to Samuel Pitoiset from comment #1)
> What preset and what llvm are you using?

With preset you mean the in-game graphics preset, right? If so I'm using Ultra (everything maxed out).
I started with LLVM 7 but my last test (there's no log as it didn't differ) was with LLVM 8:

> OpenGL renderer string: Radeon RX 580 Series (POLARIS10, DRM 3.27.0, 5.0.5, LLVM 8.0.0)

Should I try LLVM 9 (git) ? Also note that it seems every update (be it mesa or LLVM) triggers the bug less frequently. So my last test with LLVM 8 needed around 6 hours to trigger it. That's ofc. highly subjective and the freezing is randomly anyway. If you need a way to reproduce: Drive a track but instead of continuing after finish just let the reply run (over night).

Comment 3 Thomas Rohloff 2019-04-09 10:54:53 UTC

(In reply to Thomas Rohloff from comment #2)
> instead of continuing after finish just let the reply run

I meaned "replay", sorry.

Comment 4 Samuel Pitoiset 2019-04-09 16:37:20 UTC

I got a hang one time when starting a new race, I'm not sure how to reproduce it again. Like you said, it's random and that will be a huge pain to figure out.

Comment 5 Thomas Rohloff 2019-04-09 17:25:01 UTC

(In reply to Samuel Pitoiset from comment #4)
> I got a hang one time when starting a new race, I'm not sure how to
> reproduce it again. Like you said, it's random and that will be a huge pain
> to figure out.

Sad to hear. In the meantime I tested LLVM git (6aaa856e179f33a05a47db3666821d2ae44c9afc) in combination with Mesa git (75a3dd97aa) and it froze with the same infos in dmesg.

Comment 6 Samuel Pitoiset 2019-04-10 07:38:53 UTC

If you can find a good way for reproducing the issue that would be highly appreciated. You can also try different combinations of debug options, like:

RADV_DEBUG=nodcc,nohiz,nofastclears,zerovram

Comment 7 Thomas Rohloff 2019-04-10 19:05:21 UTC

Thanks for all the helpful advice. Sadly I still didn't find a way to reproduce it better. Anyway, I tried your env var for around 8 hours without a freeze. While that might be just luck I'll do more testing (just started a replay with only two of the four options to let it run overnight). Hopefully it will pinpoint which of the four options does the trick, so expect a new reply in two to four days. :)

Comment 8 Samuel Pitoiset 2019-04-11 07:25:03 UTC

Very nice, thanks for your time!

Comment 9 Thomas Rohloff 2019-04-11 14:51:02 UTC

Sorry to say but all of these combinations froze:

RADV_DEBUG=nodcc,nohiz
RADV_DEBUG=nofastclears
RADV_DEBUG=zerovram

So I'll re-run with RADV_DEBUG=nodcc,nohiz,nofastclears,zerovram to confirm it was just luck before. Not sure what to do after that to help pin-pointing this as the distributions I'm using has some problems with vktrace atm and even if I would be able to run it it would most likely produce a way to large trace to upload.

Comment 10 Thomas Rohloff 2019-04-11 18:44:04 UTC

(In reply to Thomas Rohloff from comment #9)
> So I'll re-run with RADV_DEBUG=nodcc,nohiz,nofastclears,zerovram to confirm
> it was just luck before.

And it froze, too.

Comment 11 Samuel Pitoiset 2019-04-12 16:16:42 UTC

Does the problem still happen if you try to downgrade your kernel?

Comment 12 Thomas Rohloff 2019-04-13 09:11:36 UTC

The 4.19 and 4.20 series won't boot for me cause of other bugs.
I could try 4.18.12 but the FPS is very low with it (so it might take longer to trigger the bug) and I can't bisect (cause, as told above, 4.19/4.20 won't boot for me).

Comment 13 Thomas Rohloff 2019-04-14 08:21:44 UTC

Shortly after my last reply I started a test with 4.18.12. At the moment it's still not frozen but I want to give it more time (cause of the low FPS).

Anyway, I saw someobody CCing this bug report. So @Riikka: Are you experiencing the same bug and if so would you mind sharing your system details (kernel, mesa and LLVM versions) ? Also do you have the time and skills to test other kernel versions and maybe even bisect (wait before my test on 4.18.12 is finished before doing your own testing to not waste time and energy) ?

Comment 14 Riikka 2019-04-14 10:26:47 UTC

(In reply to Thomas Rohloff from comment #13)
> Shortly after my last reply I started a test with 4.18.12. At the moment
> it's still not frozen but I want to give it more time (cause of the low FPS).
> 
> Anyway, I saw someobody CCing this bug report. So @Riikka: Are you
> experiencing the same bug and if so would you mind sharing your system
> details (kernel, mesa and LLVM versions) ? Also do you have the time and
> skills to test other kernel versions and maybe even bisect (wait before my
> test on 4.18.12 is finished before doing your own testing to not waste time
> and energy) ?

I knew I should've been sneakier...

I don't have DiRT 4 (I think) so I can't say I have a crash when I play it. I am, however, using an old card because it doesn't randomly crash things and hang my system (sort of...??) with dmesg output like what's been posted all over this place.

I don't know how much of my story I should post here since it's mostly more of the same and this isn't necessarily the issue I had (discounting the particular application's involvement). Also I apparently didn't keep records nearly as much as I thought. 

I can handle building a bunch of kernels (thanks to my lovely Ryzen CPU that only causes compiler segfaults sometimes (also I use Gentoo; please feel sorry for me now)) and should be able to handle some debugging work though I'm unfamiliar with graphics mechanisms at this level; it's more a matter of patience. My latest efforts resulted in a system so flaky I was thinking I'd need to replace the whole thing. I'm running on an old HD 6870 instead of my RX 580. It's not /physically/ hard to swap the cards, only emotionally :P

Shorter story: similar to several reports on this site: unpredictable hanging, familiar dmesg output. SSH works, cursor works, audio works. Tried many things (newer kernels (4.2x, 5.0.5, 5.0.7) (similar bugs keep getting fixed, just never my one :( ), Mesa (late 18s, 19.x), all with LLVM 7.0.1 it seems), no luck, just more pain with our friend "ring gfx timeout." Latest results were veeeerryy strange crashiness in programs but seemingly less of the hanging. Now using a cruddy old card instead because terminal-based roguelikes don't need frame rates, will be trying things again soon. I added myself just to see if there's any progress on these issues or something particular I can try or do to help.

Someone save us. amdgpu is hurting our souls.

Also, I do want to bring up a couple of questions. What are the relevant debugging tools, and is there any way to determine whether the problem is just a bad video card?

Comment 15 Thomas Rohloff 2019-04-16 17:53:45 UTC

@Samuel Pitoiset It doesn't seem to freeze on 4.18.12.

Comment 16 Samuel Pitoiset 2019-05-07 07:55:30 UTC

Are you still able to reproduce with latest mesa/llvm git?
We fixed some issues that might or might not help.
Since last time, I tried (a bunch of times) to reproduce the GPU hang with Dirt 4 without success, unfortunately.

Comment 17 Thomas Rohloff 2019-05-08 06:45:24 UTC

(In reply to Samuel Pitoiset from comment #16)
> Are you still able to reproduce with latest mesa/llvm git?

Yes.

Comment 18 Samuel Pitoiset 2019-05-09 16:28:18 UTC

What graphics settings are you using? Can you eventually record a video of the Video panel? That way I could try again with the same settings as you.

Comment 19 Cameron Banfield 2019-05-09 20:01:53 UTC

Cities: Skylines will crash in the same way (but on Ryzen PRO 7 2700U Vega 10) so might help with re-producing and testing?

This is guaranteed to crash every time just by loading a map.

Have tested with kernel 5.1 and previous, a continuous issue

Comment 20 Thomas Rohloff 2019-05-09 20:27:11 UTC

(In reply to Samuel Pitoiset from comment #18)
> What graphics settings are you using? Can you eventually record a video of
> the Video panel? That way I could try again with the same settings as you.

I made some screenshots: https://imgur.com/a/C8JTQuI

Comment 21 Samuel Pitoiset 2019-05-10 12:48:20 UTC

Can you attach the output of "RADV_DEBUG=info vulkaninfo" please?

Comment 22 Samuel Pitoiset 2019-05-10 13:23:05 UTC

Created attachment 144219 [details] [review]
force wd switch on eop

Does the attached patch help?

Comment 23 Thomas Rohloff 2019-05-11 07:03:00 UTC

Created attachment 144232 [details]
RADV_DEBUG=info vulkaninfo

(In reply to Samuel Pitoiset from comment #22)
> Created attachment 144219 [details] [review] [review]
> force wd switch on eop
> 
> Does the attached patch help?

No.

(In reply to Samuel Pitoiset from comment #21)
> Can you attach the output of "RADV_DEBUG=info vulkaninfo" please?

Sure.

Comment 24 Samuel Pitoiset 2019-05-13 08:03:05 UTC

You have two Polaris10 in your machine right?

Comment 25 Thomas Rohloff 2019-05-13 08:33:29 UTC

No, just one (Sapphire NITRO+ Radeon RX 580):

> # lspci | grep VGA
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)

Comment 26 Samuel Pitoiset 2019-05-13 08:53:45 UTC

(In reply to Cameron Banfield from comment #19)
> Cities: Skylines will crash in the same way (but on Ryzen PRO 7 2700U Vega
> 10) so might help with re-producing and testing?
> 
> This is guaranteed to crash every time just by loading a map.
> 
> Have tested with kernel 5.1 and previous, a continuous issue

Just tried to reproduce a hang with that game on both Polaris10/Vega10 with different mesa/llvm combinations, no success.

What mesa/llvm do you use and what graphics preset?

Comment 27 Samuel Pitoiset 2019-05-13 10:13:07 UTC

(In reply to Thomas Rohloff from comment #25)
> No, just one (Sapphire NITRO+ Radeon RX 580):
> 
> > # lspci | grep VGA
> > 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)

vulkaninfo reports 2 devices which means you likely have two drivers installed.
Can you first clean up your install to make sure the number of reported devices is 1 (ie. Device Count = 1).

Also, can you try to remove the shaders cache (ie. rm -rf ~/.cache/mesa_shader_cache) ?

Btw, I played this game few hours on my Polaris10 with the same graphics settings and it didn't hang. :/

Comment 28 Thomas Rohloff 2019-05-13 10:45:05 UTC

(In reply to Samuel Pitoiset from comment #27)
> vulkaninfo reports 2 devices which means you likely have two drivers
> installed.
> Can you first clean up your install to make sure the number of reported
> devices is 1 (ie. Device Count = 1).

I would love to do that but I have no idea what's happening here. This looks as the same driver would claim two devices (same apiVersion and driverVersion) :

> GPU0
> VkPhysicalDeviceProperties:
> ===========================
> 	apiVersion     = 0x40105a  (1.1.90)
> 	driverVersion  = 79695971 (0x4c01063)
> 	vendorID       = 0x1002
> 	deviceID       = 0x67df
> 	deviceType     = DISCRETE_GPU
> 	deviceName     = AMD RADV POLARIS10 (LLVM 9.0.0)

> GPU1
> VkPhysicalDeviceProperties:
> ===========================
> 	apiVersion     = 0x40105a  (1.1.90)
> 	driverVersion  = 79695971 (0x4c01063)
> 	vendorID       = 0x1002
> 	deviceID       = 0x67df
> 	deviceType     = DISCRETE_GPU
> 	deviceName     = AMD RADV POLARIS10 (LLVM 9.0.0)

Also there aren't multiple drivers available:

> $ ls -l /usr/lib64/ | grep vulkan
> lrwxrwxrwx  1 root     root            14  7. Mär 16:57 libvulkan.so -> > libvulkan.so.1
> lrwxrwxrwx  1 root     root            19  7. Mär 16:57 libvulkan.so.1 -> > libvulkan.so.1.1.92
> -rwxr-xr-x  1 root     root        317992  7. Mär 16:57 libvulkan.so.1.1.92
> -rwxr-xr-x  1 root     root       3645808 10. Mai 18:57 libvulkan_radeon.so
> $ equery b /usr/lib64/libvulkan.so.1.1.92 /usr/lib64/libvulkan_radeon.so
>  * Searching for /usr/lib64/libvulkan.so.1.1.92,/usr/lib64/libvulkan_radeon.so ... 
> media-libs/mesa-9999 (/usr/lib64/libvulkan_radeon.so)
> media-libs/vulkan-loader-1.1.92.1 (/usr/lib64/libvulkan.so.1.1.92)
> $ ls /usr/share/vulkan/icd.d/
> radeon_icd.i686.json  radeon_icd.x86_64.json
> $ cat /usr/share/vulkan/icd.d/radeon_icd.x86_64.json
> {
>     "ICD": {
>         "api_version": "1.1.90",
>         "library_path": "/usr/lib64/libvulkan_radeon.so"
>     },
>     "file_format_version": "1.0.0"
> }
> $ strace vulkaninfo 2>&1 | grep libvulkan
> openat(AT_FDCWD, "/usr/lib64/libvulkan.so.1", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib64/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib64/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib64/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3
> write(2, "ERROR: [Loader Message] Code 0 :"..., 93ERROR: [Loader Message] Code 0 : /usr/lib32/libvulkan_radeon.so: wrong ELF class: ELFCLASS32
> openat(AT_FDCWD, "/usr/lib32/libvulkan_radeon.so", O_RDONLY|O_CLOEXEC) = 3
> write(2, "ERROR: [Loader Message] Code 0 :"..., 93ERROR: [Loader Message] Code 0 : /usr/lib32/libvulkan_radeon.so: wrong ELF class: ELFCLASS32
> stat("/usr/lib64/libvulkan_radeon.so", {st_mode=S_IFREG|0755, st_size=3645808, ...}) = 0
> stat("/usr/lib64/libvulkan_radeon.so", {st_mode=S_IFREG|0755, st_size=3645808, ...}) = 0

Comment 29 Thomas Rohloff 2019-05-13 11:02:14 UTC

Created attachment 144246 [details]
RADV_DEBUG=info vulkaninfo with workaround

Nevermind, I found the error. I also found a quick workaround (setting VK_ICD_FILENAMES to /usr/share/vulkan/icd.d/radeon_icd.x86_64.json) but I think this is either a Gentoo or ICD-Loader bug. Anyway, will retest with the workaround now.

Comment 30 Thomas Rohloff 2019-05-13 11:31:39 UTC

And it crashed again:

> [ 1104.395697] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled seq=231618, emitted seq=231620
> [ 1104.395701] [drm:amdgpu_job_timedout] *ERROR* Process information: process Dirt4 pid 3969 thread WinMain pid 4016
> [ 1104.395704] amdgpu 0000:01:00.0: GPU reset begin!
> [ 1114.625442] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
> [ 1276.700561] WARNING: CPU: 0 PID: 1868 at kernel/kthread.c:529 kthread_park+0x67/0x78
> [ 1276.700564] Modules linked in: nfsd vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O)
> [ 1276.700569] CPU: 0 PID: 1868 Comm: TaskSchedulerSi Tainted: G           O      5.0.5 #1
> [ 1276.700570] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016
> [ 1276.700572] RIP: 0010:kthread_park+0x67/0x78
> [ 1276.700574] Code: 18 e8 fd c9 a9 00 be 40 00 00 00 48 89 df e8 60 72 00 00 48 85 c0 74 1b 31 c0 5b 5d c3 0f 0b eb ae 0f 0b b8 da ff ff ff eb f0 <0f> 0b b8 f0 ff ff ff eb e7 0f 0b eb e3 0f 1f 40 00 f6 47 26 20 74
> [ 1276.700575] RSP: 0018:ffffaee303127b78 EFLAGS: 00010202
> [ 1276.700576] RAX: 0000000000000004 RBX: ffffa32ac4dd88c0 RCX: 0000000000000000
> [ 1276.700577] RDX: ffffa32b2b407428 RSI: ffffa32ac4dd88c0 RDI: ffffa32b2b59b300
> [ 1276.700578] RBP: ffffa32b2bbbbde0 R08: 0000000000000000 R09: ffffa32b2ea14a00
> [ 1276.700580] R10: 0000052fd586b10b R11: 00000144012984ae R12: ffffa32b2b402790
> [ 1276.700580] R13: ffffa32b1cbe4e00 R14: 0000000000000206 R15: dead000000000100
> [ 1276.700582] FS:  00007f5d91bd6700(0000) GS:ffffa32b2ea00000(0000) knlGS:0000000000000000
> [ 1276.700583] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1276.700584] CR2: 00001f0f0872fc00 CR3: 00000003f7c74000 CR4: 00000000000406f0
> [ 1276.700585] Call Trace:
> [ 1276.700590]  drm_sched_entity_fini+0x32/0x180
> [ 1276.700592]  amdgpu_vm_fini+0xa8/0x520
> [ 1276.700595]  ? idr_destroy+0x78/0xc0
> [ 1276.700597]  amdgpu_driver_postclose_kms+0x14c/0x268
> [ 1276.700600]  drm_file_free.part.7+0x21a/0x2f8
> [ 1276.700601]  drm_release+0xa5/0x120
> [ 1276.700604]  __fput+0x9a/0x1c8
> [ 1276.700606]  task_work_run+0x8a/0xb0
> [ 1276.700608]  do_exit+0x2b5/0xb28
> [ 1276.700610]  do_group_exit+0x35/0x98
> [ 1276.700612]  get_signal+0xbd/0x690
> [ 1276.700614]  ? do_signal+0x2b/0x6b8
> [ 1276.700616]  ? __x64_sys_futex+0x137/0x178
> [ 1276.700618]  ? exit_to_usermode_loop+0x46/0xa0
> [ 1276.700619]  ? do_syscall_64+0x14c/0x178
> [ 1276.700621]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1276.700623] ---[ end trace 523e60dbe51727f8 ]---
> [ 1278.790751] [drm] Skip scheduling IBs!

I did remove the shader cache (stopped X.org, removed everything inside of ~/.cache/mesa_shader_cache and rebooted) before starting the test.

Comment 31 Samuel Pitoiset 2019-05-13 11:38:20 UTC

Is there a special race where it crashes more often?
Can you explain again the steps to reproduce?
I'm sorry but I can't do anything better without being able to repro myself.

Comment 32 Thomas Rohloff 2019-05-13 11:55:31 UTC

(In reply to Samuel Pitoiset from comment #31)
> Is there a special race where it crashes more often?

No, it's just random.

> Can you explain again the steps to reproduce?

I'm starting a new race and finish it but instead of exiting I click on "view replay" and just let it run untill it crashes.

Comment 33 Samuel Pitoiset 2019-05-13 20:06:13 UTC

Ok, I will replay a race all over the night and hopefully I will hit that hang.

Comment 34 Samuel Pitoiset 2019-05-14 07:59:37 UTC

No hangs so far.

Comment 35 Anton Herzfeld 2019-05-17 06:06:49 UTC

I do experience the same hang in Far Cry New Dawn. It occurs very randomly. Using latest LLVM9 git and mesa 19,2 git with kernel 5.0.13

Comment 36 Anton Herzfeld 2019-05-18 06:27:00 UTC

Just as an update the issue in Far Cry New Dawn doesn't seem to occur when using mesa 19.0.4 and llvm8 from the official arch repo.

Comment 37 Thomas Rohloff 2019-05-24 13:00:52 UTC

(In reply to antonh from comment #36)
> Just as an update the issue in Far Cry New Dawn doesn't seem to occur when
> using mesa 19.0.4 and llvm8 from the official arch repo.

I'm not using Arch but will test if this combination helps here, too. Might just need some time as I'm not home right now. Anyway, could you try to pinpoint if it's Mesa or LLVM and maybe even bisect?

Comment 38 Anton Herzfeld 2019-06-03 17:13:48 UTC

Running mesa-git 19.2 together with LLVM8 does not expose the issue for me as well. So I assume it's LLVM causing the issue.

Comment 39 Riikka 2019-06-05 01:07:30 UTC

To my embarrassment, my issue seems to have been a memory problem that testing had failed to reveal. Whoops. Sorry.

Comment 40 GitLab Migration User 2019-09-18 19:55:56 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/859.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.