Bug 101334 - AMD SI cards: Some vulkan apps freeze the system
Summary: AMD SI cards: Some vulkan apps freeze the system
Status: CLOSED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/radeon (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-07 20:05 UTC by John
Modified: 2018-07-20 09:22 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg after the crash (71.47 KB, text/plain)
2017-06-07 20:05 UTC, John
Details
no_si_prefetch (754 bytes, patch)
2017-06-10 14:45 UTC, Grazvydas Ignotas
Details | Splinter Review
gdb backtrace (1.79 KB, text/plain)
2017-06-11 12:42 UTC, John
Details
trace everything (956 bytes, patch)
2017-06-11 14:15 UTC, Grazvydas Ignotas
Details | Splinter Review
trace (1.79 KB, text/plain)
2017-06-11 14:48 UTC, John
Details
trace (15.51 KB, text/plain)
2017-06-11 15:24 UTC, John
Details
flush try (2.01 KB, patch)
2017-06-15 20:27 UTC, Grazvydas Ignotas
Details | Splinter Review
trace (16.64 KB, text/plain)
2017-06-15 21:16 UTC, John
Details
gdb backtrace (557 bytes, text/plain)
2017-06-15 21:27 UTC, John
Details
radv trace (1.48 KB, text/plain)
2017-08-02 07:04 UTC, John
Details
another trace (1.48 KB, text/plain)
2017-08-05 08:13 UTC, John
Details
hacky patch? (2.26 KB, patch)
2017-08-07 04:04 UTC, Dave Airlie
Details | Splinter Review
radv trace after 2nd patch (1.48 KB, text/plain)
2017-08-07 04:42 UTC, John
Details
fix MSAA rendering. (1.43 KB, patch)
2017-08-07 06:57 UTC, Dave Airlie
Details | Splinter Review

Description John 2017-06-07 20:05:34 UTC
Created attachment 131788 [details]
dmesg after the crash

Hello,

I haven't tried using radv/Vulkan in a few months, but last time I tried, most of the apps/games worked and only a few crashed the system, now it seems all do have issues.

I am using a 280x, on agd5f's linux-staging 4.11 (but the system freezes as well on standard 4.11.3). Xorg is 1.19.3.

Once the computer hangs, I'm unable to use it directly but I can still connect from another machine through SSH. I do have to manually force a restart though.

Games I've tried which froze the system:
- Mad Max (it always was bad for me).
- Serious Sam Fusion (that one used to work, as long as no steam overlay).

Apps I've tried with similar results:
- Some projects from SaschaWillems/Vulkan (some used to run, others would just refuse to, but no crash in the past).
- The Dolphin emulator (which used to work as well).

I am on mesa-git, llvm-svn and ddx-git as well.
I'm happy to test patches as needed.

Thanks!
Comment 1 Eric Engestrom 2017-06-08 13:10:02 UTC
Hi John,

You correctly gave the version number for the release-version of software you use, but you didn't give any information for the development-version of software you mentioned.
Taking Mesa as an example (and assuming you're using the master branch), there are currently 92809 different versions of mesa-git you could be using ;)
For git, the commit hash identifies the version, and for svn I think the commit number (eg 302914 for llvm-svn) is best.

Cheers,
  Eric
Comment 2 John 2017-06-08 23:27:06 UTC
Hello Eric,

Good point!

I am currently on mesa ce53e8e61b (commit number 92807), and llvm 304967.

Would anything else help?

Thank you!
John
Comment 3 Eric Engestrom 2017-06-09 17:29:28 UTC
(In reply to John from comment #2)
> Would anything else help?

It would be really helpful if you could bisect the issue.
This means picking an app (game) that was working and doesn't work anymore, and running `git bisect` using this app to determine if each commit has the issue.

This page can help you if you don't know how the commands work:
https://git-scm.com/docs/git-bisect

Note that I barely know anything about radv, so unless there's something fairly obvious in the bad commit I won't be able to help past this.
Comment 4 John 2017-06-09 21:55:52 UTC
Alright, after bisecting here's the problematic commit:

https://cgit.freedesktop.org/mesa/mesa/commit/?id=bcae3274692954ad2cd6dfc253579ec98d50856f

Thanks!
Comment 5 John 2017-06-09 21:58:41 UTC
Adding Dave.
Comment 6 Grazvydas Ignotas 2017-06-10 14:45:17 UTC
Created attachment 131842 [details] [review]
no_si_prefetch

Does this patch help?
Comment 7 Grazvydas Ignotas 2017-06-10 15:57:22 UTC
I've sent different version to ML, testing that one would be preferred:
https://lists.freedesktop.org/archives/mesa-dev/2017-June/158700.html
Comment 8 John 2017-06-10 20:57:27 UTC
Grazvydas,
I've rebuilt mesa at the faulty commit with your 2 patches and it worked as well as before that commit.

Thank you for the quick fix!



Now if possible I'd love to look at the freezes I've had since my first test months ago, and still have.

The behavior is somewhat different from the one fixed by Grazvydas' patches in that the application starts and will run fine for a bit, from a few seconds to a few minutes, and then the PC seems to freeze similarly to my previously described issue, I still have SSH access, yet trying to restart never works.

These are quite more annoying to debug though as not error gets displayed in dmesg, and since it has always been a problem for me I have no good commit for a bisection... I've looked at Xorg logs as well but I saw nothing there either.

A simple test for this is to run SaschaWillems/Vulkan/Raytracing, after moving around for a few seconds the issue will be triggered.
Mad Max's vulkan benchmark is another, this one always freezes in some sort of cave, I think in the 3rd scene. Maybe something in there triggers it...

I'd be happy to try a patch or a R600_DEBUG parameter to make it a lot more verbose, or whatever you think is best of course!


Thank you!
Comment 9 Grazvydas Ignotas 2017-06-11 11:48:52 UTC
Is the process still alive when you ssh to the system with a hung GPU? If it is, you could attach gdb and try to get a backtrace of a hung thread.

You can try at least a few other things:
* compile mesa with --enable-debug if you aren't already, it will enable asserts that might detect something bad
* set a RADV_TRACE_FILE=/path/to/file environment variable, it will then try to write out trace of GPU commands to that file if/when it detects a hang.

The trace file sometimes takes a few tries to produce successfully, but if you can get it, it might help to find the cause of the hang.
Comment 10 John 2017-06-11 12:34:31 UTC
I'm not sure if it's thanks to debug, but now I get something in dmesg, not that helpful I'm afraid:

[  141.325269] raytracing[2417]: segfault at 8 ip 00007fd0b21e74d2 sp 00007ffc604d5520 error 4 in libvulkan_radeon.so[7fd0b2170000+1b3000]


The trace file has been empty the various times I've tried. Is there a way to get a full trace of everything it's doing? maybe that would allow the last line or so to be useful.

As for gdb, it gets stuck on "attaching to process" and the process command in ps is displayed in square brackets.
Comment 11 John 2017-06-11 12:42:23 UTC
Created attachment 131874 [details]
gdb backtrace

Well, I've been able to get a backtrace thanks to screen.

That looks more interesting already.
Comment 12 Grazvydas Ignotas 2017-06-11 14:15:43 UTC
Created attachment 131877 [details] [review]
trace everything

I've sent a patch that should fix trace dumping for SI:
https://lists.freedesktop.org/archives/mesa-dev/2017-June/158739.html

If you want to trace everything, use the attached patch.
Comment 13 John 2017-06-11 14:48:43 UTC
Created attachment 131878 [details]
trace

The ML patch worked!

Here's the trace.

Thank you!
Comment 14 Grazvydas Ignotas 2017-06-11 15:18:50 UTC
Looks like you attached the wrong file.
Comment 15 John 2017-06-11 15:24:20 UTC
Created attachment 131879 [details]
trace

ooops
Comment 16 Grazvydas Ignotas 2017-06-15 20:27:06 UTC
Created attachment 131985 [details] [review]
flush try

From the trace it seems the hang is compute related. Comparing to radeonsi, radv seems to be missing some SI flushing workaround (from si_launch_grid()), maybe this patch will help?
Comment 17 John 2017-06-15 21:16:43 UTC
Created attachment 131986 [details]
trace

Hello, I don't think this patch changes the behavior. It froze in the same way and time.
Here's the new trace.


One thing I just saw when running raytracing: if I don't interact with it, it seems to run fine, but as soon as I do interact with it, it'll freeze.

Thank you!
Comment 18 John 2017-06-15 21:27:34 UTC
Created attachment 131987 [details]
gdb backtrace

And here's the gdb backtrace, it looks different, so maybe something did change.
Comment 19 John 2017-06-24 06:46:17 UTC
Is there anything else I can provide to help?

Thanks!
Comment 20 Marko 2017-07-07 06:42:18 UTC
Hello,

I have a similar problem on R9-270x and latest Mesa-git. I can check the latest commit Mesa was built from, but honestly, I have this issue since day one (it never worked and I check periodically).

Dota2 and DOOM (over wine) both hang on loading screen (the system freezes completely).
Talos Principle usually works with texture corruption (there was a commit or two that hanged it as well).


Let me know if I can help.

Regards,

Marko
Comment 21 John 2017-07-07 10:51:03 UTC
I believe that's a same generation card, so it would make sense to behave similarly.
Comment 22 Marko 2017-07-11 06:47:40 UTC
(In reply to John from comment #21)
> I believe that's a same generation card, so it would make sense to behave
> similarly.

Yeah. Did it ever work for you? RADV was a no-go on my card from day one.
Comment 23 John 2017-07-11 06:58:31 UTC
Some apps worked, others froze the system.

I'm still hopeful to find a fix here :)
Comment 24 Marko 2017-07-31 07:54:22 UTC
-ping
Comment 25 John 2017-08-01 09:14:23 UTC
Since I have seen Dave posting many SI-related fixes/workaround on the ML, I've just tried again (on latest amd-staging as well, in case there was something there too....). radv wise I am on "radv: handle 10-bit format clamping workaround."

I have not been able to freeze the system anymore using SaschaWillems/Vulkan/Raytracing. It also looks a lot better than previously, I don't know what it should look like, but this seems fine. So there's definitely some progress!

Alas, Mad Max's benchmark still hangs the system, it is in the same scene as usual (the third). As always, I was able to log in through SSH and saw nothing of value in dmesg or Xorg.log, but this time restarting from SSH did not really work, and I had to manually shutdown the computer and boot it. That could be unrelated to radv/amdgpu of course.



I was waiting to see more of the SI ML patches hitting the repo before testing again, but since Marko was asking I thought I'd try :)
Do things look somewhat better for you as well Marko?

Should we open a cleaner new bug report?

Thanks!
Comment 26 Marko 2017-08-02 05:29:54 UTC
Hi,

For the life of me can't find the commit I was using yesterday since I did a fresh checkout and deleted the source before testing, but it was something like early 2017-08-01 commits IIRC. 

DOOM and Dota2 still hang but kind of get a bit further than before. in Dota2 I was actually able to click the "your client is out of date" before it froze.

In DOOM I was able to pass past loading screen and it froze just before displaying main menu.

Regards,

Marko
Comment 27 John 2017-08-02 05:41:08 UTC
That does not seem much better in your case :/

Maybe I'll try a few emulators such as Dolphin and RPCS3 to see what happens.

It'd be great if we could get a dev back here :)
Comment 28 Dave Airlie 2017-08-02 05:51:14 UTC
can you try the patch on master?

https://patchwork.freedesktop.org/series/27906/
Comment 29 Marko 2017-08-02 06:20:35 UTC
(In reply to Dave Airlie from comment #28)
> can you try the patch on master?
> 
> https://patchwork.freedesktop.org/series/27906/

Thanks Dave, will try!

Marko
Comment 30 John 2017-08-02 06:22:17 UTC
(In reply to Dave Airlie from comment #28)
> can you try the patch on master?
> 
> https://patchwork.freedesktop.org/series/27906/

The bevarior is the same for me as a day ago. Max Max hanged at the exact same spot, nothing in dmesg or Xorg.0.log (and again I had to manually shutdown the computer).
Comment 31 John 2017-08-02 07:04:55 UTC
Created attachment 133190 [details]
radv trace

Well I somewhat take it back.

I restarted my computer (which I hadn't done in my previous reply) and reran the game to get a trace with RADV_TRACE_FILE. This time I made it to the 4th benchmark, I believe it's the first time ever since I've tried radv & Mad Max together. It ended up freezing in a similar way in the 4th benchmark, so maybe it was luck.

I've attached the tiny trace, I'm not sure if it's helpful.
Comment 32 Marko 2017-08-03 08:25:37 UTC
Sorry, couldn't test yesterday due to build bug:
https://bugs.freedesktop.org/show_bug.cgi?id=102014

I'll recompile today and test this evening with any luck.

glhf, 

Marko
Comment 33 John 2017-08-05 08:13:06 UTC
Created attachment 133252 [details]
another trace

Since I saw Dave's commit about fixing a GPU hang, I thought of trying again but it's still no good. :/

I tried with the patch provided in #28 and latest master.

This time I got to freeze again during the third benchmark, at a different time than the usual one (if that means anything).


I've attached the trace.
Should I try to get a gdb one as well?
Comment 34 Dave Airlie 2017-08-07 04:04:28 UTC
Created attachment 133273 [details] [review]
hacky patch?

does this patch make any difference? (on top of the one from the list).
Comment 35 John 2017-08-07 04:42:01 UTC
Created attachment 133274 [details]
radv trace after 2nd patch

I've just tested with the hacky patch and the one from #28 and it seemed about the same.
Comment 36 Dave Airlie 2017-08-07 06:57:28 UTC
Created attachment 133276 [details] [review]
fix MSAA rendering.

I've sent this to the list, it at least fixes bad rendering in talos for me.

I've also just ran phoronix test suite madmax benchmark to completeion.
Comment 37 John 2017-08-07 07:25:44 UTC
I've just tried this patch on top of the other 2, but it didn't change much :/

I'll try running with PTS, maybe it's related to some settings that PTS may not use.
Comment 38 John 2017-08-07 07:48:27 UTC
Actually, could it be a kernel difference?

I'm on the last of amd-staging 4.11, what about you?
Comment 39 Dave Airlie 2017-08-07 08:07:41 UTC
https://patchwork.freedesktop.org/patch/169012/

I also have this patch in my tree.
Comment 40 Dave Airlie 2017-08-07 08:08:23 UTC
I'm on some random drm-next kernel from a while back based on 4.12, I'm still seeing hangs on vulkan cts anyways, but they are fairly random.
Comment 41 Marko 2017-08-07 08:47:31 UTC
> Created attachment 133252 [details]
> another trace
> 
> Since I saw Dave's commit about fixing a GPU hang, I thought of trying again
> but it's still no good. :/
> 
> I tried with the patch provided in #28 and latest master.
> 
> This time I got to freeze again during the third benchmark, at a different
> time than the usual one (if that means anything).
> 
> 
> I've attached the trace.
> Should I try to get a gdb one as well?

Didn't have time to properly test over the weekend, busy schedule. I did try DOOM only, it hung even before intro video which usually works - as soon as the game window was opened. 

FYI I'm on some 4.12 kernel if that means anything.

Btw, should I apply all the patches John suggested or only the latest?

Cheers,

Marko
Comment 42 John 2017-08-07 09:00:48 UTC
(In reply to Dave Airlie from comment #40)
> I'm on some random drm-next kernel from a while back based on 4.12, I'm
> still seeing hangs on vulkan cts anyways, but they are fairly random.


Then I guess my kernel should be unrelated.

(In reply to Dave Airlie from comment #39)
> https://patchwork.freedesktop.org/patch/169012/
> 
> I also have this patch in my tree.

So this one changed things a bit!

When it froze, I noticed very high disk I/O, but I didn't have time to SSH connect to see what it was about before it dropped. Also, I got this in dmesg:

[ 3124.363414] amdgpu 0000:01:00.0: GPU fault detected: 147 0x02e20402
[ 3124.363418] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00116A97
[ 3124.363419] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02004002
[ 3124.363421] amdgpu 0000:01:00.0: VM fault (0x02, vmid 1) at page 1141399, read from '' (0x00000000) (4)

It's the first time I got something in dmesg for that madmax freeze!

As Marko asked, should I keep testing with all patches suggested here or using less?


Finally I've tried with PTS and it froze as well :/ (that was before using the last patch).
Comment 43 Marko 2017-08-07 16:25:12 UTC
(In reply to Dave Airlie from comment #39)
> https://patchwork.freedesktop.org/patch/169012/
> 
> I also have this patch in my tree.

Tested on fresh 4.13 RC3, today's Mesa git + this patch, still no joy. Freezes system as per usual.

Cheers,
Marko
Comment 44 John 2017-08-09 06:14:12 UTC
I've just tried on amd-staging 4.12, and without the hacky patch, and it still froze the same (still heavy I/O when it did, but nothing in dmesg this time).
Comment 45 Dave Airlie 2017-08-09 06:51:11 UTC
https://patchwork.freedesktop.org/series/28535/

is a replacement for the cs flush, might be worth trying on master on its own.

and possibly with the hack patch on top (it might not apply cleanly though).
Comment 46 John 2017-08-09 07:34:05 UTC
This patch only on master worked!

Congratulations Dave, this is the first time I've passed the whole benchmark!

I'll try a few other applications and wait on Marko's test before closing this, but so far it looks good.


Feel free to add if you couldn't reproduce the original issue:
Tested-by: John Ettedgui <john.ettedgui@gmail.com>
Comment 47 Marko 2017-08-09 08:16:59 UTC
That's great news then!

I'm compiling as we speak on Suse OBS, but will able to try it out only this afternoon after I get home from work.

Cheers,
Marko
Comment 48 Marko 2017-08-09 17:11:40 UTC
It's working!

Today's pull + https://patchwork.freedesktop.org/series/28535/ + kernel 4.13-rc4.

Doom and Dota2 both work with no freezing. I'll test (read-play) some more later but this finally seems to have fixed it.

Thanks!

Marko
Comment 49 John 2017-08-10 07:25:09 UTC
I've rerun the madmax benchmark in case it was luck, but it was fine.

I've also played with Dolphin for hours and it was fine (well, I think it was already fine before that patch).

I've tried a bit with RPCS3 as well, though that emulator has its own issue, and got no freeze either.

So far I have not found anything that would cause problem, with Marko confirming, and with the patch making it to git, I think that's good enough for resolving this bug.

Thanks Grazvydas and Dave for the fixes!
Comment 50 Marko 2017-08-10 10:22:02 UTC
Hi everyone,

I've tested DOOM yesterday for maybe half an hour and you can color me impressed.
The FPS is in the 50's range for High/ultra, no glitches, stuttering or artifacts/texture corruption. No freezes either. The performance is on par with Windows, give or take. 

Dota2 also works like a charm with comparable, solid framerates (50+ on best quality).

Thanks for fixing this guys!

Marko


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.