Summary: | AMD SI cards: Some vulkan apps freeze the system | ||
---|---|---|---|
Product: | Mesa | Reporter: | John <john.ettedgui> |
Component: | Drivers/Vulkan/radeon | Assignee: | mesa-dev |
Status: | CLOSED FIXED | QA Contact: | mesa-dev |
Severity: | normal | ||
Priority: | medium | CC: | airlied, notasas, vedran |
Version: | git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
dmesg after the crash
no_si_prefetch gdb backtrace trace everything trace trace flush try trace gdb backtrace radv trace another trace hacky patch? radv trace after 2nd patch fix MSAA rendering. |
Hi John, You correctly gave the version number for the release-version of software you use, but you didn't give any information for the development-version of software you mentioned. Taking Mesa as an example (and assuming you're using the master branch), there are currently 92809 different versions of mesa-git you could be using ;) For git, the commit hash identifies the version, and for svn I think the commit number (eg 302914 for llvm-svn) is best. Cheers, Eric Hello Eric, Good point! I am currently on mesa ce53e8e61b (commit number 92807), and llvm 304967. Would anything else help? Thank you! John (In reply to John from comment #2) > Would anything else help? It would be really helpful if you could bisect the issue. This means picking an app (game) that was working and doesn't work anymore, and running `git bisect` using this app to determine if each commit has the issue. This page can help you if you don't know how the commands work: https://git-scm.com/docs/git-bisect Note that I barely know anything about radv, so unless there's something fairly obvious in the bad commit I won't be able to help past this. Alright, after bisecting here's the problematic commit: https://cgit.freedesktop.org/mesa/mesa/commit/?id=bcae3274692954ad2cd6dfc253579ec98d50856f Thanks! Adding Dave. Created attachment 131842 [details] [review] no_si_prefetch Does this patch help? I've sent different version to ML, testing that one would be preferred: https://lists.freedesktop.org/archives/mesa-dev/2017-June/158700.html Grazvydas, I've rebuilt mesa at the faulty commit with your 2 patches and it worked as well as before that commit. Thank you for the quick fix! Now if possible I'd love to look at the freezes I've had since my first test months ago, and still have. The behavior is somewhat different from the one fixed by Grazvydas' patches in that the application starts and will run fine for a bit, from a few seconds to a few minutes, and then the PC seems to freeze similarly to my previously described issue, I still have SSH access, yet trying to restart never works. These are quite more annoying to debug though as not error gets displayed in dmesg, and since it has always been a problem for me I have no good commit for a bisection... I've looked at Xorg logs as well but I saw nothing there either. A simple test for this is to run SaschaWillems/Vulkan/Raytracing, after moving around for a few seconds the issue will be triggered. Mad Max's vulkan benchmark is another, this one always freezes in some sort of cave, I think in the 3rd scene. Maybe something in there triggers it... I'd be happy to try a patch or a R600_DEBUG parameter to make it a lot more verbose, or whatever you think is best of course! Thank you! Is the process still alive when you ssh to the system with a hung GPU? If it is, you could attach gdb and try to get a backtrace of a hung thread. You can try at least a few other things: * compile mesa with --enable-debug if you aren't already, it will enable asserts that might detect something bad * set a RADV_TRACE_FILE=/path/to/file environment variable, it will then try to write out trace of GPU commands to that file if/when it detects a hang. The trace file sometimes takes a few tries to produce successfully, but if you can get it, it might help to find the cause of the hang. I'm not sure if it's thanks to debug, but now I get something in dmesg, not that helpful I'm afraid: [ 141.325269] raytracing[2417]: segfault at 8 ip 00007fd0b21e74d2 sp 00007ffc604d5520 error 4 in libvulkan_radeon.so[7fd0b2170000+1b3000] The trace file has been empty the various times I've tried. Is there a way to get a full trace of everything it's doing? maybe that would allow the last line or so to be useful. As for gdb, it gets stuck on "attaching to process" and the process command in ps is displayed in square brackets. Created attachment 131874 [details]
gdb backtrace
Well, I've been able to get a backtrace thanks to screen.
That looks more interesting already.
Created attachment 131877 [details] [review] trace everything I've sent a patch that should fix trace dumping for SI: https://lists.freedesktop.org/archives/mesa-dev/2017-June/158739.html If you want to trace everything, use the attached patch. Created attachment 131878 [details]
trace
The ML patch worked!
Here's the trace.
Thank you!
Looks like you attached the wrong file. Created attachment 131879 [details]
trace
ooops
Created attachment 131985 [details] [review] flush try From the trace it seems the hang is compute related. Comparing to radeonsi, radv seems to be missing some SI flushing workaround (from si_launch_grid()), maybe this patch will help? Created attachment 131986 [details]
trace
Hello, I don't think this patch changes the behavior. It froze in the same way and time.
Here's the new trace.
One thing I just saw when running raytracing: if I don't interact with it, it seems to run fine, but as soon as I do interact with it, it'll freeze.
Thank you!
Created attachment 131987 [details]
gdb backtrace
And here's the gdb backtrace, it looks different, so maybe something did change.
Is there anything else I can provide to help? Thanks! Hello, I have a similar problem on R9-270x and latest Mesa-git. I can check the latest commit Mesa was built from, but honestly, I have this issue since day one (it never worked and I check periodically). Dota2 and DOOM (over wine) both hang on loading screen (the system freezes completely). Talos Principle usually works with texture corruption (there was a commit or two that hanged it as well). Let me know if I can help. Regards, Marko I believe that's a same generation card, so it would make sense to behave similarly. (In reply to John from comment #21) > I believe that's a same generation card, so it would make sense to behave > similarly. Yeah. Did it ever work for you? RADV was a no-go on my card from day one. Some apps worked, others froze the system. I'm still hopeful to find a fix here :) -ping Since I have seen Dave posting many SI-related fixes/workaround on the ML, I've just tried again (on latest amd-staging as well, in case there was something there too....). radv wise I am on "radv: handle 10-bit format clamping workaround." I have not been able to freeze the system anymore using SaschaWillems/Vulkan/Raytracing. It also looks a lot better than previously, I don't know what it should look like, but this seems fine. So there's definitely some progress! Alas, Mad Max's benchmark still hangs the system, it is in the same scene as usual (the third). As always, I was able to log in through SSH and saw nothing of value in dmesg or Xorg.log, but this time restarting from SSH did not really work, and I had to manually shutdown the computer and boot it. That could be unrelated to radv/amdgpu of course. I was waiting to see more of the SI ML patches hitting the repo before testing again, but since Marko was asking I thought I'd try :) Do things look somewhat better for you as well Marko? Should we open a cleaner new bug report? Thanks! Hi, For the life of me can't find the commit I was using yesterday since I did a fresh checkout and deleted the source before testing, but it was something like early 2017-08-01 commits IIRC. DOOM and Dota2 still hang but kind of get a bit further than before. in Dota2 I was actually able to click the "your client is out of date" before it froze. In DOOM I was able to pass past loading screen and it froze just before displaying main menu. Regards, Marko That does not seem much better in your case :/ Maybe I'll try a few emulators such as Dolphin and RPCS3 to see what happens. It'd be great if we could get a dev back here :) can you try the patch on master? https://patchwork.freedesktop.org/series/27906/ (In reply to Dave Airlie from comment #28) > can you try the patch on master? > > https://patchwork.freedesktop.org/series/27906/ Thanks Dave, will try! Marko (In reply to Dave Airlie from comment #28) > can you try the patch on master? > > https://patchwork.freedesktop.org/series/27906/ The bevarior is the same for me as a day ago. Max Max hanged at the exact same spot, nothing in dmesg or Xorg.0.log (and again I had to manually shutdown the computer). Created attachment 133190 [details]
radv trace
Well I somewhat take it back.
I restarted my computer (which I hadn't done in my previous reply) and reran the game to get a trace with RADV_TRACE_FILE. This time I made it to the 4th benchmark, I believe it's the first time ever since I've tried radv & Mad Max together. It ended up freezing in a similar way in the 4th benchmark, so maybe it was luck.
I've attached the tiny trace, I'm not sure if it's helpful.
Sorry, couldn't test yesterday due to build bug: https://bugs.freedesktop.org/show_bug.cgi?id=102014 I'll recompile today and test this evening with any luck. glhf, Marko Created attachment 133252 [details]
another trace
Since I saw Dave's commit about fixing a GPU hang, I thought of trying again but it's still no good. :/
I tried with the patch provided in #28 and latest master.
This time I got to freeze again during the third benchmark, at a different time than the usual one (if that means anything).
I've attached the trace.
Should I try to get a gdb one as well?
Created attachment 133273 [details] [review] hacky patch? does this patch make any difference? (on top of the one from the list). Created attachment 133274 [details]
radv trace after 2nd patch
I've just tested with the hacky patch and the one from #28 and it seemed about the same.
Created attachment 133276 [details] [review] fix MSAA rendering. I've sent this to the list, it at least fixes bad rendering in talos for me. I've also just ran phoronix test suite madmax benchmark to completeion. I've just tried this patch on top of the other 2, but it didn't change much :/ I'll try running with PTS, maybe it's related to some settings that PTS may not use. Actually, could it be a kernel difference? I'm on the last of amd-staging 4.11, what about you? https://patchwork.freedesktop.org/patch/169012/ I also have this patch in my tree. I'm on some random drm-next kernel from a while back based on 4.12, I'm still seeing hangs on vulkan cts anyways, but they are fairly random.
> Created attachment 133252 [details]
> another trace
>
> Since I saw Dave's commit about fixing a GPU hang, I thought of trying again
> but it's still no good. :/
>
> I tried with the patch provided in #28 and latest master.
>
> This time I got to freeze again during the third benchmark, at a different
> time than the usual one (if that means anything).
>
>
> I've attached the trace.
> Should I try to get a gdb one as well?
Didn't have time to properly test over the weekend, busy schedule. I did try DOOM only, it hung even before intro video which usually works - as soon as the game window was opened.
FYI I'm on some 4.12 kernel if that means anything.
Btw, should I apply all the patches John suggested or only the latest?
Cheers,
Marko
(In reply to Dave Airlie from comment #40) > I'm on some random drm-next kernel from a while back based on 4.12, I'm > still seeing hangs on vulkan cts anyways, but they are fairly random. Then I guess my kernel should be unrelated. (In reply to Dave Airlie from comment #39) > https://patchwork.freedesktop.org/patch/169012/ > > I also have this patch in my tree. So this one changed things a bit! When it froze, I noticed very high disk I/O, but I didn't have time to SSH connect to see what it was about before it dropped. Also, I got this in dmesg: [ 3124.363414] amdgpu 0000:01:00.0: GPU fault detected: 147 0x02e20402 [ 3124.363418] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00116A97 [ 3124.363419] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02004002 [ 3124.363421] amdgpu 0000:01:00.0: VM fault (0x02, vmid 1) at page 1141399, read from '' (0x00000000) (4) It's the first time I got something in dmesg for that madmax freeze! As Marko asked, should I keep testing with all patches suggested here or using less? Finally I've tried with PTS and it froze as well :/ (that was before using the last patch). (In reply to Dave Airlie from comment #39) > https://patchwork.freedesktop.org/patch/169012/ > > I also have this patch in my tree. Tested on fresh 4.13 RC3, today's Mesa git + this patch, still no joy. Freezes system as per usual. Cheers, Marko I've just tried on amd-staging 4.12, and without the hacky patch, and it still froze the same (still heavy I/O when it did, but nothing in dmesg this time). https://patchwork.freedesktop.org/series/28535/ is a replacement for the cs flush, might be worth trying on master on its own. and possibly with the hack patch on top (it might not apply cleanly though). This patch only on master worked! Congratulations Dave, this is the first time I've passed the whole benchmark! I'll try a few other applications and wait on Marko's test before closing this, but so far it looks good. Feel free to add if you couldn't reproduce the original issue: Tested-by: John Ettedgui <john.ettedgui@gmail.com> That's great news then! I'm compiling as we speak on Suse OBS, but will able to try it out only this afternoon after I get home from work. Cheers, Marko It's working! Today's pull + https://patchwork.freedesktop.org/series/28535/ + kernel 4.13-rc4. Doom and Dota2 both work with no freezing. I'll test (read-play) some more later but this finally seems to have fixed it. Thanks! Marko I've rerun the madmax benchmark in case it was luck, but it was fine. I've also played with Dolphin for hours and it was fine (well, I think it was already fine before that patch). I've tried a bit with RPCS3 as well, though that emulator has its own issue, and got no freeze either. So far I have not found anything that would cause problem, with Marko confirming, and with the patch making it to git, I think that's good enough for resolving this bug. Thanks Grazvydas and Dave for the fixes! Hi everyone, I've tested DOOM yesterday for maybe half an hour and you can color me impressed. The FPS is in the 50's range for High/ultra, no glitches, stuttering or artifacts/texture corruption. No freezes either. The performance is on par with Windows, give or take. Dota2 also works like a charm with comparable, solid framerates (50+ on best quality). Thanks for fixing this guys! Marko |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 131788 [details] dmesg after the crash Hello, I haven't tried using radv/Vulkan in a few months, but last time I tried, most of the apps/games worked and only a few crashed the system, now it seems all do have issues. I am using a 280x, on agd5f's linux-staging 4.11 (but the system freezes as well on standard 4.11.3). Xorg is 1.19.3. Once the computer hangs, I'm unable to use it directly but I can still connect from another machine through SSH. I do have to manually force a restart though. Games I've tried which froze the system: - Mad Max (it always was bad for me). - Serious Sam Fusion (that one used to work, as long as no steam overlay). Apps I've tried with similar results: - Some projects from SaschaWillems/Vulkan (some used to run, others would just refuse to, but no crash in the past). - The Dolphin emulator (which used to work as well). I am on mesa-git, llvm-svn and ddx-git as well. I'm happy to test patches as needed. Thanks!