Summary: | TF2: GPU lockup on HD 7950 | ||
---|---|---|---|
Product: | Mesa | Reporter: | Stephen Liang <hi+freedesktop> |
Component: | Drivers/Gallium/radeonsi | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | Default DRI bug account <dri-devel> |
Severity: | normal | ||
Priority: | medium | ||
Version: | git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
dmesg on kernel 4.7
output from GALLIUM_DDEBUG="800 noflush" steam steam://rungameid/440 Four different dmesg outputs during a hang with GPU soft reset enabled Kernel log, dmesg, gallium debug logs during GPU hang and system crash |
Description
Stephen Liang
2016-05-29 21:07:11 UTC
Created attachment 124166 [details]
output from GALLIUM_DDEBUG="800 noflush" steam steam://rungameid/440
One more thing, I did run an apitrace and replayed the apitrace to see if that reproduces the hang. The replay didn't hang my machine initially but after running it a 2-3 times, it was able to hang the machine. I would attach the apitrace but it is 2-3 GB and the hang is several hundreds of thousands of frames in (and it's not reproducible every time during replay, just sometimes). If the trace reproduces the hang at least occasionally, it may still be useful. Can you upload it to Google Drive or a similar service? Here's the trace: https://drive.google.com/file/d/0B9JbOVmvb9ugd3ZlbGJRUVFDdmc/view?usp=sharing SHA Sum: 7a368bb2a95dd939df604e7c8f239953e2957ec9 The hang occurs near the end after the team selection screen while walking out. If the GPU doesn't hang, then you'll see it stutter and then eventually recover. This particular trace was with a GPU soft reset so it recovered. I'll see if I can get another one without GPU soft reset. I just ran the apitrace 6 times in a row and it finally hung on the 6th try, so it's fairly difficult to get apitrace to hang the GPU but it eventually does do it. On this run, I'm using: Kernel: 4.4.9-300.fc23.x86_64 with radeon.lockup_timeout=0 xorg-x11-drv-ati: 7.6.1-3.20160215gitd41fccc.fc23 strace on apitrace: strace: Process 5913 attached wait4(5914, strace on glretrace: strace: Process 5914 attached futex(0xff92ec, FUTEX_WAIT_PRIVATE, 5359, NULL Created attachment 124197 [details]
Four different dmesg outputs during a hang with GPU soft reset enabled
I posted the previous comment too soon. The hang while using apitrace hangs everything except the mouse cursor. I'm unable to switch to text mode (ctrl+f1) and there is nothing new in dmesg other than the bootup messages (I was able to get this via SSH).
As an aside, I've also attached 4 different GPU hang messages from dmesg from various different hangs while I was playing TF2.
Does the issue go away if you set any of these options? R600_DEBUG=noce,nohyperz Created attachment 124225 [details]
Kernel log, dmesg, gallium debug logs during GPU hang and system crash
No good but I've got quite a bit of information! I tried with this command:
R600_DEBUG=noce,nohyperz GALLIUM_DDEBUG="800 noflush" steam steam://rungameid/440
The GPU recovered and I was able to alt-tab to gnome terminal and grab the dmesg. See "dmesg_after_alt_tab.txt".
I then continued to play the game and it hung again. This time, the entire screen froze with only the mouse being able to move. Luckily, this time I was able to drop to TTY4, and grab a dmesg. See "dmesg_after_hard_hang.txt", there is certainly a lot more output here with quite a bit of call traces which I've never seen before.
Now, I decided to restart X since the mouse still moved so I wondered if this is an X issue. So I ran "sudo service gdm restart" and it crashed the system with a calltrace displayed on screen on text mode. On restart, I checked the kernel logs and I can see a lot of call traces, so it seems whatever happened ended up tainting the kernel requiring the system to need a hard reset. See "kernel_log_during_gdm_restart.txt".
Finally, since I ran this with GALLIUM_DDEBUG="800 noflush", I was able to collect another gallium debug that goes along with this. I really hope this information helps capture the state of the system at the time of crash. Please let me know if you want me to try anything else.
Hi Stephen, if you still have the same machine/setup maybe worth to try the apitrace 1) that I mentioned here https://bugs.freedesktop.org/show_bug.cgi?id=100712#c7 . I wonder if it is the same issue. About "dmesg_after_hard_hang.txt" / "kernel_log_during_gdm_restart.txt" all these exact same backtraces are happening because the ring 0(GFX) is still stuck after gpu reset. So acceleration is marked as disabled. But then mesa clear some buffers being cached before reset and it hits WARN_ON in radeon_ttm_bo_destroy, see https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/radeon/radeon_object.c?h=amd-staging-4.9#n72. I recently submitted a patch to change it to a WARN_ON_ONCE. But the real problem is the ring 0 stalled in the first place. For me I found that setting R600_DEBUG=nowc workarounds the problem. Also I have not found any full dmesg log in the attachments, especially infos generated at startup. This old commit should fix the lockups: https://cgit.freedesktop.org/mesa/mesa/commit/?id=6dc96de303290e8d1fc294da478c4f370be98dea Assuming this was fixed as per comment 10 and closing. Please reopen if this is not the case. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.