96271 – TF2: GPU lockup on HD 7950

Bug 96271 - TF2: GPU lockup on HD 7950

Summary: TF2: GPU lockup on HD 7950

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-05-29 21:07 UTC by Stephen Liang
Modified:	2018-03-31 23:40 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
dmesg on kernel 4.7 (2.38 KB, text/plain) 2016-05-29 21:07 UTC, Stephen Liang	Details
output from GALLIUM_DDEBUG="800 noflush" steam steam://rungameid/440 (1.15 MB, text/plain) 2016-05-29 21:09 UTC, Stephen Liang	Details
Four different dmesg outputs during a hang with GPU soft reset enabled (695 bytes, application/force-download) 2016-05-30 23:29 UTC, Stephen Liang	Details
Kernel log, dmesg, gallium debug logs during GPU hang and system crash (68.98 KB, application/force-download) 2016-06-01 03:51 UTC, Stephen Liang	Details
View All

Description Stephen Liang 2016-05-29 21:07:11 UTC

Created attachment 124165 [details]
dmesg on kernel 4.7

I'm getting a GPU lockup on Team Fortress 2 which seems to be related to bug #80419 since my dmesg is the same as comment #44 but this is occurring on Team Fortress 2 rather than XCOM.

If I run native TF2 with GALLIUM_DDEBUG=800, I cannot reproduce the bug but the frame rate drops significantly. With GALLIUM_DDEBUG="800 noflush", I was able to reproduce the hang. This leads me to believe that the issue is related how high the FPS is running. 

My kernel is: Linux localhost.localdomain 4.4.9-300.fc23.x86_64
I've also tried 4.6 and 4.7, all of which hangs.

My ATI driver is: 7.7.99-3.20160524git040a7b8.fc23 (latest commit in git for radeon)
I've also tried 7.6.1-3.20160215gitd41fccc.fc23 (the latest stable from fedora 23), all of which hangs.

My mesa version is: 11.3.0-0.37.git357495b.fc23
I've also tried 11.1.0-2.20151218.fc23 (the latest stable from fedora 23), all of which hangs.

I've attached the GALLIUM_DDEBUG output and dmesg. One curious thing is that on kernel 4.7, the kernel detects a hang and is able to reset the GPU resulting in a temporary freeze that eventually resumes. However, the freeze will come again after a short while (and resumes).

As a data point, I don't get any lockups while playing Skyrim on Wine.

Comment 1 Stephen Liang 2016-05-29 21:09:12 UTC

Created attachment 124166 [details]
output from GALLIUM_DDEBUG="800 noflush" steam steam://rungameid/440

Comment 2 Stephen Liang 2016-05-29 21:15:24 UTC

One more thing, I did run an apitrace and replayed the apitrace to see if that reproduces the hang. The replay didn't hang my machine initially but after running it a 2-3 times, it was able to hang the machine. I would attach the apitrace but it is 2-3 GB and the hang is several hundreds of thousands of frames in (and it's not reproducible every time during replay, just sometimes).

Comment 3 Nicolai Hähnle 2016-05-30 22:01:26 UTC

If the trace reproduces the hang at least occasionally, it may still be useful. Can you upload it to Google Drive or a similar service?

Comment 4 Stephen Liang 2016-05-30 22:14:31 UTC

Here's the trace: https://drive.google.com/file/d/0B9JbOVmvb9ugd3ZlbGJRUVFDdmc/view?usp=sharing

SHA Sum: 7a368bb2a95dd939df604e7c8f239953e2957ec9

The hang occurs near the end after the team selection screen while walking out. If the GPU doesn't hang, then you'll see it stutter and then eventually recover. This particular trace was with a GPU soft reset so it recovered. 

I'll see if I can get another one without GPU soft reset.

Comment 5 Stephen Liang 2016-05-30 23:17:23 UTC

I just ran the apitrace 6 times in a row and it finally hung on the 6th try, so it's fairly difficult to get apitrace to hang the GPU but it eventually does do it. 

On this run, I'm using:

Kernel: 4.4.9-300.fc23.x86_64 with radeon.lockup_timeout=0
xorg-x11-drv-ati: 7.6.1-3.20160215gitd41fccc.fc23

strace on apitrace:

strace: Process 5913 attached
wait4(5914,

strace on glretrace:

strace: Process 5914 attached
futex(0xff92ec, FUTEX_WAIT_PRIVATE, 5359, NULL

Comment 6 Stephen Liang 2016-05-30 23:29:32 UTC

Created attachment 124197 [details]
Four different dmesg outputs during a hang with GPU soft reset enabled

I posted the previous comment too soon. The hang while using apitrace hangs everything except the mouse cursor. I'm unable to switch to text mode (ctrl+f1) and there is nothing new in dmesg other than the bootup messages (I was able to get this via SSH).

As an aside, I've also attached 4 different GPU hang messages from dmesg from various different hangs while I was playing TF2.

Comment 7 Marek Olšák 2016-05-31 11:03:20 UTC

Does the issue go away if you set any of these options? R600_DEBUG=noce,nohyperz

Comment 8 Stephen Liang 2016-06-01 03:51:58 UTC

Created attachment 124225 [details]
Kernel log, dmesg, gallium debug logs during GPU hang and system crash

No good but I've got quite a bit of information! I tried with this command:

R600_DEBUG=noce,nohyperz GALLIUM_DDEBUG="800 noflush" steam steam://rungameid/440

The GPU recovered and I was able to alt-tab to gnome terminal and grab the dmesg. See "dmesg_after_alt_tab.txt".

I then continued to play the game and it hung again. This time, the entire screen froze with only the mouse being able to move. Luckily, this time I was able to drop to TTY4, and grab a dmesg. See "dmesg_after_hard_hang.txt", there is certainly a lot more output here with quite a bit of call traces which I've never seen before.

Now, I decided to restart X since the mouse still moved so I wondered if this is an X issue. So I ran "sudo service gdm restart" and it crashed the system with a calltrace displayed on screen on text mode. On restart, I checked the kernel logs and I can see a lot of call traces, so it seems whatever happened ended up tainting the kernel requiring the system to need a hard reset. See "kernel_log_during_gdm_restart.txt".

Finally, since I ran this with GALLIUM_DDEBUG="800 noflush", I was able to collect another gallium debug that goes along with this. I really hope this information helps capture the state of the system at the time of crash. Please let me know if you want me to try anything else.

Comment 9 Julien Isorce 2017-04-28 08:11:22 UTC

Hi Stephen, if you still have the same machine/setup maybe worth to try the apitrace 1) that I mentioned here https://bugs.freedesktop.org/show_bug.cgi?id=100712#c7 . I wonder if it is the same issue.

About "dmesg_after_hard_hang.txt" / "kernel_log_during_gdm_restart.txt" all these exact same backtraces are happening because the ring 0(GFX) is still stuck after gpu reset. So acceleration is marked as disabled. But then mesa clear some buffers being cached before reset and it hits WARN_ON in radeon_ttm_bo_destroy, see https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/radeon/radeon_object.c?h=amd-staging-4.9#n72. I recently submitted a patch to change it to a WARN_ON_ONCE.

But the real problem is the ring 0 stalled in the first place. For me I found that setting R600_DEBUG=nowc workarounds the problem.

Also I have not found any full dmesg log in the attachments, especially infos generated at startup.

Comment 10 Marek Olšák 2017-04-28 18:46:11 UTC

This old commit should fix the lockups:
https://cgit.freedesktop.org/mesa/mesa/commit/?id=6dc96de303290e8d1fc294da478c4f370be98dea

Comment 11 Timothy Arceri 2018-03-31 23:40:01 UTC

Assuming this was fixed as per comment 10 and closing. Please reopen if this is not the case.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.