Bug 90378

Summary: [LLVM][bisected] GPU lockups in Left 4 Dead 2
Product: Mesa Reporter: Daniel Scharrer <daniel>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: daniel, tstellar
Version: git   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=89228
Whiteboard:
i915 platform: i915 features:
Attachments: dmesg output
patch to revert LLVM r233366 (fixes lockups)
R600_DEBUG=ps,vs,gs output with r233365 (no lockups)
R600_DEBUG=ps,vs,gs output with r233366 (lockups)

Description Daniel Scharrer 2015-05-09 01:24:59 UTC
Created attachment 115653 [details]
dmesg output

While playing L4D2 today I got a lot of GPU lockups.

While the lockups seem to happen randomly, they are fairly easy to reproduce in the third chapter (The Mall) of the Dead Center campaign. I recorded an apitrace while encountering 3 lockups and there seem to be at least a couple of lockups each time I retrace it:

 http://constexpr.org/tmp/L4D2-radeonsi.trace.xz (507 MiB)

At least driver was able to successfully reset the GPU each time.

There also seem to be some infrequent rendering glitches.

Probably related to bug 89228, and possibly bug 90217, bug 90284 and/or bug 89954 (all reports of lockups with Source engine games).

GPU; Radeon HD 7950 (TAHITI)
Mesa 10.6.0-devel (git-3bdbc1e)
LLVM r236436
Linux 4.0.1-gentoo

The above logs and apitrace were recorded with unsafe-fp-math enbled (see bug 89069 comment 34) but the lockups also happen without it. I also noticed some VM fault messages in dmesg while running L4D2 without unsafe-fp-math.
Comment 1 Daniel Scharrer 2015-05-09 01:30:34 UTC
The game stdout with R600_DEBUG=ps,vs,gs was too large to attach, so here it is instead:

 http://constexpr.org/tmp/l4d.log (5 MiB)
Comment 2 Daniel Scharrer 2015-05-21 15:09:04 UTC
Created attachment 115951 [details] [review]
patch to revert LLVM r233366 (fixes lockups)

This seems to be a regression in llvm:
Mesa git + LLVM svn is bad
Mesa 10.5.5 + LLVM svn is bad
Mesa git + LLVM 3.6.0 is good (no lockups, no glitches)

With Mesa git, the lockups in the L4D2 apitrace linked above bisect to LLVM r233366:

commit 9217916725713c00f17cb64123e8dffdae843eb7
Author: Andrew Trick <atrick@apple.com>
Date:   Fri Mar 27 06:10:13 2015 +0000

    Complete the MachineScheduler fix made way back in r210390.
    
    "Fix the MachineScheduler's logic for updating ready times for in-order.
     Now the scheduler updates a node's ready time as soon as it is
     scheduled, before releasing dependent nodes."
    
    This fix was only made in one variant of the ScheduleDAGMI driver.
    Francois de Ferriere reported the issue in the other bit of code where
    it was also needed.
    I never got around to coming up with a test case, but it's an
    obvious fix that shouldn't be delayed any longer.
    I'll try to refactor this code a little better.
    
    I did verify performance on a wide variety of targets and saw no
    negative impact with this fix.
    
    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@233366 91177308-0d34-0410-b5e6-96231b3b80d8


I had to revert b8797a7 and a99a16a in Mesa for it to build against that LLVM revision.

Besides the arch-specific test files, r233366 only moves one line of code around. Reverting that on current LLVM (see attached patch) also fixes the lockups.

As with bug #90510, R600_DEBUG=switch_on_eop gets rid of the glitches, and also prevents the crashes as well. Not sure if that means it could be a bug in Mesa or if that just hides the LLVM bug.

While bisecting for the lockup, I noticed the glitches were also introduced in LLVM after 3.6.0, but not by the same revision - f74b5c6 (r231401) has no lockups but does have glitches. I'll bisect that for bug #88561 as the glitches in the latest Talos apitrace there also seem to come from that commit range. (The GPU faults - bug #87278 - seem to have yet another cause, being present even with LLVM 3.6.0.)

NB: I also noticed that with compositing enabled in KWin, the system is not able to recover from the GPU lockups (and eventually does not even respond to SSH or SysRq). With compositing disabled the GPU is almost always reset successfully and the game / glretrace can continue as if nothing happened.
Comment 3 Daniel Scharrer 2015-05-21 15:13:15 UTC
Created attachment 115952 [details]
R600_DEBUG=ps,vs,gs output with r233365 (no lockups)
Comment 4 Daniel Scharrer 2015-05-21 15:16:00 UTC
Created attachment 115956 [details]
R600_DEBUG=ps,vs,gs output with r233366 (lockups)
Comment 5 Daniel Scharrer 2015-05-23 20:51:18 UTC
While the glitches come from an earlier revision than the GPU lockups, both are caused by the machine scheduler. Disabling the machine scheduler for SI fixes both the glitches and the GPU lockups. See bug 88978 for details.
Comment 6 Marek Olšák 2015-07-03 15:27:22 UTC
Hi Daniel,

Adding Tom in case he has an opinion on the LLVM issue.

(In reply to Daniel Scharrer from comment #2)
> As with bug #90510, R600_DEBUG=switch_on_eop gets rid of the glitches, and
> also prevents the crashes as well. Not sure if that means it could be a bug
> in Mesa or if that just hides the LLVM bug.

Does this mean switch_on_eop fixes this bug completely?
Comment 7 Daniel Scharrer 2015-07-03 19:03:25 UTC
Yes, afair R600_DEBUG=switch_on_eop fixed all issues with L4D2. I'll re-test with a more up to date build of LLVM and Mesa.

> Adding Tom in case he has an opinion on the LLVM issue.

Looks like he already had a look at some of the LLVM parts in bug #88978.
Comment 8 Daniel Scharrer 2015-07-05 01:11:27 UTC
I have confirmed that the issues is still the same with Mesa git-ff0a41b + LLVM r241381: L4D2 has glitches and lockups with unpatched LLVM and no glitches or lockups with unpatched LLVM and R600_DEBUG=switch_on_eop.

However other source engine games (Counter-Strike: Global Offensive and Team Fortress 2) still have similar-looking glitches even with patched LLVM *and* R600_DEBUG=switch_on_eop. No idea if those are related though.
Comment 9 Daniel Scharrer 2015-07-29 07:18:50 UTC
Hi Marek,

Looks like http://lists.freedesktop.org/archives/mesa-dev/2015-July/089950.html does help. With latest LLVM + Mesa + that patch series, glitches and lockups seem to be gone. I got one lockup when replaying L4D2-radeonsi.trace after rebuilding mesa, but could not reproduce it after reboot or in game (tested in L4D2, CS:GO and The Talos Principle).
Comment 10 Daniel Scharrer 2015-08-01 14:20:32 UTC
With that patch series merged I can no longer reproduce the GPU lockups.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.