93649 – [radeonsi] Graphics lockup while playing tf2

Bug 93649 - [radeonsi] Graphics lockup while playing tf2

Summary: [radeonsi] Graphics lockup while playing tf2

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	11.0
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Duplicates (1):	95308 (view as bug list)
Depends on:
Blocks:

Reported:	2016-01-10 05:11 UTC by Matthew Dawson
Modified:	2018-11-27 19:34 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Kernel dmesg around the time of the lockup. (506.16 KB, text/plain) 2016-01-10 05:11 UTC, Matthew Dawson	Details
Strace of Xorg up to X freezing (77.32 KB, text/plain) 2016-01-10 05:13 UTC, Matthew Dawson	Details
Radeon blocked locks (6.89 KB, text/plain) 2016-01-10 07:09 UTC, Matthew Dawson	Details
This helps avoid a complete crash when a lockup occurs. (8.14 KB, patch) 2016-01-24 05:30 UTC, Matthew Dawson	Details \| Splinter Review
Second patch to fix system lockup after gpu reset (915 bytes, patch) 2016-01-26 05:18 UTC, Matthew Dawson	Details \| Splinter Review
New avoid lockup patch (11.03 KB, patch) 2016-02-07 21:53 UTC, Matthew Dawson	Details \| Splinter Review
stellaris run via steam: GALLIUM_DDEBUG="pipelined 2000" %command% (12.27 KB, application/gzip) 2016-09-10 14:38 UTC, pandiculationfinch	Details
package update history that lead to a change in behaviour (4.32 KB, text/plain) 2016-11-02 20:18 UTC, pandiculationfinch	Details
Show Obsolete (1) View All

Description Matthew Dawson 2016-01-10 05:11:02 UTC

Created attachment 120925 [details]
Kernel dmesg around the time of the lockup.

After a period of time playing the latest version of TF2, my GPU locks up.  After the kernel tries to reset, the X becomes stuck and won't work.  The rest of the system is fine however.  Sometimes, the GPU will reset successfully and continue working, only to lockup later, eventually freezing X.

Hardware:
GPU: Gigabyte Radeon HD 7970 Ghz edition OC
CPU: AMD Phenom ii X6 1100T
MB: Asus Crosshair IV Formula

Software:
Mesa: 11.1.0
DRM: 2.4.65
LLVM: 3.7.0
X: 1.17.4
DDX: 7.6.1
Kernel: 4.3.3

I have a dmesg with debug turned on and a strace of X from around the time it crashes (attached).  I reduced the log file to the relevant bits, as they are quite large.  I'll retry with latest git, see if it helps anywhere.

Comment 1 Matthew Dawson 2016-01-10 05:13:28 UTC

Created attachment 120926 [details]
Strace of Xorg up to X freezing

FD 20 is the drm device node, and it freezes on ioctl 0xc020645d.

Comment 2 Matthew Dawson 2016-01-10 07:09:32 UTC

Created attachment 120927 [details]
Radeon blocked locks

Since X seemed blocked on an ioctl, I managed to get a list of all the blocked locks, and found most of my taken locks were from GUI related programs who would be doing GL things, and they are all blocked on a lock, including one that is currently trying to reset my GPU.

I'm guessing there is a lock that is being grabbed twice, once when userspace makes an ioctl, and again during the reset.  I'll keep digging.


Also, I think this may be a duplicate of #90217, as both involve source games.  I'll leave this open for now, in case tf2 has a different trigger.

Comment 3 russianneuromancer 2016-01-19 18:20:07 UTC

There is other logs: https://github.com/ValveSoftware/Source-1-Games/issues/1943

Comment 4 Matthew Dawson 2016-01-24 05:30:15 UTC

Created attachment 121242 [details] [review]
This helps avoid a complete crash when a lockup occurs.

Note this doesn't solve this bug, it just helps manage it.

Comment 5 pc.jago1337 2016-01-26 05:03:59 UTC

Can confirm, I have either the same or a similar problem on my R9 390 (using radeon, with DPM disabled). It doesn't just crash X though, it completely locks up and I have to reboot to even use TTY. Happens after 10-20 mins of TF2.

Running Arch Linux with everything up to date but no AUR packages, will post specifics later.

Comment 6 Matthew Dawson 2016-01-26 05:18:46 UTC

Created attachment 121293 [details] [review]
Second patch to fix system lockup after gpu reset

This is already taken accepted from the mailing list, including here for completeness.

If anyone is experiencing this issue, can you please try with all of these patches applied?  For now, X should die and restart without acceleration, but getting a dmesg out or restarting should be fine.

Comment 7 pc.jago1337 2016-01-26 10:09:36 UTC

CPU: FX 8350
GPU: R9 390
MB: Asrock 970 Extreme4

Software: 

Kernel: 4.3.3-3-ARCH x86_64
Mesa: 11.1.1
DRM: 2.43.0
LLVM: 3.7.0
X: 1.18.0


As mentioned above, I get the crash with TF2, but *NOT* CS:GO.

Comment 8 pc.jago1337 2016-01-26 16:07:51 UTC

Also, this could be a duplicate of bug #92912 - random lockups in TF2, all with radeon.

Comment 9 Matthew Dawson 2016-01-26 16:12:51 UTC

(In reply to pc.jago1337 from comment #8)
> Also, this could be a duplicate of bug #92912 - random lockups in TF2, all
> with radeon.

I was asked to file this bug separately.  Also that covers R600, a different GPU the GCN.

Comment 10 Rosco P. Coltrane 2016-02-07 20:17:13 UTC

Same problem here on a fedora 23

GPU: HD 7970
CPU: Intel Core i7 950

Mesa 11.1.0
DRM 2.43.0
LLVM 3.7.0
kernel: 4.3.4

The logs are filed with "ring stalled" and GPU lock messages. I can send more logs if needed.

radeon 0000:02:00.0: ring 3 stalled for more than 10249msec
radeon 0000:02:00.0: GPU lockup (current fence id 0x000000000001e5f1 last fence id 0x000000000001e5f2 on ring 3)
  
I've tried a different firmware (http://people.freedesktop.org/~agd5f/radeon_ucode/k/) which seemed to have helped other people with their own problem but it didn't help in my case.
  
Does it makes sense to try to rollback to an older kernel?

Comment 11 Matthew Dawson 2016-02-07 21:53:22 UTC

Created attachment 121578 [details] [review]
New avoid lockup patch

Latest version as posted to dri-devel.  With these two patches, your system should no longer lockup forever.  It will freeze the game for a moment, and X may die for other reasons.

Now the underlying tf2 issue needs investigation.

Comment 12 Luca Osvaldo 2016-03-18 00:59:36 UTC

I can say that it also affects me, I'm using the AMDGPU drivers with powerplay enabled, using a custom linux4.5 kernel.
AMD r9 380 video card.

Comment 13 Matthew Dawson 2016-05-16 15:11:58 UTC

*** Bug 95308 has been marked as a duplicate of this bug. ***

Comment 14 Amarildo 2016-07-23 14:29:42 UTC

Any chance VALVe introduced this? They won't admit it. https://github.com/ValveSoftware/steam-for-linux/issues/4409

The patches attatched here are present in Linux 4.6. I tested linux-git-4.7-rc7 with mesa-git-12.1 compiled against llvm-snv-3.9, and TF2 still crashes.

Setting every graphical option to Low doesn't help.

Comment 15 Nicolai Hähnle 2016-07-23 14:34:21 UTC

This is certainly a bug in our driver (unlike what was written on the Github tracker, a game *can* cause a hang e.g. by writing an infinite loop in a shader, but that seems exceedingly unlikely in the case of TF2). The problem with this particular bug is that it seems non-deterministic (i.e. not reliably reproducible), and that makes it hard to debug.

Comment 16 Amarildo 2016-07-23 14:50:21 UTC

So there's a chance it won't be fixed at all?

I was thinking about bisecting from version 3.16 (where I know it worked for me, on Debian Jessie) until ~4.1, but I don't have that kind of time right now.

Comment 17 Nicolai Hähnle 2016-07-23 15:45:13 UTC

Actually, if you could find a clear bisection result, that would be tremendously helpful and would probably lead to a fix.

However, with this kind of bug you need to be extremely sure about what you're doing when bisecting. For example, if you know that the hang typically occurs after 10 minutes, then you should play for at least one hour (perhaps even longer) with each kernel. Otherwise, you might have just gotten lucky, and the bisect result would be worse than useless.

Comment 18 Amarildo 2016-07-23 16:32:16 UTC

Yes, I would definitely test it for a long period, something like 16 hours hehehe.

However, I can't do any besecting right now, I'm tremendously busy at the moment. Too bad there's not many Linux players with this problem, otherwise someone would have figured this out already.

Cheers.

Comment 19 pandiculationfinch 2016-08-07 14:13:31 UTC

happens with stellaris as well.

Comment 20 Marek Olšák 2016-08-10 22:00:36 UTC

Does this fix it?

https://cgit.freedesktop.org/mesa/mesa/commit/?id=947e0614d091c260651e4f3d6209bd6bcc2cfa0d

In other words, does mesa/master work?

Comment 21 Matthew Dawson 2016-08-11 04:08:57 UTC

I can confirm lastest git head (50b49d242d702e4728329cc59f87d929963e7c53) still causes lockups, though they seem to come much faster.

Also seems to have a regression regarding lighting, I'll see about bisecting that in a separate report.

LLVM: 3.8.0
DRM: 2.43.0
Linux: 4.6.3-gentoo

Comment 22 pandiculationfinch 2016-08-11 17:12:48 UTC

I'll test this weekend with stellaris and let you know.

Comment 23 pandiculationfinch 2016-08-11 21:30:53 UTC

sad to say it did not fix the issue for me. it ran longer than usual though prior to the crash. I suspect you nixed one issue but multiple are going on.

I'm happy to run any debugging/patches you wish to try.

Comment 24 Amarildo 2016-08-31 06:32:14 UTC

Didn't fix for me either, on Arch Linux.

Comment 25 Amarildo 2016-08-31 06:38:42 UTC

Marek, since you work for AMD, I wonder if you could get a few hints for the fix on Catalyst's sources?

Comment 26 Marek Olšák 2016-08-31 13:27:18 UTC

(In reply to AmarildoJr from comment #25)
> Marek, since you work for AMD, I wonder if you could get a few hints for the
> fix on Catalyst's sources?

It's not so simple. This is a bug somewhere in the Mesa driver such that looking at other drivers won't likely help.

Comment 27 Amarildo 2016-09-02 16:15:47 UTC

(In reply to Marek Olšák from comment #26)
> (In reply to AmarildoJr from comment #25)
> > Marek, since you work for AMD, I wonder if you could get a few hints for the
> > fix on Catalyst's sources?
> 
> It's not so simple. This is a bug somewhere in the Mesa driver such that
> looking at other drivers won't likely help.

This is a very weird issue. I think it may not be in Mesa, and here's why:

* On Debian Jessie with kernel 3.16 and Mesa 10.3, the problem doesn't happen;
* On the same Debian, but with mesa backported, the problem also doesn't happen;
* On the same Debian with Mesa backported and the Kernel backported, the problem still doesn't happen;
* On Arch Linux with Mesa downgraded to 10.3, the problem happens;
* On the same Arch Linux with Mesa and Kernel downgraded (Kernel to version 3.16 and even 3.10), the problem still happens;
* I'm not 100% sure I downgraded the Firmware on Arch, but I'll try today since I'm testing a few drivers in Linux;
* On vanilla Arch with Catalyst/FGLRX, the problem doesn't happen;

So I do think this issue is much bigger than everybody thinks and only happens with a certain combination of Mesa, Kernel, Firmware, and possibly libdrm, llvm, and other pieces of software as well.

What I really think is that VALVe should investigate this since this problem started happening after they introduced mandatory Texture Streaming.

Comment 28 Vedran Miletić 2016-09-02 16:17:28 UTC

(In reply to AmarildoJr from comment #27)
> (In reply to Marek Olšák from comment #26)
> > (In reply to AmarildoJr from comment #25)
> > > Marek, since you work for AMD, I wonder if you could get a few hints for the
> > > fix on Catalyst's sources?
> > 
> > It's not so simple. This is a bug somewhere in the Mesa driver such that
> > looking at other drivers won't likely help.
> 
> This is a very weird issue. I think it may not be in Mesa, and here's why:
> 
> * On Debian Jessie with kernel 3.16 and Mesa 10.3, the problem doesn't
> happen;
> * On the same Debian, but with mesa backported, the problem also doesn't
> happen;
> * On the same Debian with Mesa backported and the Kernel backported, the
> problem still doesn't happen;
> * On Arch Linux with Mesa downgraded to 10.3, the problem happens;
> * On the same Arch Linux with Mesa and Kernel downgraded (Kernel to version
> 3.16 and even 3.10), the problem still happens;
> * I'm not 100% sure I downgraded the Firmware on Arch, but I'll try today
> since I'm testing a few drivers in Linux;
> * On vanilla Arch with Catalyst/FGLRX, the problem doesn't happen;
> 
> So I do think this issue is much bigger than everybody thinks and only
> happens with a certain combination of Mesa, Kernel, Firmware, and possibly
> libdrm, llvm, and other pieces of software as well.
> 
> What I really think is that VALVe should investigate this since this problem
> started happening after they introduced mandatory Texture Streaming.

Is the elephant in the room in this case the LLVM version difference between the two setups?

Comment 29 Amarildo 2016-09-03 21:19:57 UTC

I just tested the oldest firmware available in the Arch Linux Archive, namely linux-firmware 20130725-1, and the crashes don't happen. This is with current Arch, not a single package is old and all packages are up-to-date according to the repos.

I'm hitting 10 to 30 FPS in-game, but at least the crashes don't happen which IMO is a very good sign of where the problem might be.

I'll report the firmware problem to AMD.

In the mean time, does anyone know how I can try running the firmware from Catalyst?

@Marek, where is the best place to report this?

Comment 30 Amarildo 2016-09-03 21:23:49 UTC

(In reply to Vedran Miletić from comment #28)
> Is the elephant in the room in this case the LLVM version difference between
> the two setups?

According to a Gentoo user who compiled llvm 3.5 and and older version of mesa against it, the problem still occurs.

Comment 31 Marek Olšák 2016-09-04 11:34:37 UTC

(In reply to AmarildoJr from comment #29)
> I just tested the oldest firmware available in the Arch Linux Archive,
> namely linux-firmware 20130725-1, and the crashes don't happen. This is with
> current Arch, not a single package is old and all packages are up-to-date
> according to the repos.
> 
> I'm hitting 10 to 30 FPS in-game, but at least the crashes don't happen
> which IMO is a very good sign of where the problem might be.
> 
> I'll report the firmware problem to AMD.
> 
> In the mean time, does anyone know how I can try running the firmware from
> Catalyst?
> 
> @Marek, where is the best place to report this?

So are we certain the hangs are caused by firmware? Bisecting the firmware would help a lot.

What's your GPU?

Comment 32 Rosco P. Coltrane 2016-09-04 14:01:51 UTC

I tested today 3 different firmwares on manjaro (HD7970) 

linux-firmware-20150527.3161bfa-1-any.pkg.tar.xz (chosen because it was a bit before the first bugs were reported with TF2)
  
This allowed me to play TF2 without bugs for ~30 min. Then I had the bug (screen freeze, sound loop) but the system recovered fine after 20 sec with no loss of performance. I still had a problem before and after the bug with the mouse pointer which wasn't visible at all time.

linux-firmware-20131013.7d0c7a8-1-any.pkg.tar.xz

This allowed me to play for a good hour, then: bug + recovery after 20 sec. At the fifth bug the screen simply hanged, TF2 and steam crashed. (had to ctrl+alt+f2). This one didn't have the mouse bug. This is the most stable TF2 experience I can get.

linux-firmware-20130725-1-any.pkg.tar.xz (earlier firmware available in the repo)
  
This one crashed after 2 seconds loading the first map.
  
The first two firmwares also seem to have fixed the same bug which was present in "Victor Vran" (same symptoms, screen freeze + sound loop).

Comment 33 pandiculationfinch 2016-09-04 14:13:41 UTC

not certain but assuming I ran the test correctly, I experienced a crash using the oldest linux firmware I had linux-firmware-20140828. that leaves 13 months of time to bisect if linux-firmware 20130725-1 does indeed work. I'll see about trying installing the 20130725 version later have other stuff I need to do.

commands run to downgrade to linux-firmware-20140828:
sudo pacman -U /var/cache/pacman/pkg/linux-firmware-20140828.13eb208-1-any.pkg.tar.xz
sudo pacman -S linux

after downgrade I had the following error on boot, so I'm assuming it worked:
Sep 04 09:53:14 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/TAHITI_vce.bin failed with error -2
Sep 04 09:53:14 jambli kernel: radeon 0000:01:00.0: radeon_vce: Can't load firmware "radeon/TAHITI_vce.bin"
Sep 04 09:53:14 jambli kernel: radeon 0000:01:00.0: failed VCE (-2) init.

other info:
Name            : llvm-libs
Version         : 3.8.1-1
Name            : linux
Version         : 4.7.2-1
Name            : mesa-git
Version         : 84594.98f734e-1

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: X.Org (0x1002)
    Device: AMD OLAND (DRM 2.45.0 / 4.7.2-1-ARCH, LLVM 4.0.0) (0x6610)
    Version: 12.1.0
    Accelerated: yes
    Video memory: 2048MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.3
    Max compat profile version: 3.0
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.1

I forget the exact card off the top of my head but here is the output of lspci, if you need more precise card information let me know how to get it from the cli =):
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland XT [Radeon HD 8670 / R7 250/350]
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]

Comment 34 pandiculationfinch 2016-09-04 14:15:03 UTC

I should note I was testing against stellaris.

Comment 35 pandiculationfinch 2016-09-04 15:23:11 UTC

game froze again after ~20minutes. using the 20130725 version firmware. so if downgrading to 20130725 fixes TF2 it likely isn't the same issue as TF2.
game: stellaris

commands run to downgrade to linux-firmware-20130725:
sudo pacman -U /var/cache/pacman/pkg/linux-firmware-20130725-1-any.pkg.tar.xz
sudo pacman -S linux

other info:
Name            : llvm-libs
Version         : 3.8.1-1
Name            : linux
Version         : 4.7.2-1
Name            : mesa-git
Version         : 84594.98f734e-1
Name            : linux-firmware
Version         : 20130725-1

lspci:
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland XT [Radeon HD 8670 / R7 250/350]
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]

boot logs:
Sep 04 11:12:28 jambli kernel: [drm] initializing kernel modesetting (OLAND 0x1002:0x6610 0x174B:0xE269 0x00).
Sep 04 11:12:28 jambli kernel: [drm] register mmio base: 0xFDD80000
Sep 04 11:12:28 jambli kernel: [drm] register mmio size: 262144
Sep 04 11:12:28 jambli kernel: ATOM BIOS: C66201
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: VRAM: 2048M 0x0000000000000000 - 0x000000007FFFFFFF (2048M used)
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: GTT: 2048M 0x0000000080000000 - 0x00000000FFFFFFFF
Sep 04 11:12:28 jambli kernel: [drm] Detected VRAM RAM=2048M, BAR=256M
Sep 04 11:12:28 jambli kernel: [drm] RAM width 128bits DDR
Sep 04 11:12:28 jambli kernel: [TTM] Zone  kernel: Available graphics memory: 8209378 kiB
Sep 04 11:12:28 jambli kernel: [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
Sep 04 11:12:28 jambli kernel: [TTM] Initializing pool allocator
Sep 04 11:12:28 jambli kernel: [TTM] Initializing DMA pool allocator
Sep 04 11:12:28 jambli kernel: [drm] radeon: 2048M of VRAM memory ready
Sep 04 11:12:28 jambli kernel: [drm] radeon: 2048M of GTT memory ready.
Sep 04 11:12:28 jambli kernel: [drm] Loading oland Microcode
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_pfp.bin failed with error -2
Sep 04 11:12:28 jambli systemd[1]: Created slice system-lvm2\x2dpvscan.slice.
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_me.bin failed with error -2
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_ce.bin failed with error -2
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_rlc.bin failed with error -2
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_mc.bin failed with error -2
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/OLAND_mc2.bin failed with error -2
Sep 04 11:12:28 jambli kernel: [drm] radeon/OLAND_mc.bin: 31452 bytes
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_smc.bin failed with error -2
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/OLAND_smc.bin failed with error -2
Sep 04 11:12:28 jambli kernel: smc: error loading firmware "radeon/OLAND_smc.bin"
Sep 04 11:12:28 jambli kernel: [drm] Internal thermal controller with fan control
Sep 04 11:12:28 jambli kernel: [drm] radeon: power management initialized
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/TAHITI_vce.bin failed with error -2
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: radeon_vce: Can't load firmware "radeon/TAHITI_vce.bin"
Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: failed VCE (-2) init.

Comment 36 Marek Olšák 2016-09-04 17:13:39 UTC

If you're testing Mesa git, would you please set GALLIUM_DDEBUG="pipelined 2000" and run TF2, wait until the GPU hangs and repeat. After it happens for the 3rd time, please zip and attach the contents of ~/ddebug_dumps/*. There should be 3 files.

Though I've got a hunch that we're just running around in circles.

Comment 37 pandiculationfinch 2016-09-10 14:38:14 UTC

Created attachment 126454 [details]
stellaris run via steam: GALLIUM_DDEBUG="pipelined 2000" %command%

here are the dumps generated.

it seems like a hit or miss if anything was actually written into the files.
the computer completely locks up when it encounter the freeze in stellaris.

stellaris was even more unstable with the GALLIUM_DDEBUG, often failing to even start up.

Comment 38 Amarildo 2016-09-14 14:08:58 UTC

Does anyone have a little bit of free time to extract the files from "lib32-catalyst-libgl" into a system running "lib32-mesa-libgl" and see if that helps?

Comment 39 hofmann.zachary 2016-09-24 19:53:14 UTC

I'm also having this problem with Radeon R7 250 (radeonsi), Mesa 12.0.2, LLVM 3.8.1 and kernel version 4.6.0.

Comment 40 Amarildo 2016-10-02 04:52:47 UTC

If disabling DPM fixed the issue, shouldn't developers study it's code a little bit? I'm 99.99% positive the issue is in there somewhere, even for AMDGPU (since RadeonSI and AMDGPU drivers share a lot of code).

Comment 41 hofmann.zachary 2016-10-02 17:14:24 UTC

(In reply to Amarildo from comment #40)
> If disabling DPM fixed the issue, shouldn't developers study it's code a
> little bit? I'm 99.99% positive the issue is in there somewhere, even for
> AMDGPU (since RadeonSI and AMDGPU drivers share a lot of code).

Another user previously stated in the thread that they were experiencing the issues and had DPM disabled.

@Marek Olšák
Please let me know if there's anything I can do to help hunt this bug down.

Comment 42 Amarildo 2016-10-14 05:35:29 UTC

(In reply to hofmann.zachary from comment #41)
> (In reply to Amarildo from comment #40)
> > If disabling DPM fixed the issue, shouldn't developers study it's code a
> > little bit? I'm 99.99% positive the issue is in there somewhere, even for
> > AMDGPU (since RadeonSI and AMDGPU drivers share a lot of code).
> 
> Another user previously stated in the thread that they were experiencing the
> issues and had DPM disabled.
> 
> @Marek Olšák
> Please let me know if there's anything I can do to help hunt this bug down.

But that's one user's word against at least 5. Do we even know if the user actually disabled DPM or has the capacity to do so? Because I'm sure me and others (like Gentoo users) did in fact disable DPM and the hang didn't happen. So I don't think our word is less valid just because *one* user claimed he/she disabled DPM and the hang still happened.

Comment 43 Amarildo 2016-10-20 04:48:07 UTC

Just tried Mesa-Git (13.1) with the AMDGPU driver on R9 270X. The crash happens here as well.

However, looking at journalctl I can see new errors from the AMDGPU driver, and a brief research tells me it could be some TF2 texturing problem.

The error: GPU fault detected: 147 0x000ac802

Similar bugs have been resolved already:

https://bugs.freedesktop.org/show_bug.cgi?id=87278
https://bugs.freedesktop.org/show_bug.cgi?id=84614

LLVM seems to be related too.

Comment 44 Rosco P. Coltrane 2016-10-21 19:01:19 UTC

I don't know if it can be of any help, but I've been playing "7 days to die" during the last weeks, regularly for the last days, and I didn't encounter any kind of bug.

Until yesterday evening where at my great surprise I had the same bug (freeze, sound loop) which totally crashed my machine once and only froze it (with a recovery after a few seconds) twice.
  
I checked that no update occurred on the game files, on the steam runtime and on my OS between the days when it worked flawlessly and yesterday when it crashed 3 time in 15 minutes.

So if it's not only related to files, could it be related to the hardware? Could it be a faulty card (HD7970), or maybe a mix between a faulty hardware and some software instruction?

Comment 45 Amarildo 2016-10-21 22:47:13 UTC

Faulty hardware doesn't make any sense, because:

- It only happens on Linux;
- It only happens with specific combinations of Mesa/LLVM/Kernel/Firmware/etc
- It doesn't happen with the proprietary drivers

Comment 46 hofmann.zachary 2016-10-22 01:34:19 UTC

(In reply to Amarildo from comment #45)
> Faulty hardware doesn't make any sense, because:
> 
> - It only happens on Linux;
> - It only happens with specific combinations of Mesa/LLVM/Kernel/Firmware/etc
> - It doesn't happen with the proprietary drivers

It's probably not the exact same crash, but FWIW I also get crashes with the proprietary driver and TF2 when I tested it last. I just don't want people to get their hopes up only to have them let down.

Comment 47 Amarildo 2016-10-22 10:48:42 UTC

In all honesty, this is one of the most interesting bugs I know. Within all the people that have it, there are variations to which causes it in the first place.

What works for me (Debian Jessie with Mesa/libc6 from Backports, for example) might still cause the crash for some people.

What I do know is that it's not caused by faulty hardware. It could be for some, but seriously doubt it it's the cause for 99.99% of people experiencing the issue.

Comment 48 Marek Olšák 2016-10-24 17:56:37 UTC

Does this fix the hangs?
https://cgit.freedesktop.org/mesa/mesa/commit/?id=d4d9ec55c589156df4edc227a86b4a8c41048d58

It changes the HTILE (HyperZ) allocation function to r600_aligned_buffer_create. Without that, the hardware can hang on big GPUs (Tahiti/Pitcairn/Hawaii/Tonga/etc), but not APUs or small GPUs. The hang happens when TTM decides to move HTILE to a different location with an unaligned physical address (which is pretty random). The hardware tries to access the unaligned address and boom.

Comment 49 Marek Olšák 2016-10-24 19:54:35 UTC

(In reply to Marek Olšák from comment #48)
> Does this fix the hangs?
> https://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=d4d9ec55c589156df4edc227a86b4a8c41048d58
> 
> It changes the HTILE (HyperZ) allocation function to
> r600_aligned_buffer_create. Without that, the hardware can hang on big GPUs
> (Tahiti/Pitcairn/Hawaii/Tonga/etc), but not APUs or small GPUs. The hang
> happens when TTM decides to move HTILE to a different location with an
> unaligned physical address (which is pretty random). The hardware tries to
> access the unaligned address and boom.

Actually, I think that commit only affects Hawaii and Fiji. Other GPUs might be unaffected, which means the Tahiti hangs are due to a different bug.

Comment 50 Matthew Dawson 2016-10-24 20:10:33 UTC

(In reply to Marek Olšák from comment #49)
> (In reply to Marek Olšák from comment #48)
> > Does this fix the hangs?
> > https://cgit.freedesktop.org/mesa/mesa/commit/
> > ?id=d4d9ec55c589156df4edc227a86b4a8c41048d58
> > 
> > It changes the HTILE (HyperZ) allocation function to
> > r600_aligned_buffer_create. Without that, the hardware can hang on big GPUs
> > (Tahiti/Pitcairn/Hawaii/Tonga/etc), but not APUs or small GPUs. The hang
> > happens when TTM decides to move HTILE to a different location with an
> > unaligned physical address (which is pretty random). The hardware tries to
> > access the unaligned address and boom.
> 
> Actually, I think that commit only affects Hawaii and Fiji. Other GPUs might
> be unaffected, which means the Tahiti hangs are due to a different bug.

I've previously tried disabling hyperz on Tahiti with no luck in side stepping this bug, so I don't think this is the issue.

Could there be other buffers that need similar treatment that are being ignored?  Is there an easy way to test this locally?

Comment 51 Marek Olšák 2016-10-24 20:35:17 UTC

You can try this:

diff --git a/src/gallium/winsys/radeon/drm/radeon_drm_bo.c b/src/gallium/winsys/radeon/drm/radeon_drm_bo.c
index a15d559..ab95bae 100644
--- a/src/gallium/winsys/radeon/drm/radeon_drm_bo.c
+++ b/src/gallium/winsys/radeon/drm/radeon_drm_bo.c
@@ -939,7 +939,7 @@ radeon_winsys_bo_create(struct radeon_winsys *rws,
     struct radeon_drm_winsys *ws = radeon_drm_winsys(rws);
     struct radeon_bo *bo;
     unsigned usage = 0, pb_cache_bucket;
-
+alignment *= 2;
     /* Only 32-bit sizes are supported. */
     if (size > UINT_MAX)
         return NULL;


It will only affect radeon, not amdgpu.

Comment 52 hofmann.zachary 2016-10-24 20:52:49 UTC

Unless the changed code works independently of the nohyperz option I don't think it will help, since disabling hyperz on verde doesn't help either.

Comment 53 smoki 2016-10-25 02:11:55 UTC

 It might be possible that game fixes something, as i see there was game update 3 days ago with the following mentioned in changelog:

 "Improved several aspects of texture handling for OS X and Linux clients

    This should reduce the rate of "Out of memory" errors for players on high texture settings, especially on level change
    Players still encountering this error can reduce texture quality to medium or lower to greatly improve stability pending further improvements"

 http://store.steampowered.com/news/25022/

 Just wild guessing that this might change something, since game started to be unstable on radeonsi when streaming textures and reduction of mem was introduced last year.

Comment 54 Amarildo 2016-10-25 04:45:33 UTC

I remember disabling stream textures and still having the issue, as well as setting all graphic settings to minimal.

Can anyone confirm the status of this bug on Pitcairn + Mesa-git + amdgpu kernel driver?

Comment 55 Amarildo 2016-10-25 10:10:24 UTC

Seems that hang handling wasn't implemented at all for some GPU's: https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd?h=amd-staging-4.7&id=196cbffe7a4e23ad672b25a4226e53ea5479166c

I haven't yet tried playing TF2 with amd-staging-4.7 (though I have been using it for a few days). I'll try it this morning.

Comment 56 Amarildo 2016-10-25 11:26:13 UTC

Didn't work, hang is still there. I couldn't even go to tty2 this time.

amd-staging-4.7 compiled this morning
mesa-git
llvm-git

Comment 57 hofmann.zachary 2016-10-25 16:38:07 UTC

As smoki mentioned, many of the troubles started after Valve's texture streaming changes to TF2. They'd certainly know what changed in their code, but for someone like me they're impossible to get a hold of.

http://www.teamfortress.com/post.php?id=19733

Comment 58 pandiculationfinch 2016-11-02 20:18:59 UTC

Created attachment 127704 [details]
package update history that lead to a change in behaviour

Last night the freezes I've been having changed their behaviour. They use to just cause the system to completely freeze up. Now my system does a immediate shutdown.

this is interesting because I had just updated linux and mesa-git so I potentially have a commit range in mesa/llvm which has code related to the problem. I'm going to rollback my kernel/headers tonight and reboot to rule that out. And if that doesn't cause the hang to re-appear I'll roll back mesa tomorrow. and then I'll rollback llvm.

In the meantime I've attached the package update history for the last few days in case that helps any of the developers.

Comment 59 pandiculationfinch 2016-11-02 22:37:17 UTC

sigh turns something else must have caused the shutdowns, the game is back to just freezing the system today. =/

Comment 60 Rosco P. Coltrane 2016-11-14 19:06:37 UTC

Some people are reporting that they can reproduce the bug on windows 7. 

https://github.com/ValveSoftware/Source-1-Games/issues/1943#issuecomment-260154700

Are we absolutely sure that it is not a hardware problem?

Comment 61 hofmann.zachary 2016-11-15 20:20:10 UTC

I haven't seen anything to rule out it being a hardware problem, but Valve's overwhelming silence on the matter isn't exactly helpful.

Comment 62 pandiculationfinch 2016-11-23 03:34:46 UTC

I finally found the root cause for my problems.

Turns out my CPU was overheating. But I only stressed it enough when playing games and nothing showed up in the logs about a shutdown due to heat. Once i resolved the overheating all my games ran smoothly with no crashes. apologies for the noise.

Wish I had found it sooner.

Comment 63 Cooper Blake 2016-11-25 03:02:11 UTC

I am also see my system completely crash after running Team Fortress 2 for typically 5-20 minutes.  In the last three occurrences, I've seen the following:

1. Freeze and system reboot within 10 seconds.  I did not see anything in the logs.
2. Successful playing for ~30 minutes without issue.
3. Freeze and sound loop.  The screen resets and sound loop changes every 10-20 seconds, which I believe is when the system is trying to reset the GPU.  However, it never succeeds, and the system becomes completely non-responsive.  The keyboard does not seem to accept input (num lock is frozen, can't switch to console).  The only thing I can do is a hard restart.  This scenario happens almost every time.

Output from journalctl looks like this:
Nov 24 21:26:42 fedora kernel: radeon 0000:01:00.0: ring 3 stalled for more than 10181msec
Nov 24 21:26:42 fedora kernel: radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000075bec last fence id 0x0000000000075bf7 on ring 3)

Backtrace starts like this:
Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) Backtrace:
Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) 0: /usr/libexec/Xorg (OsLookupColor+0x139) [0x59f679]
Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) 1: /lib64/libc.so.6 (__restore_rt+0x0) [0x7f4ec08bf7df]
Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) 2: /lib64/libc.so.6 (__memcpy_sse2_unaligned+0x29) [0x7f
4ec0927739]
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 3: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensi
ons_virtio_gpu+0x37401a) [0x7f4eb9d88e7a]
...
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 15: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0xa16e) [0x7f4ebafcfd3e]
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 16: /usr/libexec/Xorg (DamageRegionAppend+0x618) [0x520ea8]
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 17: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x11427) [0x7f4ebafde9e7]
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 18: /usr/libexec/Xorg (AddTraps+0x56b1) [0x51c1d1]
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 19: /usr/libexec/Xorg (SendErrorToClient+0x2df) [0x436e2f]
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 20: /usr/libexec/Xorg (remove_fs_handlers+0x463) [0x43ae63]
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 21: /lib64/libc.so.6 (__libc_start_main+0xf1) [0x7f4ec08ab731]
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 22: /usr/libexec/Xorg (_start+0x29) [0x424d59]
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 23: ? (?+0x29) [0x29]
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE)
Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) Bus error at address 0x7f4eb5af5008


I am running Fedora 24 with the latest updates:
Hardware:
CPU:  AMD Athlon II x3 450
GPU:  Sapphire / AMD Radeon R7 350 w/ 2GB GDDR5
GPU chipset:  Cape Verde

Kernel: 4.8.7-200.fc24.x86_64
Mesa:  12.0.3
LLVM:  3.8.0
DRM:  2.46.0
Driver:  radeonsi

I have played a couple other Valve games for several hours with no problems:  Portal, Portal 2, and Dota 2.

Comment 64 Amarildo 2016-12-08 18:40:39 UTC

Have any of you tried this? https://cgit.freedesktop.org/mesa/mesa/commit/?id=6dc96de303290e8d1fc294da478c4f370be98dea

Comment 65 Marek Olšák 2016-12-08 21:33:20 UTC

(In reply to Amarildo from comment #27)
> What I really think is that VALVe should investigate this since this problem
> started happening after they introduced mandatory Texture Streaming.

If you are right about texture streaming, the cso commit might fix it.

Comment 66 Amarildo 2016-12-09 00:24:20 UTC

OH MY LORD

Been playing for 25 minutes so far, no hangs at all.

I'll test more!

Comment 67 Amarildo 2016-12-09 00:51:26 UTC

45 minutes, not a single crash. I believe it's fixed.

Comment 68 Amarildo 2016-12-09 02:22:09 UTC

Played 2 sessions of 1 hour each, no hangs at all.
To me, this is fixed.

"Thanks", I guess? 1 years is still better than nothing, AMD :P

Comment 69 Michel Dänzer 2016-12-09 02:31:37 UTC

FWIW, the fundamental problem caught by Marek (good catch!) was there for almost 9 years. It just might not have had quite as severe consequences with other drivers.

Comment 70 hofmann.zachary 2016-12-09 03:05:19 UTC

Well of course it needs more testing to be sure, but I'll probably be doing this soon.

Comment 71 Amarildo 2016-12-09 03:09:09 UTC

It would be really unfortunate if this didn't fix the issue for everybody.

Comment 72 null32 2016-12-09 03:23:27 UTC

RX470 here, I've been playing for more than 1 hour and no crash so far. Thank you!

Comment 73 hofmann.zachary 2016-12-09 20:13:33 UTC

One hour is not enough testing. I applied this patch to mesa 13.0.2 and the game still locks up.

Comment 74 Amarildo 2016-12-09 22:12:27 UTC

(In reply to hofmann.zachary from comment #73)
> One hour is not enough testing. I applied this patch to mesa 13.0.2 and the
> game still locks up.

I believe you need mesa-git and llvm-svn for it to work.

Comment 75 null32 2016-12-09 22:25:15 UTC

(In reply to hofmann.zachary from comment #73)
> One hour is not enough testing. I applied this patch to mesa 13.0.2 and the
> game still locks up.

Make sure you're using a patched version of the 32 bit libraries too. I managed to play almost 3 hours in a row in a full server and in different maps without issues at all.

These are the packages that I'm using:

* linux 4.8.12-2
* linux-firmware 20161005.9c71af9-1

* mesa-git 13.1.0_devel.87233.bd56de8-1
* lib32-mesa-git 13.1.0_devel.87233.bd56de8-1

* llvm-svn 4.0.0svn_r289147-1
* lib32-llvm-svn 4.0.0svn_r289117-1

Comment 76 Amarildo 2016-12-11 05:47:41 UTC

(In reply to null32 from comment #75)
> (In reply to hofmann.zachary from comment #73)
> > One hour is not enough testing. I applied this patch to mesa 13.0.2 and the
> > game still locks up.
> 
> Make sure you're using a patched version of the 32 bit libraries too. I
> managed to play almost 3 hours in a row in a full server and in different
> maps without issues at all.
> 
> These are the packages that I'm using:
> 
> * linux 4.8.12-2
> * linux-firmware 20161005.9c71af9-1
> 
> * mesa-git 13.1.0_devel.87233.bd56de8-1
> * lib32-mesa-git 13.1.0_devel.87233.bd56de8-1
> 
> * llvm-svn 4.0.0svn_r289147-1
> * lib32-llvm-svn 4.0.0svn_r289117-1

He confirmed it working :D

https://github.com/ValveSoftware/Source-1-Games/issues/1943#issuecomment-266251699

Comment 77 hofmann.zachary 2016-12-20 20:24:50 UTC

Oops, forgot to confirm the patch working here too. Yes, the game works without crashing now.

Comment 78 Marek Olšák 2016-12-22 17:48:55 UTC

Fixed by: https://cgit.freedesktop.org/mesa/mesa/commit/?id=6dc96de303290e8d1fc294da478c4f370be98dea

Closing.

Comment 79 Timothy Arceri 2018-04-03 04:08:21 UTC

*** Bug 95308 has been marked as a duplicate of this bug. ***

Comment 80 Amarildo 2018-11-26 23:37:10 UTC

Uh oh. This bug may be back.

I'm back on Linux. First time playing for more than 30 mins (my little sister was playing) PC hangs.

Will test it to see whether it's this hellish bug or not.

Comment 81 Alex Deucher 2018-11-27 19:34:23 UTC

(In reply to Amarildo from comment #80)
> Uh oh. This bug may be back.
> 
> I'm back on Linux. First time playing for more than 30 mins (my little
> sister was playing) PC hangs.
> 
> Will test it to see whether it's this hellish bug or not.

Not likely to be the same issue if there is a hang.  Please file a new bug report.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.