Created attachment 120925 [details] Kernel dmesg around the time of the lockup. After a period of time playing the latest version of TF2, my GPU locks up. After the kernel tries to reset, the X becomes stuck and won't work. The rest of the system is fine however. Sometimes, the GPU will reset successfully and continue working, only to lockup later, eventually freezing X. Hardware: GPU: Gigabyte Radeon HD 7970 Ghz edition OC CPU: AMD Phenom ii X6 1100T MB: Asus Crosshair IV Formula Software: Mesa: 11.1.0 DRM: 2.4.65 LLVM: 3.7.0 X: 1.17.4 DDX: 7.6.1 Kernel: 4.3.3 I have a dmesg with debug turned on and a strace of X from around the time it crashes (attached). I reduced the log file to the relevant bits, as they are quite large. I'll retry with latest git, see if it helps anywhere.
Created attachment 120926 [details] Strace of Xorg up to X freezing FD 20 is the drm device node, and it freezes on ioctl 0xc020645d.
Created attachment 120927 [details] Radeon blocked locks Since X seemed blocked on an ioctl, I managed to get a list of all the blocked locks, and found most of my taken locks were from GUI related programs who would be doing GL things, and they are all blocked on a lock, including one that is currently trying to reset my GPU. I'm guessing there is a lock that is being grabbed twice, once when userspace makes an ioctl, and again during the reset. I'll keep digging. Also, I think this may be a duplicate of #90217, as both involve source games. I'll leave this open for now, in case tf2 has a different trigger.
There is other logs: https://github.com/ValveSoftware/Source-1-Games/issues/1943
Created attachment 121242 [details] [review] This helps avoid a complete crash when a lockup occurs. Note this doesn't solve this bug, it just helps manage it.
Can confirm, I have either the same or a similar problem on my R9 390 (using radeon, with DPM disabled). It doesn't just crash X though, it completely locks up and I have to reboot to even use TTY. Happens after 10-20 mins of TF2. Running Arch Linux with everything up to date but no AUR packages, will post specifics later.
Created attachment 121293 [details] [review] Second patch to fix system lockup after gpu reset This is already taken accepted from the mailing list, including here for completeness. If anyone is experiencing this issue, can you please try with all of these patches applied? For now, X should die and restart without acceleration, but getting a dmesg out or restarting should be fine.
CPU: FX 8350 GPU: R9 390 MB: Asrock 970 Extreme4 Software: Kernel: 4.3.3-3-ARCH x86_64 Mesa: 11.1.1 DRM: 2.43.0 LLVM: 3.7.0 X: 1.18.0 As mentioned above, I get the crash with TF2, but *NOT* CS:GO.
Also, this could be a duplicate of bug #92912 - random lockups in TF2, all with radeon.
(In reply to pc.jago1337 from comment #8) > Also, this could be a duplicate of bug #92912 - random lockups in TF2, all > with radeon. I was asked to file this bug separately. Also that covers R600, a different GPU the GCN.
Same problem here on a fedora 23 GPU: HD 7970 CPU: Intel Core i7 950 Mesa 11.1.0 DRM 2.43.0 LLVM 3.7.0 kernel: 4.3.4 The logs are filed with "ring stalled" and GPU lock messages. I can send more logs if needed. radeon 0000:02:00.0: ring 3 stalled for more than 10249msec radeon 0000:02:00.0: GPU lockup (current fence id 0x000000000001e5f1 last fence id 0x000000000001e5f2 on ring 3) I've tried a different firmware (http://people.freedesktop.org/~agd5f/radeon_ucode/k/) which seemed to have helped other people with their own problem but it didn't help in my case. Does it makes sense to try to rollback to an older kernel?
Created attachment 121578 [details] [review] New avoid lockup patch Latest version as posted to dri-devel. With these two patches, your system should no longer lockup forever. It will freeze the game for a moment, and X may die for other reasons. Now the underlying tf2 issue needs investigation.
I can say that it also affects me, I'm using the AMDGPU drivers with powerplay enabled, using a custom linux4.5 kernel. AMD r9 380 video card.
*** Bug 95308 has been marked as a duplicate of this bug. ***
Any chance VALVe introduced this? They won't admit it. https://github.com/ValveSoftware/steam-for-linux/issues/4409 The patches attatched here are present in Linux 4.6. I tested linux-git-4.7-rc7 with mesa-git-12.1 compiled against llvm-snv-3.9, and TF2 still crashes. Setting every graphical option to Low doesn't help.
This is certainly a bug in our driver (unlike what was written on the Github tracker, a game *can* cause a hang e.g. by writing an infinite loop in a shader, but that seems exceedingly unlikely in the case of TF2). The problem with this particular bug is that it seems non-deterministic (i.e. not reliably reproducible), and that makes it hard to debug.
So there's a chance it won't be fixed at all? I was thinking about bisecting from version 3.16 (where I know it worked for me, on Debian Jessie) until ~4.1, but I don't have that kind of time right now.
Actually, if you could find a clear bisection result, that would be tremendously helpful and would probably lead to a fix. However, with this kind of bug you need to be extremely sure about what you're doing when bisecting. For example, if you know that the hang typically occurs after 10 minutes, then you should play for at least one hour (perhaps even longer) with each kernel. Otherwise, you might have just gotten lucky, and the bisect result would be worse than useless.
Yes, I would definitely test it for a long period, something like 16 hours hehehe. However, I can't do any besecting right now, I'm tremendously busy at the moment. Too bad there's not many Linux players with this problem, otherwise someone would have figured this out already. Cheers.
happens with stellaris as well.
Does this fix it? https://cgit.freedesktop.org/mesa/mesa/commit/?id=947e0614d091c260651e4f3d6209bd6bcc2cfa0d In other words, does mesa/master work?
I can confirm lastest git head (50b49d242d702e4728329cc59f87d929963e7c53) still causes lockups, though they seem to come much faster. Also seems to have a regression regarding lighting, I'll see about bisecting that in a separate report. LLVM: 3.8.0 DRM: 2.43.0 Linux: 4.6.3-gentoo
I'll test this weekend with stellaris and let you know.
sad to say it did not fix the issue for me. it ran longer than usual though prior to the crash. I suspect you nixed one issue but multiple are going on. I'm happy to run any debugging/patches you wish to try.
Didn't fix for me either, on Arch Linux.
Marek, since you work for AMD, I wonder if you could get a few hints for the fix on Catalyst's sources?
(In reply to AmarildoJr from comment #25) > Marek, since you work for AMD, I wonder if you could get a few hints for the > fix on Catalyst's sources? It's not so simple. This is a bug somewhere in the Mesa driver such that looking at other drivers won't likely help.
(In reply to Marek Olšák from comment #26) > (In reply to AmarildoJr from comment #25) > > Marek, since you work for AMD, I wonder if you could get a few hints for the > > fix on Catalyst's sources? > > It's not so simple. This is a bug somewhere in the Mesa driver such that > looking at other drivers won't likely help. This is a very weird issue. I think it may not be in Mesa, and here's why: * On Debian Jessie with kernel 3.16 and Mesa 10.3, the problem doesn't happen; * On the same Debian, but with mesa backported, the problem also doesn't happen; * On the same Debian with Mesa backported and the Kernel backported, the problem still doesn't happen; * On Arch Linux with Mesa downgraded to 10.3, the problem happens; * On the same Arch Linux with Mesa and Kernel downgraded (Kernel to version 3.16 and even 3.10), the problem still happens; * I'm not 100% sure I downgraded the Firmware on Arch, but I'll try today since I'm testing a few drivers in Linux; * On vanilla Arch with Catalyst/FGLRX, the problem doesn't happen; So I do think this issue is much bigger than everybody thinks and only happens with a certain combination of Mesa, Kernel, Firmware, and possibly libdrm, llvm, and other pieces of software as well. What I really think is that VALVe should investigate this since this problem started happening after they introduced mandatory Texture Streaming.
(In reply to AmarildoJr from comment #27) > (In reply to Marek Olšák from comment #26) > > (In reply to AmarildoJr from comment #25) > > > Marek, since you work for AMD, I wonder if you could get a few hints for the > > > fix on Catalyst's sources? > > > > It's not so simple. This is a bug somewhere in the Mesa driver such that > > looking at other drivers won't likely help. > > This is a very weird issue. I think it may not be in Mesa, and here's why: > > * On Debian Jessie with kernel 3.16 and Mesa 10.3, the problem doesn't > happen; > * On the same Debian, but with mesa backported, the problem also doesn't > happen; > * On the same Debian with Mesa backported and the Kernel backported, the > problem still doesn't happen; > * On Arch Linux with Mesa downgraded to 10.3, the problem happens; > * On the same Arch Linux with Mesa and Kernel downgraded (Kernel to version > 3.16 and even 3.10), the problem still happens; > * I'm not 100% sure I downgraded the Firmware on Arch, but I'll try today > since I'm testing a few drivers in Linux; > * On vanilla Arch with Catalyst/FGLRX, the problem doesn't happen; > > So I do think this issue is much bigger than everybody thinks and only > happens with a certain combination of Mesa, Kernel, Firmware, and possibly > libdrm, llvm, and other pieces of software as well. > > What I really think is that VALVe should investigate this since this problem > started happening after they introduced mandatory Texture Streaming. Is the elephant in the room in this case the LLVM version difference between the two setups?
I just tested the oldest firmware available in the Arch Linux Archive, namely linux-firmware 20130725-1, and the crashes don't happen. This is with current Arch, not a single package is old and all packages are up-to-date according to the repos. I'm hitting 10 to 30 FPS in-game, but at least the crashes don't happen which IMO is a very good sign of where the problem might be. I'll report the firmware problem to AMD. In the mean time, does anyone know how I can try running the firmware from Catalyst? @Marek, where is the best place to report this?
(In reply to Vedran Miletić from comment #28) > Is the elephant in the room in this case the LLVM version difference between > the two setups? According to a Gentoo user who compiled llvm 3.5 and and older version of mesa against it, the problem still occurs.
(In reply to AmarildoJr from comment #29) > I just tested the oldest firmware available in the Arch Linux Archive, > namely linux-firmware 20130725-1, and the crashes don't happen. This is with > current Arch, not a single package is old and all packages are up-to-date > according to the repos. > > I'm hitting 10 to 30 FPS in-game, but at least the crashes don't happen > which IMO is a very good sign of where the problem might be. > > I'll report the firmware problem to AMD. > > In the mean time, does anyone know how I can try running the firmware from > Catalyst? > > @Marek, where is the best place to report this? So are we certain the hangs are caused by firmware? Bisecting the firmware would help a lot. What's your GPU?
I tested today 3 different firmwares on manjaro (HD7970) linux-firmware-20150527.3161bfa-1-any.pkg.tar.xz (chosen because it was a bit before the first bugs were reported with TF2) This allowed me to play TF2 without bugs for ~30 min. Then I had the bug (screen freeze, sound loop) but the system recovered fine after 20 sec with no loss of performance. I still had a problem before and after the bug with the mouse pointer which wasn't visible at all time. linux-firmware-20131013.7d0c7a8-1-any.pkg.tar.xz This allowed me to play for a good hour, then: bug + recovery after 20 sec. At the fifth bug the screen simply hanged, TF2 and steam crashed. (had to ctrl+alt+f2). This one didn't have the mouse bug. This is the most stable TF2 experience I can get. linux-firmware-20130725-1-any.pkg.tar.xz (earlier firmware available in the repo) This one crashed after 2 seconds loading the first map. The first two firmwares also seem to have fixed the same bug which was present in "Victor Vran" (same symptoms, screen freeze + sound loop).
not certain but assuming I ran the test correctly, I experienced a crash using the oldest linux firmware I had linux-firmware-20140828. that leaves 13 months of time to bisect if linux-firmware 20130725-1 does indeed work. I'll see about trying installing the 20130725 version later have other stuff I need to do. commands run to downgrade to linux-firmware-20140828: sudo pacman -U /var/cache/pacman/pkg/linux-firmware-20140828.13eb208-1-any.pkg.tar.xz sudo pacman -S linux after downgrade I had the following error on boot, so I'm assuming it worked: Sep 04 09:53:14 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/TAHITI_vce.bin failed with error -2 Sep 04 09:53:14 jambli kernel: radeon 0000:01:00.0: radeon_vce: Can't load firmware "radeon/TAHITI_vce.bin" Sep 04 09:53:14 jambli kernel: radeon 0000:01:00.0: failed VCE (-2) init. other info: Name : llvm-libs Version : 3.8.1-1 Name : linux Version : 4.7.2-1 Name : mesa-git Version : 84594.98f734e-1 Extended renderer info (GLX_MESA_query_renderer): Vendor: X.Org (0x1002) Device: AMD OLAND (DRM 2.45.0 / 4.7.2-1-ARCH, LLVM 4.0.0) (0x6610) Version: 12.1.0 Accelerated: yes Video memory: 2048MB Unified memory: no Preferred profile: core (0x1) Max core profile version: 4.3 Max compat profile version: 3.0 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.1 I forget the exact card off the top of my head but here is the output of lspci, if you need more precise card information let me know how to get it from the cli =): 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland XT [Radeon HD 8670 / R7 250/350] 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
I should note I was testing against stellaris.
game froze again after ~20minutes. using the 20130725 version firmware. so if downgrading to 20130725 fixes TF2 it likely isn't the same issue as TF2. game: stellaris commands run to downgrade to linux-firmware-20130725: sudo pacman -U /var/cache/pacman/pkg/linux-firmware-20130725-1-any.pkg.tar.xz sudo pacman -S linux other info: Name : llvm-libs Version : 3.8.1-1 Name : linux Version : 4.7.2-1 Name : mesa-git Version : 84594.98f734e-1 Name : linux-firmware Version : 20130725-1 lspci: 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland XT [Radeon HD 8670 / R7 250/350] 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series] boot logs: Sep 04 11:12:28 jambli kernel: [drm] initializing kernel modesetting (OLAND 0x1002:0x6610 0x174B:0xE269 0x00). Sep 04 11:12:28 jambli kernel: [drm] register mmio base: 0xFDD80000 Sep 04 11:12:28 jambli kernel: [drm] register mmio size: 262144 Sep 04 11:12:28 jambli kernel: ATOM BIOS: C66201 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: VRAM: 2048M 0x0000000000000000 - 0x000000007FFFFFFF (2048M used) Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: GTT: 2048M 0x0000000080000000 - 0x00000000FFFFFFFF Sep 04 11:12:28 jambli kernel: [drm] Detected VRAM RAM=2048M, BAR=256M Sep 04 11:12:28 jambli kernel: [drm] RAM width 128bits DDR Sep 04 11:12:28 jambli kernel: [TTM] Zone kernel: Available graphics memory: 8209378 kiB Sep 04 11:12:28 jambli kernel: [TTM] Zone dma32: Available graphics memory: 2097152 kiB Sep 04 11:12:28 jambli kernel: [TTM] Initializing pool allocator Sep 04 11:12:28 jambli kernel: [TTM] Initializing DMA pool allocator Sep 04 11:12:28 jambli kernel: [drm] radeon: 2048M of VRAM memory ready Sep 04 11:12:28 jambli kernel: [drm] radeon: 2048M of GTT memory ready. Sep 04 11:12:28 jambli kernel: [drm] Loading oland Microcode Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_pfp.bin failed with error -2 Sep 04 11:12:28 jambli systemd[1]: Created slice system-lvm2\x2dpvscan.slice. Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_me.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_ce.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_rlc.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_mc.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/OLAND_mc2.bin failed with error -2 Sep 04 11:12:28 jambli kernel: [drm] radeon/OLAND_mc.bin: 31452 bytes Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_smc.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/OLAND_smc.bin failed with error -2 Sep 04 11:12:28 jambli kernel: smc: error loading firmware "radeon/OLAND_smc.bin" Sep 04 11:12:28 jambli kernel: [drm] Internal thermal controller with fan control Sep 04 11:12:28 jambli kernel: [drm] radeon: power management initialized Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/TAHITI_vce.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: radeon_vce: Can't load firmware "radeon/TAHITI_vce.bin" Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: failed VCE (-2) init.
If you're testing Mesa git, would you please set GALLIUM_DDEBUG="pipelined 2000" and run TF2, wait until the GPU hangs and repeat. After it happens for the 3rd time, please zip and attach the contents of ~/ddebug_dumps/*. There should be 3 files. Though I've got a hunch that we're just running around in circles.
Created attachment 126454 [details] stellaris run via steam: GALLIUM_DDEBUG="pipelined 2000" %command% here are the dumps generated. it seems like a hit or miss if anything was actually written into the files. the computer completely locks up when it encounter the freeze in stellaris. stellaris was even more unstable with the GALLIUM_DDEBUG, often failing to even start up.
Does anyone have a little bit of free time to extract the files from "lib32-catalyst-libgl" into a system running "lib32-mesa-libgl" and see if that helps?
I'm also having this problem with Radeon R7 250 (radeonsi), Mesa 12.0.2, LLVM 3.8.1 and kernel version 4.6.0.
If disabling DPM fixed the issue, shouldn't developers study it's code a little bit? I'm 99.99% positive the issue is in there somewhere, even for AMDGPU (since RadeonSI and AMDGPU drivers share a lot of code).
(In reply to Amarildo from comment #40) > If disabling DPM fixed the issue, shouldn't developers study it's code a > little bit? I'm 99.99% positive the issue is in there somewhere, even for > AMDGPU (since RadeonSI and AMDGPU drivers share a lot of code). Another user previously stated in the thread that they were experiencing the issues and had DPM disabled. @Marek Olšák Please let me know if there's anything I can do to help hunt this bug down.
(In reply to hofmann.zachary from comment #41) > (In reply to Amarildo from comment #40) > > If disabling DPM fixed the issue, shouldn't developers study it's code a > > little bit? I'm 99.99% positive the issue is in there somewhere, even for > > AMDGPU (since RadeonSI and AMDGPU drivers share a lot of code). > > Another user previously stated in the thread that they were experiencing the > issues and had DPM disabled. > > @Marek Olšák > Please let me know if there's anything I can do to help hunt this bug down. But that's one user's word against at least 5. Do we even know if the user actually disabled DPM or has the capacity to do so? Because I'm sure me and others (like Gentoo users) did in fact disable DPM and the hang didn't happen. So I don't think our word is less valid just because *one* user claimed he/she disabled DPM and the hang still happened.
Just tried Mesa-Git (13.1) with the AMDGPU driver on R9 270X. The crash happens here as well. However, looking at journalctl I can see new errors from the AMDGPU driver, and a brief research tells me it could be some TF2 texturing problem. The error: GPU fault detected: 147 0x000ac802 Similar bugs have been resolved already: https://bugs.freedesktop.org/show_bug.cgi?id=87278 https://bugs.freedesktop.org/show_bug.cgi?id=84614 LLVM seems to be related too.
I don't know if it can be of any help, but I've been playing "7 days to die" during the last weeks, regularly for the last days, and I didn't encounter any kind of bug. Until yesterday evening where at my great surprise I had the same bug (freeze, sound loop) which totally crashed my machine once and only froze it (with a recovery after a few seconds) twice. I checked that no update occurred on the game files, on the steam runtime and on my OS between the days when it worked flawlessly and yesterday when it crashed 3 time in 15 minutes. So if it's not only related to files, could it be related to the hardware? Could it be a faulty card (HD7970), or maybe a mix between a faulty hardware and some software instruction?
Faulty hardware doesn't make any sense, because: - It only happens on Linux; - It only happens with specific combinations of Mesa/LLVM/Kernel/Firmware/etc - It doesn't happen with the proprietary drivers
(In reply to Amarildo from comment #45) > Faulty hardware doesn't make any sense, because: > > - It only happens on Linux; > - It only happens with specific combinations of Mesa/LLVM/Kernel/Firmware/etc > - It doesn't happen with the proprietary drivers It's probably not the exact same crash, but FWIW I also get crashes with the proprietary driver and TF2 when I tested it last. I just don't want people to get their hopes up only to have them let down.
In all honesty, this is one of the most interesting bugs I know. Within all the people that have it, there are variations to which causes it in the first place. What works for me (Debian Jessie with Mesa/libc6 from Backports, for example) might still cause the crash for some people. What I do know is that it's not caused by faulty hardware. It could be for some, but seriously doubt it it's the cause for 99.99% of people experiencing the issue.
Does this fix the hangs? https://cgit.freedesktop.org/mesa/mesa/commit/?id=d4d9ec55c589156df4edc227a86b4a8c41048d58 It changes the HTILE (HyperZ) allocation function to r600_aligned_buffer_create. Without that, the hardware can hang on big GPUs (Tahiti/Pitcairn/Hawaii/Tonga/etc), but not APUs or small GPUs. The hang happens when TTM decides to move HTILE to a different location with an unaligned physical address (which is pretty random). The hardware tries to access the unaligned address and boom.
(In reply to Marek Olšák from comment #48) > Does this fix the hangs? > https://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=d4d9ec55c589156df4edc227a86b4a8c41048d58 > > It changes the HTILE (HyperZ) allocation function to > r600_aligned_buffer_create. Without that, the hardware can hang on big GPUs > (Tahiti/Pitcairn/Hawaii/Tonga/etc), but not APUs or small GPUs. The hang > happens when TTM decides to move HTILE to a different location with an > unaligned physical address (which is pretty random). The hardware tries to > access the unaligned address and boom. Actually, I think that commit only affects Hawaii and Fiji. Other GPUs might be unaffected, which means the Tahiti hangs are due to a different bug.
(In reply to Marek Olšák from comment #49) > (In reply to Marek Olšák from comment #48) > > Does this fix the hangs? > > https://cgit.freedesktop.org/mesa/mesa/commit/ > > ?id=d4d9ec55c589156df4edc227a86b4a8c41048d58 > > > > It changes the HTILE (HyperZ) allocation function to > > r600_aligned_buffer_create. Without that, the hardware can hang on big GPUs > > (Tahiti/Pitcairn/Hawaii/Tonga/etc), but not APUs or small GPUs. The hang > > happens when TTM decides to move HTILE to a different location with an > > unaligned physical address (which is pretty random). The hardware tries to > > access the unaligned address and boom. > > Actually, I think that commit only affects Hawaii and Fiji. Other GPUs might > be unaffected, which means the Tahiti hangs are due to a different bug. I've previously tried disabling hyperz on Tahiti with no luck in side stepping this bug, so I don't think this is the issue. Could there be other buffers that need similar treatment that are being ignored? Is there an easy way to test this locally?
You can try this: diff --git a/src/gallium/winsys/radeon/drm/radeon_drm_bo.c b/src/gallium/winsys/radeon/drm/radeon_drm_bo.c index a15d559..ab95bae 100644 --- a/src/gallium/winsys/radeon/drm/radeon_drm_bo.c +++ b/src/gallium/winsys/radeon/drm/radeon_drm_bo.c @@ -939,7 +939,7 @@ radeon_winsys_bo_create(struct radeon_winsys *rws, struct radeon_drm_winsys *ws = radeon_drm_winsys(rws); struct radeon_bo *bo; unsigned usage = 0, pb_cache_bucket; - +alignment *= 2; /* Only 32-bit sizes are supported. */ if (size > UINT_MAX) return NULL; It will only affect radeon, not amdgpu.
Unless the changed code works independently of the nohyperz option I don't think it will help, since disabling hyperz on verde doesn't help either.
It might be possible that game fixes something, as i see there was game update 3 days ago with the following mentioned in changelog: "Improved several aspects of texture handling for OS X and Linux clients This should reduce the rate of "Out of memory" errors for players on high texture settings, especially on level change Players still encountering this error can reduce texture quality to medium or lower to greatly improve stability pending further improvements" http://store.steampowered.com/news/25022/ Just wild guessing that this might change something, since game started to be unstable on radeonsi when streaming textures and reduction of mem was introduced last year.
I remember disabling stream textures and still having the issue, as well as setting all graphic settings to minimal. Can anyone confirm the status of this bug on Pitcairn + Mesa-git + amdgpu kernel driver?
Seems that hang handling wasn't implemented at all for some GPU's: https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd?h=amd-staging-4.7&id=196cbffe7a4e23ad672b25a4226e53ea5479166c I haven't yet tried playing TF2 with amd-staging-4.7 (though I have been using it for a few days). I'll try it this morning.
Didn't work, hang is still there. I couldn't even go to tty2 this time. amd-staging-4.7 compiled this morning mesa-git llvm-git
As smoki mentioned, many of the troubles started after Valve's texture streaming changes to TF2. They'd certainly know what changed in their code, but for someone like me they're impossible to get a hold of. http://www.teamfortress.com/post.php?id=19733
Created attachment 127704 [details] package update history that lead to a change in behaviour Last night the freezes I've been having changed their behaviour. They use to just cause the system to completely freeze up. Now my system does a immediate shutdown. this is interesting because I had just updated linux and mesa-git so I potentially have a commit range in mesa/llvm which has code related to the problem. I'm going to rollback my kernel/headers tonight and reboot to rule that out. And if that doesn't cause the hang to re-appear I'll roll back mesa tomorrow. and then I'll rollback llvm. In the meantime I've attached the package update history for the last few days in case that helps any of the developers.
sigh turns something else must have caused the shutdowns, the game is back to just freezing the system today. =/
Some people are reporting that they can reproduce the bug on windows 7. https://github.com/ValveSoftware/Source-1-Games/issues/1943#issuecomment-260154700 Are we absolutely sure that it is not a hardware problem?
I haven't seen anything to rule out it being a hardware problem, but Valve's overwhelming silence on the matter isn't exactly helpful.
I finally found the root cause for my problems. Turns out my CPU was overheating. But I only stressed it enough when playing games and nothing showed up in the logs about a shutdown due to heat. Once i resolved the overheating all my games ran smoothly with no crashes. apologies for the noise. Wish I had found it sooner.
I am also see my system completely crash after running Team Fortress 2 for typically 5-20 minutes. In the last three occurrences, I've seen the following: 1. Freeze and system reboot within 10 seconds. I did not see anything in the logs. 2. Successful playing for ~30 minutes without issue. 3. Freeze and sound loop. The screen resets and sound loop changes every 10-20 seconds, which I believe is when the system is trying to reset the GPU. However, it never succeeds, and the system becomes completely non-responsive. The keyboard does not seem to accept input (num lock is frozen, can't switch to console). The only thing I can do is a hard restart. This scenario happens almost every time. Output from journalctl looks like this: Nov 24 21:26:42 fedora kernel: radeon 0000:01:00.0: ring 3 stalled for more than 10181msec Nov 24 21:26:42 fedora kernel: radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000075bec last fence id 0x0000000000075bf7 on ring 3) Backtrace starts like this: Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) Backtrace: Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) 0: /usr/libexec/Xorg (OsLookupColor+0x139) [0x59f679] Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) 1: /lib64/libc.so.6 (__restore_rt+0x0) [0x7f4ec08bf7df] Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) 2: /lib64/libc.so.6 (__memcpy_sse2_unaligned+0x29) [0x7f 4ec0927739] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 3: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensi ons_virtio_gpu+0x37401a) [0x7f4eb9d88e7a] ... Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 15: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0xa16e) [0x7f4ebafcfd3e] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 16: /usr/libexec/Xorg (DamageRegionAppend+0x618) [0x520ea8] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 17: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x11427) [0x7f4ebafde9e7] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 18: /usr/libexec/Xorg (AddTraps+0x56b1) [0x51c1d1] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 19: /usr/libexec/Xorg (SendErrorToClient+0x2df) [0x436e2f] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 20: /usr/libexec/Xorg (remove_fs_handlers+0x463) [0x43ae63] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 21: /lib64/libc.so.6 (__libc_start_main+0xf1) [0x7f4ec08ab731] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 22: /usr/libexec/Xorg (_start+0x29) [0x424d59] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 23: ? (?+0x29) [0x29] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) Bus error at address 0x7f4eb5af5008 I am running Fedora 24 with the latest updates: Hardware: CPU: AMD Athlon II x3 450 GPU: Sapphire / AMD Radeon R7 350 w/ 2GB GDDR5 GPU chipset: Cape Verde Kernel: 4.8.7-200.fc24.x86_64 Mesa: 12.0.3 LLVM: 3.8.0 DRM: 2.46.0 Driver: radeonsi I have played a couple other Valve games for several hours with no problems: Portal, Portal 2, and Dota 2.
Have any of you tried this? https://cgit.freedesktop.org/mesa/mesa/commit/?id=6dc96de303290e8d1fc294da478c4f370be98dea
(In reply to Amarildo from comment #27) > What I really think is that VALVe should investigate this since this problem > started happening after they introduced mandatory Texture Streaming. If you are right about texture streaming, the cso commit might fix it.
OH MY LORD Been playing for 25 minutes so far, no hangs at all. I'll test more!
45 minutes, not a single crash. I believe it's fixed.
Played 2 sessions of 1 hour each, no hangs at all. To me, this is fixed. "Thanks", I guess? 1 years is still better than nothing, AMD :P
FWIW, the fundamental problem caught by Marek (good catch!) was there for almost 9 years. It just might not have had quite as severe consequences with other drivers.
Well of course it needs more testing to be sure, but I'll probably be doing this soon.
It would be really unfortunate if this didn't fix the issue for everybody.
RX470 here, I've been playing for more than 1 hour and no crash so far. Thank you!
One hour is not enough testing. I applied this patch to mesa 13.0.2 and the game still locks up.
(In reply to hofmann.zachary from comment #73) > One hour is not enough testing. I applied this patch to mesa 13.0.2 and the > game still locks up. I believe you need mesa-git and llvm-svn for it to work.
(In reply to hofmann.zachary from comment #73) > One hour is not enough testing. I applied this patch to mesa 13.0.2 and the > game still locks up. Make sure you're using a patched version of the 32 bit libraries too. I managed to play almost 3 hours in a row in a full server and in different maps without issues at all. These are the packages that I'm using: * linux 4.8.12-2 * linux-firmware 20161005.9c71af9-1 * mesa-git 13.1.0_devel.87233.bd56de8-1 * lib32-mesa-git 13.1.0_devel.87233.bd56de8-1 * llvm-svn 4.0.0svn_r289147-1 * lib32-llvm-svn 4.0.0svn_r289117-1
(In reply to null32 from comment #75) > (In reply to hofmann.zachary from comment #73) > > One hour is not enough testing. I applied this patch to mesa 13.0.2 and the > > game still locks up. > > Make sure you're using a patched version of the 32 bit libraries too. I > managed to play almost 3 hours in a row in a full server and in different > maps without issues at all. > > These are the packages that I'm using: > > * linux 4.8.12-2 > * linux-firmware 20161005.9c71af9-1 > > * mesa-git 13.1.0_devel.87233.bd56de8-1 > * lib32-mesa-git 13.1.0_devel.87233.bd56de8-1 > > * llvm-svn 4.0.0svn_r289147-1 > * lib32-llvm-svn 4.0.0svn_r289117-1 He confirmed it working :D https://github.com/ValveSoftware/Source-1-Games/issues/1943#issuecomment-266251699
Oops, forgot to confirm the patch working here too. Yes, the game works without crashing now.
Fixed by: https://cgit.freedesktop.org/mesa/mesa/commit/?id=6dc96de303290e8d1fc294da478c4f370be98dea Closing.
Uh oh. This bug may be back. I'm back on Linux. First time playing for more than 30 mins (my little sister was playing) PC hangs. Will test it to see whether it's this hellish bug or not.
(In reply to Amarildo from comment #80) > Uh oh. This bug may be back. > > I'm back on Linux. First time playing for more than 30 mins (my little > sister was playing) PC hangs. > > Will test it to see whether it's this hellish bug or not. Not likely to be the same issue if there is a hang. Please file a new bug report.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.