I am experiencing periodical system freezes which to my knowledge are caused by a GPU lockup. Those freezes are always triggered by 3D rendering, and are seemingly produced by a multitude of game engines. The crash is highly probabilistic, with various programs having a chance of triggering it anywhere between one minute and one hour. The image instantly freezes in place while audio stops working and every form of input dies (including the NumLock / CapsLock keyboard leds), the machine is entirely bricked until powered off and back on.
The problem is oddly similar to an issue I experienced an year ago, which had the exact same behavior and was also caused by 3D. That issue was fixed as every system component has received major updates since, however it seems to have come back sometime during the last few months. I will link its report here as it may still contain useful information:
My OS is Linux openSUSE Tumbleweed x64: Kernel 4.15.7, Xorg X11 Server 1.19.6, Mesa 18.0.0, xf86-video-ati 17.10.0, xf86-video-amdgpu 18.0.0. My motherboard is a Gigabyte GA-X58A-UD7 (rev 1.0). My video card is a Radeon R7 370 (Gigabyte) (rev 1.0), Pitcairn Islands GPU, GCN 1.0, RadeonSI.
Created attachment 137948 [details]
Created attachment 137949 [details]
journalctl --since "1 day ago"
This encompasses multiple crashes caused today by Blender 2.79.
Created attachment 137950 [details]
Created attachment 137951 [details]
Created attachment 137952 [details]
An important note: The exact same freeze happens with both the "radeon" and "amdgpu" driver: Using either module makes absolutely no difference.
I've been testing this crash using Xonotic during the past two days, granted it's a game I have a lot of experience customizing. What I found is pretty interesting and should be a good start in shedding light on this bug.
Initially the system freeze occurred somewhere between 10 and 40 minutes. Upon changing a few cvars, I seem to have almost entirely gotten rid of it: After nearly 5 hours of continuous testing, only one lockup has taken place! Below are the cvar overrides I added to my autoexec.cfg for the test: At least one of them had an influence... I'm still working on pinning down which, and that will take several more days due to the probability rate of the issue.
r_batch_multidraw 0 // old: 1
r_batch_dynamicbuffer 0 // old: 1
r_depthfirst 0 // old: 2
gl_vbo 0 // old: 3
gl_vbo_dynamicindex 0 // old: 1
gl_vbo_dynamicvertex 0 // old: 1
r_glsl_skeletal 0 // old: 1
vid_samples 1 // old: 4
gl_texture_anisotropy 0 // old: 16
I know the issue has something to do with triangles or vertices: The crash seems more frequent when there are a lot of players or objects present, indicating that an increased surface count may be a contributor. I've suspected mesh data stored on the video card to be the culprit, especially shared data with multiple objects using one instance of a mesh from video memory. This is why my bet is currently on gl_vbo (Vertex Buffer Objects / GL_ARB_vertex_buffer_object) being the variable that made a difference... again I still got a lockup even without it, so if anything it just heavily mitigated the crash.
This belief is reinforced by my previous experience in Blender: The only scene causing the GPU lockup is one where several high-poly objects share common mesh data, and the crash always occurred upon me adding a Subdivision Surface to just one of them (increasing its polygon count). It's been confirmed that as of Blender 2.77 (I have 2.79) VBO is indeed enabled in the 3D viewport. Note that I was also using the untextured viewport, thus I doubt textures play a role.
Lastly I ruled out the possibility of overheating having anything to do with it: During the first 3 hours in which I got no lockup, the temperature in my room was above 26°C. When I did get that one lockup later at night, the temperature of my room had long dropped to 23°C. The stress on the GPU was the same at all times, absolutely no settings were changed including the map.
Testing is still heavily undergoing. There's still nothing conclusive yet, but I should definitely share a piece of information early on.
To my surprise, it would appear the culprit may be either Anti-Aliasing or Anisotropic Filtering. I decided to re-enable their cvars first in Xonotic since I honestly suspected them the least... the moment I did that all hell broke lose again: In 30 minutes I had two system lockups! Then I disabled them once more, and could play a 40 minute match with no problem.
I have no idea which of the two it could be, but I should be getting there in the following days. I'm slowly re-enabling the other cvars first to rule them out, then I'll see whether AA or texture filtering is behind the crashes.
And we have a verdict. The influential factor is by far the anti-aliasing, at least in the case of Xonotic. The other cvars I previously mentioned have absolutely no effect on the frequency of this freeze.
Today I enabled the feature again and tried playing another match: I instantly got two lockups, one after 8 minutes and the other after only 20 seconds! I then disabled it and let the bots play again while I was away: This time the machine froze after more than 2 whole hours of experiencing no issues.
I find it interesting how the probability of the freeze seems to scale with the number of samples: If I use 4x AA ("vid_samples 4"), I get a crash roughly every 30 minutes... if I disable AA ("vid_samples 1"), I get a crash less than once per 2 hours... 30 minutes * 4x = 2 hours. Maybe this is just me seeing patterns but I thought I should suggest the idea.
I'd like to hear some thoughts from the developers or experienced users at this point. Can we close in on the source of this GPU lockup, knowing that Anti Aliasing greatly affects its frequency in Darkplaces engine? Are there any open bugs about AA related X11 crashes I should check out? What else can I test, ideally still under Xonotic where I have the best test case prepared?
Created attachment 138438 [details]
Screenshot of the Blender window glitching
I should add another detail to the discussion. I know this may be a separate issue which might have nothing to do with the crash, but at the same time I wouldn't be surprised if it does: Glitched graphics often indicate something going wrong with the display, such as corrupt textures in video memory, which may ultimately lead to just such a lockup.
On occasion, certain programs (namely Firefox and Blender) glitch out and draw broken rectangles all over the window. Some of those glitches are just boxes of random colors, others contain pieces of past images (for instance I saw patterns from my lock screen background). Sometimes they quickly disappear on their own, at other times I have to restart the program as it becomes illegible and unusable. If I move anything the squares flicker all over the place. The glitches continue even after I disable desktop effects, thus KDE compositing should have nothing to do with it.
Attached is a screenshot of the glitch happening in Blender, showing its window covered in the corrupt squares. I'm curious what your opinion is. Again I know this may be an unrelated issue, but I'm wondering whether it indicates some video storage corruption that's also leading up to the lockups.
I'm still struggling to debug this. The more I see the more my jaw drops.
First of all, the rule that disabling anti-aliasing decreases the frequency of the freeze (see the comments above) was just patched out: AA no longer has any effect either, it always freezes between 0 and 30 minutes now.
I ran the following new tests in Xonotic, none of which had any influence:
- Running with the following environment variable set: R600_DEBUG=checkir,precompile,nooptvariant,nodpbb,nodfsm,nodma,nowc,nooutoforder,nohyperz,norbplus,no2d,notiling,nodcc,nodccclear,nodccfb,nodccmsaa
- Disabling all shaders, even turning off OpenGL 2.0 support entirely.
- Resetting the entire BIOS to its failsafe defaults, making sure that neither overclocking nor any other settings are involved.
- Running under both an X11 and Wayland session (Plasma). In Wayland it crashes instantly so it's even worse.
- Verified that this occurs on both the "radeon" and "amdgpu" modules, meaning the video driver makes no difference either.
It's clear to me at this point that this is the work of a professional: The code causing the crash is carefully maintained and injected into my system. If this was just a bug, at least one of the countless things I tried would have affected it somehow, it's impossible for a randomly occurring bug to survive so many different settings and environments... the issue instead is adaptive, so that the moment I find and disable one implementation another is activated within minutes to keep the crashes going. I imagine the objective is to block the user from finding a solution and ultimately censor them from using specific programs. I find it unbelievable that someone out there is actively doing this.
Please help me get to the bottom of this: The crash clearly acts by simulating some sort of bug, so there must be a vulnerability deep in the system which hidden code is exploiting. I offered a lot of test data on this report: If the developers read this, please let me know what to try next!
Created attachment 138483 [details]
Output of: watch cat /sys/kernel/debug/dri/0/amdgpu_pm_info
I decided to turn my attention to the last logical thing I can imagine: DPM (Dynamic Power Management) and the clocks on my video card. The kernel added support for realtime tuning of the frequencies a while ago, so I was pondering if the default setup may have led to excess overclocking.
I left a console to watch the file /sys/kernel/debug/dri/0/amdgpu_pm_info which I understand contains the video card frequencies. The maximum "power level" I seem to reach is 4, at sclk 101500 and mclk 140000. I'm attaching the peak output of this file here.
My video card is supposed to run at 1015 MHz (core clock) + 5600 MHz (memory clock). I don't fully understand how those numbers translate to frequencies, but from what I heard that represents the MHz * 100. If such is the case, my GPU clock is just right whereas my VRAM is actually under-clocked to a quarter of its default frequency! Can anyone confirm this so at least the hypothesis of bad clocks is out of the way?
I may try testing with the kernel parameters "radeon.dpm=0 amdgpu.dpm=0" later: I tried doing so briefly but the performance is too horrible to play a game, so I'll instead leave a bot match running in spectator mode while I'm away.
(In reply to MirceaKitsune from comment #10)
> Created attachment 138438 [details]
> Screenshot of the Blender window glitching
> I should add another detail to the discussion. I know this may be a separate
> issue which might have nothing to do with the crash, but at the same time I
> wouldn't be surprised if it does: Glitched graphics often indicate something
> going wrong with the display, such as corrupt textures in video memory,
> which may ultimately lead to just such a lockup.
> On occasion, certain programs (namely Firefox and Blender) glitch out and
> draw broken rectangles all over the window. Some of those glitches are just
> boxes of random colors, others contain pieces of past images (for instance I
> saw patterns from my lock screen background). Sometimes they quickly
> disappear on their own, at other times I have to restart the program as it
> becomes illegible and unusable. If I move anything the squares flicker all
> over the place. The glitches continue even after I disable desktop effects,
> thus KDE compositing should have nothing to do with it.
> Attached is a screenshot of the glitch happening in Blender, showing its
> window covered in the corrupt squares. I'm curious what your opinion is.
> Again I know this may be an unrelated issue, but I'm wondering whether it
> indicates some video storage corruption that's also leading up to the
I think you should try running your hardware under Windows.
You might also want to check if you still have warranty on the card
and act quickly if it expires soon.
I do suspect that it is a common hardware fault that happens with most video cards over time. I also had it on my very old HD5670, but with some help I did manage to salvage it, for now. My first symptoms were problems with RAM, that could be "workaround"-ed by lowering the memory frequency to half of the nominal frequency.
The problem is in micro BGA (Ball Grid Array). The GPU chip is the size of a fingernail and is placed on a "pad" that is usually about square inch size. The chip and the pad are connected with microBGA. The pad has a normal BGA on its other side that is soldered to the GPU card. With thermal expansion and contraction the soldier of the micro pad fractures and starts to misbehave.
It's common to point out that lead-free soldier is not as reliable under temperature changes.
You might have heard about solutions like baking the card or re-balling the BGA. I do not recommending trying these. Baking the card might damage other components on it (capacitors, everything plastic). Re-balling changes the soldier of the normal BGA, but it is very expensive manual labor that is not even fixing the BGA which causes the problems.
All these "solutions" work because they also heat the small GPU chip and melt the microBGA soldier.
If you are sure that this is your problem, you can find somebody who knows how to use a hot air soldering station and heat just the small GPU chip with 200-250C for about 2-5 minutes. These the the temperatures and duration used for manufacturing the card, so they should be safe.
(In reply to iive from comment #13)
Like I said, I don't currently believe this is a hardware defect: My video card isn't even 3 years old. It's a card from Gigabyte which makes high quality products. The temperature of the GPU is within bounds at all times (45°C to 70°C). Its GPU clock seems to be at the right frequency, whereas the memory clock appears to be at 1/3 the supported frequency so it already is under-clocked and more stable! Also why does only 3D ever produce the freeze, even simple scenes that don't stress either the GPU nor the VRAM... whereas 2D never does it even when it's intensive (eg: games, desktop compositing)?
If people believe hardware hasn't been ruled out, please suggest a GPU stressing tool for Linux (I use openSUSE Tumbleweed) which you believe is adequate for this situation. I no longer have Windows and can't redo my entire setup by installing another OS, this is my main desktop on which I do my work and activities.
I still think this is related to a driver or kernel vulnerability of some sort. Please let me know which logs I can post or what else I can monitor to confirm this and see exactly where and how it's happening.
Today I've ran two tests to ensure that frequencies and DPM are not a factor.
- Setting the DPM profile to low by running the following commands as root:
echo battery > /sys/class/drm/card0/device/power_dpm_state
echo low > /sys/class/drm/card0/device/power_dpm_force_performance_level
- Booting my system with the following Kernel parameters to disable DPM:
Just like with everything else, they made absolutely no difference: Xonotic froze the machine after only 8 minutes of running each time. The settings are applied and visible by checking /sys/kernel/debug/dri/0/amdgpu_pm_info, and are even reflected in the performance which was reduced from 60 FPS to below 30 FPS.
This is NOT a hardware failure: The freezes occur identically even if both the core (GPU) and memory (VRAM) clocks are under-clocked to very safe frequencies. The key must be something in the Linux firmware for this card.
I have moved on to testing the various kernel parameters available for my driver and card. As was pointed out by malcolmlewis on the openSUSE forums, they can be listed with the following commands:
systool -vm amdgpu
I tested nearly half of them today, almost none made any difference. There were however a few settings that appeared to influence the frequency of the freeze. The most notable one of all seems to be the following:
With no parameters changed, the freeze now occurs roughly once per 30 minutes in Xonotic. With that move rate limited to 4MB/s however, I seemingly reduced it to only 90 minutes! The FPS will constantly drop and recover, but that makes sense as this setting explicitly limits the buffer migration rate.
I may test other variables in the days to come, but for now I'm hoping this offers at least some clue to get things started. My feeling is that the video card may be slowly loaded with information until something fills up, or perhaps some events throw too much data in at once and it reaches a bottleneck?
I have a very important preliminary result: Today I tested the last amdgpu parameters on the list, and seem to have found a set that greatly mitigates the problem. Those parameters have given me up to 144 minutes before experiencing the freeze, a huge record compared to the previous 90 minutes! They are:
By default, all 4 of those settings are set to 0 by the system. Setting them to 16 has, at least during one test case, reduced the problem to 1/5 of its previous frequency. The descriptions of the variables are:
parm: prim_buf_per_se:the size of Primitive Buffer per Shader Engine (default depending on gfx) (int)
parm: pos_buf_per_se:the size of Position Buffer per Shader Engine (default depending on gfx) (int)
parm: cntl_sb_buf_per_se:the size of Control Sideband per Shader Engine (default depending on gfx) (int)
parm: param_buf_per_se:the size of Off-Chip Pramater Cache per Shader Engine (default depending on gfx) (int)
I will continue trying different values and seeing how tweaking them changes the issue. Please let me know what you think.
(In reply to MirceaKitsune from comment #17)
> I will continue trying different values and seeing how tweaking them changes
> the issue. Please let me know what you think.
Those parameters are not used on your chip.
(In reply to Alex Deucher from comment #18)
> Those parameters are not used on your chip.
That would be quite something, since after setting them I've clearly seen an enormous difference. I will investigate further in the upcoming days.
(In reply to MirceaKitsune from comment #16)
> I have moved on to testing the various kernel parameters available for my
> driver and card. As was pointed out by malcolmlewis on the openSUSE forums,
> they can be listed with the following commands:
> modinfo amdgpu
> systool -vm amdgpu
> I tested nearly half of them today, almost none made any difference. There
> were however a few settings that appeared to influence the frequency of the
> freeze. The most notable one of all seems to be the following:
> With no parameters changed, the freeze now occurs roughly once per 30
> minutes in Xonotic. With that move rate limited to 4MB/s however, I
> seemingly reduced it to only 90 minutes! The FPS will constantly drop and
> recover, but that makes sense as this setting explicitly limits the buffer
> migration rate.
> I may test other variables in the days to come, but for now I'm hoping this
> offers at least some clue to get things started. My feeling is that the
> video card may be slowly loaded with information until something fills up,
> or perhaps some events throw too much data in at once and it reaches a
You are making a progress.
I just want to give you few tips.
1. You are always using 3D acceleration. The glamor driver that is used by XOrg for 2D (DDX) acceleration is using EGL and shaders for drawing. If you have composite manager (kde has one), it might do more load on it.
You might try "AccelMethod" "None" in xorg.conf, just to check if it makes any difference. I hope that won't disable OpenGL entirely...
2. My videocard is also Gigabyte. I had it replaced ones, because in the first month my initial card (same model) had major issues. Like not starting up at boot after few hours of gameplay.
3. On my chip failure the pins affected were these controlling the internal VideoRAM. If you have chip problems, it might affect other pins first, like the PCIE ones. So HW problem is not ruled out.
4. PCIE standard allows using of less parallel lanes for data transfer. If broken pins are suspected, moving to 4x slot might alleviate the issue.
BTW, I see that the card is on PCI_ID #3.00.1 , is it in the first slot? Usually the first slow is 16x and has extra electric power.
5. If you suspect issue with filled RAM, you might try environment variable "GALLIUM_HUD" it has some GTT displays.
6. In that manner of thinking. Make sure that kernel option for CMA is disabled... that's been causing me problems every time I enable it. You might also have IOMMU enabled, try disabling it, just for tests.
Keep digging and good luck.
(In reply to iive from comment #20)
That's some amazing feedback, thank you very much! I'll definitely try those out, but I have a few questions about a few of these points:
1. My OS (openSUSE Tumbleweed) doesn't have an xorg.conf file. I instead have an /etc/X11/xorg.conf.d directory with the following files in it:
5. So before running the program from a console, I do "export GALLIUM_HUD=1" or "GALLIUM_HUD=1;./my_program"? I don't suspect filled RAM as the game's process doesn't seem to leak memory... I do however suspect filled VRAM.
6. What is the Kernel parameter for CMA please? It doesn't seem to be an amdgpu setting so I assume it's separate.
(In reply to MirceaKitsune from comment #21)
> (In reply to iive from comment #20)
> That's some amazing feedback, thank you very much! I'll definitely try those
> out, but I have a few questions about a few of these points:
> 1. My OS (openSUSE Tumbleweed) doesn't have an xorg.conf file. I instead
> have an /etc/X11/xorg.conf.d directory with the following files in it:
It goes in the Section "Device" of the video driver.
You can see all options with `man amdgpu` or
`man radeon`, etc.
> 5. So before running the program from a console, I do "export GALLIUM_HUD=1"
> or "GALLIUM_HUD=1;./my_program"? I don't suspect filled RAM as the game's
> process doesn't seem to leak memory... I do however suspect filled VRAM.
Do `GALLIUM_HUD=help glxgears` it would print all available graphs and the syntax.
Here is what I used last time:
> 6. What is the Kernel parameter for CMA please? It doesn't seem to be an
> amdgpu setting so I assume it's separate.
You can look in linux-source/Documentation/admin-guide/kernel-parameters.txt
I just finished running the GALLIUM_HUD test, and will be taking a look at the other options next. It was more difficult to test now since the freeze occurs almost instantly after Xonotic loads the map (only a few seconds).
I managed to make two photos which I'll attach below: One is the last screenshot I managed to take within the system a few seconds before it froze. The other is a photo of my screen after the freeze has taken place, obviously taken with my phone camera as the computer itself was bricked.
A footnote I will add, even if I don't know whether people will even believe me: I had to take those screenshots several times because they kept getting deleted. Whenever I booted the machine back, every screenshot of Xonotic with this HUD was corrupted and turned into a 0 byte file... even ones that were quickly moved to other directories precisely to avoid this, and were also taken by an external process! Other files on my drive are fine, it's only those screenshots... thankfully one survived and it shows all of the graphs and parameters right before the crash. I'm legitimately crept out, as I didn't expect a potential attack program to contain software capable of identifying and deleting evidence of testing, which is the only explanation I can find for what I just saw. I'll do the next tests carefully as I don't know what else may happen to my computer.
Created attachment 138798 [details]
GALLIUM_HUD pre crash screenshot
Created attachment 138799 [details]
GALLIUM_HUD post crash photo
I preformed the next test suggested to me, by changing /etc/X11/xorg.conf.d/50-device.conf to the following content:
Identifier "Default Device"
Option "AccelMethod" "None"
The frequency of the crash was reduced from a matter of seconds to 45 minutes, but a freeze still occurred after that time.
Loosing recently written files is unfortunately way too common, despite all filesystem using journaling.
It might help if you call `sync` after writing the file.
If you have kernel with enabled magic-sysrq, after crash you could hold "Alt+PrintScrn+" and then press (one by one) "s" to sync, "u" to umount and "b" to reboot.
All info about it could be found in:
Since now hangs happen in a minute after starting gameplay, does that mean that the "workarounds" that you reported previously doesn't help anymore?
Few ideas to test.
1. Try disabling gallium threads. They are recent feature and it seems they've been working a lot in your graphs.
Check also /etc/drirc , ~/.drirc etc...
2. I'm not quite sure what is the difference between num_shaders_created and num_compilations, but at the crash there are 2 shaders created and 0 compiled.
This reminds me that you might want to turn off the shader cache. This might introduce some stuttering during gameplay.
3. Your framerate is limited to 60fps. It's synced to your monitor vertical refresh. Try
and see if you can control it from the game.
See what happens when you disable it. (Might make things much worse, much faster.)
4. Generally it is not good idea to test hangs with real game play. It is too random. It would be ideal if you can record an apitrace that would reproduce the hang reliably.
Obviously it might not be possible to do that recording on the system that hangs. (The trace could be lost at reboot, or the commands that cause the hang might not even be written).
If you have another machine or video card, that works reliably, try recording gameplay of a single level. Then do the test replaying it. Would it play entirely, would it hang, would it hang at the same place?
Can you trigger hang with `glxgears` ?
5. You might find something else to test here (e.g. disable DRI3?):
Just finished the last test from yesterday's recommendations. It appears I cannot boot with iommu=off as that disables all USB devices, so I can't use a keyboard and mouse and cannot do anything. I tried the closest working equivalent I could find, which still froze after 15 minutes from bootup:
cma=0 iommu=soft intel_iommu=off amd_iommu=off
(In reply to iive from comment #27)
Thanks again, I'll be moving to these tests next and posting the results here.
For the first time ever, I might finally have some very good news on this issue! It will take several more days to confirm, then possibly another month to pinpoint the exact option responsible. However it's possible I may have found something that finally gets rid of the crash.
The issue appears to go away when playing Xonotic with those parameters:
export LIBGL_DEBUG=true LIBGL_NO_DRAWARRAYS=true LIBGL_DRI3_DISABLE=true MESA_DEBUG=true MESA_NO_ASM=true MESA_NO_MMX=true MESA_NO_3DNOW=true MESA_NO_SSE=true MESA_NO_ERROR=true MESA_GLSL_CACHE_DISABLE=true MESA_NO_MINMAX_CACHE=true RADEON_NO_TCL=true DRAW_NO_FSE=true DRAW_USE_LLVM=0
I additionally disabled the cvar "r_shadows 2" which I forgot I had on for a while now, as it enabled a shadowing system that might have itself been the culprit.
With these two changes, I was able to clock up to 120 minutes of continuous gameplay last night, followed by an outstanding 200 minutes today! That's over 2 respectively 3 hours with no system freeze whatsoever. I need to repeat this test several times to be 100% sure there's not still some obscure chance of it happening, but in any case there is definitely a major difference visible.
Today's testing reveals an important detail I had missed: There are likely multiple different crashes taking place... or at most one crash but triggered by several unrelated occurrences. There is no one option or central point.
In the case of Xonotic: Using shadows (r_shadows 2) was by far the primary factor... even without that however, a crash may still occur after roughly 3 hours of a match running. The MESA variables I mentioned are likely the source of the second much rarer crash (seen after 2+ hours).
I don't know how I'm going to get to the bottom of this and find the exact parameters involved: I can't leave 4 hour matches running every single day, and even then I'd have to test a combination several days in total. I'll continue slowly testing, but at this rate expect it to take many months.
(In reply to MirceaKitsune from comment #29)
> For the first time ever, I might finally have some very good news on this
> issue! It will take several more days to confirm, then possibly another
> month to pinpoint the exact option responsible. However it's possible I may
> have found something that finally gets rid of the crash.
> The issue appears to go away when playing Xonotic with those parameters:
> export LIBGL_DEBUG=true LIBGL_NO_DRAWARRAYS=true LIBGL_DRI3_DISABLE=true
> MESA_DEBUG=true MESA_NO_ASM=true MESA_NO_MMX=true MESA_NO_3DNOW=true
> MESA_NO_SSE=true MESA_NO_ERROR=true MESA_GLSL_CACHE_DISABLE=true
> MESA_NO_MINMAX_CACHE=true RADEON_NO_TCL=true DRAW_NO_FSE=true DRAW_USE_LLVM=0
> I additionally disabled the cvar "r_shadows 2" which I forgot I had on for a
> while now, as it enabled a shadowing system that might have itself been the
> With these two changes, I was able to clock up to 120 minutes of continuous
> gameplay last night, followed by an outstanding 200 minutes today! That's
> over 2 respectively 3 hours with no system freeze whatsoever. I need to
> repeat this test several times to be 100% sure there's not still some
> obscure chance of it happening, but in any case there is definitely a major
> difference visible.
"MESA_NO_ASM=true" supersedes the other "MESA_NO_MMX=true MESA_NO_3DNOW=true MESA_NO_SSE=true", so you don't need to make combinations with all of them.
Also I don't see you testing `export mesa_glthread=false`. Race conditions are one of the hardest bugs to catch and reproduce.
If you think that 'r_shadow' could quickly and "reliably" trigger a hang, then I would ask you to focus on it first.
1. Read about sysrq and make sure you have it enabled in the kernel and that it works. Make sure you have text console, as it might need it.
2. Enable back "r_shadows 2"
3. Use apitrace to capture a hang, while playing the game.
4. Try to reboot gracefully, using sysrq to sync and reboot, or get in text console and restart.
5. Test if the recorded trace could reproduce the crash reliably.
If the trace seems complete and it cannot reproduce the bug, then maybe it does capture everything, but the bug is not simple infinite loop in the shader. (These seem to be common cause of hangs).
If the bug can be reliably reproduced, it will be fixed.
(In reply to iive from comment #31)
Sounds a lot more complicated, but I'm gladly willing to try it as long as there's no risk of anything permanently breaking my system.
The main problem is that I wasn't able to get the SysRq keys working in openSUSE Tumbleweed, which I tried as to enable the "REISUB" keys. I could really use clear instructions on how to enable and test them in openSUSE... ideally during runtime without having to make any permanent system changes.
I need to remember how apitrace works, been a while since I used that. Also I remember it generated really a huge file, and the longer you run the program for the bigger it gets... if it doesn't happen within a few seconds I may gave a +1 GB trace, and I'm not sure where I can share that with the devs online.
One thing to note: I have two computers at home, with mine being the crashy one and my mother's being an old and slow but stable machine. I can use SSH to connect in between them from bash. The problem is that the moment my machine freezes, its SSH connection instantly dies on the other PC as well... therefore I'm not sure how helpful this option is.
The "r_shadows 2" option in Xonotic clearly makes a difference: Without it the crash only occurs after 3 hours... with it it's anywhere between a few seconds and at most 45 minutes. Definitely my best test case so far.
This doesn't sound good.
The sshd dying indicates that the kernel or the CPU has hang. If there is GPU shader hang this doesn't happen right away, it usually waits 10 seconds before attempting to reset the GPU and then panics.
1. When the system hangs, do you see LEDs on the keyboard flashing?
When kernel panics this is how it signals it. You might need to wait for 10 seconds or minute...
2. It seems that OpenSuse disables "sysrq", google told me that
"You can enable it in YaST->Security and Users->Security Center and Hardening..."
Alternatively you should be able to enable it with executing this as root:
echo 1 > /proc/sys/kernel/sysrq
Check if it works with "Alt+PrtScr+h", it should display help message in `dmesg` .
3. After you have sysrq working, try to reproduce the crash, (without apitrace).
This is to check if sysrq is working at all during hang and if it does then hopefully getting a kernel panic message in the log.
4. If you cannot get crash messages in the logs/journal, then you might to use `serial console` or `netconsole`.
The Serial console is best option, if both computers have their own serial ports and you happen to have a serial cable to connect them.
Otherwise you might try network console logger, that sends UDP packets to the second computer.
Setting up these might be tricky, as they might not even be compiled in the stock kernel. So if you need detailed instructions, at least check if they are present as modules or built-in the kernel.
zgrep CONFIG_NETCONSOLE /proc/config.gz
zgrep SERIAL_8250_CONSOLE /proc/config.gz
5. Disable vsync and run `glxgears` for hours. Leave it to work through the night or something.
I just want to know if your computer hangs with that simple 3D.
Let me be clear.
I want to see the crash messages for only 2 reasons:
- To see that there is a kernel crash.
- To see if the crash is in the graphics stack.
Since the `sshd` stops working, it might be network-card crash. (Multiplayer games, using network...)
If the machine just hangs, without actual kernel crash... then it might be hardware problem, but not a graphic card, it might also be MB, CPU, PSU, RAM, etc...
(In reply to iive from comment #33)
Ahhh... you've reminded me of a detail that I have in fact noticed but forgot to mention: After the machine freezes and becomes completely unresponsive, some keyboard leds will indeed turn off after roughly 10 seconds. I only noticed this because I currently have a backlit keyboard that has the lighting controlled by the Scroll Lock LED... I saw that a few seconds after the crash, the keyboard lighting always turns itself off.
I cannot connect the computers with a serial cable: I think the motherboards are too modern to have a serial port, and they're at a far distance in opposite rooms. Both computers are connected to the same home router via LAN cables though, and can communicate through local IP... so net console sounds like a good idea, but I've never heard of it before so I'll have to look this up.
Your sysrq suggestion seems to have worked! I first did this:
echo 1 > /proc/sys/kernel/sysrq
Then I pressed 'Alt + PrintScreen + H'. Now dmesg shows me:
[265102.938475] sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) show-blocked-tasks(w) dump-ftrace-buffer(z)
I assume that after the crash, I should first use it to test REISUB?
And here is the output of the kernel features you said to check. If they're not there, I'm out of luck on this one, as I don't know how to compile my own kernel and can't risk breaking my machine with dangerous tests.
mircea@linux-qz0r:~> zgrep CONFIG_NETCONSOLE /proc/config.gz
mircea@linux-qz0r:~> zgrep SERIAL_8250_CONSOLE /proc/config.gz
Lastly I'll try glxgears without vsync for a few hours in the next days: I have to leave my computer locked while I'm away from home or sleeping, but can leave it on for roughly 3 hours of the day while I'm around but AFK. I should note that I tried running Xonotic without vsync, and that seems to have made absolutely no difference. Also this likely isn't network related, I always test in a local match with bots and not online multiplayer.
Enable SysRq, start Xonotic, set r_shadow 2, play until it crashes, use SysRq to sync, umonut, reboot.
After reboot, check if the crash has been captured by syslog/journald.
If there is nothing, then you'd have to use `netconsole`. OpenSUSE has it compiled as module, so the description that involves `insmod` or `modprobe` applies to you.
If you have the crash in the logs, then it is more likely that apitrace file will remain whole after the hang&reboot.
(In reply to iive from comment #35)
I will be busy tomorrow, and also wanted to look into how to do the other more complicated tests. I'll be trying this out sometime in the next days.
I have just finished preforming the first new test.
First of all I must say I'm utterly amazed at the variability of this issue: Last week when I played Xonotic with "r_shadows 2", the freeze occurred in just a few seconds or minutes at most... today after a few openSUSE Tumbleweed snapshots, I was able to play for over 60 minutes even with this option enabled! It's clear that package updates are causing the lockup to vary unpredictably.
In any case, I can confirm the SysRq keys also stop functioning after the freeze: I tried REISUB for a few minutes, but there was no form of response.
I also kept both NumLock and CapsLock enabled during the game to better see how they behave in a crash. 5 seconds after the freeze, they both turned off... that is the very last noticeable activity of the system.
None the less I will attach the logs you suggested below, just in case they still captured something. Please let me know exactly what you believe I should try next, in as much detail as possible since I'm unfamiliar with the other tests you hinted to in our last conversation (eg: apitrace).
Created attachment 138950 [details]
Created attachment 138951 [details]
journalctl --since yesterday
It seems like the sysrq actually worked, I see it in the logs.
The crash has happened around Apr20 15:54. Unfortunately the kernel error/panic message is missing in both logs. My distro doesn't have systemd, so I cannot tell you what the magic journalctl options are to get these messages out. On my system, these usually go in /var/log/syslog . See if you can find something more useful from about that time.
As for the apitrace, it should work out of the box.
First test it with:
apitrace trace glxgears
It should create a glxgears.trace . If you run it again, it would create glxgears.1.trace etc.
It's important to be sure that you are using 64 bit apitrace with 64bit applications.
apitrace trace ./xonotic-linux64-glx
Should do the trick.
First, start a game match and exit right away. Then try to replay the result with:
apitrace replay xonotic-linux64.glx.trace
This is to make sure that tracing is working properly.
Then just make sure you have enough free space. Enable vsync, to limit the frames per second. Using smaller textures should also help with the trace size (textures are loaded at the level start, so playing longer match should help too).
After you record a crash and reboot with sysrq, see if replaying the resulting trace file would cause hang at its end.
You can compress the trace with `xz -9e xonotic*.trace` .
Created attachment 139052 [details]
apitrace trace xonotic-sdl
I have attempted the apitrace test as instructed. Something interesting happens: Whenever I run Xonotic through the apitrace command, it always crashes a few seconds after the match starts. By crashes I don't mean the GPU crash, but the process closing and sending me back to the desktop. This never happens when running Xonotic normally, only when running it through apitrace... I tried it several times to be extra sure of this.
I'm attaching the apitrace it successfully recorded, which just barely fit under 200 MB thanks to xz compression.
Created attachment 139053 [details]
Output of: apitrace trace xonotic-sdl
Here is the console output generated by the Xonotic process when it crashes to the desktop through apitrace.
Created attachment 139054 [details]
Output of: apitrace replay xonotic-sdl.trace
And here's the console output of Xonotic crashing to the desktop when replaying the recorded apitrace it crashed with.
You haven't looked for the kernel panic message in the logs.
I'm still waiting for it.
As for the trace. I did download and traced Xonotic before writing the instructions. I had no issues using the EXACT commands I have given you.
Why are you asking for instructions if you do not follow them!
You are using the SDL instead of the GLX and that might be what causes issues.
Please remove the failed traces. If the trace cannot reliably reproduce the hang then it is useless.
Let me say it again. If you cannot hang your computer using the recorded trace, we have no use for it. And it must hang at the exact same place.
(In reply to iive from comment #44)
I'm doing my best to debug this as well as possible, but there are a lot of points and it's hard to pay attention to everything. I thought the crashing to the desktop in the trace I posted might still hold some information, especially as it seemed to complain about some OpenGL related issues.
I'll try the GLX version as well and let you know if that works. If not it means I can't test using apitrace, because Xonotic crashes for some reason I can't explain when I run it on that.
And I don't yet know where to extract the kernel logs from on my distribution. I don't have a /var/log/syslog so it must be somewhere else. I'll try looking that up as well and will post them once I find them.
I found out what's causing the apitrace crash: It no longer happens when I use fresh settings, therefore something in my config was breaking it. Upon further inspection it seems to be one of the visual effects. I'll try again with different settings, and see if I can find a config that still triggers the GPU lockup without also making apitrace crash to the desktop.
I also looked up where the kernel output should be located on openSUSE. I'm told that it's /var/log/messages which indeed exists on my system. Next time I experience the freeze, I will post that file here as instructed.
(In reply to MirceaKitsune from comment #46)
> I found out what's causing the apitrace crash: It no longer happens when I
> use fresh settings, therefore something in my config was breaking it. Upon
> further inspection it seems to be one of the visual effects. I'll try again
> with different settings, and see if I can find a config that still triggers
> the GPU lockup without also making apitrace crash to the desktop.
> I also looked up where the kernel output should be located on openSUSE. I'm
> told that it's /var/log/messages which indeed exists on my system. Next time
> I experience the freeze, I will post that file here as instructed.
You can report the apitrace crash to the apitrace issue tracker. Include everything needed to replicate it (aka Xonotic version, options, commands).
You already have posted a /var/log/messages with a crash here, it doesn't have any kernel oops/panics. No point of posting another.
See if there are any other /var/log/* files that might contain these.
Here is how a shader hang look in my logs (that bug was reported and fixed):
kernel: [19746.660911] radeon 0000:01:00.0: ring 0 stalled for more than 10030msec
kernel: [19746.660915] radeon 0000:01:00.0: GPU lockup (current fence id 0x000000000039874d last fence id 0x0000000000398759 on ring 0)
kernel: [19746.844799] radeon 0000:01:00.0: couldn't schedule ib
kernel: [19746.844837] [drm:radeon_uvd_suspend [radeon]] *ERROR* Error destroying UVD (-22)!
kernel: [19748.260945] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait timed out.
kernel: [19748.260965] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on GFX ring (-110).
kernel: [19748.438730] radeon 0000:01:00.0: couldn't schedule ib
kernel: [19748.438761] [drm:radeon_uvd_suspend [radeon]] *ERROR* Error destroying UVD (-22)!
Created attachment 139071 [details]
I managed to trigger the GPU freeze while running Xonotic through apitrace. Upon restarting the machine, I still found the resulting xonotic-glx.trace file on my drive. Unfortunately the trace seems to end several seconds before the crash, despite my attempts to restart the system using the REISUB SysRq keys. A warning in the console also indicates this when playing back the trace:
warning: unexpected end of file while reading trace
Do you have any advice on how I can make sure the trace captures the moment of the freeze, rather than last recording several seconds before it happens? Would it perhaps be possible to use SSH or some other library to create or deposit the recorded trace on my other computer via LAN connection?
Meanwhile I'm attaching the output of /var/log/messages which I understand is the name of /var/log/syslog for my distribution. Please let me know if this is the correct kernel log you mentioned.
Sorry about that: I posted my last message before reading your last one, and didn't notice your mention about /var/log/messages being obsolete. Just mentioning so you don't think I ignored what you said.
I'll look deeper in /var/log for anything useful. Here's what that directory contains in case the list is of any use:
linux-qz0r:/var/log # ls ./ -1
If this `messages` file is from the failed apitrace crash recording, then maybe you should try again.
In the previous file I could see that SysRq has been used. Since the first command you use also kills all programs, including the logging, there are no more logs from the session.
I do not see anything SysRq in the new `messages` file. So one possibility is that you forgot to enable sysrq and it just hang. Another possibility is that you need to do things with certain timing.
Here is how "transactions" work in linux.
The program (apitrace) can make many writes and they all could be cached in ram, without been send to disk. Even if they are written on disk, they might not be committed to the file, until `flush()` or `close()` is called on the file.
That is, "in theory", the file should remain unchanged until it is flushed, even when old content is overwritten.
SysRq+r takes over the keyboard, so keep pressing that first.
SysRq+e would send TERM signal to all programs. It is very likely that apitrace could handle that signal and close all files, thus committing them to disk. Give it a few seconds to finish. Count to 5 or wait until hdd stops working.
SysRq+i sends KILL signal to all programs. This is forcibly termination and might eliminate programs that are still handling the TERM. In your case I would ask you not to use that.
SysRq+s sync all kernel buffered reads to disk. Wait for hdd to stop before pressing next key combination.
SysRq+u unmount all filesystems. Same as before, wait for the hdd to stop before pressing the next key.
So basically, watch the HDD LED on the PC box and wait for it to stop, before pressing the next key.
Have in mind, when you kill all programs, systemd remains as it is running as init#1, and it would try to restart everything again. So disabling some services you don't need might be good idea. I think I saw apache web server in the previous log.
Good Luck and try again.
Created attachment 139103 [details]
Output of: systemctl | grep running
(In reply to iive from comment #50)
I've preformed many more tests during the past two hours, getting nearly a dozen freezes in the process. I tried with both the glx and sdl versions of Xonotic, and even ran the "Alt + SysRq + RESUB" combination at different rates (instantly as well as 1 minute in between each press). Before each test I made sure the SysRq keys are working, by using "Alt + SysRq + H" then checking that the help message appears at the end of the "dmesg" output.
In all cases the trace file never catches the crash: Either I find a zero byte file when I reboot, either it ends several seconds before the crash.
I couldn't find any obviously useless systemctl services that I can shut down (such as Apache). In case there is anything dangerous or that I could disable in there, I'm attaching the output of "systemctl | grep running".
I'm clearly going to need a different approach to recording this trace: apitrace must be dying the moment the lockup occurs, so it never finishes writing the complete trace file. I found some info on how a trace can be played back from a remote machine, but not how to record it from one. What should I do next?
(In reply to MirceaKitsune from comment #51)
> Created attachment 139103 [details]
> Output of: systemctl | grep running
> (In reply to iive from comment #50)
> I've preformed many more tests during the past two hours, getting nearly a
> dozen freezes in the process. I tried with both the glx and sdl versions of
> Xonotic, and even ran the "Alt + SysRq + RESUB" combination at different
> rates (instantly as well as 1 minute in between each press). Before each
> test I made sure the SysRq keys are working, by using "Alt + SysRq + H" then
> checking that the help message appears at the end of the "dmesg" output.
> In all cases the trace file never catches the crash: Either I find a zero
> byte file when I reboot, either it ends several seconds before the crash.
> I couldn't find any obviously useless systemctl services that I can shut
> down (such as Apache). In case there is anything dangerous or that I could
> disable in there, I'm attaching the output of "systemctl | grep running".
> I'm clearly going to need a different approach to recording this trace:
> apitrace must be dying the moment the lockup occurs, so it never finishes
> writing the complete trace file. I found some info on how a trace can be
> played back from a remote machine, but not how to record it from one. What
> should I do next?
I'm running out of ideas.
I just want to make sure that the `apitrace` you are using is recent enough.
The last release of apitrace-7.1 is almost 3 years old and there are many fixes that it is missing. For example, there is 2 years old commit that calls "localWrite.flush()" on "_exit".
(In reply to iive from comment #52)
Ah... I do indeed have apitrace 7.1-3.89. I don't know if a newer version exists on https://software.opensuse.org which is currently down. I need to head off to bed in a minute, but I'll check whether a new version compiled for my OS exists tomorrow (if anyone knows please share a link). I'm sorry this is taking so much effort, and I'm sure the cause of this freeze can't elude us forever.
I have some very interesting results from today: As instructed, I used the latest version of apitrace. I cloned it straight from its Github repository and compiled it myself, then ran Xonotic through it.
Same thing: The trace always ends several seconds before the moment of the freeze and prints an "end of file" warning in the console.
Then I decided to do something different: I ran Blender 3D through apitrace, loading up the scene that triggered this same lockup last time. I used various features and went into several modes which I remembered were responsible. Eventually I got the exact same freeze as I do with Xonotic. I rebooted and played back the Blender trace. Same story as with Xonotic: It cuts a few seconds earlier and complains about EoF.
But there's a bizarre twist this time: When playing back the trace generated by Blender, my system will freeze at various points during the replay! Sometimes it freezes early, sometimes it freezes late, at other times I can replay the whole trace without getting a freeze at all.
This is very peculiar: The crash must be occurring beyond what apitrace is even capturing, likely something deep in the kernel or renderer which is only triggered when the conditions are just right. What do you make of this?
(In reply to MirceaKitsune from comment #54)
> But there's a bizarre twist this time: When playing back the trace generated
> by Blender, my system will freeze at various points during the replay!
> Sometimes it freezes early, sometimes it freezes late, at other times I can
> replay the whole trace without getting a freeze at all.
> This is very peculiar: The crash must be occurring beyond what apitrace is
> even capturing, likely something deep in the kernel or renderer which is
> only triggered when the conditions are just right. What do you make of this?
Well, this makes hardware issue a lot more probable.
Still, it is good that you have a trace that can trigger crashes.
Having an apitrace issuing same OpenGL commands eliminates a lot of variables.
From now on, you shell be using only this trace for your tests.
But first, you should try and setup `netconsole`.
I haven't used it myself so I can't give you any hints.
Still the documentation looks detailed. AFAIR you have it as module.
After you have it working, you can resume your experiments with environment variables. And keep an eye on the kernel messages when a crash happens.
Few hints. If `MESA_NO_ASM=true` is set, then the other(MESA_NO_MMX=true ; MESA_NO_3DNOW=true ; MESA_NO_SSE=true) have no effect.
And don't forget to test `export mesa_glthread=false` too.
Also try `export RADEON_THREAD=false` with the above.
Threading and concurrency just increase the random variables.
Your hope is to find something that always works, or some error that is always present before crash.
You should also seriously consider testing the card on other OS or computer.
If that blender trace hangs on Windows, it definitely is not software issue.
I've preformed the netconsole test today. After over an hour of learning how it works, I set it up and could confirm that system messages are properly received by netcat on the other computer. Unfortunately, as expected, no messages get sent at the time of the freeze: Even the netconsole kernel module dies immediately.
The MESA parameters I mentioned don't seem to affect the freeze produced by the Blender trace either. For reference, my testing string was:
export LIBGL_DEBUG=true LIBGL_NO_DRAWARRAYS=true LIBGL_DRI3_DISABLE=true MESA_DEBUG=true MESA_NO_ASM=true MESA_NO_MMX=true MESA_NO_3DNOW=true MESA_NO_SSE=true MESA_NO_ERROR=true MESA_GLSL_CACHE_DISABLE=true MESA_NO_MINMAX_CACHE=true RADEON_NO_TCL=true DRAW_NO_FSE=true DRAW_USE_LLVM=0
I retain my conviction that this is nothing hardware related. Mainly because the freeze doesn't seem to be affected by VRAM fill nor GPU stress, but by specific renderer features regardless of the complexity of the scene. I see no way in which for instance, a Blender scene with a load of high-poly objects won't ever trigger a hardware failure, but a Blender scene with a few low-poly objects can do so within seconds if some obscure conditions are just right.
If anyone can suggest anything else, please do. This is the weirdest and most difficult test I've ever had to preform on a computer to debug a crash, mainly due to the way in which absolutely nothing seems to work. There's definitely a way to catch it... I just don't know what that is.
(In reply to MirceaKitsune from comment #56)
> I've preformed the netconsole test today. After over an hour of learning how
> it works, I set it up and could confirm that system messages are properly
> received by netcat on the other computer. Unfortunately, as expected, no
> messages get sent at the time of the freeze: Even the netconsole kernel
> module dies immediately.
When the system hangs, is SysRq still operational?
Aka, if you have netconsole working and press SysRq+h, it should show help and send that text over the network.
If you press SysRq+r it should reboot.
I want to confirm that netconsole indeed stops working, but SysRq is still working.
There is another method for capturing panic messages. It involves preserving portion of the memory and loading a second kernel in there, that is started at the event of panic.
Actually there was even a method storing kernel panics in non-volatile memory of the uefi bios... (That might be a bit risky).
However at this point I am not convinced that you are even getting any kernel panic.
It is very strange that the system hangs, without the kernel panic issuing a panic. And it is even more strange that the GPU is causing such a hang.
You see, the GPU for the most part is working on its own, so if the GPU hangs, it should not affect the CPU operation. The radeon/amdgpu drivers could detect GPU hang and they should complain. I've shown you how they do that for me.
This points us again in the direction of hardware. I do remember that you had some success with `amdgpu.moverate=4` . So the issue might be around DMA and PCIE...
For now, try `export R600_DEBUG=nodma` .
This environment variable has remained with this name, despite the fact that it now works on much newer drivers than R600. You can see all supported options with `R600_DEBUG=help glxgears` .
Also, you've done overclock before, maybe some options has remained. See if your bios/uefi have something in the equivalent of "safe defaults"...
(In reply to iive from comment #57)
I believe I've already tried with nodma under Xonotic, as I previously attempted playing with the following options and still got the crash:
I've also checked out the BIOS, no overclocking settings are responsible. The freeze happened even with the failsafe defaults of my BIOS in use.
And during the netconsole test, I did enable and use the SysRq keys (RESUB)... nothing got printed to the other machine. SysRq doesn't seen to do anything after the freeze: Nothing happens if I press them, including the HDD led which never flashes again indicating that even the drive is never used any more.
Loading a second kernel into memory sounds complicated and much more dangerous. At this stage I also doubt even that would work, as the machine literally acts as if the whole system (CPU / RAM) is bricked and jammed upon crashing.
I'm not sure if this helps, but here is the darkplaces engine file responsible for drawing shadows. Remember that when disabling shadows, the frequency of the freeze is reduced from 0 - 30 minutes to 60 - 240 minutes in Xonotic. Maybe if someone more experienced takes a look at what renderer functions get enabled when shadows are turned on, they might notice what could be speeding up the freeze?
Remember that one or a combination of the following two cvars triggers it. You can search that file and see where those settings are used and what they enable.
(In reply to MirceaKitsune from comment #58)
> (In reply to iive from comment #57)
> And during the netconsole test, I did enable and use the SysRq keys
> (RESUB)... nothing got printed to the other machine. SysRq doesn't seen to
> do anything after the freeze: Nothing happens if I press them, including the
> HDD led which never flashes again indicating that even the drive is never
> used any more.
After hang, does SysRq+R reboot the machine?
(In reply to iive from comment #60)
No, it never does. SysRq + R - E - S - U - B (pressed in slow order after one another) does not reboot, nor make the hard drive led flash, nor have any other noticeable effects once the freeze has occurred.
You don't even get kernel panic, the machine just freezes.
Just to confirm that, add the following to the kernel line options in grub
"panic=30" . Then freeze the computer again.
If the kernel panics, then it should reboot after 30 seconds.
Do you have a temperature reading for the Mother Board chipset? Can you make sure it doesn't overheat or something during gameplay?
Use ssh to log into your computer and run `watch sensors`. You will have the last readings when the computer hangs.
I think you should try to compile a vanilla kernel and enable every debug option that you can. (You can use the SUSE kernel /proc/config.gz as template).
(In reply to iive from comment #62)
I booted my machine with the kernel parameter panic=30 as instructed. I then waited for over two minutes to see if there's any sign of movement. Nothing happened: The machine never reboots on its own after the freeze, I still need to press the reset button on the computer case to restart it.
I also noticed another detail worth noting: The keyboard NumLock led only seems to turn off after I press a key on the keyboard post freeze. So let's say the machine just crashed: I can wait a whole minute and the led is still on... then I press Control or Shift or any other key, and after roughly 3 seconds, the led then turns off. This is always the last noticeable response from the PC.
Lastly I have something important to mention: Someone just replied to my thread about this crash on the openSUSE forum, and confirmed they're getting the same issue! They even posted a screenshot showing the exact same graphical garbage I'm noticing in various applications (colorful little squares littering the screen). This might be the first time someone else can confirm the problem, which is very exciting news if the person will provide us with more info.
I've been having a very different problem with AMD cards. But I have reason to think that the problem could vary from one processor/chipset to another.
My problems disappear using 'MESA_EXTENSION_OVERRIDE=-GL_ARB_buffer_storage', can you try that?
(In reply to H4nN1baL from comment #64)
Just tested with "export MESA_EXTENSION_OVERRIDE=-GL_ARB_buffer_storage". It did not affect the freeze triggered by playing back the Blender trace.
Also, to answer iive's last point which I forgot in the previous response: I have temperature monitors on my desktop, and one of the Xonotic freezes happened just a second after I alt-tab switched back from checking it. All fans and temperatures were perfectly fine during that test: The CPU was around the typical 48°C, whereas the GPU never exceeded 68°C itself. Temperatures are most surely not an issue.
Okay, thanks for your reply. Then our problems are unrelated.
Even so let me share you some intel. Disable any BIOS configuration related to "GART" and "PCIE Spread Spectrum"(PCIe overclock).
And keep in mind that GPUs also come with a BIOS, in some cases, they really need to be updated.
(In reply to MirceaKitsune from comment #63)
> (In reply to iive from comment #62)
> I booted my machine with the kernel parameter panic=30 as instructed. I then
> waited for over two minutes to see if there's any sign of movement. Nothing
> happened: The machine never reboots on its own after the freeze, I still
> need to press the reset button on the computer case to restart it.
As I suspected.
It just hangs.
> I also noticed another detail worth noting: The keyboard NumLock led only
> seems to turn off after I press a key on the keyboard post freeze. So let's
> say the machine just crashed: I can wait a whole minute and the led is still
> on... then I press Control or Shift or any other key, and after roughly 3
> seconds, the led then turns off. This is always the last noticeable response
> from the PC.
Are you using USB keyboard plugged into USB port?
Your motherboard does have PS/2 ports, see if you can use the one for keyboard. (Sometimes keyboards come with small dongle that lets you plug USB keyboard into PS/2 port).
> Lastly I have something important to mention: Someone just replied to my
> thread about this crash on the openSUSE forum, and confirmed they're getting
> the same issue! They even posted a screenshot showing the exact same
> graphical garbage I'm noticing in various applications (colorful little
> squares littering the screen). This might be the first time someone else can
> confirm the problem, which is very exciting news if the person will provide
> us with more info.
At very least they does manage to get errors from the kernel driver before the crash. You haven't seen such kind of errors, have you?
Still, you can share the blender trace with them. See if it causes hang for them too. See if it hangs at the same place...
> Also, to answer iive's last point which I forgot in the previous response: I
> have temperature monitors on my desktop, and one of the Xonotic freezes
> happened just a second after I alt-tab switched back from checking it. All
> fans and temperatures were perfectly fine during that test: The CPU was
> around the typical 48°C, whereas the GPU never exceeded 68°C itself.
> Temperatures are most surely not an issue.
Chipset temperature is different than CPU and GPU.
The motherboard has a huge chips that connects the CPU with the RAM and the PCIE slots. The GB website says it has temperature sensor on the North Bridge and there seems to be a huge passive cooler with heat pipes on it. One leading to a radiator on top of the MB, probably to use the PSU intake for cooling.
`sensors` should display everything that is available, even if they are not libeled correctly.
GB like to increase the voltage a bit on their hardware, so it is more stable when overclocked. This however means it also runs a bit hotter.
(And you need to dust off the heat sinks once or twice per year. Unless you have special dust filters on the PC case intake vents.)
Created attachment 139283 [details]
Output of: watch --interval 0.1 sensors
(In reply to H4nN1baL from comment #66)
My BIOS offers no options regarding GART and Spread Spectrum as far as I recall. Here is my exact motherboard model, in case anyone has extra info on what its available BIOS settings mean which I may have missed.
(In reply to iive from comment #67)
That is indeed a difference: Everyone else reporting those crashes seems to be able to record a Kernel panic, but in my case the system records nothing as it fundamentally stops functioning at that very moment.
I logged into SSH from my other machine and ran "watch --interval 0.1 sensors". Attached are my readings at most 0.1 seconds before the freeze, which seem to capture every relevant voltage and temperature available. Obviously there are no fans in slots 2 and 4 which is why they show 0 RPM.
I'm really out of ideas...
Could you try using only the radeon kernel driver, just blacklist amdgpu one.
See if the blender trace hangs and netconsole still doesn't give any warnings.
See if you can completely disable iommu, when using radeon.ko.
I've asked you at least 3 times to test "export mesa_glthread=false", but you never included it in your list of things you've tried.
Same for `export RADEON_THREAD=false`.
I haven't asked you, but add `MESA_DEBUG=flush` to the things to test.
Now, if you have run out of things to test. You can try a prolonged experiment, that might not even bring usable result.
If we had a case that hanged reliably, one thing to do is to locate the exact operation that causes the hang.
So, you start `qapitrace` with the blender trace.
You then do a binary search for the frame that causes hang. It's done by "Lookup State" at a frame number, it would replay the trace to that frame. You start with the full range, let's say [0 - 10000], so you pick the frame from the middle of that range, in this case frame#5000. If it hangs during replay, you use [0 - 5000] as interval, if it doesn't hang, then you use the other half [5000-10000] (because the cause of hang mush be there). Then you pick the middle of the new interval and repeat the experiment.
(e.g. [0 - 2500]; [1250 - 2500]; [1250 - 1875].
Once you locate the exact frame that could cause the first hang, you can do the binary search, but this time on the draw operations inside that frame. It can help if you set:
Now, since crashing to you is kind of random, you might try to disable all threaded options (all options from above) and run same lookup a dozen of times. If it crashes even once, then it crashes.
Also, be sure to write down the current range, as to not loose it at reboot.
I also strongly encourage you to at least try some other distribution, something you can start from life-cd or something. Or build your own vanilla kernel.
(In reply to iive from comment #69)
Just tried mesa_glthread=false RADEON_THREAD=false MESA_DEBUG=flush and got the same results. The others seem a lot more complex: I might try them later, but currently I'm very busy and it's difficult to organize myself accordingly. I wish I had a better way of helping to understand this issue, as I really need to get it fixed, sadly I feel stuck myself at the moment.
The Mesa 18.1.0 update, which was supposed to fix several GPU crashes, seems to have managed to expand this freeze instead: I now get it even when playing simple 3D games with low-poly models and low-res textures, such as MegaGlest.
At this moment the issue is at a point where it may have real life implications: I may be constrained to buy a new video card just to stop this, and if I do that I literally won't have money to eat for a month. As I make my living from animation and game development, it's either that or this issue can be solved. I know it has to be software related, but in some mind boggling way every way to see what's doing it gets covered up and no kernel or MESA parameters make it go away.
Can someone please ask other developers and people experienced with the video drivers to subscribe to this and post their ideas? iive helped me with a lot of advice, but somehow whatever is doing it managed to dodge everything even he could think of. Perhaps someone else has some new suggestions?
(In reply to MirceaKitsune from comment #71)
> The Mesa 18.1.0 update, which was supposed to fix several GPU crashes, seems
> to have managed to expand this freeze instead: I now get it even when
> playing simple 3D games with low-poly models and low-res textures, such as
Can you confirm this?
Does reverting to older Mesa release "fix" the new issues?
Or/and reverting to older kernel.
Slow deterioration of the situation is consistent with hardware problems. That might not be so bad, because it means it could be fixed relatively easy.
BTW are you using suspend to RAM? My card had worse symptoms after resume, even if it has been suspended for seconds. Suspend still provides +5V on PCIE, so the card might still be partially powered, but not cooled.
This reminds me of something we haven't tested - ASPM.
Try kernel parameter "pcie_aspm=off"
Disabling it might lead to more power consumption by the card, even when idle. But it might improve stability.
(In reply to iive from comment #72)
I've thought about testing an older version of Mesa too. Especially since, from what I can vaguely remember, certain system instabilities were introduced roughly two years ago (autumn of 2016) when I switched from Mesa 13 to 17. I doubt that's related after so long but figured I'd still mention.
The only issue is that I'm not sure how far I can downgrade my Mesa version without it asking for old dependencies, potentially rendering my system unusable due to library conflicts. On the other hand, I remember there was once a way to run games against a custom version of Mesa, by separately compiling a .so library and using an environment variable to point to it.
Is it possible to download a Mesa 13.x library from any repository? And what was the environment variable to point an executable to it when running a game?
I will try booting with pcie_aspm=off next and let you know how it goes.
pcie_aspm=off makes no difference. In addition, I tried booting back to the radeon module (instead of amdgpu) and disabling the SI scheduler: This seems to have slightly mitigated the problem in some cases (eg: Blender) but made no difference in others (eg: Xonotic).
As for suspending to RAM: I haven't used Standby mode in ages. I never suspend my computer to RAM, so this could not be an issue.
(In reply to MirceaKitsune from comment #73)
> (In reply to iive from comment #72)
> I've thought about testing an older version of Mesa too. Especially since,
> from what I can vaguely remember, certain system instabilities were
> introduced roughly two years ago (autumn of 2016) when I switched from Mesa
> 13 to 17. I doubt that's related after so long but figured I'd still mention.
> The only issue is that I'm not sure how far I can downgrade my Mesa version
> without it asking for old dependencies, potentially rendering my system
> unusable due to library conflicts. On the other hand, I remember there was
> once a way to run games against a custom version of Mesa, by separately
> compiling a .so library and using an environment variable to point to it.
> Is it possible to download a Mesa 13.x library from any repository? And what
> was the environment variable to point an executable to it when running a
(In reply to MirceaKitsune from comment #74)
> pcie_aspm=off makes no difference. In addition, I tried booting back to the
> radeon module (instead of amdgpu) and disabling the SI scheduler: This seems
> to have slightly mitigated the problem in some cases (eg: Blender) but made
> no difference in others (eg: Xonotic).
> As for suspending to RAM: I haven't used Standby mode in ages. I never
> suspend my computer to RAM, so this could not be an issue.
No easy solutions...
If you are sure that mesa 13.x works for you, then you must try it, again.
I can't help you with packages, but you should be able to download the old packages manually and install them manually too.
It might be PITA as it seems that OpenSUSE breaks Mesa on multiple packages, like mesa, mesa-drm, mesa-libva, mesa-libgl1, mesa-libd3d...
Most packages should be forward compatible, so you don't have to downgrade stuff like libdrm.
However the tricky moment here is LLVM. Most likely only Mesa depends on LLVM, so you have to downgrade both at the same time (and nothing else).
Theoretically it might still be possible to compile your own mesa-13.x from source, if there are still some issue with the other dependencies. LLVM might be the tricky part here, you might need matching older version.
Are there live cd's with OpenSuse? Something you can start without installing it?.
(In reply to iive from comment #75)
That's what I feared too: I know Mesa depends on a lot of other libraries (including LLVM) and you can't mix old and new versions between them. This is my primary desktop on which I do all my activities, so I can't risk breaking it nor downgrade the whole OS to an ancient version.
A live DVD would solve this however. Unfortunately I don't know how far I can still find those for openSUSE, nor what the last openSUSE release was that came with Mesa 13 instead of 17. Does anyone else have this information?
As a side note, I should mention that I'm now in the process of trying to obtain a new video card: This couldn't be investigated in several months and I can't wait much longer. Once I get a new card, I might not be able to continue this test any more. I may still ask a friend to try my video card in Windows, just so we at least know if this was a combination of bad hardware or the devs should still be on the lookout for an obscure driver bug.
I had the same problem with Xubuntu 17.10 and maybe 18.04 (can't remember). I
GPU would hang when watching Videos with mpv or even in Firefox. When I tested gnome-shell this would also sporadically hang.
What it solved it for me was to switch from Radeonsi to Amdgpu.
(add radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1 to your grub.cfg kernel boot parameters)
Now I've upgraded to 18.10 and decided to give radensi another try (mainly because VLC refuses to deinterlace mpeg2 under amdgpu for some reason) and it works without problems for about ~2 weeks daily use.
MSI R7 370 4G
Gigabyte GA-970A-DS3P FX
This will likely be my last update on the situation. A few months ago I got a new video card and replaced my R7 370 with it. Since then I've never once experienced this type of crash again, on either the "radeon" or "amdgpu" module. The old card is now on my mother's computer... since she doesn't play games it's working well for her, there hasn't been a GPU crash on there either.
I still believe this was a driver or firmware bug, not a damaged video card; There's no way only specific 3D games would ever trigger the problem, independent of the card's GPU or VRAM load. But we'll likely never know. Closing this as I'll no longer be able nor interested to keep testing it.