Approximately once per day, the system experiences a random reoccurring crash: All windows and buttons freeze in place and become unusable, while only the mouse cursor remains possible to move around. After a few seconds, the monitor starts turning itself on and off several times (standby mode). Finally I either find myself back to the login manager, or the monitor permanently stays off... which of the two is completely random. If I manage to kill X11 in time (control + alt + backspace) and try logging back in, desktop effects will sometimes not work although they report no error as to why, whereas at other times the image just freezes again and I'm forced to restart anyway.
Today I discovered an important detail: The crash is not limited to KDE desktop effects, unlike most crashes of this sort and what I initially suspected to be the case; I had compositing disabled for several hours (alt + shift + F12), yet the exact same crash occurred! Although the freeze is rare and probabilistic so it's impossible to reproduce on demand, the trigger seems to be some sort of window activity, such as a system tray notification or window popping up... also starting up certain games may cause an immediate system freeze.
My OS is Linux openSUSE Tumbleweed, running on top of KDE / Plasma 5. I use the free video drivers and default system packages of this distribution. My card is a Radeon R7 370, GCN 1.0 on RadeonSI and the radeon module (no amdgpu support yet). The issue survived through several kernel updates, indicating it's likely not part of the Linux kernel directly. Current versions or relevant packages (monitored at http://tumbleweed.boombatower.com ):
Here's a link to the same report on the openSUSE bug tracker: http://bugzilla.opensuse.org/show_bug.cgi?id=1028575
Also I'd appreciate it if someone could please look at this relatively soon. My report there is already two weeks old, and so far there's been no official response. While I understand the developers are busy with other problems, this is a major issue which gets my system shot down at random intervals; Since such a crash may occur anytime, including during a package update or while other processes are handling data, it leaves my system at risk of data corruption.
Created attachment 130357 [details]
Created attachment 130359 [details]
Created attachment 130360 [details]
Created attachment 130362 [details]
Created attachment 130363 [details]
Created attachment 130364 [details]
A note regarding the logs I just posted: The last crash of this sort (with desktop effects disabled) occurred around 18:00 21-03-2017 on my local time. Look around that hour, for logs that have timestamps available.
Looks like GPU lockups, which are most likely due to an issue in Mesa/LLVM/kernel. Reassigning to Mesa for now.
I'm happy to announce that since xorg-x11-server 1.19.3 and / or xf86-video-ati 7.9.0 the issue appears to have been remedied. I now have over 5 days of uptime, with not a single GPU freeze caused by the desktop. If by chance the problem returns, I'll definitely post an update and let everyone know. It would be very appreciated if in the future, the Xorg team and driver developers could please consider more in-depth testing for GPU lockups, to prevent this sort of thing from repeating.
I imagine this issue can be marked as resolved. I will reopen it if I see the problem happening again. Feel free to otherwise reopen it if anyone believes there's still something to investigate.
To my stupefaction, the exact same issue has been re-introduced and is happening again. After another openSUSE Tumbleweed snapshot and several package updates, I believe approximately 2 days ago, the freeze occurs once more. I can however confirm that for approximately 8 days before that, the problem went away as I had one week of uninterrupted uptime. Can someone please analyze what changes fit this time pattern?
(In reply to MirceaKitsune from comment #11)
> I can however confirm that for approximately 8 days before that, the
> problem went away as I had one week of uninterrupted uptime. Can someone
> please analyze what changes fit this time pattern?
Not without more information about what changed on your system when the problem apparently stopped and started again (or alternatively, a magic crystal ball ;).
(In reply to Michel Dänzer from comment #12)
Like I said, I use openSUSE Tumbleweed and its latest official system packages. openSUSE has a lot of useful web tools, so I assume a package history should exist somewhere? If not, can anyone tell me how to dump my package installation history over the last month from zypper, so I can post a new log?
This is the upstream bug tracker. For openSUSE related help, talk to openSUSE folks.
(In reply to Michel Dänzer from comment #14)
Yes, but it's due to an issue in one of the system components. I'm talking about it there for the packaging related aspect, but figured this is helpful for debugging the issue itself.
KDE is slow and buggy desktop and it does not even have freely configurable menus like the Xfce Whisker menu is. Test with Xfce and lightdm.
Very important note: While investigating a completely unrelated bug, I remembered that KWin was configured to use egl over glx on my machine. I believe glx is the old render architecture, whereas egl is the new renderer which uses OpenGL and is visibly a lot faster.
Considering that desktop activity was always the cause of the freezes in some form, I have a strong suspicion that this might have something to do with the GPU freezes as well! It's a very likely candidate because egl involves experimental OpenGL rendering, and since it's not enabled by default that would also explain why few other people are able to reproduce the GPU lockups. I have just now switched back to glx and therefore haven't had the time to confirm this, but I'm willing to bet it just might make the problem go away... I will immediately post an update if and when I'm proven wrong obviously.
If anyone else wants to test this theory and help in reproducing the system freeze, consider switching to egl rendering. Obviously this means your machine might start freezing as well, so only do this if there's no risk of data loss or major annoyance. The switch is done by opening ~/.config/kwinrc in a text editor and changing:
Note that if you're using an Aurorae theme, you might experience the bug I mentioned above, which involves KWin no longer rendering window decorations. Here are the reports for that in case anyone is curious:
Nope, still happens in glx as well. Switching away from egl rendering does not make the freezes go away.
llvm 4.0.0 is now in openSUSE Tumbleweed: I have preformed a 'zypper dup', installed it, and restarted. Now it's time to see if this really makes the freeze go away.
Please allow me to keep this issue open for another month to make sure it's truly gone: If through some stretch of imagination I see the problem again, I'll immediately post a reply here and let everyone know! I need to take at least several days to have no doubt it has been solved, due to how far its probability seems to have ranged. Thank you.
The same freeze happened again today, after a two week period of not seeing the problem any more. This is reaching the point where it's becoming outright ridiculous: I've used openSUSE for years, and have never seen such a thing spanning over such a long time period.
Unfortunately I can't risk breaking my system by installing custom versions of llvm or the system wide Mesa. I can only hope the developers know what to bisect to find out where this fault lies, based on the logs I've attached here. Let me know if there is more info I could somehow help with.
Once again, I'm dealing with at least one system crash per day. The latest one happens even after upgrading to the 4.11.0 Kernel, meaning the error was ported to it as well.
Why are the developers so incapable this time? This has been happening for nearly 3 darn months! The problem has been fixed twice, and each time it's returned after over 2 weeks. Can someone please explain what the heck is going on here? At this stage, it feels like someone is actively developing and updating this freeze against the latest system components... I don't even see how it could survive through this many package updates by chance alone, it is ridiculous.
And please don't lecture me on how this is free software, and I should only be complaining if I was actually paying the developers: There is a limit beyond which an important piece of software, be it a free Linux distribution or component, can break and stay broken. To literally be unable to keep a system running without the image suddenly freezing and the monitor shutting down every single day, for over a quarter of an year... that goes far beyond that limit.
I'm sorry for the outburst, but at this stage I think something needed to be said. I did not expect something like this to get dragged so far, and that I'd be unable to keep my system running for months. I'm going to bump the severity of this issue again, in hopes that someone can please take a look at it so I can run my system normally again! Thank you.
(In reply to MirceaKitsune from comment #21)
> Why are the developers so incapable this time? This has been happening for
> nearly 3 darn months! The problem has been fixed twice, and each time it's
> returned after over 2 weeks. Can someone please explain what the heck is
> going on here? At this stage, it feels like someone is actively developing
> and updating this freeze against the latest system components... I don't
> even see how it could survive through this many package updates by chance
> alone, it is ridiculous.
You may be surprised to hear that the code is not an explicit 'if (rand()) hang_gpu();' call.
> And please don't lecture me on how this is free software, and I should only
> be complaining if I was actually paying the developers:
Regardless, insulting people is rarely the best way to get them to do things. Everyone has more problems to solve than hours in the day, and the guy calling people 'incapable' does not tend to work his way to the top of the list.
> I'm sorry for the outburst, but at this stage I think something needed to be
No, it really did not. I understand your frustration, and I'm sorry to hear it, though as the Bugzilla footer notes, the freedesktop.org Code of Conduct applies here:
If you cannot keep your behaviour civil in future, your access to this bug tracker will be revoked. Thanks for your understanding, and I do hope your bug gets resolved.
(In reply to Daniel Stone from comment #22)
> (In reply to MirceaKitsune from comment #21)
> You may be surprised to hear that the code is not an explicit 'if (rand())
> hang_gpu();' call.
> Regardless, insulting people is rarely the best way to get them to do
> things. Everyone has more problems to solve than hours in the day, and the
> guy calling people 'incapable' does not tend to work his way to the top of
> the list.
> No, it really did not. I understand your frustration, and I'm sorry to hear
> it, though as the Bugzilla footer notes, the freedesktop.org Code of Conduct
> applies here:
> If you cannot keep your behaviour civil in future, your access to this bug
> tracker will be revoked. Thanks for your understanding, and I do hope your
> bug gets resolved.
Alright. Can't argue that and I'll keep your words in mind... also I wasn't looking to insult anyone per say. Please understand this sort of thing is something I've never dealt with in nearly 5 years since I use Linux, and I don't understand either how it happens nor why nothing is being done: Not only is it the most severe type of issue, and that it's been there for months, but it's literally coming and going on a weekly schedule. I'd care less if it was a minor issue, but I literally can't preform daily activities properly because my monitor simply shuts itself down randomly for no reason! I don't have Windows and wouldn't want to use it again, nor can I downgrade to a version as old as before the time I can assume the issue started... what am I to do?
I'll try to be more calm, but I believe this sort of thing needs to have some solution. There are millions of people using these drivers, I don't think it's a normal situation that they can't use their machine safely and no one even knows what's triggering it... if this happened in Windows it would probably make headlines.
Anyway I don't want to take this off-topic. I'll wait for more answers and post new updates as I see changes. Sorry again for earlier.
More info after the latest crash: In ~/.config/kwinrc I tried setting GLCore=false and GLColorCorrection=false, however none of these seem to affect the problem although I found them suspicious. I wonder if other such settings, such as vsync (tearing prevention) could make a difference... there are many combinations to try, and I'm not sure which also affect the system when compositing is off.
When the system doesn't completely freeze, the behavior of the problem can be very strange at times; Last night after the desktop froze, I quickly hit Control + Alt + Backspace several times to kill X11... the system went into a console, which started flashing continuously together with the HDD led. I couldn't do anything on my keyboard and mouse... but once I pressed the power button it stopped and the machine shut itself down quickly and cleanly, meaning it still received the power off signal and managed to recover from that.
Can you narrow down what component update caused the issue?
(In reply to Alex Deucher from comment #25)
> Can you narrow down what component update caused the issue?
openSUSE Tumbleweed upgraded from Mesa 17.0.5 to 17.1.0 yesterday. I'm still in the process of seeing whether this affects the issue and how... so far no freezes, and over 1 day of uptime which should at least mean it's rarer.
I do not believe it's the Kernel: The same freeze started with Kernel 4.9.0, persisted in 4.10.0, and ever after 4.11.0 it happened again. A kernel issue would have been impossible not to notice during months of development.
Other components aren't updated as frequently or easy for me to track, and with the issue happening once every day there's no way to test it on demand. As such I don't currently have more info on what the source could possibly be.
In weekly news, it appears a recent openSUSE Tumbleweed snapshot (which among other changes upgraded Mesa 17.0.5 to 17.1.0) made the issue a lot rarer for the time being: Until this snapshot it was even happening twice a day, hence why it was starting to drive me nuts... now I only seem to get this freeze every 2-3 days of uptime, which is so much more bearable! I worry whether the next snapshot will make the issue more frequent again rather than better... clearly it's one of the important system packages that's messing with it, but I still have no idea which and how.
After another two weeks of absence, the issue was apparently reimplemented on top of Kernel 4.11.3 + Mesa 17.1.1 + Plasma 5.10.0, likely sometime during the last few days. The behavior is once again identical, with alt-tab switching or desktop effects causing everything but the mouse pointer to freeze then after 10 seconds the monitor shuts down. Other unrelated GPU crashes (such as those caused by some games) behave by the classic model, where the entire system simple freezes in place at once... that's a very different result from this freeze, and likely confirms this is a different type of crash.
At this point I have almost no doubt this is an attack that's being deliberately programmed, and manually reimplemented on top of new drivers once it gets fixed. The cycle seems to be that a kernel or driver update resolves the issue, then the creators of the crash require about two weeks to patch it and reimplement the exact same functionality. This is the 4th time the story repeats.
I tried steering away from this possibility until I was sure, as I didn't want it triggering any unnecessary arguments... if this is an attack then investigating it as such might help in finding its source more quickly. There's simply no way something this precise could happen by itself for nearly half an year, always coming back with the exact same effects after a period of absence... all despite radical changes to nearly every driver and system component, which would have no doubt altered the behavior of the initial problem in some form. Therefore I hope everyone can see why I'm now going with this theory and greatly considering the option of malicious intent.
I have no idea how the virus (?) could be updated on my computer, as it's likely not through the package update system directly. However I suspect it's using a constant series of vulnerabilities in one or more system components, which should be fixed by the developers if they exist. I would appreciate any ideas on both how the malicious code might be inserted into the computer, as well as finding the vulnerability within radeon / Mesa / X11 / etc that it exploits. Please let me know what your thoughts are!
An update: The latest reimplementation of the freeze appears to be worse than all others. My system can now be taken down after only 4 hours of uptime! This is a huge difference from all previous versions of the crash, which required that the system had at least been running for a day before it could be crashed.
Created attachment 132030 [details]
Photo of the corrupt image on the screen
I have discovered some very important details today. Everyone following up on the report, please see this comment!
Recently I realized that a useful test would be to jump into a different run level once I notice the crash, in order to see how the system behaves there. A few minutes ago another freeze took place, so I instantly hit Control + Alt + F1 to go to a console. What I noticed was pretty remarkable and sheds light on a few aspects:
I could keep typing in the console for nearly 10 seconds, but after that the exact same behavior still took place (monitor turned itself on and off two times then the image froze). This time however I was able to toggle the NumLock led a minute after the crash, while also seeing the HDD led still working; That means this is not (always) a total system freeze such as a Kernel panic... instead it appears to be the image output corrupting and staying that way, freezing only specific components with it (I was still unable to issue a blind reboot command for instance). To put everything into an approximate timeline, this is what happened:
00 seconds in: The crash occurs.
02 seconds in: I notice and instantly hit Control + Alt + F1.
05 seconds in: I'm taken to a console where everything works fine: I see the blinking cursor, can write my login and password, etc.
12 seconds in: Suddenly the monitor turns off and back on several times, then the image remains frozen in place.
This time however, the screen did not remain turned off or black. Instead it stayed stuck in a state showing corrupt lines and rectangles of random colors. I took a photo of my screen with my smartphone, which I attached to this issue.
Created attachment 132448 [details]
Screenshot of "top"
Lots of important new information on this freeze, which was of course ported to the latest openSUSE Tumbleweed system packages and still works:
First and foremost, the problem does not happen in every session, and this is not always influenced by updates! During an interval in which I installed absolutely no relevant package changes, the following has happened: The freeze occurred after about just 8 hours of uptime... after that I restarted the machine, but then I had 4 days of uptime with no freeze! This leads me to believe that certain applications or system actions prepare the system with a "time bomb", which then causes switching between windows or desktops to produce the freeze... however I have no way to know what mines the system and what doesn't yet, as I use too many applications at once to figure out which might be responsible.
Anyway another crash happened today. Once more I quickly hit Control + Alt + F1 to switch to a different runlevel; This caused the image to become corrupted on the monitor, however the system remained responsive and didn't actually freeze. So I went to my mother's computer and logged in via SSH, which indeed still worked. I was able to issue a reboot command, which caused the image to briefly unfreeze as the monitor turned on and off a few more times... I could see a few KDE error messages about applications crashing, before the system actually went ahead and rebooted successfully! However this is only possible if I switch to a console quickly enough when noticing the freeze start to happen, if not the whole machine freezes and not even SSH responds from other devices!
While I was in SSH, I decided to run "top" and take a screenshot of my processes (while the computer was frozen and with corrupt image stuck on the screen). I can't tell if anything is out of the ordinary such as a memory leak, but I'm attaching a screenshot of it here.
Thought I'd also post another detail that might be useful, I'm not sure how much it relates to the freeze but better be safe than sorry; I have the following two environment variables added to my ~/.profile file, which basically tell Mesa to post errors to a log file:
There's one reoccurring line which keeps getting printed in there. It's added periodically with no side effects, but I imagine it could still have some relation to the trigger of the freeze:
Mesa: User error: GL_INVALID_OPERATION in glTexSubImage2D(invalid texture image)
After months of careful testing and experimentation, I have discovered what seems to be the primary trigger of this freeze at last. It's not what triggers it per say, but what "rigs" the system and causes it to crash within the course of the next hours... the actual trigger is alt-tab switching between windows, or certain desktop effects playing.
The freeze is mined into the system when you disable and re-enable KDE desktop compositing. If I hit Alt + Shift + F12 to turn off desktop effects, then hit the key combo to turn them back on... there is a great chance that within a few hours the crash occurs. If I don't toggle compositing on the run and just leave it enabled after the system has started, I seem to be fine... this only happens if I turn it off and back on during runtime. It's uncertain whether anything else mines the system, but this is almost always what seems to do it for me.
Notice: I use OpenGL 3.1 for desktop compositing. I remember selecting OpenGL 2.0 long ago, but that still caused the freeze at that time. I can't use Xrender on a daily basis as many effects don't work with it. No other compositor options seem to affect the problem either.
It would be highly appreciated if at least after this information, the developers and maintainers could finally look at this issue! It has taken me months to confirm this as a cause, and I really hope this information (alongside dozens of comments and logs I have posted) can finally be put to use.
Today I discovered that even when not toggling desktop effects at runtime, the freeze can still be mined into the system. I got a crash after 1 day of uptime, no toggling of desktop compositing required.
I find it remarkable how the cause of the crash appears to have immediately changed after me making the comment above yesterday; I tested my theory that desktop effects are the root for 2 months, yet the moment I publish my observations the behavior changes in less than a day. This further makes me concerned that someone might be deliberately programming this crash using vulnerabilities in system components, solely for how strange this coincidence is. I'm still waiting for the developers to help investigate this further whatever the case, as I cannot find any explanation at this point.
Created attachment 133243 [details]
To rule out the possibility of a hardware issue, I ran two Memtest86 5.01 sessions from a Clonezilla bootable CD. The first was in the day for 5 hours, the second was during the night for over 10 hours: The program only registered 3 passes in total, but it did not find any errors. I'll attach a picture just in case any useful information is printed there.
Created attachment 133254 [details]
Output of "dmesg -w"
This is perhaps the most important piece of information I managed to gather on the problem thus far. If you have a technical understanding of this data, please take a look at the log and let us know what it says!
I was able to run a SSH session on my computer from another machine. In it I left the command "dmesg -w" running. I toggled desktop effects last night to provoke a crash today, which happened as expected and allowed me to conduct the test. This is basically what dmesg is seeing in realtime as the system is crashing.
I can't make sense of the information, but it definitely looks descriptive. Although the computer seemed completely frozen locally, the output continued flowing on the other machine printing new information every few seconds. I had to wait in order to catch some of the red lines in the console.
I briefly discussed the above log (output of "dmesg -w") on IRC with someone who seemed to have an understanding of the issue. They pointed out something important which I thought to highlight:
The problem appears to start from 'radeon_vm_bo_invalidate' and is most likely a GPU locking bug. Looking at the stack trace I can see it, alongside explicit mentions of spin lock / CPU soft lockup / stall on CPU. I've also noticed a potentially important message, which although marked as a warning seems to point to a line of source code from the radeon driver:
[58857.640890] WARNING: CPU: 3 PID: 2549 at ../drivers/gpu/drm/radeon/radeon_object.c:84 radeon_ttm_bo_destroy+0xec/0xf0 [radeon]
Created attachment 133368 [details]
Output of "dmesg -w" (full)
Full output of "dmesg -w", recorded by running "dmesg -w > filename.txt". The previous one was incomplete as it was subject to console line limitations, cutting off the moment when the crash actually occurs. I left the command running in a different runlevel; This time the crash didn't shut down the monitor after switching to it (Control + Alt + F1) so I was able to cleanly shut down dmesg then issue a normal reboot. I waited there for about 5 minutes before doing so, to give dmesg time to record as much information as possible. The crash appears to start at the following lines:
[112873.658950] radeon 0000:03:00.0: ring 4 stalled for more than 10024msec
[112873.658953] radeon 0000:03:00.0: GPU lockup (current fence id 0x000000000072f6bd last fence id 0x000000000072f6c1 on ring 4)
I randomly decided to google parts of my dmesg output. I was surprised to discover that someone else has reported a very similar issue, which looks like it might have the same root as mine!
The dmesg output their provided almost perfectly matches my last log, and they also have a RadeonSI card which further narrows down the problem. The main difference is that they experience this with Unreal Engine 4 Editor, whereas for me the trigger is the Plasma desktop.
That report seems to contain a fair amount of logs, so hopefully bringing it and this together can help produce a solution at long last.
(In reply to MirceaKitsune from comment #39)
> I randomly decided to google parts of my dmesg output. I was surprised to
> discover that someone else has reported a very similar issue, which looks
> like it might have the same root as mine!
> The dmesg output their provided almost perfectly matches my last log, and
> they also have a RadeonSI card which further narrows down the problem. The
> main difference is that they experience this with Unreal Engine 4 Editor,
> whereas for me the trigger is the Plasma desktop.
> That report seems to contain a fair amount of logs, so hopefully bringing it
> and this together can help produce a solution at long last.
All of these bug reports are GPU lockups. Whether they are the same root cause or not remains to be seen.
I have important new information. After yet more weeks of testing, I seem to have found both of the common triggers for this issue. The crash happens a few hours after either of the following actions is preformed:
1 - Desktop effects are toggled at runtime. Pressing Alt + Shift + F12 twice to turn compositing off then back on will mine the system with this crash.
2 - I insert my USB stick or external drive into an USB port, mount it and access it in Dolphin, then unmount and remove it. A few hours after I've inserted / removed my drive, the freeze occurs. I suspect this has to do with the device notifier popping up in the system tray, asking what action to preform on the device or telling me the device is safe to unplug.
I'm not sure if the themes I'm using might have any relevancy. Considering this is a graphics problem, I figured I'd share this info as well so others can test them if they wish. I'm using the Plasma / KWin theme Freeze with the default Breeze icons / cursor / widget style:
Further more, I suspect I now know what the culprit component is. It's very likely that the problem lies within Mesa itself, and was introduced in the switch between 13.0 and 17.0.
This was confirmed by the bug report I linked previously, which I strongly believe is related to the issue I'm experiencing here: Another person there was able to verify that their crash happens with Mesa 17 but not 13. Looking at the dates, I realize that I started experiencing this problem precisely when openSUSE Tumbleweed upgraded from Mesa 13.0 to 17.0: Mesa 17 landed in early March 2017, it was a few days later that the issues began, which I then reported the following week (08 March 2017). See my comment in the other bug for more info on this:
I also seem to confirm that the issue only affects RadeonSI cards but not R600: My laptop has a Mobility Radeon HD 5470 card (R600) whereas my desktop has a Radeon R7 370 card (RadeonSI). I've been away for two weeks and have been using my laptop exclusively during this time, which has the exact same OS and configuration as my desktop. I was able to preform every task I do on my desktop from my laptop, including the triggers I described above... I have never experienced this freeze with the laptop.
I'm sorry for having taken so long to get back to this issue: I needed to be sure that what I'm mentioning is correct, which at this point took months of verification to be certain the issue is gone for good.
The problem has finally went away; It has not happened once during 3 months, in which I was able to achieve well over a week of uptime! It disappeared after I've preformed the following 3 changes on my system:
- Modifying my system GTK theme.
- Disabling KMix at startup.
- Uninstalling IBus.
I'm convinced the culprit here was IBus... more specifically its system tray icon. That icon has caused odd glitches in the past, such as making random menus pop up or crashing. It was likely also causing a graphical glitch that introduced this infinite GPU loop. As such the ingredients you should need are:
- A GCN 1.0 RadeonSI AMD card, running on the "radeon" driver.
- A KDE (Plasma 5) Linux OS.
- The IBus input system, with the option to show the system tray icon.
If others can reproduce this, please comment on the issue and let us know! If the problem does not return, I will mostly just be watching this bug from now on; I don't plan on spending days to do more odd tests... especially after receiving nearly no support from the FreeDesktop crew for almost an year, despite giving them a ton of data and how major this issue was.
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1262.