I recently installed a new Radeon R9 380 and use the amdgpu driver. After some random time my dekstop suddenly freezes, no input (mouse, keyboard, acpi events) is possible. I still can login via ssh from another box and run some commands but I do not see any error messages. Dmesg, .xsession-erros, Xorg.0.log, journalctl are all clean.
I can erform either one of the following steps:
(A) Try to shutdown via the ssh session. I am kicked of from the ssh session (obviously) but the box is is not powered off. Somewhere during the shutdown process the box gets stuck. All the time I still see the frozen desktop. The only option is to forcefully power off the PC by pressing the power button for 3sec.
(B) Do not try to shutdown completely, but initiate a "systemctl rescue". All X11-related stuff is terminated (according to ps -elf), but I still see the frozen desktop. "lsmod" still reports that amdgpu is used by approx. 35 processes. "rmmod amdgpu" kills the machine entirely, after that even my SSH session is stuck, i.e. the shell never returned to its prompt. Any attempt to login via SSH a second time fails. The only option is to forcefully power off.
After reboot: No error messages anywhere
Sounds like a GPU hang, which is most likely caused by Mesa or LLVM. With the environment variable GALLIUM_DDEBUG="pipelined 2000" set for the compositor or Xorg process, the radeonsi driver might detect the hang and dump some information about it in a file in ~/ddebug_dumps/ . Please attach that file here.
The failure to recover cleanly from the problem is a kernel issue. You can try setting amdgpu.lockup_timeout=2000 to make the amdgpu driver detect the hang and try to reset the GPU, but it doesn't work reliably in general yet.
@Michel: How and where do I set the environment variable GALLIUM_DDEBUG="pipelined 2000" such that it is passed to the execution environment of the Xorg process or compositor? I use systemd as my init system and the active service file is sddm.service. Presumably, I need to modify some unit files but offhand I do not have an idea which one.
Created attachment 128252 [details]
Dump due to GALLIUM_DDEBUG="pipelined 2000"
Created attachment 128253 [details]
ps -elf -q 458
Please not that the process state is "D"
I could obtain the requested dump :) I hope it helps.
Some words are in order: After I knew that the guilty process was /usr/bin/X and a knew its PID I also tried to get a "gcore <PID>" or to attach gdb to it. Both failed. Otherwise I also had provided you a backtrace of all threads as I compiled all packages with "-g -ggdb". (I am a gentoo user.) The process is stuck in state "D". "kill -KILL <PID>" did not work either.
A second note (I know it is selfish, because you do a great job, and off-topic.). I bought this new graphics card, because I was being pestered by with an Nvidia graphics card, a buggy nouveau driver and a lot of crashes due to an unstable OpenGL. (I know it Nvidia is to blame for the situation not the maintainers of nouveau.) After 18 month of hope that the situation might improve, I finally decided to spend money for a new graphics card by AMD. I thought I would eventually get working PC. Now, it seems I stepped "out of the frying pan into the fire". Just now I have still the chance to withdraw from my investment and give the AMD graphics card back to the dealer. Should I do that? Or may I hope for a fix soon?
Please attach the output of glxinfo.
Created attachment 128261 [details]
Created attachment 128305 [details]
Here is a new dump from a another crash
Created attachment 128329 [details]
Created attachment 128330 [details]
dmesg with amdgpu.lockup_timeout=2500 after 3rd crash
See log entries starting at 3308 sec.
Is there anything I can do to push this one forward? There are some events that trigger the crash with high probability:
- autocompletion of URL in Firefox
- open context menu in Libre Writer
- scrolling source code in PhpStorm
Unfortunately, with this bug my PC is nearly unusable for daily work.
Created attachment 128728 [details]
4th dump 2017-01-03
Anybody working on this? Anything I can help to push this one forward?
I still see this error and I can nearly reliably trigger it.
Notice that my bug report https://bugs.freedesktop.org/show_bug.cgi?id=102322 might be about the same symptom - but using a different GPU architecture, a bleeding-edge new kernel, and I wanted to report this on the "amdgpu" driver (not Mesa), because amdgpu produces the only logged error messages, and if the bug was in Mesa that would not explain why my system totally crashes (and not just X11).