Bug 98874

Summary:	amdgpu: [drm:amdgpu_job_timedout] ERROR ring gfx timeout, [drm] IP block:5 is hang
Product:	Mesa	Reporter:	Matthias Nagel <matthias.h.nagel>
Component:	Drivers/Gallium/radeonsi	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED MOVED	QA Contact:	Default DRI bug account <dri-devel>
Severity:	normal
Priority:	medium	CC:	devurandom, jb5sgc1n.nya, johan.gardhage, keramidasceid, samuel, vedran
Version:	unspecified
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	Dump due to GALLIUM_DDEBUG="pipelined 2000" ps -elf -q 458 glxinfo 2nd dump 3rd dump dmesg with amdgpu.lockup_timeout=2500 after 3rd crash 4th dump 2017-01-03

Description Matthias Nagel 2016-11-27 17:15:18 UTC

I recently installed a new Radeon R9 380 and use the amdgpu driver. After some random time my dekstop suddenly freezes, no input (mouse, keyboard, acpi events) is possible. I still can login via ssh from another box and run some commands but I do not see any error messages. Dmesg, .xsession-erros, Xorg.0.log, journalctl are all clean.

I can erform either one of the following steps:

(A) Try to shutdown via the ssh session. I am kicked of from the ssh session (obviously) but the box is is not powered off. Somewhere during the shutdown process the box gets stuck. All the time I still see the frozen desktop. The only option is to forcefully power off the PC by pressing the power button for 3sec.

(B) Do not try to shutdown completely, but initiate a "systemctl rescue". All X11-related stuff is terminated (according to ps -elf), but I still see the frozen desktop. "lsmod" still reports that amdgpu is used by approx. 35 processes. "rmmod amdgpu" kills the machine entirely, after that even my SSH session is stuck, i.e. the shell never returned to its prompt. Any attempt to login via SSH a second time fails. The only option is to forcefully power off.

After reboot: No error messages anywhere

Comment 1 Michel Dänzer 2016-11-28 07:22:22 UTC

Sounds like a GPU hang, which is most likely caused by Mesa or LLVM. With the environment variable GALLIUM_DDEBUG="pipelined 2000" set for the compositor or Xorg process, the radeonsi driver might detect the hang and dump some information about it in a file in ~/ddebug_dumps/ . Please attach that file here.

The failure to recover cleanly from the problem is a kernel issue. You can try setting amdgpu.lockup_timeout=2000 to make the amdgpu driver detect the hang and try to reset the GPU, but it doesn't work reliably in general yet.

Comment 2 Matthias Nagel 2016-11-28 08:36:28 UTC

@Michel: How and where do I set the environment variable GALLIUM_DDEBUG="pipelined 2000" such that it is passed to the execution environment of the Xorg process or compositor? I use systemd as my init system and the active service file is sddm.service. Presumably, I need to modify some unit files but offhand I do not have an idea which one.

Comment 3 Matthias Nagel 2016-11-28 19:20:41 UTC

Created attachment 128252 [details]
Dump due to GALLIUM_DDEBUG="pipelined 2000"

Comment 4 Matthias Nagel 2016-11-28 19:22:23 UTC

Created attachment 128253 [details]
ps -elf -q 458

Please not that the process state is "D"

Comment 5 Matthias Nagel 2016-11-28 19:34:12 UTC

I could obtain the requested dump :) I hope it helps.

Some words are in order: After I knew that the guilty process was /usr/bin/X and a knew its PID I also tried to get a "gcore <PID>" or to attach gdb to it. Both failed. Otherwise I also had provided you a backtrace of all threads as I compiled all packages with "-g -ggdb". (I am a gentoo user.) The process is stuck in state "D". "kill -KILL <PID>" did not work either.

A second note (I know it is selfish, because you do a great job, and off-topic.). I bought this new graphics card, because I was being pestered by with an Nvidia graphics card, a buggy nouveau driver and a lot of crashes due to an unstable OpenGL. (I know it Nvidia is to blame for the situation not the maintainers of nouveau.) After 18 month of hope that the situation might improve, I finally decided to spend money for a new graphics card by AMD. I thought I would eventually get working PC. Now, it seems I stepped "out of the frying pan into the fire". Just now I have still the chance to withdraw from my investment and give the AMD graphics card back to the dealer. Should I do that? Or may I hope for a fix soon?

Comment 6 Michel Dänzer 2016-11-29 02:08:21 UTC

Please attach the output of glxinfo.

Comment 7 Matthias Nagel 2016-11-29 07:13:10 UTC

Created attachment 128261 [details]
glxinfo

Comment 8 Matthias Nagel 2016-12-01 20:38:27 UTC

Created attachment 128305 [details]
2nd dump

Here is a new dump from a another crash

Comment 9 Matthias Nagel 2016-12-04 12:12:28 UTC

Created attachment 128329 [details]
3rd dump

Comment 10 Matthias Nagel 2016-12-04 12:14:04 UTC

Created attachment 128330 [details]
dmesg with amdgpu.lockup_timeout=2500 after 3rd crash

See log entries starting at 3308 sec.

Comment 11 Matthias Nagel 2016-12-04 12:17:22 UTC

Is there anything I can do to push this one forward? There are some events that trigger the crash with high probability:
- autocompletion of URL in Firefox
- open context menu in Libre Writer
- scrolling source code in PhpStorm
Unfortunately, with this bug my PC is nearly unusable for daily work.

Comment 12 Matthias Nagel 2017-01-03 16:01:24 UTC

Created attachment 128728 [details]
4th dump 2017-01-03

Anybody working on this? Anything I can help to push this one forward?

I still see this error and I can nearly reliably trigger it.

Comment 13 dwagner 2017-08-20 22:55:31 UTC

Notice that my bug report https://bugs.freedesktop.org/show_bug.cgi?id=102322 might be about the same symptom - but using a different GPU architecture, a bleeding-edge new kernel, and I wanted to report this on the "amdgpu" driver (not Mesa), because amdgpu produces the only logged error messages, and if the bug was in Mesa that would not explain why my system totally crashes (and not just X11).

Comment 14 GitLab Migration User 2019-09-25 17:55:29 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1241.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.