Bug 104817 - [Raven][GALLIUM_DDEBUG] system crashes/freezes randomly every few minutes/hours
Summary: [Raven][GALLIUM_DDEBUG] system crashes/freezes randomly every few minutes/hours
Status: RESOLVED WORKSFORME
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: git
Hardware: All Linux (All)
: medium critical
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-01-27 23:04 UTC by Marcus Husar
Modified: 2018-08-17 11:34 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
GALLIUM_DDEBUG: folder ddebug_dumps with multiple dumps (20.83 KB, application/x-xz)
2018-01-27 23:04 UTC, Marcus Husar
Details
kernel: [drm:amdgpu_job_timedout [amdgpu]] (798 bytes, text/plain)
2018-01-27 23:06 UTC, Marcus Husar
Details
kernel: amdgpu [gfxhub] VMC page fault (1) (3.88 KB, text/plain)
2018-01-27 23:08 UTC, Marcus Husar
Details
kernel: amdgpu [gfxhub] VMC page fault (2) (2.91 KB, text/plain)
2018-01-27 23:09 UTC, Marcus Husar
Details

Description Marcus Husar 2018-01-27 23:04:24 UTC
Created attachment 137000 [details]
GALLIUM_DDEBUG: folder ddebug_dumps with multiple dumps

OpenGL renderer string: AMD RAVEN (DRM 3.23.0 / 4.16.0-2.fc27.x86_64, LLVM 6.0.0)

My system is an Acer SF315-41 (Ryzen Mobile 5 2500U) with Fedora 27, Kernel 4.16-drm-next (based on 4.15-rc8), LLVM 6.0.0-rc1, Mesa 18.0.0-rc2.

I can reproduce these crashes from kernel-4.15-rcX/mesa-17.3/llvm5 to kernel-4.16-drm-next/mesa-18-rc2/llvm6-rc1 and in between. They mostly appear while watching videos (firefox/totem), switching tabs in firefox, resizing windows (gnome-shell) or gaming.

With amdgpu.lockup_timeout=2000 and amdgpu.GALLIUM_DDEBUG=2000 I was able to gather lots of dumps within a few minutes (see attachment). As you can see in the dumps the GPU lockup results sometimes in a CPU lockup (kernel bluetooth deadlock) as a result of gnome shell’s complete freezing. I can reproduce amdgpu crashes also with an USB mouse and bluetooth disabled.

Not very often I can find some kernel errors in the logfiles that result from a crash. I’ll attach the few I found in the last two weeks.
Comment 1 Marcus Husar 2018-01-27 23:06:30 UTC
Created attachment 137001 [details]
kernel: [drm:amdgpu_job_timedout [amdgpu]]
Comment 2 Marcus Husar 2018-01-27 23:08:38 UTC
Created attachment 137002 [details]
kernel: amdgpu [gfxhub] VMC page fault (1)
Comment 3 Marcus Husar 2018-01-27 23:09:17 UTC
Created attachment 137003 [details]
kernel: amdgpu [gfxhub] VMC page fault (2)
Comment 4 Bráulio Barros de Oliveira 2018-04-13 22:27:16 UTC
Same here with AMD 2500U on a HP Envy x360, details at:
- https://bugzilla.redhat.com/show_bug.cgi?id=1562530
- https://lists.freedesktop.org/archives/amd-gfx/2018-March/020580.html
Comment 5 Justin Mitzel 2018-05-19 04:41:45 UTC
I am also having this problem. Ryzen 2500u on kernel 4.16-DRM-next. Many hangs that require a reboot to fix.
Comment 6 Justin Mitzel 2018-05-19 04:42:56 UTC
Although it also seems very likely that this is a Kernel driver issue.
Comment 7 James Le Cuirot 2018-07-13 19:58:54 UTC
OP also filed a kernel bug about this. It missed the crucial information about how he was able to debug it! Glad I found this one.

https://bugzilla.kernel.org/show_bug.cgi?id=199653
Comment 8 Marcus Husar 2018-08-06 10:32:19 UTC
It seems to me that this is in fact a CPU related problem. Since July 25 I don’t have any problems. My system is pretty stable. What helped was to add idle=nomwait to my GRUB command line. This has fixed those problems for me.

Please try to add idle=nomwait to your GRUB command line. I think this bug can be closed.
Comment 9 James Le Cuirot 2018-08-06 11:07:28 UTC
I added idle=nomwait recently and that has fixed it for me too. I thought I had already tried this, not sure, but perhaps there were two issues and the other has since been fixed.
Comment 10 Marcus Husar 2018-08-17 11:34:50 UTC
See comment #8. Kernel parameter idle=nomwait fixed this bug for me. It seems to be a CPU related problem.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.