Bug 104639 - kernel 4.15-rc8 reboots randomly with RX 560
Summary: kernel 4.15-rc8 reboots randomly with RX 560
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium blocker
Assignee: Default DRI bug account
QA Contact:
Depends on:
Reported: 2018-01-15 13:23 UTC by Peter Alm
Modified: 2019-11-19 08:29 UTC (History)
0 users

See Also:
i915 platform:
i915 features:

dmesg (61.74 KB, text/plain)
2018-01-15 13:23 UTC, Peter Alm
no flags Details

Description Peter Alm 2018-01-15 13:23:16 UTC
Created attachment 136726 [details]

My workstation running arch linux with a RX560 GPU randomly reboots after a few minutes of use with the new amdgpu version in kernel 4.15. It seems to mostly happen when there's a full screen redraw happening (such as maximizing a window.) It happens regardless of desktop environment (I've tested GNOME on both wayland and Xorg, as well as KDE). If it's left at the GDM login screen it doesn't seem to reboot.

Kernel version 4.14 and older seems to be rock solid.

I've tried the following things to workaround it but nothing seems to make any difference:
 * Setting amdgpu.dc to both 1 and 0

 * Disabling dpm by setting amdgpu.dpm to 0

 * Upgrading MESA from 17.3.2 to latest git master
Comment 1 Michel Dänzer 2018-01-15 14:56:16 UTC
Can you try bisecting? Make sure to test each commit for plenty of time before marking it as good.
Comment 2 Peter Alm 2018-01-16 12:47:59 UTC
Hi Michel,

After bisecting it seems as the offending commit seems to be:

commit 648bc3574716400acc06f99915815f80d9563783
Author: Christian König <christian.koenig@amd.com>
Date:   Thu Jul 6 09:59:43 2017 +0200

    drm/ttm: add transparent huge page support for DMA allocations v2
    Try to allocate huge pages when it makes sense.
    v2: fix comment and use ifdef
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

After reverting that 4.15-rc8 seems to be working fine.
Comment 3 Christian König 2018-01-16 13:05:50 UTC
Does your kernel also have the following patch?

commit f4c809914a7c3e4a59cf543da6c2a15d0f75ee38
Author: Christian König <christian.koenig@amd.com>
Date:   Mon Oct 9 14:34:13 2017 +0200

    drm/ttm: don't use compound pages for now
    We need to figure out first how to correctly map them into the CPU page tables.
    bug: https://bugs.freedesktop.org/show_bug.cgi?id=103138
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Acked-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Comment 4 Christian König 2018-01-16 13:09:12 UTC
Sorry, just seen that you wrote that you are using 4-15-rc8 and that should include the patch.

No idea what's going wrong here. You not by any chance could add a serial/network console and grab the last logs before the reboot?
Comment 5 Peter Alm 2018-01-16 13:30:01 UTC

These are the last messages from a network console. All of them are from before the crash:

[   90.254437] [drm] {3840x2160, 4000x2222@533250Khz}
[   91.026831] [drm] U24E590: [Block 0]
[   91.028144] [drm] U24E590: [Block 1]
[   91.029479] [drm] dc_link_detect: manufacturer_id = 2D4C, product_id = CD3, serial_number = 304D5844, manufacture_week = 50, manufacture_year = 26, display_name = U24E590, speaker_flag = 1, audio_mode_count = 1
[   91.032187] [drm] dc_link_detect: mode number = 0, format_code = 1, channel_count = 1, sample_rate = 7, sample_size = 7
[   91.033565] [drm] {3840x2160, 4000x2222@533250Khz}
[   91.049360] [drm] {3840x2160, 4400x2250@594000Khz}
[  102.531453] input: Surface Mouse as /devices/virtual/misc/uhid/0005:045E:0919.0006/input/input21
[  102.531670] hid-generic 0005:045E:0919.0006: input,hidraw5: BLUETOOTH HID v1.10 Mouse [Surface Mouse] on 00:1A:7D:DA:71:15
[  102.537616] mousedev: PS/2 mouse device common for all mice
[  106.699544] fuse init (API version 7.26)
[  106.974151] usb 1-3.3: 1:1: cannot get freq at ep 0x81
[  107.942126] rfkill: input handler disabled
[  108.240100] ISO 9660 Extensions: RRIP_1991A

I don't have a serial dongle here so I can't use a serial console but at one point I ran the console on the integrated intel GPU (I usually have it disabled in BIOS) while using Xorg on the AMD GPU, there was no messages there either. 

If you come up with any ideas or patches I'm happy to try them out.
Comment 6 fin4478 2018-02-27 15:56:30 UTC Comment hidden (spam)
Comment 7 Martin Peres 2019-11-19 08:29:03 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/298.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.