110413 – GPU crash and failed reset leading to deadlock on Polaris 22 XL [Radeon RX Vega M GL]

Bug 110413 - GPU crash and failed reset leading to deadlock on Polaris 22 XL [Radeon RX Vega M GL]

Summary: GPU crash and failed reset leading to deadlock on Polaris 22 XL [Radeon RX Ve...

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-04-12 14:44 UTC by Rémi Verschelde
Modified:	2019-11-19 09:19 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
lspci -vvv output for HP Spectre 360x (15.79 KB, text/x-log) 2019-04-12 14:44 UTC, Rémi Verschelde	no flags	Details
dmesg output after GPU crash in game StarCrawlers with kernel 5.0.7-desktop from Mageia 7 (114.64 KB, text/x-log) 2019-04-12 14:45 UTC, Rémi Verschelde	no flags	Details
dmesg output after GPU crash in game For The King with kernel 5.0-rc1 built from amd-staging-drm-next (124.09 KB, text/x-log) 2019-04-12 14:47 UTC, Rémi Verschelde	no flags	Details
/proc/config.gz from Mageia's kernel 5.0.7-desktop, used for custom amd-staging-drm-next build (210.67 KB, text/plain) 2019-04-12 14:48 UTC, Rémi Verschelde	no flags	Details
dmesg output after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7 (102.40 KB, text/x-log) 2019-04-13 13:29 UTC, Rémi Verschelde	no flags	Details
journalctl -b output after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7 (332.33 KB, text/x-log) 2019-04-13 13:30 UTC, Rémi Verschelde	no flags	Details
Xorg.0.log after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7 (34.28 KB, text/x-log) 2019-04-13 13:32 UTC, Rémi Verschelde	no flags	Details
Xorg.1.log after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7 (2.96 KB, text/x-log) 2019-04-13 13:32 UTC, Rémi Verschelde	no flags	Details
journalctl -b0 output on kernel 5.3.0-rc2 from ubuntu mainline repository, with a system with rx 540 gpu (1.10 MB, text/plain) 2019-08-01 18:44 UTC, Utku Helvacı (tuxutku)	no flags	Details
View All

Description Rémi Verschelde 2019-04-12 14:44:42 UTC

Created attachment 143950 [details]
lspci -vvv output for HP Spectre 360x

My HP Spectre x360 laptop bought in March 2019 comes with KabyLake G HD Graphics 630 and a discrete AMD Radeon RX Vega M GL GPU.

I only enable the Radeon GPU when needed to play graphics intensive games with `DRI_PRIME=1`, and so far I experience a lot of GPU deadlocks with the following symptoms:
- Temperatures raise, the CPUs are throttled. Framerate drops when this happens.
- Later on, GPU faults are reported in dmesg, the game's rendering freezes (but music continues playing). I am still able to alt+tab back to desktop or open a terminal, but the game's process can't be killed. If I'm monitoring temperatures, lm_sensors always reports a bogus 511°C temperature for the AMD dGPU at this point, before breaking.
- Any subsequent attempt at using the AMD GPU will cause a system deadlock, and I need to force shutdown with the power button.

My testing so far has covered:
- Unity3D games like For The King or StarCrawlers. The crash happens mid-game, not in a strictly reproducible manner, but seems related to CPU temperature/throttling.
  * I could also reproduce the crash with SuperTuxKart, not in-game but when alt-tabbing back to desktop.
  * I could not get the crash yet with glmark2. With For The King, I can reliably get a crash within 1 to 10 minutes in-game when playing with "High" or "Dream" graphics quality.
- Kernel 5.0.x (up to 5.0.7) from Mageia 7 (Cauldron), e.g. 5.0.7-desktop-4.mga7.
  * I also tried `git://people.freedesktop.org/~agd5f/linux -b amd-staging-drm-next` at b07c394a327fc9e435ee03288584c111fa73d963, but I still got the same symptoms. dmesg output was in part different though, more spammy.
  * Following discussions in bug 109692, I tried the patches provided by Andrey Grodzovsky in bug 109692 comment 34, but they did not solve the issue for me.
- Mesa 19.0.0 to 19.0.2 built against LLVM 7.0.1.
- Suspecting the CPU temperature/throttling as a trigger, I'm using https://github.com/kitsunyan/intel-undervolt to undervolt the CPU Cache by -100 mV and set the CPU limit temperature to 80°C instead of 100°C. This has helped with throttling issues I had during code compilation, but no visible change on my GPU crashes that I can tell. I can disable this undervolting when doing tests if required.

I found various bug reports which might well be duplicates, but I'm opening my own to avoid hijacking discussions on what may or may not be the same root cause: bug 109461, bug 109466, bug 109692 (I installed Shadow of the Tomb Raider but haven't checked if I can reproduce this one's symptoms yet), bug 109819.

I attach some relevant logs on the system and the bug. Please ask for anything else you may need.

Comment 1 Rémi Verschelde 2019-04-12 14:45:46 UTC

Created attachment 143951 [details]
dmesg output after GPU crash in game StarCrawlers with kernel 5.0.7-desktop from Mageia 7

Comment 2 Rémi Verschelde 2019-04-12 14:47:38 UTC

Created attachment 143952 [details]
dmesg output after GPU crash in game For The King with kernel 5.0-rc1 built from amd-staging-drm-next

Built with the same .config as Mageia's 5.0.7-desktop kernel, see next attachment.

Comment 3 Rémi Verschelde 2019-04-12 14:48:57 UTC

Created attachment 143953 [details]
/proc/config.gz from Mageia's kernel 5.0.7-desktop, used for custom amd-staging-drm-next build

Comment 4 Rémi Verschelde 2019-04-12 14:52:44 UTC

Pasting some relevant output from attachment 143951 [details] so that relevant keywords can be found by Bugzilla searches.

```
[  325.087186] mce: CPU7: Core temperature above threshold, cpu clock throttled (total events = 1)
[  325.087187] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
[  325.087188] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[  325.087189] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)
[  325.087224] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
[  325.087225] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[  325.087226] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[  325.087226] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
[  325.087227] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
[  325.087228] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[  325.089212] mce: CPU7: Core temperature/speed normal
[  325.089213] mce: CPU0: Package temperature/speed normal
[  325.089214] mce: CPU3: Core temperature/speed normal
[  325.089214] mce: CPU4: Package temperature/speed normal
[  325.089215] mce: CPU7: Package temperature/speed normal
[  325.089215] mce: CPU3: Package temperature/speed normal
[  325.089248] mce: CPU6: Package temperature/speed normal
[  325.089248] mce: CPU5: Package temperature/speed normal
[  325.089249] mce: CPU2: Package temperature/speed normal
[  325.089250] mce: CPU1: Package temperature/speed normal
[  565.312183] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0040d508 for process  pid 0 thread  pid 0
[  565.312194] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00169208
[  565.312200] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF
[  565.312209] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 1479176, write from '\xff\xff\xff\xff' (0xffffffff) (511)
[  565.312219] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00405508 for process  pid 0 thread  pid 0
[  565.312224] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0xFFFFFFFF
[  565.312229] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF
[  565.312236] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)
[  565.312244] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00485508 for process  pid 0 thread  pid 0
[  565.312248] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0xFFFFFFFF
[  565.312252] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF
[  565.312258] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)

<snip>

[  565.312378] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00785508 for process  pid 0 thread  pid 0
[  565.312383] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0xFFFFFFFF
[  565.312387] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0xFFFFFFFF
[  565.312393] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page 4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)
[  575.625913] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=117668, emitted seq=117670
[  575.625950] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process starcrawlers.x8 pid 9151 thread starcrawle:cs0 pid 9162
[  575.625953] amdgpu 0000:01:00.0: GPU reset begin!
[  575.626419] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  575.626420] amdgpu: [powerplay] 
                failed to send message 281 ret is 65535 
[  575.636259] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110
[  575.651311] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  575.651312] amdgpu: [powerplay] 
                failed to send message 133 ret is 65535 
[  575.651316] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  575.651316] amdgpu: [powerplay] 
                failed to send message 310 ret is 65535 
[  575.651317] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  575.651317] amdgpu: [powerplay] 
                failed to send message 5e ret is 65535 

<snip>

[  575.651340] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  575.651341] amdgpu: [powerplay] 
                failed to send message 84 ret is 65535 
[  575.651341] amdgpu: [powerplay] Failed to force to switch arbf0!
[  575.651342] amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM!
[  575.651360] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -22
[  575.769673] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[  575.769740] [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[  575.888355] cp is busy, skip halt cp
[  576.007183] rlc is busy, skip halt rlc
[  576.008188] amdgpu 0000:01:00.0: GPU pci config reset
[  576.126260] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* ASIC reset failed with err r, -22 for drm dev, 0000:01:00.0
[  576.127736] Asynchronous wait on fence drm_sched:gfx:1ca87 timed out (hint:submit_notify+0x0/0x58 [i915])
[  576.127768] Asynchronous wait on fence drm_sched:gfx:1ca82 timed out (hint:submit_notify+0x0/0x58 [i915])
[  576.127788] Asynchronous wait on fence i915:Xorg[3673]/0:6455 timed out (hint:intel_atomic_commit_ready+0x0/0x4c [i915])
[  581.126683] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
[  581.126734] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D654 (len 62, WS 0, PS 0) @ 0xD670
[  581.126754] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C410 (len 114, WS 0, PS 8) @ 0xC42B
[  581.126755] [drm] asic atom init failed!
[  581.126765] amdgpu 0000:01:00.0: GPU reset(2) failed
[  581.126766] amdgpu 0000:01:00.0: GPU reset end with ret = -22
[  581.126777] [drm] Skip scheduling IBs!
[  581.126782] [drm] Skip scheduling IBs!
[  581.126784] [drm] Skip scheduling IBs!
[  581.126785] [drm] Skip scheduling IBs!
[  581.126786] [drm] Skip scheduling IBs!
[  581.126787] [drm] Skip scheduling IBs!
[  581.126789] [drm] Skip scheduling IBs!
[  581.126790] [drm] Skip scheduling IBs!
[  581.126791] [drm] Skip scheduling IBs!
[  591.487678] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=117670, emitted seq=117670
[  591.487716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process starcrawlers.x8 pid 9151 thread starcrawle:cs0 pid 9162
[  591.487719] amdgpu 0000:01:00.0: GPU reset begin!
[  591.488418] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  591.488419] amdgpu: [powerplay] 
                failed to send message 281 ret is 65535 
[  591.488495] WARNING: CPU: 2 PID: 666 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:788 dm_suspend+0x4e/0x60 [amdgpu]
[  591.488496] Modules linked in: cmac rfcomm ccm msr ip6t_REJECT nf_reject_ipv6 xt_comment ip6table_mangle ip6table_nat nf_nat_ipv6 ip6table_raw nf_log_ipv6 ip6table_filter ip6_tables xt_recent ipt_IFWLOG ipt_psd xt_set ip_set_hash_ip ip_set ipt_REJECT nf_reject_ipv4 xt_conntrack xt_hashlimit xt_addrtype xt_mark iptable_mangle iptable_nat nf_nat_ipv4 xt_CT xt_tcpudp iptable_raw nfnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_conntrack_sane nf_conntrack_netlink nfnetlink nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda nf_nat nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter af_packet bnep binfmt_misc fuse nls_iso8859_1 nls_cp437 vfat fat dm_mirror dm_region_hash dm_log dm_mod snd_hda_codec_hdmi arc4 joydev
[  591.488509]  intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm hid_sensor_incl_3d hid_sensor_gyro_3d hid_sensor_magn_3d hid_sensor_rotation hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio irqbypass hid_multitouch crc32_pclmul crc32c_intel ghash_clmulni_intel spi_pxa2xx_platform 8250_dw iwlmvm hid_sensor_hub aesni_intel iTCO_wdt iTCO_vendor_support mac80211 snd_hda_codec_realtek hid_generic aes_x86_64 input_leds tpm_crb crypto_simd cryptd snd_hda_codec_generic glue_helper ledtrig_audio intel_cstate psmouse intel_uncore iwlwifi snd_hda_intel thermal snd_hda_codec uvcvideo btusb snd_hda_core btbcm videobuf2_vmalloc btrtl videobuf2_memops videobuf2_v4l2 btintel videobuf2_common cfg80211 snd_hwdep videodev snd_pcm bluetooth media snd_timer intel_rapl_perf pinctrl_sunrisepoint ucsi_acpi typec_ucsi usbhid typec tpm_tis pinctrl_intel intel_wmi_thunderbolt snd tpm_tis_core hp_wmi soundcore tpm wmi_bmof idma64 ecdh_generic
[  591.488521]  int3400_thermal battery virt_dma button acpi_thermal_rel rtsx_pci_ms intel_vbtn i2c_i801 acpi_pad hp_wireless ac rfkill sparse_keymap int3403_thermal memstick mei_me mei intel_lpss_pci intel_pch_thermal intel_lpss processor_thermal_device intel_ishtp_hid int340x_thermal_zone intel_soc_dts_iosf evdev nvram sch_fq_codel efivarfs ip_tables x_tables ipv6 crc_ccitt autofs4 amdgpu xhci_pci rtsx_pci_sdmmc xhci_hcd mmc_block mmc_core usbcore serio_raw chash amd_iommu_v2 rtsx_pci gpu_sched intel_ish_ipc ttm intel_ishtp usb_common i915 i2c_hid hid i2c_algo_bit drm_kms_helper wmi video drm
[  591.488549] CPU: 2 PID: 666 Comm: kworker/2:2 Not tainted 5.0.7-desktop-4.mga7 #1
[  591.488550] Hardware name: HP HP Spectre x360 Convertible 15-ch0xx/83BB, BIOS F.24 11/06/2018
[  591.488552] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  591.488627] RIP: 0010:dm_suspend+0x4e/0x60 [amdgpu]
[  591.488627] Code: 00 48 89 83 70 cb 00 00 e8 af fc ff ff 48 89 df e8 67 75 00 00 48 8b bb 60 b3 00 00 be 08 00 00 00 e8 16 8f 0a 00 31 c0 5b c3 <0f> 0b eb c1 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00
[  591.488628] RSP: 0018:ffffb50201f97d20 EFLAGS: 00010282
[  591.488629] RAX: ffffffffc08a3e00 RBX: ffff93f4a35c0000 RCX: 0000000000000012
[  591.488629] RDX: 0000000000000080 RSI: 0000000000000001 RDI: ffff93f4a35c0000
[  591.488629] RBP: ffff93f4a35ccb98 R08: 0000000000000492 R09: 0000000000000004
[  591.488630] R10: 0000000000000000 R11: 0000000000000001 R12: ffff93f4a35c0000
[  591.488630] R13: ffffffffc09e25a0 R14: 0000000000000000 R15: ffff93f4a35c3498
[  591.488631] FS:  0000000000000000(0000) GS:ffff93f4b1c80000(0000) knlGS:0000000000000000
[  591.488631] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  591.488632] CR2: 00007f8c18a40a38 CR3: 000000033220e002 CR4: 00000000003606e0
[  591.488632] Call Trace:
[  591.488676]  amdgpu_device_ip_suspend_phase1+0x94/0xc0 [amdgpu]
[  591.488721]  amdgpu_device_ip_suspend+0x1b/0x60 [amdgpu]
[  591.488796]  amdgpu_device_pre_asic_reset+0x9e/0x260 [amdgpu]
[  591.488817]  amdgpu_device_gpu_recover+0x87/0x7e0 [amdgpu]
[  591.488828]  ? drm_err+0x72/0x90 [drm]
[  591.488882]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[  591.488884]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[  591.488887]  process_one_work+0x200/0x400
[  591.488888]  worker_thread+0x2d/0x3d0
[  591.488889]  ? process_one_work+0x400/0x400
[  591.488891]  kthread+0x112/0x130
[  591.488892]  ? kthread_create_on_node+0x60/0x60
[  591.488894]  ret_from_fork+0x35/0x40
[  591.488895] ---[ end trace 356c1ae357df635c ]---
[  591.499325] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110
```

Comment 5 Rémi Verschelde 2019-04-13 13:27:28 UTC

Tried another game (Northgard) today, same issue. I manually enabled the performance CPU governor, but it didn't prevent the GPU crash which happened ~10 min in game.

I'm attaching the dmesg, journalctl and Xorg.1.log taken right after the crash (before deadlock).

Comment 6 Rémi Verschelde 2019-04-13 13:29:00 UTC

Created attachment 143958 [details]
dmesg output after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7

Worth noting: Northgard is not a Unity3D game compared to For The King and StarCrawlers. It uses the Haxe/Heaps engine.

Comment 7 Rémi Verschelde 2019-04-13 13:30:15 UTC

Created attachment 143959 [details]
journalctl -b output after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7

As can be seen in these logs, I'm running Plasma 5/KWin. Some messages from plasmashell regarding temperature/sensors are likely due to the widget I use to monitor the CPU and GPU temperatures in the taskbar.

Comment 8 Rémi Verschelde 2019-04-13 13:32:00 UTC

Created attachment 143960 [details]
Xorg.0.log after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7

This seems to only cover the startup of the computer. The rest of the log seems to be in Xorg.1.log, I guess DRI_PRIME=1 does that.

Comment 9 Rémi Verschelde 2019-04-13 13:32:16 UTC

Created attachment 143961 [details]
Xorg.1.log after GPU crash in game Northgard with kernel 5.0.7-desktop from Mageia 7

Comment 10 Alex Behling 2019-07-28 15:41:10 UTC

From my experience this seems to be a thermal problem. I have the exact same hardware configuration running latest Archlinux Kernel. 

$ uname -a
Linux lexnote 5.2.3-arch1-1-ARCH #1 SMP PREEMPT Fri Jul 26 08:13:47 UTC 2019 x86_64 GNU/Linux

If I leave leave the system with the default PM settings (Profile Performance or Balance doesn't matter) sooner or later I will get Lock-Ups in any game or application with higher GPU loads.


EXAMPLE DMESG OUTPUT:

[Do Jul 25 23:33:45 2019] amdgpu 0000:01:00.0: GPU pci config reset
[Do Jul 25 23:33:53 2019] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[Do Jul 25 23:33:53 2019] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[Do Jul 25 23:33:53 2019] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
[Do Jul 25 23:33:53 2019] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).
[Do Jul 25 23:33:53 2019] [drm] schedsdma0 is not ready, skipping
[Do Jul 25 23:33:53 2019] [drm] schedsdma1 is not ready, skipping
[Do Jul 25 23:33:59 2019] WARNING: CPU: 1 PID: 20969 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:891 dm_suspend+0x4e/0x60 [amdgpu]
[Do Jul 25 23:33:59 2019] Modules linked in: msr fuse 8021q garp mrp stp llc ccm snd_hda_codec_hdmi hid_sensor_gyro_3d hid_sensor_accel_3d hid_sensor_magn_3d hid_sensor_rotation hid_sensor_incl_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio hid_sensor_hub intel_ishtp_loader intel_ishtp_hid arc4 iwlmvm mousedev cdc_ether usbnet r8152 xpad ff_memless joydev mii mac80211 uvcvideo btusb videobuf2_vmalloc hid_logitech_hidpp videobuf2_memops btrtl btbcm nls_iso8859_1 videobuf2_v4l2 btintel nls_cp437 videobuf2_common bluetooth vfat fat videodev media spi_pxa2xx_platform ecdh_generic iTCO_wdt 8250_dw hid_multitouch ecc mei_hdcp iTCO_vendor_support iwlwifi intel_rapl hp_wmi x86_pkg_temp_thermal wmi_bmof intel_powerclamp intel_wmi_thunderbolt coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm psmouse input_leds snd_hda_intel cfg80211 irqbypass intel_cstate snd_hda_codec snd_hda_core intel_uncore snd_hwdep intel_rapl_perf snd_pcm
[Do Jul 25 23:33:59 2019]  rtsx_pci_ms memstick snd_timer pcspkr mei_me intel_ish_ipc processor_thermal_device snd idma64 int3403_thermal ucsi_acpi i2c_i801 soundcore typec_ucsi rfkill intel_lpss_pci mei tpm_crb int340x_thermal_zone intel_pch_thermal intel_ishtp intel_soc_dts_iosf intel_lpss i2c_hid typec wmi tpm_tis tpm_tis_core tpm rng_core intel_vbtn battery sparse_keymap hp_wireless evdev mac_hid int3400_thermal acpi_thermal_rel ac pcc_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 algif_skcipher af_alg hid_logitech_dj hid_generic usbhid hid dm_crypt crct10dif_pclmul crc32_pclmul dm_mod crc32c_intel ghash_clmulni_intel rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 ahci libahci aesni_intel libata aes_x86_64 crypto_simd cryptd xhci_pci glue_helper scsi_mod xhci_hcd rtsx_pci i8042 serio amdgpu amd_iommu_v2 gpu_sched ttm i915 intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
[Do Jul 25 23:33:59 2019]  agpgart
[Do Jul 25 23:33:59 2019] CPU: 1 PID: 20969 Comm: kworker/1:2 Tainted: G           OE     5.2.1-arch1-1-ARCH #1
[Do Jul 25 23:33:59 2019] Hardware name: HP HP Spectre x360 Convertible 15-ch0xx/83BB, BIOS F.24 11/06/2018
[Do Jul 25 23:33:59 2019] Workqueue: pm pm_runtime_work
[Do Jul 25 23:33:59 2019] RIP: 0010:dm_suspend+0x4e/0x60 [amdgpu]
[Do Jul 25 23:33:59 2019] Code: 00 48 89 83 70 e9 00 00 e8 9f fc ff ff 48 89 df e8 97 83 00 00 48 8b bb 70 cf 00 00 be 08 00 00 00 e8 b6 9a 08 00 31 c0 5b c3 <0f> 0b eb c1 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00
[Do Jul 25 23:33:59 2019] RSP: 0018:ffffb286869cfcb8 EFLAGS: 00010282
[Do Jul 25 23:33:59 2019] RAX: ffffffffc0675ed0 RBX: ffffa1b9e0d30000 RCX: ffffffffc073e980
[Do Jul 25 23:33:59 2019] RDX: 0000000000000080 RSI: 0000000000000001 RDI: ffffa1b9e0d30000
[Do Jul 25 23:33:59 2019] RBP: ffffa1b9e0d3e998 R08: 0000000000000001 R09: 0000000000000018
[Do Jul 25 23:33:59 2019] R10: fefefefefefefeff R11: 0000000000000000 R12: ffffa1b9e0d30000
[Do Jul 25 23:33:59 2019] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa1b9ebc8bd80
[Do Jul 25 23:33:59 2019] FS:  0000000000000000(0000) GS:ffffa1b9eea40000(0000) knlGS:0000000000000000
[Do Jul 25 23:33:59 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Do Jul 25 23:33:59 2019] CR2: 00007f1d9c02b000 CR3: 0000000469058002 CR4: 00000000003606e0
[Do Jul 25 23:33:59 2019] Call Trace:
[Do Jul 25 23:33:59 2019]  amdgpu_device_ip_suspend_phase1+0x8e/0xc0 [amdgpu]
[Do Jul 25 23:33:59 2019]  amdgpu_device_suspend+0x234/0x390 [amdgpu]
[Do Jul 25 23:33:59 2019]  amdgpu_pmops_runtime_suspend+0x41/0xb0 [amdgpu]
[Do Jul 25 23:33:59 2019]  pci_pm_runtime_suspend+0x5b/0x150
[Do Jul 25 23:33:59 2019]  ? __switch_to_asm+0x40/0x70
[Do Jul 25 23:33:59 2019]  vga_switcheroo_runtime_suspend+0x25/0xb0
[Do Jul 25 23:33:59 2019]  ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019]  __rpm_callback+0x7b/0x130
[Do Jul 25 23:33:59 2019]  ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019]  ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019]  rpm_callback+0x2a/0x90
[Do Jul 25 23:33:59 2019]  ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019]  rpm_suspend+0x136/0x610
[Do Jul 25 23:33:59 2019]  pm_runtime_work+0x94/0xa0
[Do Jul 25 23:33:59 2019]  process_one_work+0x1d1/0x3e0
[Do Jul 25 23:33:59 2019]  worker_thread+0x4a/0x3d0
[Do Jul 25 23:33:59 2019]  kthread+0xfd/0x130
[Do Jul 25 23:33:59 2019]  ? process_one_work+0x3e0/0x3e0
[Do Jul 25 23:33:59 2019]  ? kthread_park+0x90/0x90
[Do Jul 25 23:33:59 2019]  ret_from_fork+0x35/0x40
[Do Jul 25 23:33:59 2019] ---[ end trace 69a711ec632dab70 ]---
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Trying to disable SCLK DPM when DPM is disabled
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Trying to disable voltage DPM when DPM is disabled
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Failed to force to switch arbf0!
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM!
[Do Jul 25 23:33:59 2019] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -22
[Do Jul 25 23:34:00 2019] cp is busy, skip halt cp
[Do Jul 25 23:34:00 2019] rlc is busy, skip halt rlc
[Do Jul 25 23:34:00 2019] amdgpu 0000:01:00.0: GPU pci config reset

If I reduce the maximum CPU Frequency to 2.2GHz keeping the temperatures of CPU cores and GPU just below 60 degree Celsius the problem does not occur anymore.

$ sudo cpupower frequency-set -u 2.2GHz

Comment 11 Utku Helvacı (tuxutku) 2019-08-01 18:44:41 UTC

Created attachment 144926 [details]
journalctl -b0 output on kernel 5.3.0-rc2 from ubuntu mainline repository, with a system with rx 540 gpu

kernel 5.3.0-rc1 was just fine and was just fixed a long lasted regression on rx 540 gpu, updating to 5.3.0-rc2 causes gpu to be disabled after launching a single application with it, gpu works fine until application is closed, then DRI_PRIME=1 doesn't work

Comment 12 Utku Helvacı (tuxutku) 2019-08-07 18:56:45 UTC

as it turns out this is not a bug in kernel but amd's aco compiler so its irrelevant

Comment 13 Martin Peres 2019-11-19 09:19:20 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/747.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.