Bug 111921 - GPU crash on VegaM (amdgpu: The CS has been rejected)
Summary: GPU crash on VegaM (amdgpu: The CS has been rejected)
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: not set major
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-08 07:57 UTC by Rémi Verschelde
Modified: 2019-10-09 14:56 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output after GPU crash with "amdgpu: The CS has been rejected" (88.86 KB, text/x-log)
2019-10-08 07:57 UTC, Rémi Verschelde
no flags Details
dmesg output after GPU crash with "amdgpu: The CS has been rejected" (74.93 KB, text/x-log)
2019-10-09 10:43 UTC, Rémi Verschelde
no flags Details
dmesg output running kernel 5.1.20 and resuming from screensaver, no bug (71.64 KB, text/x-log)
2019-10-09 10:44 UTC, Rémi Verschelde
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rémi Verschelde 2019-10-08 07:57:53 UTC
Created attachment 145680 [details]
dmesg output after GPU crash with "amdgpu: The CS has been rejected"

Might be related to bug 111860.

In my case, the GPU crashes or fails to resume when I use the Godot Engine FOSS application: https://github.com/godotengine/godot

The application works fine for a time, but eventually it will freeze and this gets printed to the terminal:

amdgpu: The CS has been rejected, see dmesg for more information (-2).
amdgpu: The CS has been rejected, see dmesg for more information (-19).

(attaching dmesg)

At this point, I have to kill the application, and reboot if I want to use the GPU again.

This seems to happen mainly when alt-tabbing between Godot and the desktop or terminal (both of which run on the Intel HD 630 IGP), so it might be an issue with context switching?

I don't have precise steps to reproduce yet apart from using Godot (debug build from git master branch) and other applications in parallel, to eventually see it crash within 5-10 min.

I *think* the bug started to happen when I upgraded to kernel 5.2.x (now running 5.3.2, still having the bug). That's what bug 111860 claims too, so I'll attempt running 5.1.20 for a while to see if the bug still happens.

System info:

$ inxi
CPU: Quad Core Intel Core i7-8705G (-MT MCP-) speed/min/max: 1347/800/4100 MHz Kernel: 5.3.2-desktop-1.mga7 x86_64 Up: 2h 44m 
Mem: 3451.1/15767.7 MiB (21.9%) Storage: 953.87 GiB (58.3% used) Procs: 241 Shell: bash 4.4.23 inxi: 3.0.33 
$ inxi -G
Graphics:  Device-1: Intel HD Graphics 630 driver: i915 v: kernel 
           Device-2: Advanced Micro Devices [AMD/ATI] Polaris 22 XL [Radeon RX Vega M GL] driver: amdgpu v: kernel 
           Display: x11 server: Mageia X.org 1.20.4 driver: amdgpu,intel FAILED: ati unloaded: fbdev,modesetting,vesa tty: N/A 
           OpenGL: renderer: Mesa DRI Intel HD Graphics 630 (Kaby Lake GT2) v: 4.5 Mesa 19.1.7
Comment 1 Rémi Verschelde 2019-10-08 08:00:07 UTC
Pasting relevant part of `dmesg` log:
```
[ 7813.339782] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 7813.454656] [drm] UVD and UVD ENC initialized successfully.
[ 7813.565585] [drm] VCE initialized successfully.
[ 7836.109655] amdgpu 0000:01:00.0: GPU pci config reset
[ 7852.253685] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 7852.479940] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[ 7852.479971] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
[ 7852.480000] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).
[ 7852.497697] [drm] schedsdma0 is not ready, skipping
[ 7852.497697] [drm] schedsdma1 is not ready, skipping
[ 7852.508213] Move buffer fallback to memcpy unavailable
[ 7852.508264] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
[ 7852.508377] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 7852.508379] #PF: supervisor read access in kernel mode
[ 7852.508379] #PF: error_code(0x0000) - not-present page
[ 7852.508380] PGD 800000030ef3d067 P4D 800000030ef3d067 PUD 30e4e3067 PMD 0 
[ 7852.508383] Oops: 0000 [#1] SMP PTI
[ 7852.508384] CPU: 0 PID: 30196 Comm: godot.x11.:cs0 Tainted: G        W  O      5.3.2-desktop-1.mga7 #1
[ 7852.508385] Hardware name: HP HP Spectre x360 Convertible/83BB, BIOS F.30 03/07/2019
[ 7852.508433] RIP: 0010:amdgpu_vm_sdma_commit+0x46/0x110 [amdgpu]
[ 7852.508434] Code: 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8b 47 08 48 8b aa 88 01 00 00 4c 8b a8 80 00 00 00 48 8b 80 c8 00 00 00 <4c> 8b 70 08 8b 45 08 4d 8d 7e 88 85 c0 0f 84 1c 0e 1f 00 49 8b 46
[ 7852.508435] RSP: 0018:ffffb5d70ec939e0 EFLAGS: 00010246
[ 7852.508436] RAX: 0000000000000000 RBX: ffffb5d70ec93a28 RCX: 0000000000000800
[ 7852.508437] RDX: ffff988d83c97c00 RSI: ffff988c8e5ae9b8 RDI: ffffb5d70ec93a28
[ 7852.508438] RBP: ffff988d83c97df8 R08: 0000000000001000 R09: 0000000000000011
[ 7852.508438] R10: 0000000000000600 R11: 000000000000000d R12: ffff988c8e5ae9b8
[ 7852.508439] R13: ffff988d922cc000 R14: 00000000000005ff R15: 0000000000000071
[ 7852.508440] FS:  00007f2cf7602700(0000) GS:ffff988e31c00000(0000) knlGS:0000000000000000
[ 7852.508441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7852.508441] CR2: 0000000000000008 CR3: 0000000308aca001 CR4: 00000000003606f0
[ 7852.508442] Call Trace:
[ 7852.508482]  amdgpu_vm_bo_update_mapping+0xcd/0xe0 [amdgpu]
[ 7852.508518]  amdgpu_vm_bo_update+0x336/0x730 [amdgpu]
[ 7852.508552]  amdgpu_cs_ioctl+0x1324/0x1a40 [amdgpu]
[ 7852.508555]  ? __switch_to_asm+0x34/0x70
[ 7852.508591]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
[ 7852.508600]  drm_ioctl_kernel+0xac/0xf0 [drm]
[ 7852.508608]  drm_ioctl+0x201/0x3a0 [drm]
[ 7852.508640]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
[ 7852.508643]  ? do_futex+0xca/0xb70
[ 7852.508674]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 7852.508677]  do_vfs_ioctl+0xa4/0x630
[ 7852.508678]  ? __x64_sys_futex+0x13c/0x180
[ 7852.508680]  ksys_ioctl+0x60/0x90
[ 7852.508681]  __x64_sys_ioctl+0x16/0x20
[ 7852.508683]  do_syscall_64+0x69/0x1d0
[ 7852.508684]  ? prepare_exit_to_usermode+0x4c/0xb0
[ 7852.508686]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 7852.508687] RIP: 0033:0x7f2d0c21f2b7
[ 7852.508688] Code: 0f 1f 00 64 48 8b 14 25 00 00 00 00 48 8b 05 d0 8b 0c 00 c7 04 02 26 00 00 00 48 c7 c0 ff ff ff ff c3 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a9 8b 0c 00 f7 d8 64 89 01 48
[ 7852.508689] RSP: 002b:00007f2cf7601af8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 7852.508690] RAX: ffffffffffffffda RBX: 00007f2cf7601bf8 RCX: 00007f2d0c21f2b7
[ 7852.508691] RDX: 00007f2cf7601b60 RSI: 00000000c0186444 RDI: 0000000000000006
[ 7852.508691] RBP: 00007f2cf7601b60 R08: 00007f2cf7601c50 R09: 0000000000000020
[ 7852.508692] R10: 00007f2cf7601c50 R11: 0000000000000246 R12: 00000000c0186444
[ 7852.508693] R13: 0000000000000006 R14: 00000000073bb160 R15: 00000000073bb1e8
[ 7852.508694] Modules linked in: cmac rfcomm msr ip6t_REJECT nf_reject_ipv6 xt_comment ip6table_mangle ip6table_nat ip6table_raw nf_log_ipv6 ip6table_filter ip6_tables xt_recent ipt_IFWLOG ipt_psd xt_set ip_set_hash_ip ip_set ipt_REJECT nf_reject_ipv4 xt_conntrack xt_hashlimit xt_addrtype xt_mark iptable_mangle iptable_nat xt_CT xt_tcpudp iptable_raw nfnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_nat nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp nf_conntrack nf_defrag_ipv4 iptable_filter ccm af_packet bnep vboxnetadp(O) vboxnetflt(O) vboxdrv(O) fuse nls_iso8859_1 nls_cp437 vfat fat dm_mirror dm_region_hash dm_log dm_mod btusb uvcvideo btbcm btrtl btintel videobuf2_vmalloc
[ 7852.508713]  videobuf2_memops bluetooth videobuf2_v4l2 videobuf2_common videodev mc usbhid ecdh_generic ecc snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek coretemp iwlmvm snd_hda_codec_generic ledtrig_audio kvm_intel joydev snd_hda_intel mac80211 kvm libarc4 snd_hda_codec irqbypass crc32_pclmul snd_hda_core iwlwifi crc32c_intel spi_pxa2xx_platform dw_dmac dw_dmac_core hid_multitouch 8250_dw ghash_clmulni_intel hid_sensor_magn_3d hid_sensor_gyro_3d hid_sensor_incl_3d aesni_intel hid_sensor_rotation hid_sensor_accel_3d iTCO_wdt iTCO_vendor_support hid_sensor_trigger industrialio_triggered_buffer aes_x86_64 tpm_crb kfifo_buf crypto_simd snd_hwdep cryptd hid_sensor_iio_common industrialio cfg80211 glue_helper snd_pcm intel_cstate intel_uncore mei_hdcp snd_timer hp_wmi ucsi_acpi tpm_tis typec_ucsi tpm_tis_core snd hid_sensor_hub psmouse intel_rapl_msr intel_rapl_perf wmi_bmof soundcore intel_wmi_thunderbolt rfkill i2c_i801 rtsx_pci_ms input_leds hid_generic
[ 7852.508731]  memstick pinctrl_sunrisepoint idma64 int3403_thermal thermal typec battery pinctrl_intel virt_dma tpm hp_wireless intel_vbtn sparse_keymap int3400_thermal acpi_pad acpi_thermal_rel ac intel_ishtp_loader button mei_me mei intel_lpss_pci intel_pch_thermal intel_ishtp_hid processor_thermal_device intel_lpss cros_ec_ishtp intel_rapl_common int340x_thermal_zone cros_ec_core intel_soc_dts_iosf evdev sch_fq_codel nvram binfmt_misc efivarfs ip_tables x_tables ipv6 crc_ccitt nf_defrag_ipv6 autofs4 amdgpu xhci_pci xhci_hcd rtsx_pci_sdmmc mmc_block mmc_core usbcore rtsx_pci amd_iommu_v2 serio_raw gpu_sched intel_ish_ipc intel_ishtp ttm usb_common i915 i2c_hid hid i2c_algo_bit wmi drm_kms_helper video drm
[ 7852.508746] CR2: 0000000000000008
[ 7852.508748] ---[ end trace 4ad3d7dd37eb10d6 ]---
[ 7852.508787] RIP: 0010:amdgpu_vm_sdma_commit+0x46/0x110 [amdgpu]
[ 7852.508788] Code: 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8b 47 08 48 8b aa 88 01 00 00 4c 8b a8 80 00 00 00 48 8b 80 c8 00 00 00 <4c> 8b 70 08 8b 45 08 4d 8d 7e 88 85 c0 0f 84 1c 0e 1f 00 49 8b 46
[ 7852.508789] RSP: 0018:ffffb5d70ec939e0 EFLAGS: 00010246
[ 7852.508790] RAX: 0000000000000000 RBX: ffffb5d70ec93a28 RCX: 0000000000000800
[ 7852.508790] RDX: ffff988d83c97c00 RSI: ffff988c8e5ae9b8 RDI: ffffb5d70ec93a28
[ 7852.508791] RBP: ffff988d83c97df8 R08: 0000000000001000 R09: 0000000000000011
[ 7852.508792] R10: 0000000000000600 R11: 000000000000000d R12: ffff988c8e5ae9b8
[ 7852.508792] R13: ffff988d922cc000 R14: 00000000000005ff R15: 0000000000000071
[ 7852.508793] FS:  00007f2cf7602700(0000) GS:ffff988e31c00000(0000) knlGS:0000000000000000
[ 7852.508794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7852.508795] CR2: 0000000000000008 CR3: 0000000308aca001 CR4: 00000000003606f0
```
Comment 2 Andrey Grodzovsky 2019-10-08 15:37:09 UTC
Hey, I noticed a lot of 'amdgpu 0000:01:00.0: GPU pci config reset' there. Since I see no command submissions timeout errors it looks like you manually tried to reset the GPU multiple times - on one of them there was a failure after which the errors you described appeared. IS this correct ?
Comment 3 Rémi Verschelde 2019-10-08 15:51:11 UTC
I don't reset the GPU manually, no. I'm not sure why this happens, but I've had such output in dmesg as far as I can remember (since I got this laptop in March).

For the reference, I've been using kernel 5.1.20 and did not experience this crash. I'm not sure yet it's conclusive to say it's a regression though, I will test more in coming days.
Comment 4 Andrey Grodzovsky 2019-10-08 16:45:16 UTC
What happens if you disable GPU reset by loading the kernel with amdgpu.gpu_recovery=0 ?
Comment 5 Rémi Verschelde 2019-10-09 10:43:03 UTC
(In reply to Andrey Grodzovsky from comment #4)
> What happens if you disable GPU reset by loading the kernel with
> amdgpu.gpu_recovery=0 ?

Good point, I forgot to mention that I added `amdgpu.dc=0 amdgpu.gpu_recovery=1` in an attempt to work around this issue just before reproducing it again. So I can confirm that I could reproduce this issue both without any amdgpu kernel parameters and with the above two.

I now did some more testing with kernel 5.3.2 and `amdgpu.gpu_recovery=0` (removing the `amdgpu.dc=0` too). Initially I could not trigger the bug, but I got it when letting the desktop environment (KDE) trigger its screensaver while Godot was running on the AMD GPU. Once I resumed from the screensaver, the GPU crashed (note: I did trigger suspend-to-RAM, the laptop was still powered).

The dmesg output is attached.

To compare, I did another test with kernel 5.1.20 (using `amdgpu.dc=0 amdgpu.gpu_recovery=1`), letting it go to sleep with Godot running on the AMD GPU, and it resumed without crashing. I also attach the dmesg output for comparison.
Comment 6 Rémi Verschelde 2019-10-09 10:43:39 UTC
Created attachment 145681 [details]
dmesg output after GPU crash with "amdgpu: The CS has been rejected"

The terminal output matching this crash was:

amdgpu: The CS has been rejected, see dmesg for more information (-19).
amdgpu: The CS has been rejected, see dmesg for more information (-19).
Comment 7 Rémi Verschelde 2019-10-09 10:44:11 UTC
Created attachment 145682 [details]
dmesg output running kernel 5.1.20 and resuming from screensaver, no bug
Comment 8 Rémi Verschelde 2019-10-09 10:45:07 UTC
(In reply to Andrey Grodzovsky from comment #2)
> Hey, I noticed a lot of 'amdgpu 0000:01:00.0: GPU pci config reset' there.

These actually happen every time I change the focus between an application running on the AMD GPU (with `DRI_PRIME=1`) and another application (e.g. desktop environment, firefox, terminal) running on the Intel HD 630 IGP (`DRI_PRIME=0`, default).
Comment 9 Andrey Grodzovsky 2019-10-09 14:18:31 UTC
(In reply to Rémi Verschelde from comment #8)
> (In reply to Andrey Grodzovsky from comment #2)
> > Hey, I noticed a lot of 'amdgpu 0000:01:00.0: GPU pci config reset' there.
> 
> These actually happen every time I change the focus between an application
> running on the AMD GPU (with `DRI_PRIME=1`) and another application (e.g.
> desktop environment, firefox, terminal) running on the Intel HD 630 IGP
> (`DRI_PRIME=0`, default).

So i guess the problem only happens when you run in DRI PRIME mode when different apps render of off different GPUs ?
Comment 10 Rémi Verschelde 2019-10-09 14:56:15 UTC
(In reply to Andrey Grodzovsky from comment #9)
> 
> So i guess the problem only happens when you run in DRI PRIME mode when
> different apps render of off different GPUs ?

Probably, but that's the only way the AMD discrete GPU can be used on such a hybrid graphics laptop to my knowledge.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.