Created attachment 145421 [details] Syslog data Since kernel 5.2.x, resuming from suspend will randomly crash amdgpu/drm with a black screen. After login via SSH, rebooting or halting the host will not complete. A hard reset becomes unavoidable. Currently using 5.3.0-rc8, RX 480 with one single display. Stable 5.2.x kernels don't resolve the issue. The host does not last a week without a forced reboot. Please see the attached syslog trace. Thanks.
Created attachment 145422 [details] Xorg log
I can confirm this bug. Same phenomenology here with 5.2.11, two screens, and a 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7) Sep 29 08:26:39 phlegethon kernel: PM: suspend exit Sep 29 08:26:40 phlegethon kernel: [drm] schedsdma0 is not ready, skipping Sep 29 08:26:58 phlegethon kernel: [drm] schedsdma1 is not ready, skipping Sep 29 08:26:58 phlegethon kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008 Sep 29 08:26:58 phlegethon kernel: #PF: supervisor read access in kernel mode Sep 29 08:26:58 phlegethon kernel: #PF: error_code(0x0000) - not-present page Sep 29 08:26:59 phlegethon kernel: PGD 6032ef067 P4D 6032ef067 PUD 603c09067 PMD 0 Sep 29 08:26:59 phlegethon kernel: Oops: 0000 [#1] SMP NOPTI Sep 29 08:26:59 phlegethon kernel: CPU: 0 PID: 1429 Comm: X Tainted: G W 5.2.11 #1-NixOS Sep 29 08:26:59 phlegethon kernel: Hardware name: Gigabyte Technology Co., Ltd. GA-78LMT-USB3 6.0/GA-78LMT-USB3 6.0, BIOS F2 11/25/2014 Sep 29 08:26:59 phlegethon kernel: RIP: 0010:amdgpu_vm_sdma_commit+0x46/0x140 [amdgpu] Sep 29 08:26:59 phlegethon kernel: Code: 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8b 47 08 48 8b aa a8 01 00 00 4c 8b a8 80 00 00 00 48 8b 80 c8 00 00 00 <4c> 8b 70 08 8b 45 08 85 c0 4d 8d 7e 88 0f 84 c2 00 00 00 49 8b 46 Sep 29 08:26:59 phlegethon kernel: RSP: 0018:ffff99bb838ffad8 EFLAGS: 00010246 Sep 29 08:26:59 phlegethon kernel: RAX: 0000000000000000 RBX: ffff99bb838ffb20 RCX: 0000000000103c00 Sep 29 08:26:59 phlegethon kernel: RDX: ffff89abd64a4000 RSI: ffff99bb838ffba8 RDI: ffff99bb838ffb20 Sep 29 08:26:59 phlegethon kernel: RBP: ffff89abd64a4210 R08: ffff99bb838ffa6c R09: ffff99bb838ffa70 Sep 29 08:26:59 phlegethon kernel: R10: 0000000000103804 R11: 0000000000000021 R12: ffff99bb838ffba8 Sep 29 08:26:59 phlegethon kernel: R13: ffff89ac48fc6000 R14: 0000000000103803 R15: 0000000000000000 Sep 29 08:26:59 phlegethon kernel: FS: 00007fd089d18e40(0000) GS:ffff89ac67a00000(0000) knlGS:0000000000000000 Sep 29 08:26:59 phlegethon kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 29 08:26:59 phlegethon kernel: CR2: 0000000000000008 CR3: 0000000603afa000 CR4: 00000000000006f0 Sep 29 08:26:59 phlegethon kernel: Call Trace: Sep 29 08:26:59 phlegethon kernel: amdgpu_vm_bo_update_mapping+0xcd/0xe0 [amdgpu] Sep 29 08:26:59 phlegethon kernel: amdgpu_vm_clear_freed+0xbe/0x190 [amdgpu] Sep 29 08:26:59 phlegethon kernel: amdgpu_gem_va_ioctl+0x488/0x4f0 [amdgpu] Sep 29 08:26:59 phlegethon kernel: ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] Sep 29 08:26:59 phlegethon kernel: ? drm_ioctl_kernel+0xac/0xf0 [drm] Sep 29 08:26:59 phlegethon kernel: drm_ioctl_kernel+0xac/0xf0 [drm] Sep 29 08:26:59 phlegethon kernel: ? sock_write_iter+0x8f/0xf0 Sep 29 08:26:59 phlegethon kernel: drm_ioctl+0x2e6/0x3a0 [drm] Sep 29 08:26:59 phlegethon kernel: ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] Sep 29 08:26:59 phlegethon kernel: ? do_iter_write+0xe2/0x190 Sep 29 08:26:59 phlegethon kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] Sep 29 08:26:59 phlegethon kernel: do_vfs_ioctl+0xa4/0x630 Sep 29 08:26:59 phlegethon kernel: ksys_ioctl+0x70/0x80 Sep 29 08:26:59 phlegethon kernel: __x64_sys_ioctl+0x16/0x20 Sep 29 08:26:59 phlegethon kernel: do_syscall_64+0x4e/0x130 Sep 29 08:26:59 phlegethon kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Sep 29 08:26:59 phlegethon kernel: RIP: 0033:0x7fd08a3c4b57 Sep 29 08:26:59 phlegethon kernel: Code: 00 00 00 48 8b 05 29 53 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f9 52 0c 00 f7 d8 64 89 01 48 Sep 29 08:26:59 phlegethon kernel: RSP: 002b:00007fff6a6738b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Sep 29 08:26:59 phlegethon kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd08a3c4b57 Sep 29 08:26:59 phlegethon kernel: RDX: 00007fff6a673900 RSI: 00000000c0286448 RDI: 0000000000000012 Sep 29 08:26:59 phlegethon kernel: RBP: 00007fff6a673900 R08: 0000000103400000 R09: 000000000000000e Sep 29 08:26:59 phlegethon kernel: R10: 0000000000000026 R11: 0000000000000246 R12: 00000000c0286448 Sep 29 08:26:59 phlegethon kernel: R13: 0000000000000012 R14: 0000000000000002 R15: 00000000043d74a0 Sep 29 08:26:59 phlegethon kernel: Modules linked in: fuse bridge stp llc cfg80211 msr rfkill 8021q ext4 crc16 mbcache jbd2 amdgpu wmi_bmof ppdev gpu_sched edac_core ttm drm_kms_helper snd_hda_codec_realtek k10temp snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi drm snd_hda_intel joydev r8169 ata_generic snd_hda_codec sp5100_tco uas agpgart watchdog i2c_algo_bit pata_acpi evdev fb_sys_fops syscopyarea i2c_piix4 mousedev sysfillrect realtek sysimgblt mac_hid snd_hda_core backlight libphy i2c_core parport_pc snd_hwdep parport wmi button pcc_cpufreq acpi_cpufreq iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables iptable_filter snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore atkbd libps2 serio loop cpufreq_ondemand edac_mce_amd kvm irqbypass ip_tables x_tables ipv6 crc_ccitt autofs4 crypto_simd cryptd glue_helper input_leds led_class Sep 29 08:26:59 phlegethon kernel: usb_storage sd_mod xhci_pci pata_atiixp ahci xhci_hcd ohci_pci libahci libata ehci_pci ohci_hcd scsi_mod ehci_hcd rtc_cmos aes_x86_64 serpent_generic btrfs zstd_decompress zstd_compress libcrc32c crc32c_generic xor raid6_pq dm_crypt dm_mod usbhid usbcore usb_common hid_generic hid_microsoft ff_memless hid Sep 29 08:26:59 phlegethon kernel: CR2: 0000000000000008 Sep 29 08:26:59 phlegethon kernel: ---[ end trace b0432b776c251e2d ]--- Sep 29 08:26:59 phlegethon kernel: RIP: 0010:amdgpu_vm_sdma_commit+0x46/0x140 [amdgpu] Sep 29 08:26:59 phlegethon kernel: Code: 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8b 47 08 48 8b aa a8 01 00 00 4c 8b a8 80 00 00 00 48 8b 80 c8 00 00 00 <4c> 8b 70 08 8b 45 08 85 c0 4d 8d 7e 88 0f 84 c2 00 00 00 49 8b 46 Sep 29 08:26:59 phlegethon kernel: RSP: 0018:ffff99bb838ffad8 EFLAGS: 00010246 Sep 29 08:26:59 phlegethon kernel: RAX: 0000000000000000 RBX: ffff99bb838ffb20 RCX: 0000000000103c00 Sep 29 08:26:59 phlegethon kernel: RDX: ffff89abd64a4000 RSI: ffff99bb838ffba8 RDI: ffff99bb838ffb20 Sep 29 08:26:59 phlegethon kernel: RBP: ffff89abd64a4210 R08: ffff99bb838ffa6c R09: ffff99bb838ffa70 Sep 29 08:26:59 phlegethon kernel: R10: 0000000000103804 R11: 0000000000000021 R12: ffff99bb838ffba8 Sep 29 08:26:59 phlegethon kernel: R13: ffff89ac48fc6000 R14: 0000000000103803 R15: 0000000000000000 Sep 29 08:26:59 phlegethon kernel: FS: 00007fd089d18e40(0000) GS:ffff89ac67a00000(0000) knlGS:0000000000000000 Sep 29 08:26:59 phlegethon kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 29 08:26:59 phlegethon kernel: CR2: 0000000000000008 CR3: 0000000603afa000 CR4: 00000000000006f0 Sep 29 08:30:10 phlegethon kernel: sysrq: Keyboard mode set to system default
Can also confirm this bug, doesn't happen on 4.20.16-200.fc29.x86_64, which is the last fc29 kernel. All 5.X series fc30 kernels seem to be affected.
Can you bisect?
(In reply to Alex Deucher from comment #4) > Can you bisect? Bisecting would be of great help. I doubt it's feasible in practice. The bug happens after an undetermined number of syspend/resume cycles. It would take weeks or months to isolate the searched patch. By that time, it may well become invalid due to other changes.
I agree with SET that bisecting would not be feasible due to the rarity of the bug. For reference, the affected box just now crashed after a week of no issue with suspending and resuming approx. twice a day.
I began bisecting yesterday and discovered another bug that happens on suspend, which makes it hard to determine the good/bad status of a build with regards to _this_ bug in a timely manner. Hence aborting any bisection attempts. Maybe a new crowd of people runs into this when upgrading their Ubuntu LTS systems :(
There is a correspnding bug report on the Gentoo users forum: https://forums.gentoo.org/viewtopic-p-8375988.html#8375988
Potential fix (and kernel Bugzilla bug): https://bugzilla.kernel.org/show_bug.cgi?id=204241
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/909.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.