Bug 111729

Summary: RX480 : random NULL pointer dereference on resume from suspend
Product: DRI Reporter: SET <nmset>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: NEW --- QA Contact:
Severity: major    
Priority: not set CC: me, nmset, phg, transmit
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Syslog data
none
Xorg log none

Description SET 2019-09-18 07:41:46 UTC
Created attachment 145421 [details]
Syslog data

Since kernel 5.2.x, resuming from suspend will randomly crash amdgpu/drm with a black screen. After login via SSH, rebooting or halting the host will not complete. A hard reset becomes unavoidable. Currently using 5.3.0-rc8, RX 480 with one single display. Stable 5.2.x kernels don't resolve the issue. The host does not last a week without a forced reboot.

Please see the attached syslog trace. Thanks.
Comment 1 SET 2019-09-18 07:42:38 UTC
Created attachment 145422 [details]
Xorg log
Comment 2 phg 2019-09-29 06:54:50 UTC
I can confirm this bug.

Same phenomenology here with 5.2.11, two screens, and a
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7)


Sep 29 08:26:39 phlegethon kernel: PM: suspend exit
Sep 29 08:26:40 phlegethon kernel: [drm] schedsdma0 is not ready, skipping
Sep 29 08:26:58 phlegethon kernel: [drm] schedsdma1 is not ready, skipping
Sep 29 08:26:58 phlegethon kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Sep 29 08:26:58 phlegethon kernel: #PF: supervisor read access in kernel mode
Sep 29 08:26:58 phlegethon kernel: #PF: error_code(0x0000) - not-present page
Sep 29 08:26:59 phlegethon kernel: PGD 6032ef067 P4D 6032ef067 PUD 603c09067 PMD 0 
Sep 29 08:26:59 phlegethon kernel: Oops: 0000 [#1] SMP NOPTI
Sep 29 08:26:59 phlegethon kernel: CPU: 0 PID: 1429 Comm: X Tainted: G        W         5.2.11 #1-NixOS
Sep 29 08:26:59 phlegethon kernel: Hardware name: Gigabyte Technology Co., Ltd. GA-78LMT-USB3 6.0/GA-78LMT-USB3 6.0, BIOS F2 11/25/2014
Sep 29 08:26:59 phlegethon kernel: RIP: 0010:amdgpu_vm_sdma_commit+0x46/0x140 [amdgpu]
Sep 29 08:26:59 phlegethon kernel: Code: 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8b 47 08 48 8b aa a8 01 00 00 4c 8b a8 80 00 00 00 48 8b 80 c8 00 00 00 <4c> 8b 70 08 8b 45 08 85 c0 4d 8d 7e 88 0f 84 c2 00 00 00 49 8b 46
Sep 29 08:26:59 phlegethon kernel: RSP: 0018:ffff99bb838ffad8 EFLAGS: 00010246
Sep 29 08:26:59 phlegethon kernel: RAX: 0000000000000000 RBX: ffff99bb838ffb20 RCX: 0000000000103c00
Sep 29 08:26:59 phlegethon kernel: RDX: ffff89abd64a4000 RSI: ffff99bb838ffba8 RDI: ffff99bb838ffb20
Sep 29 08:26:59 phlegethon kernel: RBP: ffff89abd64a4210 R08: ffff99bb838ffa6c R09: ffff99bb838ffa70
Sep 29 08:26:59 phlegethon kernel: R10: 0000000000103804 R11: 0000000000000021 R12: ffff99bb838ffba8
Sep 29 08:26:59 phlegethon kernel: R13: ffff89ac48fc6000 R14: 0000000000103803 R15: 0000000000000000
Sep 29 08:26:59 phlegethon kernel: FS:  00007fd089d18e40(0000) GS:ffff89ac67a00000(0000) knlGS:0000000000000000
Sep 29 08:26:59 phlegethon kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 29 08:26:59 phlegethon kernel: CR2: 0000000000000008 CR3: 0000000603afa000 CR4: 00000000000006f0
Sep 29 08:26:59 phlegethon kernel: Call Trace:
Sep 29 08:26:59 phlegethon kernel:  amdgpu_vm_bo_update_mapping+0xcd/0xe0 [amdgpu]
Sep 29 08:26:59 phlegethon kernel:  amdgpu_vm_clear_freed+0xbe/0x190 [amdgpu]
Sep 29 08:26:59 phlegethon kernel:  amdgpu_gem_va_ioctl+0x488/0x4f0 [amdgpu]
Sep 29 08:26:59 phlegethon kernel:  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu]
Sep 29 08:26:59 phlegethon kernel:  ? drm_ioctl_kernel+0xac/0xf0 [drm]
Sep 29 08:26:59 phlegethon kernel:  drm_ioctl_kernel+0xac/0xf0 [drm]
Sep 29 08:26:59 phlegethon kernel:  ? sock_write_iter+0x8f/0xf0
Sep 29 08:26:59 phlegethon kernel:  drm_ioctl+0x2e6/0x3a0 [drm]
Sep 29 08:26:59 phlegethon kernel:  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu]
Sep 29 08:26:59 phlegethon kernel:  ? do_iter_write+0xe2/0x190
Sep 29 08:26:59 phlegethon kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Sep 29 08:26:59 phlegethon kernel:  do_vfs_ioctl+0xa4/0x630
Sep 29 08:26:59 phlegethon kernel:  ksys_ioctl+0x70/0x80
Sep 29 08:26:59 phlegethon kernel:  __x64_sys_ioctl+0x16/0x20
Sep 29 08:26:59 phlegethon kernel:  do_syscall_64+0x4e/0x130
Sep 29 08:26:59 phlegethon kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 29 08:26:59 phlegethon kernel: RIP: 0033:0x7fd08a3c4b57
Sep 29 08:26:59 phlegethon kernel: Code: 00 00 00 48 8b 05 29 53 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f9 52 0c 00 f7 d8 64 89 01 48
Sep 29 08:26:59 phlegethon kernel: RSP: 002b:00007fff6a6738b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Sep 29 08:26:59 phlegethon kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd08a3c4b57
Sep 29 08:26:59 phlegethon kernel: RDX: 00007fff6a673900 RSI: 00000000c0286448 RDI: 0000000000000012
Sep 29 08:26:59 phlegethon kernel: RBP: 00007fff6a673900 R08: 0000000103400000 R09: 000000000000000e
Sep 29 08:26:59 phlegethon kernel: R10: 0000000000000026 R11: 0000000000000246 R12: 00000000c0286448
Sep 29 08:26:59 phlegethon kernel: R13: 0000000000000012 R14: 0000000000000002 R15: 00000000043d74a0
Sep 29 08:26:59 phlegethon kernel: Modules linked in: fuse bridge stp llc cfg80211 msr rfkill 8021q ext4 crc16 mbcache jbd2 amdgpu wmi_bmof ppdev gpu_sched edac_core ttm drm_kms_helper snd_hda_codec_realtek k10temp snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi drm snd_hda_intel joydev r8169 ata_generic snd_hda_codec sp5100_tco uas agpgart watchdog i2c_algo_bit pata_acpi evdev fb_sys_fops syscopyarea i2c_piix4 mousedev sysfillrect realtek sysimgblt mac_hid snd_hda_core backlight libphy i2c_core parport_pc snd_hwdep parport wmi button pcc_cpufreq acpi_cpufreq iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip6table_filter ip6_tables iptable_filter snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore atkbd libps2 serio loop cpufreq_ondemand edac_mce_amd kvm irqbypass ip_tables x_tables ipv6 crc_ccitt autofs4 crypto_simd cryptd glue_helper input_leds led_class
Sep 29 08:26:59 phlegethon kernel:  usb_storage sd_mod xhci_pci pata_atiixp ahci xhci_hcd ohci_pci libahci libata ehci_pci ohci_hcd scsi_mod ehci_hcd rtc_cmos aes_x86_64 serpent_generic btrfs zstd_decompress zstd_compress libcrc32c crc32c_generic xor raid6_pq dm_crypt dm_mod usbhid usbcore usb_common hid_generic hid_microsoft ff_memless hid
Sep 29 08:26:59 phlegethon kernel: CR2: 0000000000000008
Sep 29 08:26:59 phlegethon kernel: ---[ end trace b0432b776c251e2d ]---
Sep 29 08:26:59 phlegethon kernel: RIP: 0010:amdgpu_vm_sdma_commit+0x46/0x140 [amdgpu]
Sep 29 08:26:59 phlegethon kernel: Code: 18 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8b 47 08 48 8b aa a8 01 00 00 4c 8b a8 80 00 00 00 48 8b 80 c8 00 00 00 <4c> 8b 70 08 8b 45 08 85 c0 4d 8d 7e 88 0f 84 c2 00 00 00 49 8b 46
Sep 29 08:26:59 phlegethon kernel: RSP: 0018:ffff99bb838ffad8 EFLAGS: 00010246
Sep 29 08:26:59 phlegethon kernel: RAX: 0000000000000000 RBX: ffff99bb838ffb20 RCX: 0000000000103c00
Sep 29 08:26:59 phlegethon kernel: RDX: ffff89abd64a4000 RSI: ffff99bb838ffba8 RDI: ffff99bb838ffb20
Sep 29 08:26:59 phlegethon kernel: RBP: ffff89abd64a4210 R08: ffff99bb838ffa6c R09: ffff99bb838ffa70
Sep 29 08:26:59 phlegethon kernel: R10: 0000000000103804 R11: 0000000000000021 R12: ffff99bb838ffba8
Sep 29 08:26:59 phlegethon kernel: R13: ffff89ac48fc6000 R14: 0000000000103803 R15: 0000000000000000
Sep 29 08:26:59 phlegethon kernel: FS:  00007fd089d18e40(0000) GS:ffff89ac67a00000(0000) knlGS:0000000000000000
Sep 29 08:26:59 phlegethon kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 29 08:26:59 phlegethon kernel: CR2: 0000000000000008 CR3: 0000000603afa000 CR4: 00000000000006f0
Sep 29 08:30:10 phlegethon kernel: sysrq: Keyboard mode set to system default
Comment 3 me 2019-10-03 14:14:57 UTC
Can also confirm this bug, doesn't happen on 4.20.16-200.fc29.x86_64, which is the last fc29 kernel. All 5.X series fc30 kernels seem to be affected.
Comment 4 Alex Deucher 2019-10-03 14:33:21 UTC
Can you bisect?
Comment 5 SET 2019-10-03 18:19:01 UTC
(In reply to Alex Deucher from comment #4)
> Can you bisect?

Bisecting would be of great help. I doubt it's feasible in practice. The bug happens after an undetermined number of syspend/resume cycles. It would take weeks or months to isolate the searched patch. By that time, it may well become invalid due to other changes.
Comment 6 phg 2019-10-04 11:23:43 UTC
I agree with SET that bisecting would not be feasible due to
the rarity of the bug. For reference, the affected box just
now crashed after a week of no issue with suspending and
resuming approx. twice a day.
Comment 7 me 2019-10-04 13:34:45 UTC
I began bisecting yesterday and discovered another bug that happens on suspend, which makes it hard to determine the good/bad status of a build with regards to _this_ bug in a timely manner.
Hence aborting any bisection attempts.

Maybe a new crowd of people runs into this when upgrading their Ubuntu LTS systems :(
Comment 8 me 2019-10-04 18:29:05 UTC
There is a correspnding bug report on the Gentoo users forum:

https://forums.gentoo.org/viewtopic-p-8375988.html#8375988
Comment 9 me 2019-10-07 10:14:30 UTC
Potential fix (and kernel Bugzilla bug): https://bugzilla.kernel.org/show_bug.cgi?id=204241

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.