Bug 111633 - amdgpu driver crash with kernel NULL pointer dereference
Summary: amdgpu driver crash with kernel NULL pointer dereference
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: not set not set
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-10 19:03 UTC by vakevk+freedesktopbugzilla
Modified: 2019-11-19 09:51 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description vakevk+freedesktopbugzilla 2019-09-10 19:03:17 UTC
I am running on arch linux: Linux arch 5.2.13-arch1-1-ARCH #1 SMP PREEMPT Fri Sep 6 17:52:33 UTC 2019 x86_64 GNU/Linux

I am running wayland via sway.

My gpu is a Radeon RX Vega 64.

While in my sway session the image on my screen froze but audio from a video continued to play. I was able to ssh in from a different machine and found this message with dmesg:

BUG: kernel NULL pointer dereference, address: 0000000000000360
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 1 PID: 12766 Comm: kworker/u16:0 Not tainted 5.2.11-arch1-1-ARCH #1
Hardware name: ASUS All Series/Z87-PLUS, BIOS 2103 08/15/2014
Workqueue: events_unbound commit_work [drm_kms_helper]
RIP: 0010:dc_stream_retain+0x5/0x20 [amdgpu]
<Code and registers omitted. Can post if important and someone reassures me that it doesn't sensitive information since it looks like a memory dump.>
Call Trace:
 dc_resource_state_copy_construct+0xa0/0xf0 [amdgpu]
 dc_commit_updates_for_stream+0xa63/0xc20 [amdgpu]
 amdgpu_dm_atomic_commit_tail+0xabe/0x19a0 [amdgpu]
 ? commit_tail+0x3c/0x70 [drm_kms_helper]
 commit_tail+0x3c/0x70 [drm_kms_helper]
 process_one_work+0x1d1/0x3e0
 worker_thread+0x4a/0x3d0
 kthread+0xfb/0x130
 ? process_one_work+0x3e0/0x3e0
 ? kthread_park+0x80/0x80
 ret_from_fork+0x35/0x40
Modules linked in: snd_seq_dummy snd_seq tun nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nf_tables_set cfg80211 nf_tables nfnetlink 8021q garp mrp stp llc intel_rapl nls_iso8859_1 nls_cp437 vfat fat snd_hda_codec_realtek snd_hda_codec_generic fuse ledtrig_audio ofpart snd_hda_codec_hdmi cmdlinepart btusb intel_spi_platform snd_hda_intel btrtl x86_pkg_temp_thermal intel_spi btbcm intel_powerclamp spi_nor btintel eeepc_wmi snd_usb_audio coretemp snd_hda_codec uvcvideo asus_wmi bluetooth snd_usbmidi_lib iTCO_wdt kvm_intel snd_hda_core videobuf2_vmalloc mei_hdcp mtd iTCO_vendor_support mxm_wmi wmi_bmof sparse_keymap snd_hwdep snd_rawmidi snd_seq_device videobuf2_memops snd_pcm videobuf2_v4l2 snd_timer videobuf2_common snd videodev kvm irqbypass input_leds ecdh_generic intel_cstate mousedev rfkill intel_uncore mei_me joydev cdc_acm media ecc e1000e intel_rapl_perf mei soundcore pcc_cpufreq i2c_i801 lpc_ich pcspkr wmi evdev mac_hid ip_tables x_tables ext4
 crc32c_generic crc16 mbcache jbd2 hid_generic usbhid hid dm_crypt dm_mod sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ahci libahci aesni_intel libata aes_x86_64 xhci_pci crypto_simd cryptd glue_helper xhci_hcd scsi_mod ehci_pci ehci_hcd amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart
CR2: 0000000000000360
---[ end trace 08eaa2e1d713ba4d ]---

At this point I tried killing the sway process but did not succeed even with `kill -9`. Not even `sudo reboot` completed despite killing the ssh session. I had to hard reset the machine.

Potentially related is that since roughly a week I have been experiencing intermittent screen freezes from time to time that would resolve themselves after about 10 seconds with a message like this in dmesg:

drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
[drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
Comment 1 vakevk+freedesktopbugzilla 2019-09-19 20:31:22 UTC
Another one, different stacktrace.

BUG: kernel NULL pointer dereference, address: 0000000000000360
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 4 PID: 28005 Comm: kworker/u16:1 Not tainted 5.2.14-arch1-1-ARCH #1
Hardware name: ASUS All Series/Z87-PLUS, BIOS 2103 08/15/2014
Workqueue: events_unbound commit_work [drm_kms_helper]
RIP: 0010:dc_stream_retain+0x5/0x20 [amdgpu]
Call Trace:
dc_resource_state_copy_construct+0xa0/0xf0 [amdgpu]
dc_commit_updates_for_stream+0xa63/0xc20 [amdgpu]
amdgpu_dm_atomic_commit_tail+0xabe/0x19a0 [amdgpu]
? commit_tail+0x3c/0x70 [drm_kms_helper]
commit_tail+0x3c/0x70 [drm_kms_helper]
process_one_work+0x1d1/0x3e0
worker_thread+0x4a/0x3d0
kthread+0xfb/0x130
? process_one_work+0x3e0/0x3e0
? kthread_park+0x80/0x80
ret_from_fork+0x35/0x40
Modules linked in: tun nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nf_tables_set cfg80211 8021q garp mrp nf_tables stp llc nfnetlink intel_rapl ofpart nls_iso8859_1 nls_cp437 cmdlinepart vfat intel_spi_platform fat fuse intel_spi mei_hdcp spi_nor iTCO_wdt x86_pkg_temp_thermal mtd iTCO_vendor_support intel_powerclamp uvcvideo coretemp videobuf2_vmalloc kvm_intel btusb snd_hda_codec_realtek videobuf2_memops btrtl btbcm snd_hda_codec_generic videobuf2_v4l2 btintel ledtrig_audio snd_hda_codec_hdmi videobuf2_common bluetooth eeepc_wmi kvm snd_usb_audio snd_hda_intel videodev asus_wmi snd_hda_codec sparse_keymap wmi_bmof snd_usbmidi_lib mxm_wmi irqbypass snd_hda_core snd_rawmidi snd_hwdep snd_seq_device intel_cstate ecdh_generic snd_pcm intel_uncore mei_me i2c_i801 snd_timer intel_rapl_perf rfkill snd pcspkr media cdc_acm pcc_cpufreq mousedev ecc joydev e1000e input_leds soundcore mei lpc_ich evdev mac_hid wmi ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2
hid_generic usbhid hid dm_crypt dm_mod sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ahci libahci aesni_intel libata aes_x86_64 crypto_simd xhci_pci cryptd scsi_mod glue_helper xhci_hcd ehci_pci ehci_hcd amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart
CR2: 0000000000000360
---[ end trace 3b3265e8a1ad7f82 ]---
Comment 2 Martin Peres 2019-11-19 09:51:12 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/904.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.