Bug 111881

Summary: [kernel 5.4-rc4][amdgpu][CIK]: FW bug: No PASID in KFD interrupt
Product: DRI Reporter: erhard_f
Component: DRM/amdkfdAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=111021
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg (kernel 5.4-rc1)
none
kernel.config (5.4-rc1)
none
dmesg (kernel 5.4-rc4)
none
kernel.config (5.4-rc4)
none
rocminfo output (ROC 2.9) none

Description erhard_f 2019-10-02 11:03:26 UTC
Created attachment 145612 [details]
dmesg (kernel 5.4-rc1)

Card is a Sapphire Radeon R9 290 Tri-X running on a Supermicro H8SGL (Opteron 6380) with Gentoo Linux. OpenCL driver is ROCm 2.8.0.

clinfo segfaults, also the kernel gets a hit:

[...]
Okt 02 12:47:51 yea kernel: clinfo[1138]: segfault at 1000 ip 00007f78d4f52971 sp 00007ffd81ab7170 error 6 in libhsa-runtime64.so.1.1.9[7f78d4f34000+c7000]
Okt 02 12:47:51 yea kernel: Code: ff ff ff 48 8b 85 58 ff ff ff 48 8b 80 b8 03 00 00 48 8b 95 78 ff ff ff 48 c1 e2 03 48 01 c2 48 8b 85 68 ff ff ff 48 8b 40 18 <48> 89 02 c6 45 b0 01 bb 00 00 00 00 0f b6 45 b0 83 f0 01 84 c0 74
Okt 02 12:47:59 yea kernel: Evicting PASID 32770 queues
Okt 02 12:47:59 yea kernel: ------------[ cut here ]------------
Okt 02 12:47:59 yea kernel: FW bug: No PASID in KFD interrupt
Okt 02 12:47:59 yea kernel: WARNING: CPU: 5 PID: 0 at drivers/gpu/drm/amd/amdgpu/../amdkfd/cik_event_interrupt.c:70 cik_event_interrupt_isr+0x223/0x230 [amdgpu]
Okt 02 12:47:59 yea kernel: Modules linked in: fuse dm_crypt nhpoly1305_sse2 nhpoly1305 chacha_x86_64 chacha_generic adiantum poly1305_generic algif_skcipher amd64_edac_mod crct10dif_pclmul crc32_generic crc32_pclmul dm_mod joydev input_leds ghash_generic gf128mul gcm hid_generic usbhid hid xts ext4 crc16 mbcache ctr jbd2 ath5k led_class amdgpu cbc mac80211 ath ohci_pci ecb evdev cfg80211 gpu_sched ehci_pci ohci_hcd snd_oxygen i2c_algo_bit ehci_hcd fam15h_power snd_oxygen_lib aesni_intel ttm snd_mpu401_uart sr_mod glue_helper rfkill snd_rawmidi usbcore crypto_simd k10temp libarc4 cdrom cryptd drm_kms_helper snd_hda_codec_hdmi hwmon snd_seq_device i2c_piix4 usb_common cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt snd_hda_intel fb_sys_fops cfbcopyarea snd_intel_nhlt fb snd_hda_codec font snd_hwdep fbdev snd_hda_core drm e1000e snd_pcm snd_timer snd drm_panel_orientation_quirks backlight soundcore button acpi_cpufreq processor lzo zstd sg zram zsmalloc
Okt 02 12:47:59 yea kernel: CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.4.0-rc1 #1
Okt 02 12:47:59 yea kernel: Hardware name: Supermicro H8SGL/H8SGL, BIOS 3.5b       03/18/2016
Okt 02 12:47:59 yea kernel: RIP: 0010:cik_event_interrupt_isr+0x223/0x230 [amdgpu]
Okt 02 12:47:59 yea kernel: Code: ff 0f b6 05 53 15 49 00 84 c0 74 07 31 c0 e9 b0 fe ff ff 48 c7 c7 c0 b2 88 c1 88 44 24 08 c6 05 36 15 49 00 01 e8 81 0f a5 f8 <0f> 0b 0f b6 44 24 08 e9 8d fe ff ff 90 48 b8 00 00 00 00 00 fc ff
Okt 02 12:47:59 yea kernel: RSP: 0018:ffff8883e7888c08 EFLAGS: 00010086
Okt 02 12:47:59 yea kernel: RAX: 0000000000000000 RBX: ffff8883cc044b48 RCX: ffffffffba10693f
Okt 02 12:47:59 yea kernel: RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8883e5704f80
Okt 02 12:47:59 yea kernel: RBP: ffff8883e7888c40 R08: fffffbfff76d3d31 R09: fffffbfff76d3d31
Okt 02 12:47:59 yea kernel: R10: fffffbfff76d3d30 R11: ffffffffbb69e983 R12: 0000000000000008
Okt 02 12:47:59 yea kernel: R13: 00000000000000b5 R14: 0000000000000023 R15: 0000000000000000
Okt 02 12:47:59 yea kernel: FS:  0000000000000000(0000) GS:ffff8883e7880000(0000) knlGS:0000000000000000
Okt 02 12:47:59 yea kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Okt 02 12:47:59 yea kernel: CR2: 00007fea9066f000 CR3: 00000007f52c2000 CR4: 00000000000406e0
Okt 02 12:47:59 yea kernel: Call Trace:
Okt 02 12:47:59 yea kernel:  <IRQ>
Okt 02 12:47:59 yea kernel:  kgd2kfd_interrupt+0x151/0x1a0 [amdgpu]
Okt 02 12:47:59 yea kernel:  ? kgd2kfd_resume+0xa0/0xa0 [amdgpu]
Okt 02 12:47:59 yea kernel:  ? check_flags.part.41+0x82/0x210
Okt 02 12:47:59 yea kernel:  ? amdgpu_fence_process+0x95/0x1b0 [amdgpu]
Okt 02 12:47:59 yea kernel:  ? amdgpu_irq_dispatch+0x184/0x390 [amdgpu]
Okt 02 12:47:59 yea kernel:  ? gfx_v7_0_eop_irq+0xba/0x100 [amdgpu]
Okt 02 12:47:59 yea kernel:  amdgpu_irq_dispatch+0x1c6/0x390 [amdgpu]
Okt 02 12:47:59 yea kernel:  ? amdgpu_irq_add_id+0x160/0x160 [amdgpu]
Okt 02 12:47:59 yea kernel:  ? lock_downgrade+0x390/0x390
Okt 02 12:47:59 yea kernel:  amdgpu_ih_process+0xf4/0x1d0 [amdgpu]
Okt 02 12:47:59 yea kernel:  ? amdgpu_irq_disable_all+0x1b0/0x1b0 [amdgpu]
Okt 02 12:47:59 yea kernel:  amdgpu_irq_handler+0x20/0x60 [amdgpu]
Okt 02 12:47:59 yea kernel:  ? amdgpu_irq_disable_all+0x1b0/0x1b0 [amdgpu]
Okt 02 12:47:59 yea kernel:  __handle_irq_event_percpu+0x72/0x390
Okt 02 12:47:59 yea kernel:  handle_irq_event_percpu+0x6a/0xe0
Okt 02 12:47:59 yea kernel:  ? __handle_irq_event_percpu+0x390/0x390
Okt 02 12:47:59 yea kernel:  ? rwlock_bug.part.2+0x50/0x50
Okt 02 12:47:59 yea kernel:  ? do_raw_spin_unlock+0x9d/0x130
Okt 02 12:47:59 yea kernel:  handle_irq_event+0x4f/0x7e
Okt 02 12:47:59 yea kernel:  handle_edge_irq+0x100/0x2d0
Okt 02 12:47:59 yea kernel:  do_IRQ+0x72/0x160
Okt 02 12:47:59 yea kernel:  common_interrupt+0xf/0xf
Okt 02 12:47:59 yea kernel:  </IRQ>
Okt 02 12:47:59 yea kernel: RIP: 0010:cpuidle_enter_state+0xcd/0x640
Okt 02 12:47:59 yea kernel: Code: 00 31 ff e8 a5 86 80 ff 80 7c 24 10 00 74 12 9c 58 f6 c4 02 0f 85 42 05 00 00 31 ff e8 cc 5e 89 ff e8 f7 be 8f ff fb 45 85 e4 <0f> 88 fb 03 00 00 4d 63 ec 4f 8d 74 6d 00 49 c1 e6 05 4a 8d 7c 33
Okt 02 12:47:59 yea kernel: RSP: 0018:ffff8883e571fd98 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdd
Okt 02 12:47:59 yea kernel: RAX: 0000000000000000 RBX: ffffffffc0316680 RCX: ffffffffba1067e0
Okt 02 12:47:59 yea kernel: RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8883e5704fb4
Okt 02 12:47:59 yea kernel: RBP: ffff888812779028 R08: 0000000000000002 R09: 0000000000000000
Okt 02 12:47:59 yea kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
Okt 02 12:47:59 yea kernel: R13: 0000000000000002 R14: ffffffffc0316740 R15: ffffffffc0316780
Okt 02 12:47:59 yea kernel:  ? lockdep_hardirqs_on+0x190/0x280
Okt 02 12:47:59 yea kernel:  ? cpuidle_enter_state+0xc9/0x640
Okt 02 12:47:59 yea kernel:  cpuidle_enter+0x37/0x60
Okt 02 12:47:59 yea kernel:  do_idle+0x2e7/0x380
Okt 02 12:47:59 yea kernel:  ? arch_cpu_idle_exit+0x40/0x40
Okt 02 12:47:59 yea kernel:  ? schedule_idle+0x41/0x50
Okt 02 12:47:59 yea kernel:  cpu_startup_entry+0x14/0x20
Okt 02 12:47:59 yea kernel:  start_secondary+0x1fd/0x240
Okt 02 12:47:59 yea kernel:  ? set_cpu_sibling_map+0xbc0/0xbc0
Okt 02 12:47:59 yea kernel:  secondary_startup_64+0xa4/0xb0
Okt 02 12:47:59 yea kernel: irq event stamp: 450550
Okt 02 12:47:59 yea kernel: hardirqs last  enabled at (450547): [<ffffffffba8c30b9>] cpuidle_enter_state+0xc9/0x640
Okt 02 12:47:59 yea kernel: hardirqs last disabled at (450548): [<ffffffffba00276a>] trace_hardirqs_off_thunk+0x1a/0x20
Okt 02 12:47:59 yea kernel: softirqs last  enabled at (450550): [<ffffffffba07b210>] irq_enter+0x70/0x80
Okt 02 12:47:59 yea kernel: softirqs last disabled at (450549): [<ffffffffba07b1f5>] irq_enter+0x55/0x80
Okt 02 12:47:59 yea kernel: ---[ end trace 5951fa91933dcafd ]---
Comment 1 erhard_f 2019-10-02 11:05:01 UTC
Created attachment 145613 [details]
kernel.config (5.4-rc1)
Comment 2 erhard_f 2019-10-03 22:02:22 UTC
Forgot for a moment about the GitLab Tracker...

Moved over there: https://gitlab.freedesktop.org/mesa/mesa/issues/1881
Comment 3 erhard_f 2019-10-26 10:58:36 UTC
After re-reading https://bugs.freedesktop.org/enter_bug.cgi I think the appropriate place for the bug is here after all as 'amdkfd' is explicitely mentioned in the stacktrace.

Re-opening with current data.
Comment 4 erhard_f 2019-10-26 10:59:09 UTC
Created attachment 145818 [details]
dmesg (kernel 5.4-rc4)
Comment 5 erhard_f 2019-10-26 11:01:13 UTC
Created attachment 145819 [details]
kernel.config (5.4-rc4)
Comment 6 erhard_f 2019-10-26 11:04:14 UTC
Created attachment 145820 [details]
rocminfo output (ROC 2.9)

Interesting is that clinfo (2.2.18.04.06) produces this hit whereas rocminfo (ROC 2.9) gives proper output.
Comment 7 Martin Peres 2019-11-19 07:54:07 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/7.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.