Bug 105244

Summary: NULL dereference during startup of Cape Verde with AMDGPU and GPU passthrough
Product: DRI Reporter: Bas Nieuwenhuizen <bas>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: elproducto1
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
possible fix
none
possible fix none

Description Bas Nieuwenhuizen 2018-02-25 22:01:58 UTC
Created attachment 137596 [details]
dmesg

device:
00:02.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde XT [Radeon HD 7770/8760 / R7 250X] [1002:683d]
00:0a.0 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series] [1002:aab0]

uname -a:
Linux localhost.localdomain 4.15.4-300.fc27.x86_64 #1 SMP Mon Feb 19 23:31:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linu

kernel params include radeon.si_support=0 amdgpu.si_support=1

The host also contains a Tonga, passed through to another VM.

Trace (full dmesg also attached):

[    1.486955] BUG: unable to handle kernel NULL pointer dereference at 000000000000003c
[    1.486971] IP: drm_pcie_get_speed_cap_mask+0x35/0xe0 [drm]
[    1.486972] PGD 0 P4D 0 
[    1.486975] Oops: 0000 [#1] SMP PTI
[    1.486977] Modules linked in: amdkfd amd_iommu_v2 amdgpu(+) virtio_console virtio_net virtio_blk crc32c_intel chash i2c_algo_bit drm_kms_helper ttm serio_raw drm ata_generic qemu_fw_cfg virtio_pci pata_acpi virtio_rng virtio_ring virtio
[    1.486987] CPU: 0 PID: 324 Comm: systemd-udevd Not tainted 4.15.4-300.fc27.x86_64 #1
[    1.486989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[    1.486996] RIP: 0010:drm_pcie_get_speed_cap_mask+0x35/0xe0 [drm]
[    1.486998] RSP: 0018:ffffa663c089b908 EFLAGS: 00010286
[    1.486999] RAX: ffff8f38ba5a6800 RBX: ffff8f38b3890000 RCX: 0000000000000000
[    1.487003] RDX: 0000000000000000 RSI: ffffa663c089b998 RDI: ffff8f38b385c000
[    1.487004] RBP: 0000000000000000 R08: ffffc715c4ce2600 R09: 0000000000040000
[    1.487006] R10: 0000000000140000 R11: 0000000000000000 R12: 0000000000000003
[    1.487007] R13: ffff8f38b2e0a9c8 R14: ffff8f38b2e00000 R15: 0000000000000000
[    1.487009] FS:  00007f1bcd4a91c0(0000) GS:ffff8f38bfc00000(0000) knlGS:0000000000000000
[    1.487011] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.487012] CR2: 000000000000003c CR3: 0000000135546004 CR4: 00000000001606f0
[    1.487016] Call Trace:
[    1.487065]  si_dpm_sw_init+0x330/0x15d0 [amdgpu]
[    1.487070]  ? request_threaded_irq+0xad/0x160
[    1.487074]  ? printk+0x52/0x6e
[    1.487101]  amdgpu_device_init+0xcb4/0x15e0 [amdgpu]
[    1.487105]  ? kmalloc_order+0x14/0x40
[    1.487130]  amdgpu_driver_load_kms+0x86/0x2d0 [amdgpu]
[    1.487155]  drm_dev_register+0x132/0x1c0 [drm]
[    1.487180]  amdgpu_pci_probe+0x10a/0x140 [amdgpu]
[    1.487184]  local_pci_probe+0x42/0xa0
[    1.487190]  ? pci_assign_irq+0x27/0x130
[    1.487192]  pci_device_probe+0x141/0x1b0
[    1.487196]  driver_probe_device+0x315/0x480
[    1.487198]  __driver_attach+0xa0/0xe0
[    1.487201]  ? driver_probe_device+0x480/0x480
[    1.487203]  bus_for_each_dev+0x6b/0xb0
[    1.487205]  bus_add_driver+0x1c2/0x260
[    1.487207]  ? 0xffffffffc07b6000
[    1.487209]  driver_register+0x57/0xc0
[    1.487211]  ? 0xffffffffc07b6000
[    1.487214]  do_one_initcall+0x4e/0x190
[    1.487218]  ? _cond_resched+0x15/0x40
[    1.487220]  ? kmem_cache_alloc_trace+0xac/0x1b0
[    1.487223]  ? do_init_module+0x22/0x201
[    1.487226]  do_init_module+0x5b/0x201
[    1.487228]  load_module+0x26b1/0x2b60
[    1.487231]  ? SYSC_init_module+0x160/0x190
[    1.487233]  ? _cond_resched+0x15/0x40
[    1.487235]  SYSC_init_module+0x160/0x190
[    1.487238]  do_syscall_64+0x75/0x180
[    1.487240]  entry_SYSCALL_64_after_hwframe+0x21/0x86
[    1.487243] RIP: 0033:0x7f1bccda71da
[    1.487244] RSP: 002b:00007ffec1e4f598 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    1.487246] RAX: ffffffffffffffda RBX: 0000555c28533860 RCX: 00007f1bccda71da
[    1.487248] RDX: 0000555c28531780 RSI: 00000000005748f3 RDI: 0000555c28de0b10
[    1.487250] RBP: 0000555c28531780 R08: 0000000000000005 R09: 00007ffec1e4dd23
[    1.487251] R10: 0000000000000005 R11: 0000000000000246 R12: 0000555c28de0b10
[    1.487253] R13: 0000555c285317b0 R14: 0000000000020000 R15: 0000000000000000
[    1.487255] Code: 10 c7 06 00 00 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8b 87 c0 01 00 00 48 85 c0 74 18 48 8b 40 10 48 8b 68 38 <0f> b7 45 3c 66 3d 06 11 74 06 66 3d 66 11 75 1c b8 ea ff ff ff 
[    1.487281] RIP: drm_pcie_get_speed_cap_mask+0x35/0xe0 [drm] RSP: ffffa663c089b908
[    1.487283] CR2: 000000000000003c
[    1.487296] ---[ end trace 81fa2514df506ee9 ]---
Comment 1 Alex Deucher 2018-02-26 16:11:42 UTC
Created attachment 137609 [details] [review]
possible fix

This patch should fix it.
Comment 2 Alex Deucher 2018-02-27 14:13:37 UTC
Created attachment 137644 [details] [review]
possible fix

Fix includes.
Comment 3 Bas Nieuwenhuizen 2018-02-28 22:49:54 UTC
Sorry for the delay, I can confirm this fixes the NULL issue.

pp_dpm_mclk / pp_dpm_sclk look empty to me, not sure though if that is just because they are not hooked up yet for SI, but since I don't need DPM and this now boots AMDGPU with the default config to an usable state I'd consider this fixed.

(leaving open because I don't know if the patch landed yet, feel free to close when you push it)
Comment 4 Alex Deucher 2018-03-01 02:03:33 UTC
SI still uses the legacy dpm code rather than powerplay so it doesn't expose all the same options as newer chips.  SI also has an older smu implementation so it has a more limited feature set compared to CI and VI.
Comment 5 mercuriete 2018-03-04 12:12:00 UTC
Sorry for the noise:

I have a null dereference in a Cape Verde but not in bootup

my bug is https://bugs.freedesktop.org/show_bug.cgi?id=102553

Can somebody check that bug

It blocks me to switch from radeon to amdgpu.

Thank you very much.
Comment 6 Elproducto 2018-03-05 13:28:24 UTC
I can confirm I have the same issue with my GPU passed through to a VM.  Not sure how to test the possible fix.  I have never applied a patch but found a few pointers online.  I did the following but my card still has the same.  If possible please review and let me know how to apply the proposed fix to my system.

git clone git://anongit.freedesktop.org/drm/drm-amd
cd drm-amd
nano 0001-drm-amdgpu-used-cached-pcie-gen-info-for-SI-v2.patch
copy content of possible fix into file generated above
git am --signoff < 0001-drm-amdgpu-used-cached-pcie-gen-info-for-SI-v2.patch
make defconfig
make
make install

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.