Bug 108585

Summary:

*ERROR* hw_init of IP block <gfx_v8_0> failed -22

Product:

DRI

Reporter:

Dan Horák <dan>

Component:

DRM/AMDgpu

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED NOTABUG

QA Contact:

Severity:

normal

Priority:

medium

CC:

bcrocker, joel.stan

Version:

unspecified

Hardware:

PowerPC

OS:

Linux (All)

See Also:

https://bugs.freedesktop.org/show_bug.cgi?id=108754

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
full dmesg output	none
possible fix	none
more involved fix	none

Description Dan Horák 2018-10-29 10:29:13 UTC

Created attachment 142253 [details]
full dmesg output

amdgpu driver fails to initialize Radeon WX4100 PRO on my Talos Power9 system with kernel 4.19 (GA). There is no such problem with 4.19-rc8 (and earlier).

...
[    2.421393] [drm] amdgpu kernel modesetting enabled.
[    2.421512] amdgpu 0000:01:00.0: enabling device (0540 -> 0542)
[    2.421732] [drm] initializing kernel modesetting (POLARIS11 0x1002:0x67E3 0x1002:0x0B0D 0x00).
[    2.421776] [drm] register mmio base: 0x00000000
[    2.421781] [drm] register mmio size: 262144
[    2.421787] [drm] PCI I/O BAR is not found.
[    2.421798] [drm] add ip block number 0 <vi_common>
[    2.421801] [drm] add ip block number 1 <gmc_v8_0>
[    2.421805] [drm] add ip block number 2 <tonga_ih>
[    2.421808] [drm] add ip block number 3 <powerplay>
[    2.421811] [drm] add ip block number 4 <dce_v11_0>
[    2.421814] [drm] add ip block number 5 <gfx_v8_0>
[    2.421818] [drm] add ip block number 6 <sdma_v3_0>
[    2.421821] [drm] add ip block number 7 <uvd_v6_0>
[    2.421824] [drm] add ip block number 8 <vce_v3_0>
[    2.421837] [drm] UVD is enabled in VM mode
[    2.421840] [drm] UVD ENC is enabled in VM mode
[    2.421845] [drm] VCE enabled in VM mode
[    2.609475] md/raid1:md127: active with 2 out of 2 mirrors
[    2.625800] md127: detected capacity change from 0 to 481708474368
[    2.627770] md/raid1:md126: active with 2 out of 2 mirrors
[    2.643643] md126: detected capacity change from 0 to 1072693248
[    2.769520] usb 1-4: new high-speed USB device number 4 using xhci_hcd
[    2.769550] ATOM BIOS: 113-D0150600-103
[    2.769747] [drm] vm size is 256 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[    2.769846] pci 0000:01     : [PE# 00] pseudo-bypass sizes: tracker 32800 bitmap 8192 TCEs 65536
[    2.769851] pci 0000:01     : [PE# 00] TCE tables configured for pseudo-bypass
[    2.769903] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0x6000010000000-0x60000101fffff 64bit pref]
[    2.769907] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0x6000000000000-0x600000fffffff 64bit pref]
[    2.769939] pci 0000:00:00.0: BAR 15: releasing [mem 0x6000000000000-0x6003fbff0ffff 64bit pref]
[    2.769956] pci 0000:00:00.0: BAR 15: assigned [mem 0x6000000000000-0x600017fffffff 64bit pref]
[    2.769961] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x6000000000000-0x60000ffffffff 64bit pref]
[    2.769972] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x6000100000000-0x60001001fffff 64bit pref]
[    2.770004] pci 0000:00:00.0: PCI bridge to [bus 01]
[    2.770009] pci 0000:00:00.0:   bridge window [mem 0x600c000000000-0x600c07fefffff]
[    2.770015] pci 0000:00:00.0:   bridge window [mem 0x6000000000000-0x6003fbff0ffff 64bit pref]
[    2.770066] amdgpu 0000:01:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
[    2.770069] amdgpu 0000:01:00.0: GART: 256M 0x0000000000000000 - 0x000000000FFFFFFF
[    2.770075] [drm] Detected VRAM RAM=4096M, BAR=4096M
[    2.770077] [drm] RAM width 128bits GDDR5
[    2.770162] [TTM] Zone  kernel: Available graphics memory: 32717248 kiB
[    2.770165] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[    2.770166] [TTM] Initializing pool allocator
[    2.771771] [drm] amdgpu: 4096M of VRAM memory ready
[    2.771774] [drm] amdgpu: 4096M of GTT memory ready.
[    2.771790] [drm] GART: num cpu pages 4096, num gpu pages 65536
[    2.771839] [drm] PCIE GART of 256M enabled (table at 0x000000F4008D0000).
[    2.771911] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    2.771913] [drm] Driver supports precise vblank timestamp query.
[    2.772311] [drm] AMDGPU Display Connectors
[    2.772313] [drm] Connector 0:
[    2.772315] [drm]   DP-1
[    2.772316] [drm]   HPD5
[    2.772318] [drm]   DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b
[    2.772320] [drm]   Encoders:
[    2.772322] [drm]     DFP1: INTERNAL_UNIPHY1
[    2.772323] [drm] Connector 1:
[    2.772325] [drm]   DP-2
[    2.772326] [drm]   HPD4
[    2.772328] [drm]   DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f 0x486f
[    2.772330] [drm]   Encoders:
[    2.772332] [drm]     DFP2: INTERNAL_UNIPHY1
[    2.772333] [drm] Connector 2:
[    2.772335] [drm]   DP-3
[    2.772336] [drm]   HPD3
[    2.772338] [drm]   DDC: 0x4870 0x4870 0x4871 0x4871 0x4872 0x4872 0x4873 0x4873
[    2.772340] [drm]   Encoders:
[    2.772341] [drm]     DFP3: INTERNAL_UNIPHY
[    2.772343] [drm] Connector 3:
[    2.772345] [drm]   DP-4
[    2.772346] [drm]   HPD2
[    2.772348] [drm]   DDC: 0x4874 0x4874 0x4875 0x4875 0x4876 0x4876 0x4877 0x4877
[    2.772350] [drm]   Encoders:
[    2.772351] [drm]     DFP4: INTERNAL_UNIPHY
[    2.772477] [drm] Chained IB support enabled!
[    2.773607] [drm] Found UVD firmware Version: 1.130 Family ID: 16
[    2.775588] [drm] Found VCE firmware Version: 53.26 Binary ID: 3
[    2.780989] amdgpu: [powerplay] dpm has been enabled
[    2.990665] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xCAFEDEAD)
[    2.990695] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v8_0> failed -22
[    2.990698] amdgpu 0000:01:00.0: amdgpu_device_ip_init failed
[    2.990701] amdgpu 0000:01:00.0: Fatal error during GPU init
[    2.990703] [drm] amdgpu: finishing device.
[    3.833155] ------------[ cut here ]------------
[    3.833157] Memory manager not clean during takedown.
[    3.833188] WARNING: CPU: 0 PID: 338 at drivers/gpu/drm/drm_mm.c:950 drm_mm_takedown+0x3c/0x60 [drm]
[    3.833191] Modules linked in: raid1 amdgpu(+) mfd_core chash i2c_algo_bit gpu_sched drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_vpmsum tg3 aacraid drm_panel_orientation_quirks i2c_core
[    3.833204] CPU: 0 PID: 338 Comm: kworker/0:2 Not tainted 4.19.0-1.fc30.op.1.ppc64le #1
[    3.833210] Workqueue: events work_for_cpu_fn
[    3.833213] NIP:  c00800000cdfea14 LR: c00800000cdfea10 CTR: c0000000006ff6e0
[    3.833215] REGS: c0000007f85d74d0 TRAP: 0700   Not tainted  (4.19.0-1.fc30.op.1.ppc64le)
[    3.833217] MSR:  9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 44002222  XER: 20040000
[    3.833224] CFAR: c000000000119b04 IRQMASK: 0 
               GPR00: c00800000cdfea10 c0000007f85d7750 c00800000ce6f200 0000000000000029 
               GPR04: 0000000000000001 0000000000000399 ffffffffffffffff 0000000000000000 
               GPR08: 0000000000000007 0000000000000007 0000000000000001 0769077207750764 
               GPR12: 0000000000002000 c000000001820000 c000000000148e68 c0002006d7c997c0 
               GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
               GPR20: 0000000000000000 0000000000000000 fffffffffffffef7 c0000007e02b3060 
               GPR24: c0000007e02b3080 c0000007e02b3088 c0000007e02b3078 0000000000000000 
               GPR28: 0000000000000000 c0000007e02b2980 c0000007e02b29a0 c0000007f9f38300 
[    3.833258] NIP [c00800000cdfea14] drm_mm_takedown+0x3c/0x60 [drm]
[    3.833265] LR [c00800000cdfea10] drm_mm_takedown+0x38/0x60 [drm]
[    3.833267] Call Trace:
[    3.833275] [c0000007f85d7750] [c00800000cdfea10] drm_mm_takedown+0x38/0x60 [drm] (unreliable)
[    3.833307] [c0000007f85d77b0] [c00800000dba9058] amdgpu_vram_mgr_fini+0x40/0xb0 [amdgpu]
[    3.833313] [c0000007f85d77e0] [c00800000cf15904] ttm_bo_clean_mm+0x10c/0x1a0 [ttm]
[    3.833341] [c0000007f85d7860] [c00800000db7a35c] amdgpu_ttm_fini+0x94/0x180 [amdgpu]
[    3.833370] [c0000007f85d78e0] [c00800000db7d1f8] amdgpu_bo_fini+0x20/0x40 [amdgpu]
[    3.833404] [c0000007f85d7900] [c00800000dc0ce50] gmc_v8_0_sw_fini+0x58/0x98 [amdgpu]
[    3.833440] [c0000007f85d7930] [c00800000dd55718] amdgpu_device_fini+0x3c4/0x628 [amdgpu]
[    3.833469] [c0000007f85d79e0] [c00800000db67b04] amdgpu_driver_unload_kms+0x6c/0x100 [amdgpu]
[    3.833496] [c0000007f85d7a10] [c00800000db67d84] amdgpu_driver_load_kms+0x1ec/0x280 [amdgpu]
[    3.833504] [c0000007f85d7a90] [c00800000cdfa830] drm_dev_register+0x1a8/0x270 [drm]
[    3.833533] [c0000007f85d7b30] [c00800000db60708] amdgpu_pci_probe+0x160/0x290 [amdgpu]
[    3.833537] [c0000007f85d7bc0] [c0000000006d7ddc] local_pci_probe+0x6c/0x140
[    3.833541] [c0000007f85d7c50] [c00000000013ad48] work_for_cpu_fn+0x38/0x60
[    3.833543] [c0000007f85d7c80] [c00000000013f880] process_one_work+0x250/0x500
[    3.833546] [c0000007f85d7d20] [c00000000013fda0] worker_thread+0x270/0x5b0
[    3.833550] [c0000007f85d7dc0] [c00000000014900c] kthread+0x1ac/0x1c0
[    3.833553] [c0000007f85d7e30] [c00000000000bdd4] ret_from_kernel_thread+0x5c/0x68
[    3.833556] Instruction dump:
[    3.833558] 60000000 e9230038 38630038 7fa34800 4d9e0020 7c0802a6 f8010010 f821ffa1 
[    3.833563] 3c620000 e8638540 4803297d e8410018 <0fe00000> 38210060 e8010010 7c0803a6 
[    3.833569] ---[ end trace 6009d10b516b7f29 ]---
[    3.833576] [TTM] Finalizing pool allocator
[    3.833612] [TTM] Zone  kernel: Used memory at exit: 6 kiB
[    3.833617] [TTM] Zone   dma32: Used memory at exit: 6 kiB
[    3.833620] [drm] amdgpu: ttm finalized
[    3.833902] amdgpu: probe of 0000:01:00.0 failed with error -22
...

Both 4.19-rc8 and 4.19 kernels use the same firmware from linux-firmware-20180815-86.gitf1b95fe5.fc28 package.

Comment 1 Michel Dänzer 2018-10-29 11:04:09 UTC

There were no amdgpu driver changes between rc8 and final... Are you sure this is 100% reproducible with the latter and not reproducible with the former? If so, can you bisect?

Comment 2 Dan Horák 2018-10-29 12:58:45 UTC

(In reply to Michel Dänzer from comment #1)
> There were no amdgpu driver changes between rc8 and final... Are you sure
> this is 100% reproducible with the latter and not reproducible with the
> former? If so, can you bisect?

till now 100% reproduceable, will try bisecting the kernel sources and also will look what else might have changed

for the record
- you can find the kernels at https://copr.fedorainfracloud.org/coprs/sharkcz/talos-kernel/builds/
- amdgpu.dc=0 or 1 makes no difference (0 is my default value, see bug 107049)

Comment 3 Dan Horák 2018-10-29 16:08:31 UTC

Ha, so it's the firmware stored in the initrds, what is different (lsinitrd lied). And the latest polaris11* ones provoke the crash. When I manually replaced them with the ones from the rc8 initrd, I've successfully booted into the 4.19 GA kernel.

Comment 4 Joel 2018-10-30 03:39:57 UTC

I see a similar backtrace on 4.19.0-11706-g11743c56785c (Linus' tree mid-merge window).

My system has a "fiji" card. The first kernel is 4.19 (upstream release), and the second kernel where the backtrace occurs is with 4.19+.

The second kernel is kexec'd from the first.

If I don't load amdgpu in the first kernel, the second kernel works. So there is something missed in the shutdown path.

Comment 5 Alex Deucher 2018-10-30 14:20:17 UTC

(In reply to Joel from comment #4)
> I see a similar backtrace on 4.19.0-11706-g11743c56785c (Linus' tree
> mid-merge window).
> 
> My system has a "fiji" card. The first kernel is 4.19 (upstream release),
> and the second kernel where the backtrace occurs is with 4.19+.
> 
> The second kernel is kexec'd from the first.

Please file your own bug.  Kexec is not likely to work and should be tracked separately.

Comment 6 Benjamin Herrenschmidt 2018-10-30 23:17:33 UTC

They may or may not be related ... Alex, kexec is how we boot these machines, there's a Linux kernel in flash that runs a Linux based bootloader.

Until recently however, that didn't have an amdgpu driver. This might have changed.

Dan... did you do some firmware changes here ? Could it have to do with the versions differences between petitboot and the final kernel ?

Alex, whether we track it here or separately, we probably need to look into kexec support. It's not just us, there's a bit of momentum around kexec based bootloaders (google's on it too) as was seen at the recent OSFC (firmware conf).

A workaround in the meantime (for kexec problems) could be to hot reset the card during the transition I suppose.

Comment 7 Joel 2018-10-31 00:44:06 UTC

(In reply to Benjamin Herrenschmidt from comment #6)
> Dan... did you do some firmware changes here ? Could it have to do with the
> versions differences between petitboot and the final kernel ?

FWIW, Talos II machines use a fork of op-build that include the amdgpu driver in petitboot. (They also appear to be stuck on 4.16).

I was experimenting with the same.

Comment 8 Dan Horák 2018-10-31 07:40:25 UTC

I should have mentioned I'm kexec-ing too. It's from 4.15.9 (in skiroot) to Fedora kernels 4.16, 4.17, 4.18 and now 4.19 during the time. It worked fine until the recent amdgpu firmware update. The skiroot kernel uses amdgpu firmware from ~June.

Comment 9 Christian König 2018-10-31 08:09:43 UTC

(In reply to Benjamin Herrenschmidt from comment #6)
> They may or may not be related ... Alex, kexec is how we boot these
> machines, there's a Linux kernel in flash that runs a Linux based bootloader.

Yeah, you guys should have noted that because that combination is known to not work correctly.

The problem is that some parts of the hardware are explicitly designed in a way which only allows loading one firmware after an ASIC reset. So as long as kexec doesn't makes a full PCIe level ASIC reset the second driver load is intended to fail.

We have the same problem with virtualization and used to have a workaround in KVM which triggers the ASIC reset with a PCIe config space write. Alex should know the details.

Only solution I can see is to either use the same workaround as the KVM guys or use the same firmware for both the loader and the final kernel.

Comment 10 Christian König 2018-10-31 08:12:00 UTC

*** Bug 108607 has been marked as a duplicate of this bug. ***

Comment 11 Dan Horák 2018-10-31 08:35:34 UTC

Thanks for the info, I've documented that in the Talos wiki under https://wiki.raptorcs.com/wiki/Troubleshooting/GPU#AMDGPU_driver_crashes_after_firmware_update

Comment 12 Benjamin Herrenschmidt 2018-10-31 09:31:49 UTC

We have no control on what firmware is loaded by the target distro so the right thing is going to reset the adapter.

We'll probably need to add something to the amdgpu shutdown() path to force an adapter reset.

Do you have details of what specific PCIe config space write you use ? FLR ?

Comment 13 Christian König 2018-10-31 10:13:53 UTC

(In reply to Benjamin Herrenschmidt from comment #12)
> We'll probably need to add something to the amdgpu shutdown() path to force
> an adapter reset.

If that would be possible we would have already done that.

The problem is that you do a full ASIC reset. So not only the GPU is affected, but also bridges, sound codecs etc... If any of those parts have a driver loaded while you do the reset you usually crash the system.

Additional to that AFAIK this doesn't work on APUs. Because there the GPU is part of the CPU and so you would need to to reset both.

How about stopping to use amdgpu in the boot loader? For just displaying a splash screen vesafb or efifb should do fine as well.

> Do you have details of what specific PCIe config space write you use ? FLR ?

Alex knows the details of that, but a FLR alone doesn't work AFAIK.

Comment 14 Alex Deucher 2018-10-31 15:13:27 UTC

Created attachment 142303 [details] [review]
possible fix

(In reply to Benjamin Herrenschmidt from comment #12)
> We have no control on what firmware is loaded by the target distro so the
> right thing is going to reset the adapter.
> 
> We'll probably need to add something to the amdgpu shutdown() path to force
> an adapter reset.

Does the attached patch help?  I'd been hesitant to add reset to the shutdown path because it adds latency to the regular shutdown path and users complain when that slows down.

> 
> Do you have details of what specific PCIe config space write you use ? FLR ?

The reset sequence is asic specific.  older parts just happened to use PCI config space to trigger a GPU reset via an AMD specific sequence.  Newer GPUs reset via the PSP.  FLR is only available on SR-IOV capable skus so it's not a general solution.

Comment 15 Alex Deucher 2018-11-01 05:14:16 UTC

Created attachment 142316 [details]
more involved fix

These patches attempt to reset the GPU on init if the GPU was already running from a previous load of the driver.  Compile tested only at the moment.

Comment 16 Dan Horák 2018-11-01 09:49:56 UTC

Reset on init sounds better to me as the loader kernel (in kexec case) is more difficult to update than the host kernel.

And for the record - after updating the skiroot kernel firmware version to the latest there is no problem/crash.

Comment 17 Dan Horák 2018-11-01 12:37:57 UTC

Fedora/ppc64le users can find a pre-built kernel with the patchset at https://copr.fedorainfracloud.org/coprs/sharkcz/talos-kernel/build/817728/

Comment 18 Alex Deucher 2018-12-14 20:46:44 UTC

(In reply to Dan Horák from comment #17)
> Fedora/ppc64le users can find a pre-built kernel with the patchset at
> https://copr.fedorainfracloud.org/coprs/sharkcz/talos-kernel/build/817728/

Should these patches go upstream?  Can you confirm they fix your issues?

Comment 19 Dan Horák 2019-01-08 13:43:22 UTC

They (= https://src.fedoraproject.org/fork/sharkcz/rpms/kernel/blob/talos/f/ppc64-talos-amdgpu-reset.patch) can go upstream.

I have a 4.20 kernel on the host with recent firmware for polaris11, the skiroot boot environment (4.15 IIRC) uses polaris11 firmware files from 20181015. The system boots OK in the skiroot kernel -> kexec -> host kernel sequence.

The diff between the firmware sets is
--- a-old	2019-01-08 14:34:02.300578731 +0100
+++ a-new	2019-01-08 14:34:14.519024951 +0100
@@ -1,19 +1,21 @@
 5dc1006ce1896d997232b9165f2ce8ebad52a63c  polaris11_ce.bin
 193b0baaabf2a0a9e37a201a251013f26b0b70ee  polaris11_ce_2.bin
+8ee2e5db95fe589e8292642e638a15d1b8291bcb  polaris11_k_mc.bin
 cf2b8cd1d7f723f0edb6f17123711d6fa21ef379  polaris11_k_smc.bin
+a4ab9c0484cc9957112ab803d88e5b967c412c01  polaris11_k2_smc.bin
 f6551d45c0b652955009560bea1694c5ca86c1af  polaris11_mc.bin
 a3bfc83f5f52978365428e8756ed165655984a3b  polaris11_me.bin
 eb17beac2b09e25cfdc4662afc353f51a1c23272  polaris11_mec.bin
-c960ba31f13806c889d076bbe0b796281fdb0075  polaris11_mec_2.bin
+4e2290c5e030d6211168802fbe60db53b7c076c5  polaris11_mec_2.bin
 eb17beac2b09e25cfdc4662afc353f51a1c23272  polaris11_mec2.bin
-a13f4c7ce6e30ed930c7a5756196b24114247691  polaris11_mec2_2.bin
+f7956bba6312950db2de12336f79ccb28201593a  polaris11_mec2_2.bin
 a7fb9fab4529707592ce3dd449cd27fbf415fb94  polaris11_me_2.bin
 6377d75775fbe5353c8397ff03d230be0f4d6bcf  polaris11_pfp.bin
-af8c1b94170ada698379d1dab33924003ee525c0  polaris11_pfp_2.bin
+01eda59a1f159889d9a0ea2a9744eae4e09eaa1c  polaris11_pfp_2.bin
 38c512c82fe4773f33e53ba1ada414ddbe9b9e09  polaris11_rlc.bin
 82d8fcf56ac3051981b9e70199b115ed9d46995f  polaris11_sdma.bin
 8b21e98cb7e0ab000d131543c13a3ed95aa6687a  polaris11_sdma1.bin
 e01ac87abb011582d1da84eda9444353de082d11  polaris11_smc.bin
-6b804243472b5653ba449106426a0da1c46a9d84  polaris11_smc_sk.bin
+f8680ef51f84df00b388d9c230a28d150836ce08  polaris11_smc_sk.bin
 85a2f70f1f3b63e02a1bfbaba73a1729cee2104e  polaris11_uvd.bin
 a9abead599bb8497f38d587f682726a00bc067d2  polaris11_vce.bin

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.