Bug 108754

Summary: hard crash of amdgpu in 4.20-rc
Product: DRI Reporter: Dan Horák <dan>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: bcrocker
Version: unspecified   
Hardware: PowerPC   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=108585
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
full dmesg output none

Description Dan Horák 2018-11-15 12:32:48 UTC
Created attachment 142474 [details]
full dmesg output

I'm seeing hard crashes (taking down the whole system) in the amdgpu driver in 4.20-rc kernels (starting around rc1). This is on Power9 Talos system with Radeon Pro WX4100.

after "modprobe amdgpu" in a system booted with "modprobe.blacklist=amdgpu" I got following and the system stopped responding
...
lis 15 12:40:56 talos.danny.cz kernel: [drm] amdgpu kernel modesetting enabled.
lis 15 12:40:56 talos.danny.cz kernel: amdgpu 0000:01:00.0: enabling device (0540 -> 0542)
lis 15 12:40:56 talos.danny.cz kernel: [drm] initializing kernel modesetting (POLARIS11 0x1002:0x67E3 0x1002:0x0B0D 0x00).
lis 15 12:40:56 talos.danny.cz kernel: [drm] register mmio base: 0x00000000
lis 15 12:40:56 talos.danny.cz kernel: [drm] register mmio size: 262144
lis 15 12:40:56 talos.danny.cz kernel: [drm] PCI I/O BAR is not found.
lis 15 12:40:56 talos.danny.cz kernel: [drm] add ip block number 0 <vi_common>
lis 15 12:40:56 talos.danny.cz kernel: [drm] add ip block number 1 <gmc_v8_0>
lis 15 12:40:56 talos.danny.cz kernel: [drm] add ip block number 2 <tonga_ih>
lis 15 12:40:56 talos.danny.cz kernel: [drm] add ip block number 3 <gfx_v8_0>
lis 15 12:40:56 talos.danny.cz kernel: [drm] add ip block number 4 <sdma_v3_0>
lis 15 12:40:56 talos.danny.cz kernel: [drm] add ip block number 5 <powerplay>
lis 15 12:40:56 talos.danny.cz kernel: [drm] add ip block number 6 <dm>
lis 15 12:40:56 talos.danny.cz kernel: [drm] add ip block number 7 <uvd_v6_0>
lis 15 12:40:56 talos.danny.cz kernel: [drm] add ip block number 8 <vce_v3_0>
lis 15 12:40:56 talos.danny.cz kernel: [drm] UVD is enabled in VM mode
lis 15 12:40:56 talos.danny.cz kernel: [drm] UVD ENC is enabled in VM mode
lis 15 12:40:56 talos.danny.cz kernel: [drm] VCE enabled in VM mode
lis 15 12:40:56 talos.danny.cz kernel: ATOM BIOS: 113-D0150600-103
lis 15 12:40:56 talos.danny.cz kernel: [drm] vm size is 256 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
lis 15 12:40:56 talos.danny.cz kernel: amdgpu: No suitable DMA available
lis 15 12:40:56 talos.danny.cz kernel: amdgpu 0000:01:00.0: BAR 2: releasing [mem 0x6000010000000-0x60000101fffff 64bit pref]
lis 15 12:40:56 talos.danny.cz kernel: amdgpu 0000:01:00.0: BAR 0: releasing [mem 0x6000000000000-0x600000fffffff 64bit pref]
lis 15 12:40:56 talos.danny.cz kernel: pci 0000:00:00.0: BAR 15: releasing [mem 0x6000000000000-0x6003fbff0ffff 64bit pref]
lis 15 12:40:56 talos.danny.cz kernel: pci 0000:00:00.0: BAR 15: assigned [mem 0x6000000000000-0x600017fffffff 64bit pref]
lis 15 12:40:56 talos.danny.cz kernel: amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x6000000000000-0x60000ffffffff 64bit pref]
lis 15 12:40:56 talos.danny.cz kernel: amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x6000100000000-0x60001001fffff 64bit pref]
lis 15 12:40:56 talos.danny.cz kernel: pci 0000:00:00.0: PCI bridge to [bus 01]
lis 15 12:40:56 talos.danny.cz kernel: pci 0000:00:00.0:   bridge window [mem 0x600c000000000-0x600c07fefffff]
lis 15 12:40:56 talos.danny.cz kernel: pci 0000:00:00.0:   bridge window [mem 0x6000000000000-0x6003fbff0ffff 64bit pref]
lis 15 12:40:56 talos.danny.cz kernel: amdgpu 0000:01:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
lis 15 12:40:56 talos.danny.cz kernel: amdgpu 0000:01:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
lis 15 12:40:56 talos.danny.cz kernel: [drm] Detected VRAM RAM=4096M, BAR=4096M
lis 15 12:40:56 talos.danny.cz kernel: [drm] RAM width 128bits GDDR5
lis 15 12:40:56 talos.danny.cz kernel: [TTM] Zone  kernel: Available graphics memory: 33386016 kiB
lis 15 12:40:56 talos.danny.cz kernel: [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
lis 15 12:40:56 talos.danny.cz kernel: [TTM] Initializing pool allocator
lis 15 12:40:56 talos.danny.cz kernel: [drm] amdgpu: 4096M of VRAM memory ready
lis 15 12:40:56 talos.danny.cz kernel: [drm] amdgpu: 4096M of GTT memory ready.
lis 15 12:40:56 talos.danny.cz kernel: [drm] GART: num cpu pages 4096, num gpu pages 65536
lis 15 12:40:56 talos.danny.cz kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F4008D0000).
lis 15 12:40:56 talos.danny.cz kernel: [drm] Chained IB support enabled!
lis 15 12:40:56 talos.danny.cz kernel: [drm] Found UVD firmware Version: 1.130 Family ID: 16
lis 15 12:40:56 talos.danny.cz kernel: [drm] Found VCE firmware Version: 53.26 Binary ID: 3
lis 15 12:40:56 talos.danny.cz kernel: amdgpu: [powerplay] dpm has been enabled
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB: values for Engine clock
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:         214000
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:         517000
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:         845000
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:         1049000
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:         1099000
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:         1136000
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:         1175000
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:         1201000
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB: Validation clocks:
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:    engine_max_clock: 0
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:    memory_max_clock: 0
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:    level           : 8
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB: reducing engine clock level from 8 to 0
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB: values for Memory clock
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:         300000
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:         1500000
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB: Validation clocks:
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:    engine_max_clock: 0
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:    memory_max_clock: 0
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB:    level           : 8
lis 15 12:40:56 talos.danny.cz kernel: [drm] DM_PPLIB: reducing memory clock level from 2 to 0
lis 15 12:40:56 talos.danny.cz kernel: [drm] Display Core initialized with v3.1.68!
lis 15 12:40:56 talos.danny.cz kernel: [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
lis 15 12:40:56 talos.danny.cz kernel: [drm] Driver supports precise vblank timestamp query.
lis 15 12:40:56 talos.danny.cz kernel: [drm] UVD and UVD ENC initialized successfully.
lis 15 12:40:58 talos.danny.cz kernel: [drm] VCE initialized successfully.
lis 15 12:40:58 talos.danny.cz kernel: [drm] Cannot find any crtc or sizes
lis 15 12:40:58 talos.danny.cz kernel: Unable to handle kernel paging request for data at address 0xc000001369cefffc
lis 15 12:40:58 talos.danny.cz kernel: Faulting instruction address: 0xc008000011b8be54
lis 15 12:40:58 talos.danny.cz kernel: Oops: Kernel access of bad area, sig: 11 [#1]
lis 15 12:40:58 talos.danny.cz kernel: LE SMP NR_CPUS=1024 NUMA PowerNV
lis 15 12:40:58 talos.danny.cz kernel: Modules linked in: amdgpu(+) mfd_core chash gpu_sched i2c_algo_bit ttm drm_kms_helper drm drm_panel_orientation_quirks fb_sys_fops syscopyarea sysfillrect sysimgblt xt_CHECKSUM ipt_MASQUERADE tun kvm_hv kvm devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc dm_crypt snd_hda_codec_realtek snd_hda_codec_generic at24 snd_hda_codec_hdmi snd_hda_intel regmap_i2c snd_hda_codec ipmi_powernv ipmi_devintf i2c_opal snd_hda_core i2c_core snd_hwdep snd_seq vmx_crypto snd_seq_device snd_pcm ses enclosure ipmi_msghandler snd_timer scsi_transport_sas snd ofpart powernv_flash mtd rtc_opal opal_prd crct10dif_vpmsum soundcore raid1 aacraid tg3 crc32c_vpmsum
lis 15 12:40:58 talos.danny.cz kernel: CPU: 0 PID: 338 Comm: kworker/0:2 Not tainted 4.20.0-rc2+ #1
lis 15 12:40:58 talos.danny.cz kernel: Workqueue: events work_for_cpu_fn
lis 15 12:40:58 talos.danny.cz kernel: NIP:  c008000011b8be54 LR: c008000011b7885c CTR: c008000011b8bd68
lis 15 12:40:58 talos.danny.cz kernel: REGS: c0000007f84533c0 TRAP: 0300   Not tainted  (4.20.0-rc2+)
lis 15 12:40:58 talos.danny.cz kernel: MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 84002482  XER: 20040000
lis 15 12:40:58 talos.danny.cz kernel: CFAR: c008000011b8c6fc DAR: c000001369cefffc DSISR: 42000000 IRQMASK: 0 
                                       GPR00: c008000011b7885c c0000007f8453648 c008000011d69e00 c0000007f74bf67c 
                                       GPR04: 000000000001d524 00000000000249f0 c0000007f8453758 0000000020130307 
                                       GPR08: c000001369cefff4 c000000769cf0000 0000000000000001 0000000002100800 
                                       GPR12: c008000011b8bd68 c0000000018b0000 c000000000151e88 c0000007fe1f8340 
                                       GPR16: 0000000000000000 0000000000000000 0000000000000000 c0000007f87d30c0 
                                       GPR20: c0000007f87d30c8 c0000007f87d30b8 c0000007f87d30d8 c0000007f87d30e0 
                                       GPR24: c0000007f87d30d0 c0000007f87dc528 0000000000000000 0000000000000001 
                                       GPR28: c000000769cf0000 c0000007f8453710 c0000007f74b2340 c000200721935c00 
lis 15 12:40:58 talos.danny.cz kernel: NIP [c008000011b8be54] smu7_set_power_state_tasks+0xec/0xab0 [amdgpu]
lis 15 12:40:58 talos.danny.cz kernel: LR [c008000011b7885c] phm_set_power_state+0x64/0xc0 [amdgpu]
lis 15 12:40:58 talos.danny.cz kernel: Call Trace:
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453648] [c008000011b4ee7c] amdgpu_cgs_write_ind_register+0x84/0x170 [amdgpu] (unreliable)
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f84536e8] [c008000011b7885c] phm_set_power_state+0x64/0xc0 [amdgpu]
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453728] [c008000011ba0d48] psm_adjust_power_state_dynamic+0x130/0x270 [amdgpu]
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453788] [c008000011b764f0] hwmgr_handle_task+0x58/0x178 [amdgpu]
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f84537c8] [c008000011bae29c] pp_late_init+0xa4/0x1f0 [amdgpu]
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453868] [c008000011a318d8] amdgpu_device_ip_late_init+0x90/0x1b0 [amdgpu]
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f84538f8] [c008000011a34cb8] amdgpu_device_init+0x1590/0x18e0 [amdgpu]
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453a08] [c008000011a3823c] amdgpu_driver_load_kms+0xb4/0x330 [amdgpu]
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453a88] [c008000010ccae30] drm_dev_register+0x1b8/0x280 [drm]
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453b28] [c008000011a306bc] amdgpu_pci_probe+0x114/0x200 [amdgpu]
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453bb8] [c00000000070024c] local_pci_probe+0x6c/0x140
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453c48] [c000000000143b88] work_for_cpu_fn+0x38/0x60
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453c78] [c000000000148c40] process_one_work+0x250/0x500
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453d18] [c000000000149160] worker_thread+0x270/0x5b0
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453db8] [c00000000015202c] kthread+0x1ac/0x1c0
lis 15 12:40:58 talos.danny.cz kernel: [c0000007f8453e28] [c00000000000bdd0] ret_from_kernel_thread+0x5c/0x6c
lis 15 12:40:58 talos.danny.cz kernel: Instruction dump:
lis 15 12:40:58 talos.danny.cz kernel: 7d485378 7f872000 419e0464 39480001 38c6000c 794a0020 4200ffe4 1d08000c 
lis 15 12:40:58 talos.danny.cz kernel: 81490d3c 614a0001 7d094214 91490d3c <90880008> 81490064 2faa0000 419e0880 
lis 15 12:40:58 talos.danny.cz kernel: ---[ end trace d5e132cd328da1c7 ]---
lis 15 12:40:58 talos.danny.cz kernel:
Comment 1 Dan Horák 2018-12-14 08:22:26 UTC
Not a problem anymore with 4.20.0-0.rc6.git0.1.fc30.op.1.ppc64le (contains the reset fix from https://bugs.freedesktop.org/show_bug.cgi?id=108585#c15)
Comment 2 Alex Deucher 2018-12-14 20:46:55 UTC
(In reply to Dan Horák from comment #1)
> Not a problem anymore with 4.20.0-0.rc6.git0.1.fc30.op.1.ppc64le (contains
> the reset fix from https://bugs.freedesktop.org/show_bug.cgi?id=108585#c15)

Should these patches go upstream?  Can you confirm they fix your issues?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.