108781 – 4.19 Regression - Hawaii (R9 390) boot failure - Invalid PCC GPIO / invalid powerlevel state / Fatal error during GPU init

Bug 108781 - 4.19 Regression - Hawaii (R9 390) boot failure - Invalid PCC GPIO / invalid powerlevel state / Fatal error during GPU init

Summary: 4.19 Regression - Hawaii (R9 390) boot failure - Invalid PCC GPIO / invalid p...

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium critical
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-11-17 21:25 UTC by jamespharvey20
Modified:	2019-11-19 09:04 UTC (History)
CC List:	12 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg (journalctl) of failure on 4.19.2.arch1-1 (159.47 KB, text/plain) 2018-11-17 21:25 UTC, jamespharvey20	no flags	Details
dmesg (journalctl) of failure on 4.19.arch1-1 (175.09 KB, text/plain) 2018-11-17 21:26 UTC, jamespharvey20	no flags	Details
dmesg (journalctl) of working on 4.18.16.arch1-1 (151.43 KB, text/plain) 2018-11-17 21:26 UTC, jamespharvey20	no flags	Details
journalctl of c91b007ed137, which gets to a tty (200.49 KB, text/plain) 2018-11-22 00:15 UTC, jamespharvey20	no flags	Details
journalctl of 0d9988910989, which gets to a black screen (195.49 KB, text/plain) 2018-11-22 00:15 UTC, jamespharvey20	no flags	Details
journalctl of git master 7c98a4261827, with patch 259364, which gets to a black screen (189.75 KB, text/plain) 2018-11-23 22:43 UTC, jamespharvey20	no flags	Details
Patch to workaround FATAL issue on VCE2.0 ringtest initialization . (678 bytes, patch) 2018-12-02 06:34 UTC, alex.vl	no flags	Details \| Splinter Review
Full dmesg && lspci (failure) on linux-4.19.6 + pcifix (https://patchwork.freedesktop.org/patch/259364/) (24.88 KB, application/gzip) 2018-12-03 13:26 UTC, alex.vl	no flags	Details
View All

Description jamespharvey20 2018-11-17 21:25:58 UTC

Created attachment 142499 [details]
dmesg (journalctl) of failure on 4.19.2.arch1-1

arch 4.18.16.arch1-1 works, using kernel parameters:

 radeon.cik_support=0 amdgpu.cik_support=1 amdgpu.dpm=1 amdgpu.dc=1

Upgraded to 4.19.2.arch1-1, and started getting this failure.  Going back to 4.19.arch1-1 still gives this failure.

Full dmesg (journalctl) attached for 4.19.2.arch1-1 (failing), 4.19.arch1-1 (failing), and 4.18.16.arch1-1 (working).  But pertinent part of failure is below for search.

This failure occurs booting to a tty, so no X logs are involved.  (You might see on 4.18.16.arch1-1, there is a [drm:generic_reg_wait [amdgpu]] error and backtrace which has been happening forever, but it works and doesn't cause a noticeable problem.)

-----

# lspci -v
...
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] (rev 80) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. Hawaii PRO [Radeon R9 290/390]
        Flags: bus master, fast devsel, latency 0, IRQ 75, NUMA node 0
        Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Memory at d0000000 (64-bit, prefetchable) [size=8M]
        I/O ports at 8000 [size=256]
        Memory at dfe00000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [200] Resizable BAR <?>
        Capabilities: [270] Secondary PCI Express <?>
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Kernel driver in use: amdgpu
        Kernel modules: radeon, amdgpu

-----

[drm] Invalid PCC GPIO: 13!
        ui class: none
        internal class: boot
        caps:
        uvd    vclk: 0 dclk: 0
                power level 0    sclk: 30000 mclk: 15000 pcie gen: 3 pcie lanes: 16                                           
        status: c r b
        ui class: performance
        internal class: none
        caps:
        uvd    vclk: 0 dclk: 0
                power level 0    sclk: 30000 mclk: 15000 pcie gen: 3 pcie lanes: 16                                           
                power level 1    sclk: 105000 mclk: 150000 pcie gen: 3 pcie lanes: 16                                         
        status:
[drm] amdgpu: dpm initialized
[drm] Found UVD firmware Version: 1.64 Family ID: 9                                                                           
[drm] Found VCE firmware Version: 50.10 Binary ID: 2                                                                          
[drm] PCIE gen 3 link speeds already enabled
[drm:dm_pp_get_static_clocks [amdgpu]] *ERROR* DM_PPLIB: invalid powerlevel state: 0!                                         
[drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!                             
[drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!                             
[drm] Display Core initialized with v3.1.59!
[drm] DM_MST: Differing MST start on aconnector: 00000000d3bd29d7 [id: 55]                                                    
[drm] DM_MST: Differing MST start on aconnector: 000000004b0d56b6 [id: 57]                                                    
[drm] DM_MST: Differing MST start on aconnector: 0000000058d5a853 [id: 59]                                                    
[drm] Supports vblank timestamp caching Rev 2 (21.10.2013).                                                                   
[drm] Driver supports precise vblank timestamp query.                                                                         
[drm] UVD initialized successfully.
[drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed                                                  
[drm:amdgpu_device_init.cold.14 [amdgpu]] *ERROR* hw_init of IP block <vce_v2_0> failed -110                                  
amdgpu 0000:03:00.0: amdgpu_device_ip_init failed                                                                             
amdgpu 0000:03:00.0: Fatal error during GPU init                                                                              
[drm] amdgpu: finishing device.
------------[ cut here ]------------
Memory manager not clean during takedown.
WARNING: CPU: 0 PID: 670 at drivers/gpu/drm/drm_mm.c:950 drm_mm_takedown+0x1f/0x30 [drm]                                      
Modules linked in: amdkfd amd_iommu_v2 amdgpu(+) intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i>
 x_tables sr_mod cdrom btrfs xor sd_mod dm_thin_pool dm_persistent_data raid6_pq dm_bio_prison dm_bufio libcrc32c crc32c_gener>
CPU: 0 PID: 670 Comm: kworker/0:4 Not tainted 4.19.0-arch1-1-ARCH #1                                                          
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EP2C602, BIOS P1.90 04/12/2018                                   
Workqueue: events work_for_cpu_fn
RIP: 0010:drm_mm_takedown+0x1f/0x30 [drm]
Code: 0d d0 cb 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 47 38 48 83 c7 38 48 39 c7 75 01 c3 48 c7 c7 08 b1 1b c1 e8 5b 10 >
RSP: 0018:ffff91764827bd08 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8e5a1b613200 RCX: 0000000000000000
RDX: 0000000000000007 RSI: ffffffff8de9d696 RDI: 00000000ffffffff
RBP: ffff8e5a0ca729a0 R08: 0000000000000001 R09: 00000000000005aa
R10: 0000000000000004 R11: 0000000000000000 R12: ffff8e5a1b6132e8
R13: 0000000000000000 R14: 0000000000000170 R15: ffff8e5a0c69e650
FS:  0000000000000000(0000) GS:ffff8e5a1f800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4f26530480 CR3: 00000001f0a0a006 CR4: 00000000000606f0
Call Trace:
 amdgpu_vram_mgr_fini+0x27/0x50 [amdgpu]
 ttm_bo_clean_mm+0xa9/0xb0 [ttm]
 amdgpu_ttm_fini+0x71/0x100 [amdgpu]
 amdgpu_bo_fini+0xe/0x30 [amdgpu]
 gmc_v7_0_sw_fini+0x32/0x60 [amdgpu]
 amdgpu_device_fini+0x2cc/0x4aa [amdgpu]
 amdgpu_driver_unload_kms+0x42/0x90 [amdgpu]
 amdgpu_driver_load_kms+0x168/0x2c0 [amdgpu]
 drm_dev_register+0x109/0x140 [drm]
 amdgpu_pci_probe+0x13c/0x1c0 [amdgpu]
 ? _raw_spin_unlock_irqrestore+0x20/0x40
 local_pci_probe+0x41/0x90
 work_for_cpu_fn+0x16/0x20
 process_one_work+0x1eb/0x410
 worker_thread+0x218/0x3d0
 ? process_one_work+0x410/0x410
 kthread+0x112/0x130
 ? kthread_park+0x80/0x80
 ret_from_fork+0x35/0x40
---[ end trace 3cf1bcf02bf4fe1a ]---

Comment 1 jamespharvey20 2018-11-17 21:26:24 UTC

Created attachment 142500 [details]
dmesg (journalctl) of failure on 4.19.arch1-1

Comment 2 jamespharvey20 2018-11-17 21:26:49 UTC

Created attachment 142501 [details]
dmesg (journalctl) of working on 4.18.16.arch1-1

Comment 3 jamespharvey20 2018-11-17 21:34:08 UTC

I should add, the screen goes black and system is unresponsive after this.

Comment 4 Alex Deucher 2018-11-19 15:01:05 UTC

Possibly the same issue as bug 108704.  Does the patch there help?

Comment 5 mike 2018-11-19 17:25:08 UTC

I can add that I also hit this on a R9 290 Reference card:

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] (prog-if 00 [VGA controller])
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0b00
        Flags: fast devsel, IRQ 16
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Memory at f0000000 (64-bit, prefetchable) [size=8M]
        I/O ports at e000 [size=256]
        Memory at f7e00000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [270] Secondary PCI Express <?>
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Kernel modules: radeon, amdgpu


with Fedora's:

kernel-4.19.2-300.fc29.x86_64


with options:

$ cat /etc/modprobe.d/amdgpu.conf
blacklist radeon
options amdgpu cik_support=1
options amdgpu dpm=1
options amdgpu dc=0
options amdgpu pcie_gen2=0


Dmesg:


kern  :err   : [Mon Nov 19 12:11:42 2018] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed
kern  :err   : [Mon Nov 19 12:11:42 2018] [drm:amdgpu_device_init.cold.28 [amdgpu]] *ERROR* hw_init of IP block <vce_v2_0> failed -110
kern  :err   : [Mon Nov 19 12:11:42 2018] amdgpu 0000:01:00.0: amdgpu_device_ip_init failed
kern  :err   : [Mon Nov 19 12:11:42 2018] amdgpu 0000:01:00.0: Fatal error during GPU init
kern  :info  : [Mon Nov 19 12:11:42 2018] [drm] amdgpu: finishing device.
kern  :warn  : [Mon Nov 19 12:11:42 2018] ------------[ cut here ]------------
kern  :warn  : [Mon Nov 19 12:11:42 2018] Memory manager not clean during takedown.
kern  :warn  : [Mon Nov 19 12:11:42 2018] WARNING: CPU: 1 PID: 380 at drivers/gpu/drm/drm_mm.c:950 drm_mm_takedown+0x1f/0x30 [drm]
kern  :warn  : [Mon Nov 19 12:11:42 2018] Modules linked in: btrfs libcrc32c xor amdkfd zstd_decompress zstd_compress amd_iommu_v2 xxhash amdgpu(+) raid6_pq chash gpu_sched i2c_algo_bit drm_kms_helper ttm crc32c_intel drm e1000e serio_raw uas usb_storage bfq lz4 lz4_compress
kern  :warn  : [Mon Nov 19 12:11:42 2018] CPU: 1 PID: 380 Comm: systemd-udevd Not tainted 4.19.2-300.fc29.x86_64 #1
kern  :warn  : [Mon Nov 19 12:11:42 2018] Hardware name: System manufacturer System Product Name/P8P67 PRO REV 3.1, BIOS 3602 11/01/2012
kern  :warn  : [Mon Nov 19 12:11:42 2018] RIP: 0010:drm_mm_takedown+0x1f/0x30 [drm]
kern  :warn  : [Mon Nov 19 12:11:42 2018] Code: f6 c3 48 8d 41 c0 eb bb 0f 1f 00 66 66 66 66 90 48 8b 47 38 48 83 c7 38 48 39 c7 75 01 c3 48 c7 c7 a0 88 4e c0 e8 6b 2d c0 fb <0f> 0b c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 66 66 66 66 90
kern  :warn  : [Mon Nov 19 12:11:42 2018] RSP: 0018:ffffbced41e5f9e8 EFLAGS: 00010282
kern  :warn  : [Mon Nov 19 12:11:42 2018] RAX: 0000000000000000 RBX: ffff948185a21d00 RCX: 0000000000000006
kern  :warn  : [Mon Nov 19 12:11:42 2018] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff94818ea96860
kern  :warn  : [Mon Nov 19 12:11:42 2018] RBP: ffff9481836429a0 R08: 000000000000003c R09: 0000000000000003
kern  :warn  : [Mon Nov 19 12:11:42 2018] R10: 0000000000000000 R11: 0000000000000001 R12: ffff948183642980
kern  :warn  : [Mon Nov 19 12:11:42 2018] R13: 0000000000000000 R14: 0000000000000170 R15: ffff9481860b9e30
kern  :warn  : [Mon Nov 19 12:11:42 2018] FS:  00007f283bcf8940(0000) GS:ffff94818ea80000(0000) knlGS:0000000000000000
kern  :warn  : [Mon Nov 19 12:11:42 2018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern  :warn  : [Mon Nov 19 12:11:42 2018] CR2: 0000559d1d60abd8 CR3: 0000000404f18004 CR4: 00000000000606e0
kern  :warn  : [Mon Nov 19 12:11:42 2018] Call Trace:
kern  :warn  : [Mon Nov 19 12:11:42 2018]  amdgpu_vram_mgr_fini+0x22/0x40 [amdgpu]
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ttm_bo_clean_mm+0xa2/0xb0 [ttm]
kern  :warn  : [Mon Nov 19 12:11:42 2018]  amdgpu_ttm_fini+0x71/0x100 [amdgpu]
kern  :warn  : [Mon Nov 19 12:11:42 2018]  amdgpu_bo_fini+0xe/0x30 [amdgpu]
kern  :warn  : [Mon Nov 19 12:11:42 2018]  gmc_v7_0_sw_fini+0x32/0x60 [amdgpu]
kern  :warn  : [Mon Nov 19 12:11:42 2018]  amdgpu_device_fini+0x2cc/0x487 [amdgpu]
kern  :warn  : [Mon Nov 19 12:11:42 2018]  amdgpu_driver_unload_kms+0x42/0x90 [amdgpu]
kern  :warn  : [Mon Nov 19 12:11:42 2018]  amdgpu_driver_load_kms+0x146/0x2c0 [amdgpu]
kern  :warn  : [Mon Nov 19 12:11:42 2018]  drm_dev_register+0x109/0x140 [drm]
kern  :warn  : [Mon Nov 19 12:11:42 2018]  amdgpu_pci_probe+0x13c/0x1c0 [amdgpu]
kern  :warn  : [Mon Nov 19 12:11:42 2018]  local_pci_probe+0x41/0x90
kern  :warn  : [Mon Nov 19 12:11:42 2018]  pci_device_probe+0x188/0x1a0
kern  :warn  : [Mon Nov 19 12:11:42 2018]  really_probe+0x235/0x3a0
kern  :warn  : [Mon Nov 19 12:11:42 2018]  driver_probe_device+0xb3/0xf0
kern  :warn  : [Mon Nov 19 12:11:42 2018]  __driver_attach+0xdd/0x110
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ? driver_probe_device+0xf0/0xf0
kern  :warn  : [Mon Nov 19 12:11:42 2018]  bus_for_each_dev+0x76/0xc0
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ? klist_add_tail+0x3b/0x60
kern  :warn  : [Mon Nov 19 12:11:42 2018]  bus_add_driver+0x152/0x230
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ? 0xffffffffc090d000
kern  :warn  : [Mon Nov 19 12:11:42 2018]  driver_register+0x6b/0xb0
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ? 0xffffffffc090d000
kern  :warn  : [Mon Nov 19 12:11:42 2018]  do_one_initcall+0x46/0x1c3
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ? _cond_resched+0x15/0x30
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ? kmem_cache_alloc_trace+0x15f/0x1e0
kern  :warn  : [Mon Nov 19 12:11:42 2018]  do_init_module+0x5a/0x210
kern  :warn  : [Mon Nov 19 12:11:42 2018]  load_module+0x206d/0x22d0
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ? __switch_to_asm+0x40/0x70
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ? __switch_to_asm+0x34/0x70
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ? __switch_to_asm+0x40/0x70
kern  :warn  : [Mon Nov 19 12:11:42 2018]  ? __do_sys_init_module+0x13d/0x180
kern  :warn  : [Mon Nov 19 12:11:42 2018]  __do_sys_init_module+0x13d/0x180
kern  :warn  : [Mon Nov 19 12:11:42 2018]  do_syscall_64+0x5b/0x160
kern  :warn  : [Mon Nov 19 12:11:42 2018]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kern  :warn  : [Mon Nov 19 12:11:42 2018] RIP: 0033:0x7f283c9b2fde
kern  :warn  : [Mon Nov 19 12:11:42 2018] Code: 48 8b 0d ad 1e 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7a 1e 0c 00 f7 d8 64 89 01 48
kern  :warn  : [Mon Nov 19 12:11:42 2018] RSP: 002b:00007fff3b08bb08 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
kern  :warn  : [Mon Nov 19 12:11:42 2018] RAX: ffffffffffffffda RBX: 0000559d1d6382c0 RCX: 00007f283c9b2fde
kern  :warn  : [Mon Nov 19 12:11:42 2018] RDX: 0000559d1d619d90 RSI: 0000000000607e0e RDI: 0000559d1def3430
kern  :warn  : [Mon Nov 19 12:11:42 2018] RBP: 0000559d1d619d90 R08: 0000000000000007 R09: 0000000000000006
kern  :warn  : [Mon Nov 19 12:11:42 2018] R10: 0000559d1d607010 R11: 0000000000000246 R12: 0000559d1def3430
kern  :warn  : [Mon Nov 19 12:11:42 2018] R13: 0000559d1d63a970 R14: 0000000000020000 R15: 0000000000000000
kern  :warn  : [Mon Nov 19 12:11:42 2018] ---[ end trace 0596c9d7ae3ce46b ]---

Comment 6 jamespharvey20 2018-11-20 02:26:51 UTC

Alex Deucher, unfortunately the patch on bug 108704 has no effect.

Comment 7 Linnea S 2018-11-20 12:50:50 UTC

I'm also hitting this bug with MSI R9 390 and kernel 4.19.2-300.fc29.x86_64. It works with 4.18.17-300.fc29.x86_64. Boot options:
radeon.cik_support=0 amdgpu.cik_support=1 amdgpu.dc=0 amdgpu.dpm=1

Please let me know if there's something I can try to help troubleshoot this issue.

Backtrace:
------------[ cut here ]------------
Memory manager not clean during takedown.
WARNING: CPU: 0 PID: 437 at drivers/gpu/drm/drm_mm.c:950 drm_mm_takedown+0x1f/0x30 [drm]
Modules linked in: amdkfd amd_iommu_v2 amdgpu(+) chash gpu_sched radeon drm_kms_helper ttm drm igb dca i2c_algo_bit nvme crc32c_intel nvme_core pinctrl_amd
CPU: 0 PID: 437 Comm: systemd-udevd Not tainted 4.19.2-300.fc29.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX X470-F GAMING, BIOS 4018 07/12/2018
RIP: 0010:drm_mm_takedown+0x1f/0x30 [drm]
Code: f6 c3 48 8d 41 c0 eb bb 0f 1f 00 0f 1f 44 00 00 48 8b 47 38 48 83 c7 38 48 39 c7 75 01 c3 48 c7 c7 a0 88 2b c0 e8 6b 2d e3 f0 <0f> 0b c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00
RSP: 0018:ffffa5c881fdf9e8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff931b04cd3900 RCX: 0000000000000006
RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff931b0ea16860
RBP: ffff931b013e29a0 R08: 000000000000003c R09: 0000000000000003
R10: 0000000000000000 R11: 0000000000000001 R12: ffff931b013e2980
R13: 0000000000000000 R14: 0000000000000170 R15: ffff931b02157b30
FS:  00007f79bad29940(0000) GS:ffff931b0ea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055bb8c6fc840 CR3: 000000040397a000 CR4: 00000000003406f0
Call Trace:
 amdgpu_vram_mgr_fini+0x22/0x40 [amdgpu]
 ttm_bo_clean_mm+0xa2/0xb0 [ttm]
 amdgpu_ttm_fini+0x71/0x100 [amdgpu]
 amdgpu_bo_fini+0xe/0x30 [amdgpu]
 gmc_v7_0_sw_fini+0x32/0x60 [amdgpu]
 amdgpu_device_fini+0x2cc/0x487 [amdgpu]
 amdgpu_driver_unload_kms+0x42/0x90 [amdgpu]
 amdgpu_driver_load_kms+0x146/0x2c0 [amdgpu]
 drm_dev_register+0x109/0x140 [drm]
 amdgpu_pci_probe+0x13c/0x1c0 [amdgpu]
 local_pci_probe+0x41/0x90
 pci_device_probe+0x188/0x1a0
 really_probe+0x235/0x3a0
 driver_probe_device+0xb3/0xf0
 __driver_attach+0xdd/0x110
 ? driver_probe_device+0xf0/0xf0
 bus_for_each_dev+0x76/0xc0
 ? klist_add_tail+0x3b/0x60
 bus_add_driver+0x152/0x230
 ? 0xffffffffc087f000
 driver_register+0x6b/0xb0
 ? 0xffffffffc087f000
 do_one_initcall+0x46/0x1c3
 ? _cond_resched+0x15/0x30
 ? kmem_cache_alloc_trace+0x15f/0x1e0
 do_init_module+0x5a/0x210
 load_module+0x206d/0x22d0
 ? __do_sys_init_module+0x13d/0x180
 __do_sys_init_module+0x13d/0x180
 do_syscall_64+0x5b/0x160
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f79bb9e3fde
Code: 48 8b 0d ad 1e 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7a 1e 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffea6d7de98 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
RAX: ffffffffffffffda RBX: 00005646e3c2ac10 RCX: 00007f79bb9e3fde
RDX: 00005646e3c09890 RSI: 0000000000607e0e RDI: 00005646e47a4640
RBP: 00005646e3c09890 R08: 0000000000000007 R09: 0000000000000006
R10: 00005646e3bf7010 R11: 0000000000000246 R12: 00005646e47a4640
R13: 00005646e3c0bac0 R14: 0000000000020000 R15: 0000000000000000
---[ end trace cfe6e347a1906090 ]---

Comment 8 Alex Deucher 2018-11-20 14:35:38 UTC

Can you bisect?

Comment 9 jamespharvey20 2018-11-20 20:20:35 UTC

I have been bisecting.  It's really fun that there's hundreds, maybe around a thousand, of bisection commits that won't compile.  See 4eaf317a, which explains commits between 39b4cbad and itself are broken.

Cherry picking 4eaf317a has applied cleanly to the next 2 bisections.  Now at 560 revisions, 9 steps.

Comment 10 jamespharvey20 2018-11-21 00:06:54 UTC

Bisecting results are below.

This is an Asus model STRIX-R9390-DC3OC-8GD5-GAMING.

Don't think is relevant, oter than why someone was looking at this code, but this commit immediately trails a bunch of the vkms commits, and the crash/to black screen happens right when amdgpu is attempting to automatically switch to kms.

0d99889109892396a8164bf6dd178e36d3fe3166 is the first bad commit
commit 0d99889109892396a8164bf6dd178e36d3fe3166
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Thu Jun 28 16:13:07 2018 +0300

    drm/fb-helper: Eliminate the .best_encoder() usage

    Instead of using the .best_encoder() hook to figure out whether a given
    connector+crtc combo will work, let's instead do what userspace does and
    just iterate over all the encoders for the connector, and then check
    each crtc against each encoder's possible_crtcs bitmask.

    v2: Avoid oopsing on NULL encoders (Daniel)
        s/connector_crtc_ok/connector_has_possible_crtc/

    Cc: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
    Cc: Harry Wentland <harry.wentland@amd.com>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180628131315.14156-2-ville.syrjala@linux.intel.com

Comment 11 Daniel Vetter 2018-11-21 07:52:55 UTC

From the logs:

Nov 17 03:24:04 newKvm kernel: [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed
Nov 17 03:24:04 newKvm kernel: [drm:amdgpu_device_init.cold.14 [amdgpu]] *ERROR* hw_init of IP block <vce_v2_0> failed -110
Nov 17 03:24:04 newKvm kernel: amdgpu 0000:03:00.0: amdgpu_device_ip_init failed
Nov 17 03:24:04 newKvm kernel: amdgpu 0000:03:00.0: Fatal error during GPU init

That's not anywhere near close to modeset code, and even further away from the fbdev stuff changed by the supposed bisect. I suspect in all the skipped commits the bisect got derailed somewhere. Please double-check whether the parent of the first bad commit works (and perhaps also double-check the bad commit again). I recommend compiling with CONFIG_LOCALVERFSION_AUTIO and checking that you have the right git version with uname -r to make really sure.

Comment 12 jamespharvey20 2018-11-21 09:48:46 UTC

Daniel, based on your questioning, looking back, I am definitely second guessing my last entry of "bisect good".  I think I may have been hasty from thinking I was done with a frustrating bisect.  If that's what happened, it is probably going to be 854502fa0a38.

I'll be able to double check/redo the final bisect steps in about 10 hours,  double check a commit fail/parent working, and post journalctl's showing the kernel version, to be absolutely sure.

Comment 13 jamespharvey20 2018-11-22 00:13:17 UTC

In all seriousness, can the AMD devs please tell me exactly which make and model video card the devs use?  As long as it's something that has 3+ DisplayPorts, and can display 5 monitors using chaining, I'd honestly rather have to buy that and be done with all this, and sell mine on eBay saying "windows only".

The symptom I see, and others are seeing, is that 4.18.16 boots to a tty just fine, and 4.19 goes to a black screen when I'd expect it to automatically use kms to go to a higher resolution.

Bisecting between 4.18.16 and 4.19 unfortunately runs across multiple other amdgpu bugs that make this a tangled mess of spaghetti.  Bisecting using "Do I get to see a tty on my monitor" as the deciding factor for good/bad absolutely gets to that 0d998891 is bad, and its parent c91b007e is good.  I've confirmed via booting each of these a bunch of times.  See new attached journalctl's line 3, which includes the auto kernel version confirming this.

I really hope I'm wrong about this, but I don't think I've found the bug making my screen go black in 4.19.  I'm saying this because the journacltl differences illustrating what's wrong with 0d998891 do not show up in 4.19.  I think the 0d998891 bug was fixed by a later commit, and I think I haven't yet reached the bug I really care about in 4.19.  The prospect of having to continue bisecting thousands of other commits with the multiple amdgpu bugs discussed below between these versions, plus who knows how many other bugs pop up and are fixed infuriates me.

This isn't just about complaining about bisecting.  It's about what in the world am I supposed to use as the deciding factor on "good" vs "bad"?  So, more recent than 0d998891, the screen is going to be black a lot of the time, but I can't use that because I'm hunting for the "other black screen" bug.  There are so many errors in 4.19 journalctl, I'd be comparing tons of journalctl's, since I couldn't go by is the screen on, going maybe based off the "amdgpu_device_ip_init failed".  But, what if that isn't the deciding factor?

I think all of this is why you were saying you don't think 0d998891 is the problem, because the 4.18.16 vs 4.19 original journalctl's I attached are showing a bug from somewhere else.

With there being multiple bugs that pop up and back out, I honestly think AMD needs to revert all changes between 4.18.16 and 4.19, and only re-add them once it has actually tested the commits with its own products.  Cards being discussed here are not unusual or old.  I don't mind doing a bisect for an open source project once and a while, but I think having to get this deep is going too far, and with this being a company making code for its own product rather than something like a filesystem bug, I don't feel like this depth of bug hunting should be on me.

If I'm wrong and 0d998891 is truly the source of the problem, and for some reason the 4.19 journalctl just don't show the errors at the bottom of this comment, then let me apologize and retract most of my rant here.  But, with its journalctl errors disappearing somewhere between it and 4.19, I don't feel like I'm wrong.



In my last comment, I was thinking it was at least possible I had the wrong commit at the very end, because I couldn't help but notice that the parent/good commit and the ones before it are regarding vkms.  With the worst symptom being a black screen at the kms stage, it seemed to make sense that somehow vkms was somehow turning my system into a headless system, making the screen black.  But, that's *NOT* what's happening.  Parent/good commit has vkms=n.  Although Arch 4.19 has vkms=m, I've been using Arch's 4.18 config which doesn't even have vkms, so it winds up using the default of =n.  (Furthermore, I've tested Arch 4.19 as it is but changing vkms=n and I still get a black screen.)

-----

Issue 1

We have to start somewhere, and the biggest issue to me right now is obviously the screen going black preventing a tty.

Interestingly, using the 0d998891 (bad) commit, the system does boot and I can ssh in.  Just all the screens are black.

Like I explained above, I don't know if this turns out to be the cause of the 4.19 black screen.

-----

Issue 2

[drm] Invalid PCC GPIO: 13!

This error is a red herring as it pertains to the usable screen / black screen issue.  It appears in both 0d998891 (bad) and its parent c91b007e (good.)  So, that is in an earlier commit.  No idea if it's harmful, but with it, at least booting c91b007e (good) to tty it works.  So, another bisect towards older commits would be needed to find what causes this.

-----

Issue 3 - Maybe an issue 4 or 5 in here too?

[drm:dm_pp_get_static_clocks [amdgpu]] *ERROR* DM_PPLIB: invalid powerlevel state: 0!
...
[drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed
[drm:amdgpu_device_init.cold.14 [amdgpu]] *ERROR* hw_init of IP block
<vce_v2_0> failed -110
amdgpu 0000:03:00.0: amdgpu_device_ip_init failed
amdgpu 0000:03:00.0: Fatal error during GPU init
(stacktrace)

The rest of the errors in my original attachment, such as the ones briefly shown just above this paragraph, don't show in my good or bad commit.  So, another bisect towards newer commits would be needed to find what causes these.  Is this a single commit that introduces all of these errors?  Could there be multiple commits causing all of this?  Who knows.





-----

Deeper on issue 1, regarding this bad commit

I'm vimdiff'ing the new attached journalctl's with ":%s/Nov 21 ..:..:.. //g".  These are interesting (to me) differences:


archlinux kernel:   Magic number: 10:966:801
archlinux kernel: acpi PNP0F03:00: hash matches
===good above becomes bad below - probably pseudo-random noise but not sure so including===
archlinux kernel:   Magic number: 10:413:850
archlinux kernel:  index2: hash matches
(line repeats 32 times, number of cores I  have)
archlinux kernel: processor cpu14: hash matches


Then at :1625(good) and :1663(bad) we see what changes between the good and bad commits, regarding drm/fbcon.

[drm] amdgpu_dm_irq_schedule_work FAILED src 10
[drm] DM_MST: added connector: (____ptrval____) [id: 76] [master: (____ptrval____)]
[drm] fb mappable at 0xC05BC000
[drm] vram apper at 0xC0000000
[drm] size 14745600
[drm] fb depth is 24
[drm]    pitch is 10240
fbcon: amdgpudrmfb (fb0) is primary device
switching from power state:
        ui class: performance
        internal class: none
        caps:
        uvd    vclk: 0 dclk: 0
                power level 0    sclk: 76600 mclk: 150000 pcie gen: 3 pcie lanes: 16
                power level 1    sclk: 105000 mclk: 150000 pcie gen: 3 pcie lanes: 16
        status: c
switching to power state:
        ui class: performance
        internal class: none
        caps:
        uvd    vclk: 0 dclk: 0
                power level 0    sclk: 30000 mclk: 15000 pcie gen: 3 pcie lanes: 16
                power level 1    sclk: 105000 mclk: 150000 pcie gen: 3 pcie lanes: 16
        status: r
[drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 241500
===good above becomes bad below===
[drm] amdgpu_dm_irq_schedule_work FAILED src 10
[drm] amdgpu_dm_irq_schedule_work FAILED src 8
[drm] amdgpu_dm_irq_schedule_work FAILED src 10
[drm] DM_MST: added connector: (____ptrval____) [id: 76] [master: (____ptrval____)]
[drm] Cannot find any crtc or sizes
[drm] amdgpu_dm_irq_schedule_work FAILED src 12
[drm] DM_MST: added connector: (____ptrval____) [id: 143] [master: (____ptrval____)]
[drm] Cannot find any crtc or sizes
[drm] DM_MST: added connector: (____ptrval____) [id: 220] [master: (____ptrval____)]
[drm] Cannot find any crtc or sizes
[drm] DM_MST: added connector: (____ptrval____) [id: 183] [master: (____ptrval____)]
[drm] DM_MST: added connector: (____ptrval____) [id: 236] [master: (____ptrval____)]
[drm] Cannot find any crtc or sizes
[drm] DM_MST: added connector: (____ptrval____) [id: 266] [master: (____ptrval____)]
[drm] Cannot find any crtc or sizes


My original comment gave kernel parameters relating to radeon/amd.  The journalctl's had it all.  At first, I worried that abbreviating what I said in the comment might have thrown things off for the dev's, because the "bad" commit has to do with fb, and I do use some fbcon kernel parameters.  But, trying my "bad" commit and even Arch 4.19 without the fbcon kernel parameters still leads to a black screen.  It's in the journalctl's, but my full kernel line is:

initrd=intel-ucode.img initrd=initramfs-linux.img root=/dev/lvm/arch rw consoleblank=0 fbcon=scrollback:128k fbcon=rotate:3 intel_iommu=on radeon.cik_support=0 amdgpu.cik_support=1 amdgpu.dpm=1 amdgpu.dc=1

Comment 14 jamespharvey20 2018-11-22 00:15:11 UTC

Created attachment 142559 [details]
journalctl of c91b007ed137, which gets to a tty

Comment 15 jamespharvey20 2018-11-22 00:15:56 UTC

Created attachment 142560 [details]
journalctl of 0d9988910989, which gets to a black screen

Comment 16 jamespharvey20 2018-11-22 00:48:36 UTC

I do want to acknowledge that I don't know that all of the problematic commits between 4.18.16 and 4.19 were written or even signed off by AMD.  I still put most of that on AMD, because I just don't get why AMD doesn't at least have a bunch of systems lined up with different cards, each running linux master, seeing when things break.  (Hey, system 37's screen is black again.)

Maybe that's completely unfair of me.  And, maybe it's more card specific than chipset specific than I'm thinking, and wouldn't be practical due to the number of combinations.  As a user, it's infuriating to keep running into problems with my AMD video cards, whether it be in amdgpu or mesa.  Especially after the 2.5 year old bug #91880 was fixed 3 months ago with the kernel parameters I'm using.

Comment 17 jamespharvey20 2018-11-22 01:38:32 UTC

I also really want to make clear that none of my frustration is toward a specific developer.  My code doesn't always work either.  I'm not frustrated with whoever wrote whatever commits are causing this.  I'm frustrated with AMD as a whole for not having what seems to be an adequate way of testing, before something is sent to Linus to be released.

Comment 18 freedesktop 2018-11-22 17:01:07 UTC

I just hit this as well with 4.19 on Fedora and a R9 390X - Grub shows fine, then no video output after that (monitor goes into power save), and boot doesn't seem to continue (no disk activity etc.)

I removed amdgpu.dpm=1 from my kernel params and was able to boot with 4.19.x - I noticed that everyone who mentions kernel params on this bug has this param present too - try without it?

This is a regression vs. 4.18.x - with that kernel series I was able to boot with amdgpu.dpm=1 without issue.

Comment 19 Jim Haddad 2018-11-22 18:56:05 UTC

This happened to me 7 days ago when Fedora replaced kernel-4.18.18-300.fc29 with kernel-4.19.2-300.fc29.  Also on kernel-4.19.3-300.fc29 from yesterday. 
 On a different hard drive I tried rawhide and kernel-4.20.0-0.rc3.git1.1.fc30.  Same thing.  I have also been using radeon.cik_support=0 amdgpu.cik_support=1 amdgpu.dpm=1 amdgpu.dc=1 because it crashes without it.  Removing amdgpu.dpm=1 didn't fix this.  Removing all of these didn't fix this.  I have a Sapphire R9 290.  Fedora says downgrading the kernel isn't supported but downgrading to kernel-4.18.18-300.fc29 seems to work.

Comment 20 jamespharvey20 2018-11-23 09:42:17 UTC

freedesktop, I'm glad to hear removing "amdgpu.dpm=1" allows you to boot 4.19.  Unfortunately, I've tried that, and it made no difference.  Even if it did allow booting, that would bring back the 2.5 year old stability bug 91880, unless that's been fixed in 4.19 to no longer need this and all the other kernel parameters.  I've also tried github.com/torvalds/linux master with and without this kernel parameter and the others, and with and without the patch on bug 108704, and nothing has worked so far.

Comment 21 Shecks 2018-11-23 14:42:43 UTC

I'd like to report that I am have been experiencing the same issue. After upgrading (Fedora 29) from kernel 4.18.18 to 4.19.2 my PC would no longer boot.

I have a an MSI R9 390 GPU and have been using the the following kernel parameters successfully with previous kernels:-

amdgpu.dc=1 radeon.cik_support=0 amdgpu.cik_support=1 amdgpu.dpm=1

After upgrading to kernel 4.19.2 my PC will boot to GRUB but then, as others 
have posted, the screen goes black (just before the Plymouth loader is displayed and the switch from text to graphics mode) and then the monitor goes into standby as no video output is detected.

I have tried booting without the rhgb kernel parameter but this does not solve the issue, however as suggested by freedesktop above, I removed the amdgpu.dpm=1 kernel parameter and can now successfully boot the 4.19.2 kernel.

With previous builds I found that my PC was not stable without the amdgpu.dpm=1 parameter and the GPU would crash as soon as I did anything graphics intensive (I presume this was due to the GPU power state changing) so I have run the Unigine Heaven bench mark and so far it seems to be stable (previously on kernels without amdgpu.dpm=1 the GPU would crash either before Unigine Heaven had been started or shortly afterward)

I hope this helps pinpoint the issue and if anyone can suggest other tests I can run to get more information please let me know.

Comment 22 Alex Deucher 2018-11-23 21:46:17 UTC

As noted in comment 4, please try this patch:
https://patchwork.freedesktop.org/patch/259364/

Comment 23 jamespharvey20 2018-11-23 22:42:36 UTC

Don't know if comment 22 re patch 259364 was directed toward me or not. If me, see comment 6 & 20, where I tried it. It's also applied to the journalctl I'm about to upload, and the text of this comment.

After so many people here and elsewhere saying removing "amdgpu.dpm=1" fixes the problem for them, I retried it. I still get a black screen. But, on further analysis, removing it might untangle the spaghetti.

Leaving off "amdgpu.dpm=1" shows the bug in commit 0d998891 is still in git master(7c98a42.) Attached is a journalctl from git master(7c98a42) - with the bug 108704 patch applied. (I'm not sure my setup needs that, just applying it to reduce potential bugs.)

* On 4.18.16, using dpm=1, as required to workaround bug 91880, works.
* On 4.19, using dpm=1, I get black screens with errors as shown in my original post.
* On 4.19, leaving dpm=1 out, I still get a black screen, but the errors given match the errors from commit 0d998891 as shown on comment 13, toward the bottom under "Deeper on issue 1, regarding this bad commit". They're the "amdgpu_drm_irq_schedule_work FAILED src / Cannot find any crtc or sizes" errors. And, previously attached as "journalctl of 0d9988910989, which gets to a black screen".
* Commit 0d998891 is what introduces the "irq_schedule_work / crtc or sizes" error. Its parent c91b007e doesn't have them and works fine.

So, 4.19 breaks "amdgpu.dpm=1" as others have shown, and removing the parameter bypasses that bug. The commit that breaks dpm must be somewhere after 0d998891, because otherwise that commit shouldn't be showing me the errors that are hidden on 4.19 and git master.

And, 4.19 also breaks without dpm for myself and whoever else that 0d998891 breaks.

With 4.19, the code causing the bug from dpm executes first in the auto-kms stage, and if that kernel parameter isn't given, the code causing the bug from 0d998891 executes after that.

Because the dpm bug is in a more recent commit than 0d998891, it appeared to me that the bug in 0d998891 bug was fixed in some more recent commit. But, that was only because running 4.19 or git master runs into the dpm bug that hides the 0d998891 bug, because execution never gets that far.

Since some others are not running into the 0d998891 bug, notably Shecks' MSI R9 390, I'm thinking my specific Asus STRIX-R9390-DC3OC-8GD5-GAMING and some others, implement something differently or wrongly. I don't even know enough about what video card manufacturers add on top of the chipset to know if that's possible, but it's the only thing that makes sense to me. Unless it's an interaction between the R9 390 and running a server Xeon board, running 5 DisplayPort monitors through 3 DisplayPorts having 2 chained, or something else specific to me and some others but not Shecks.

I don't plan on bisecting to determine which commit breaks dpm. I'll only be doing it if 0d998891 is fixed, and leaving the kernel parameter off brings back the stability problems in bug 91880 -- and if the vega 64 card I'm getting Wednesday doesn't run perfectly -- and if no one else will does it.

Comment 24 jamespharvey20 2018-11-23 22:43:48 UTC

Created attachment 142603 [details]
journalctl of git master 7c98a4261827, with patch 259364, which gets to a black screen

Comment 25 Alex Deucher 2018-11-25 19:08:59 UTC

(In reply to jamespharvey20 from comment #23)
> Don't know if comment 22 re patch 259364 was directed toward me or not.  If
> me, see comment 6 & 20, where I tried it.  It's also applied to the
> journalctl I'm about to upload, and the text of this comment.

It was directed at the others that have posted on this bug.  That patch fixed various issues on a number of cards including hawaii for other users.  It seems there may be several issues at play here.

Comment 26 Linnea S 2018-11-27 17:11:04 UTC

The patch from bug 108704 doesn't help for me, the system still boots to a black screen. The card is a MSI Radeon R9 390 Gaming 8G running on (patched) kernel 4.19.4-300.fc29.x86_64. The system boots if I leave out dpm=1, but I haven't tested to see if it's stable.

Comment 27 freedesktop 2018-11-28 11:28:43 UTC

OK there seems to be something screwy going on with the amdgpu.dpm option - if that's set to 0, it still results in a blank screen. It has to be removed entirely from the kernel command line for boot to work for me on 4.19.

I'm not sure if that's a different bug or not, but can anyone who had success with removing the option, also test with amdgpu.dpm=0 to see if it still results in a blank screen?

Also, are FD BZ email notifications broken for anyone else?

Comment 28 Linnea S 2018-11-28 11:45:09 UTC

(In reply to freedesktop from comment #27)
> OK there seems to be something screwy going on with the amdgpu.dpm option -
> if that's set to 0, it still results in a blank screen. It has to be removed
> entirely from the kernel command line for boot to work for me on 4.19.
> 
> I'm not sure if that's a different bug or not, but can anyone who had
> success with removing the option, also test with amdgpu.dpm=0 to see if it
> still results in a blank screen?
> 
> Also, are FD BZ email notifications broken for anyone else?

I'm seeing the same thing. amdgpu.dpm=0 causes a blank screen, leaving it out enables the system to boot.

Comment 29 pascalp 2018-11-28 14:46:50 UTC

For me it is the other way round:
With my 290x I used to boot fine under 4.18 without amdgpu.dpm set at all. After updating to 4.19 and being unable to boot I read that amdgpu.dpm defaults to "enabled" so I explicitly had to add amdgpu.dpm=0 to be able to boot again. amdgpu.dc does not matter in this context - both 0 and 1 work as long as dpm is disabled. This is on an unpatched kernel (4.19.4 currently).

Comment 30 alex.vl 2018-12-02 06:34:46 UTC

Created attachment 142687 [details] [review]
Patch to workaround FATAL issue on VCE2.0 ringtest initialization .

Comment 31 alex.vl 2018-12-02 06:44:04 UTC

Comment on attachment 142687 [details] [review]
Patch to workaround FATAL issue on VCE2.0 ringtest initialization .

Note: Its just workaround .

I see 2 problems :
 - Issue with HW initialization. ( why VCE2.0 ringtest not pass ) 
 - Driver-cleanup issue : cause ( Memory manager not clean during takedown.)

Comment 32 Shecks 2018-12-02 15:53:47 UTC

(In reply to freedesktop from comment #27)
> OK there seems to be something screwy going on with the amdgpu.dpm option -
> if that's set to 0, it still results in a blank screen. It has to be removed
> entirely from the kernel command line for boot to work for me on 4.19.
> 
> I'm not sure if that's a different bug or not, but can anyone who had
> success with removing the option, also test with amdgpu.dpm=0 to see if it
> still results in a blank screen?
> 
> Also, are FD BZ email notifications broken for anyone else?

I can confirm that attempting to boot with either amdgpu.dpm=0 or amdgpu.dpm=1 results in the same black screen issue on my PC

I am now running Fedora 29 with kernel 4.19.5 (MSI R9 390 8GB Gaming GPU) and the only way to boot is by completely removing the amdgpu.dpm kernel parameter.

Comment 33 Alex Deucher 2018-12-03 03:07:07 UTC

(In reply to alex.vl from comment #31)
> Comment on attachment 142687 [details] [review] [review]
> Patch to workaround FATAL issue on VCE2.0 ringtest initialization .
> 
> Note: Its just workaround .
> 
> I see 2 problems :
>  - Issue with HW initialization. ( why VCE2.0 ringtest not pass ) 
>  - Driver-cleanup issue : cause ( Memory manager not clean during takedown.)

Likely a duplicate of bug 108608.  Have you tried the patch there (also mentioned in comment 22)?

Comment 34 alex.vl 2018-12-03 13:16:02 UTC

(In reply to Alex Deucher from comment #33)
> (In reply to alex.vl from comment #31)
> > Comment on attachment 142687 [details] [review] [review] [review]
> > Patch to workaround FATAL issue on VCE2.0 ringtest initialization .
> > 
> > Note: Its just workaround .
> > 
> > I see 2 problems :
> >  - Issue with HW initialization. ( why VCE2.0 ringtest not pass ) 
> >  - Driver-cleanup issue : cause ( Memory manager not clean during takedown.)
> 
> Likely a duplicate of bug 108608.  Have you tried the patch there (also
> mentioned in comment 22)?

FYI : No success (applying patch :https://patchwork.freedesktop.org/patch/259364/) ( seems isn't bug 108608 -- something else )

Notes My case is :  (HW config: CPU:FX-9590 ; MB:M5A99Fx pro r2.0 ;  GPU:R9 290 4G )
    kernel v linux-4.19.6: 
     - Boot normal till amdgpu module load
     - Then "black screen" 
     - after ~5..10 sec fallbacked to VESA mode.

Comment 35 alex.vl 2018-12-03 13:26:42 UTC

Created attachment 142701 [details]
Full dmesg && lspci  (failure) on linux-4.19.6 + pcifix (https://patchwork.freedesktop.org/patch/259364/)

Comment 36 DanglingPointerException 2018-12-30 22:55:51 UTC

I was getting similar unusable issues as everyone with 4.19.x with my R9-290X so I decided to wait for 4.20

I installed 4.20.0 to see if the problem has been resolved but it has changed to a different problem; it only renders one monitor and cannot detect the monitor types.
Only one DVI and one HDMI port works, all the rest don't and show black screens.
Using the only two ports that actually output a signal forces the two monitors to mirror.  
Going into display settings to identify monitors shows only 1 monitor hardware detected and I am unable to identify monitors.
Rebooting doesn't fix problem.
Cold boot doesn't fix problem.
These problems do NOT exist with Linux Kernel 4.18.20.  It just worked.

I would really appreciate a resolution to this with future Kernels 4.20.x!  I'm switching back to Linux Kernel 4.18.20.

# Kernel and command line
-----------
kernel: Linux version 4.20.0-042000-generic (kernel@tangerine) (gcc version 8.2.0 (Ubuntu 8.2.0-12ubuntu1)) #201812232030 SMP Mon Dec 24 01:32:58 UTC 2018
kernel: Command line: BOOT_IMAGE=/vmlinuz-4.20.0-042000-generic root=/dev/mapper/ubuntu--vg-root ro quiet splash radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1 amdgpu.dc=1


# "lspci -v" for Linux Kernel 4.20.0
-----------
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii XT / Grenada XT [Radeon R9 290X/390X] (prog-if 00 [VGA controller])
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii XT / Grenada XT [Radeon R9 290X/390X]
	Flags: fast devsel, IRQ 16
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Memory at ef800000 (64-bit, prefetchable) [size=8M]
	I/O ports at ae00 [size=256]
	Memory at fb980000 (32-bit, non-prefetchable) [size=256K]
	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [270] #19
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Kernel modules: radeon, amdgpu

01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii HDMI Audio [Radeon R9 290/290X / 390/390X]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii HDMI Audio [Radeon R9 290/290X / 390/390X]
	Flags: bus master, fast devsel, latency 0, IRQ 32
	Memory at fb9fc000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel


# "journalctl -b | grep drm" output for last boot Kernel 4.20 only one display working. Other display forced mirroring. Other GPU ports not working.
---------------------------------------
[drm] amdgpu kernel modesetting enabled.
[drm] initializing kernel modesetting (HAWAII 0x1002:0x67B0 0x1002:0x0B00 0x00).
[drm] register mmio base: 0xFB980000
[drm] register mmio size: 262144
[drm] add ip block number 0 <cik_common>
[drm] add ip block number 1 <gmc_v7_0>
[drm] add ip block number 2 <cik_ih>
[drm] add ip block number 3 <gfx_v7_0>
[drm] add ip block number 4 <cik_sdma>
[drm] add ip block number 5 <powerplay>
[drm] add ip block number 6 <dm>
[drm] add ip block number 7 <uvd_v4_2>
[drm] add ip block number 8 <vce_v2_0>
[drm] vm size is 128 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[drm:gmc_v7_0_sw_init [amdgpu]] *ERROR* Failed to load mc firmware!
[drm:amdgpu_device_init.cold.31 [amdgpu]] *ERROR* sw_init of IP block <gmc_v7_0> failed -2
[drm] amdgpu: finishing device.


# "journalctl | grep drm" output for last SUCCESSFUL boot of linux Kernel 4.18.20. All displays working. Everything worked ok. Here for reference
---------------------------------------
[drm] amdgpu kernel modesetting enabled.
fb: switching to amdgpudrmfb from VESA VGA
[drm] initializing kernel modesetting (HAWAII 0x1002:0x67B0 0x1002:0x0B00 0x00).
[drm] register mmio base: 0xFB980000
[drm] register mmio size: 262144
[drm] probing gen 2 caps for device 8086:151 = 261ac83/e
[drm] probing mlw for device 8086:151 = 261ac83
[drm] add ip block number 0 <cik_common>
[drm] add ip block number 1 <gmc_v7_0>
[drm] add ip block number 2 <cik_ih>
[drm] add ip block number 3 <ci_dpm>
[drm] add ip block number 4 <dm>
[drm] add ip block number 5 <gfx_v7_0>
[drm] add ip block number 6 <cik_sdma>
[drm] add ip block number 7 <uvd_v4_2>
[drm] add ip block number 8 <vce_v2_0>
[drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[drm] Detected VRAM RAM=4096M, BAR=256M
[drm] RAM width 512bits GDDR5
[drm] amdgpu: 4096M of VRAM memory ready
[drm] amdgpu: 4096M of GTT memory ready.
[drm] GART: num cpu pages 262144, num gpu pages 262144
[drm] PCIE GART of 1024M enabled (table at 0x000000F4007E9000).
[drm] Internal thermal controller with fan control
[drm] Invalid PCC GPIO: 13!
[drm] amdgpu: dpm initialized
[drm] Found UVD firmware Version: 1.64 Family ID: 9
[drm] Found VCE firmware Version: 50.10 Binary ID: 2
[drm] PCIE gen 3 link speeds already enabled
[drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[drm] Display Core initialized with v3.1.44!
[drm] SADs count is: -524, don't need to read it
[drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[drm] Driver supports precise vblank timestamp query.
[drm] UVD initialized successfully.
[drm] VCE initialized successfully.
[drm] fb mappable at 0xD0BD0000
[drm] vram apper at 0xD0000000
[drm] size 8294400
[drm] fb depth is 24
[drm]    pitch is 7680
fbcon: amdgpudrmfb (fb0) is primary device
[drm] dce_get_required_clocks_state: clocks unsupported disp_clk 681000 pix_clk 148500
amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device
[drm] Initialized amdgpu 3.26.0 20150101 for 0000:01:00.0 on minor 0

Comment 37 Alex Deucher 2018-12-31 15:08:01 UTC

(In reply to DanglingPointerException from comment #36)
> [drm:gmc_v7_0_sw_init [amdgpu]] *ERROR* Failed to load mc firmware!
> [drm:amdgpu_device_init.cold.31 [amdgpu]] *ERROR* sw_init of IP block
> <gmc_v7_0> failed -2
> [drm] amdgpu: finishing device.

The driver is not able to find the firmware when it loads.  Please make sure the initrd contains the firmware if you are using one.  Please note that when using amdgpu, the firmware must be in /lib/firmware/amdgpu rather than /lib/firmware/radeon.

Comment 38 DanglingPointerException 2019-01-05 13:28:47 UTC

(In reply to Alex Deucher from comment #37)
> (In reply to DanglingPointerException from comment #36)
> > [drm:gmc_v7_0_sw_init [amdgpu]] *ERROR* Failed to load mc firmware!
> > [drm:amdgpu_device_init.cold.31 [amdgpu]] *ERROR* sw_init of IP block
> > <gmc_v7_0> failed -2
> > [drm] amdgpu: finishing device.
> 
> The driver is not able to find the firmware when it loads.  Please make sure
> the initrd contains the firmware if you are using one.  Please note that
> when using amdgpu, the firmware must be in /lib/firmware/amdgpu rather than
> /lib/firmware/radeon.

Thanks for the tip/hint on how to go about starting to sort it out. I have SOLVED the problem and the R9-290X now fully works with Mesa 18.3 and Linux Kernel 4.20.0

[SOLVED]
My Solution for those wishing to migrate to 4.20.0 with R9-290/X

1) Removed amdgpu.dpm=x completely from linux commandline and updated grub. '0' or '1' will NOT work and will not boot, NOT even tty
2) copied /lib/firmware/radeon/* to /lib/firmware/amdgpu/
3) backed-up all contents of /lib/firmware/radeon/*
4) deleted /lib/firmware/radeon/
5) ensured initrd for 4.20.0 was in the /boot location
6) sudo update-initramfs -u
7) confirm contents of functioning/working kernel via "~$ lsinitramfs /boot/initrd.img-<YOUR-KERNEL>-generic | grep hawaii" It needs to still point to the /lib/firmware/radeon even though we have deleted it.
8) confirm contents of new kernel that isn't functioning. For me the kernel was "~$ lsinitramfs /boot/initrd.img-4.20.0-042000-generic | grep hawaii"  It should only contain /lib/firmware/amdgpu/*
9) Restore /lib/firmware/radeon/* from backup. This is so you can recover to the previous kernel version if necessary.
10) Restart/Reboot
11)[OPTIONAL-IMPORTANT] If all working well (it is for me) then for no conflicts for future kernels, delete /lib/firmware/radeon THEN delete all previous kernels prior to the new kernel that is now functioning.  If you do NOT do this AND install a new Kernel AND then run the command update-initramfs, then you will have duplicate paths in the initrd for the future kernel.  Not sure what happens when that happens, I'm not testing to find out as haven't got time for it.

Comment 39 i.kalvachev 2019-01-11 00:04:32 UTC

Just for the test, if you still have an issue,
would you try with "iommu=soft" ?

e.g. I see in jamespharvey20's log that he has intel_iommu enabled.

Comment 40 Martin Peres 2019-11-19 09:04:57 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/612.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.