Bug 110822

Summary: [Bisected]Booting with kernel version 5.1.0 or higher on RX 580 hangs
Product: DRI Reporter: Gobinda Joy <gobinda.joy>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: blocker    
Priority: medium CC: gobinda.joy, johan.gardhage
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Linux version 5.1.6-350.vanilla.knurd.1.fc30.x86_64
none
Linux version 5.1.0-0.rc3.git2.1.fc31.x86_64
none
Linux version 5.1.0-0.rc3.git3.1.fc31.x86_64 none

Description Gobinda Joy 2019-06-03 09:37:22 UTC
Created attachment 144420 [details]
Linux version 5.1.6-350.vanilla.knurd.1.fc30.x86_64

My hardware is as follows:
CPU: i7 3770 at stock clock
Motherboard: Gigabyte G1.Sniper 3 latest BIOS available
RAM: 24 GB DDR3 at 1600 mhz
GPU: RX 580 8GB (Sapphire) latest VBIOS

The problem is with kernel 5.1.0 or higher (currently 5.1.6) Display hangs when amdgpu driver loads. I'm unable to determine if the booting is continued or hangs as well. Disk activity stops after couple seconds and not possible to switch TTY.
Ctrl+Alt+Del is unresponsive as well.

This problem goes away when amdgpu.dpm=0 is used but in that case dynamic power scaling is not available and gpu stuck at low clock, graphics performance is abysmal. Also GPU temp/fan speed utilities doesn't work.

Here is the excerpt of the problematic log lines:

Jun 02 09:54:05 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:06 kernel: amdgpu: [powerplay] 
                         failed to send message 15b ret is 65535 
Jun 02 09:54:06 kernel: hrtimer: interrupt took 287743313 ns
Jun 02 09:54:06 kernel: clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
Jun 02 09:54:06 kernel: clocksource:                       'hpet' wd_now: 628dd7b wd_last: 5fef431 mask: ffffffff
Jun 02 09:54:06 kernel: clocksource:                       'tsc' cs_now: 254aa24747 cs_last: 25104a5bfd mask: ffffffffffffffff
Jun 02 09:54:06 kernel: tsc: Marking TSC unstable due to clocksource watchdog
Jun 02 09:54:07 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:07 kernel: amdgpu: [powerplay] 
                         failed to send message 148 ret is 65535 
Jun 02 09:54:07 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:07 kernel: amdgpu: [powerplay] 
                         failed to send message 145 ret is 65535 
Jun 02 09:54:08 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:08 kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
Jun 02 09:54:08 kernel: sched_clock: Marking unstable (8791691311, 362291)<-(8817904668, -25851212)
Jun 02 09:54:08 kernel: amdgpu: [powerplay] 
                         failed to send message 146 ret is 65535 
Jun 02 09:54:08 kernel: hid-generic 0003:09DA:FC7C.0003: input,hidraw2: USB HID v1.11 Mouse [COMPANY USB Device] on usb-0000:00:1a.0-1.5.3/input0
Jun 02 09:54:09 kernel: hid-generic 0003:09DA:FC7C.0004: hiddev97,hidraw3: USB HID v1.11 Device [COMPANY USB Device] on usb-0000:00:1a.0-1.5.3/input1
Jun 02 09:54:11 kernel: clocksource: Switched to clocksource hpet
Jun 02 09:54:13 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:13 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:15 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:15 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:15 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:15 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:17 kernel: [drm] Initialized amdgpu 3.30.0 20150101 for 0000:04:00.0 on minor 0
Jun 02 09:54:17 kernel: EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null)
Jun 02 09:54:20 kernel: amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).
Jun 02 09:54:21 kernel: [drm:amdgpu_device_ip_late_init_func_handler [amdgpu]] *ERROR* ib ring test failed (-110).

Any help is appreciated. Also let me know if I can help in any way.
Comment 1 Gobinda Joy 2019-06-03 09:42:03 UTC
I've tested kernel version from 5.1.0 to the latest git all shows similar problems.

For the 5.2 git versions when using amdgpu.dpm=0 command line the following error occurs:
kernel: [drm] amdgpu kernel modesetting enabled.
kernel: CRAT table not found
kernel: Virtual CRAT table created for CPU
kernel: Parsing CRAT table with 1 nodes
kernel: Creating topology SYSFS entries
kernel: Topology: Add CPU node
kernel: Finished initializing topology
kernel: amdgpu 0000:04:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
kernel: amdgpu 0000:04:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
kernel: amdgpu 0000:04:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xf7800000 -> 0xf783ffff
kernel: checking generic (e0000000 300000) vs hw (e0000000 10000000)
kernel: fb0: switching to amdgpudrmfb from EFI VGA
kernel: Console: switching to colour dummy device 80x25
kernel: amdgpu 0000:04:00.0: vgaarb: deactivate vga console
kernel: [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1DA2:0xE387 0xE7).
kernel: [drm] register mmio base: 0xF7800000
kernel: [drm] register mmio size: 262144
kernel: [drm] add ip block number 0 <vi_common>
kernel: [drm] add ip block number 1 <gmc_v8_0>
kernel: [drm] add ip block number 2 <tonga_ih>
kernel: [drm] add ip block number 3 <gfx_v8_0>
kernel: [drm] add ip block number 4 <sdma_v3_0>
kernel: [drm] add ip block number 5 <powerplay>
kernel: [drm] add ip block number 6 <dm>
kernel: [drm] add ip block number 7 <uvd_v6_0>
kernel: [drm] add ip block number 8 <vce_v3_0>
kernel: kfd kfd: skipped device 1002:67df, PCI rejects atomics
kernel: [drm] UVD is enabled in VM mode
kernel: [drm] UVD ENC is enabled in VM mode
kernel: [drm] VCE enabled in VM mode
kernel: resource sanity check: requesting [mem 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000d3fff window]
kernel: caller pci_map_rom+0x6a/0x17d mapping multiple BARs
kernel: amdgpu 0000:04:00.0: No more image in the PCI ROM
kernel: ATOM BIOS: 113-1E3870U-O45
kernel: [drm] RAS INFO: ras initialized successfully, hardware ability[0] ras_mask[0]
kernel: [drm] vm size is 128 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
kernel: amdgpu 0000:04:00.0: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
kernel: amdgpu 0000:04:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
kernel: [drm] Detected VRAM RAM=8192M, BAR=256M
kernel: [drm] RAM width 256bits GDDR5
kernel: [TTM] Zone  kernel: Available graphics memory: 12350340 KiB
kernel: [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
kernel: [TTM] Initializing pool allocator
kernel: [TTM] Initializing DMA pool allocator
kernel: [drm] amdgpu: 8192M of VRAM memory ready
kernel: [drm] amdgpu: 8192M of GTT memory ready.
kernel: [drm] GART: num cpu pages 65536, num gpu pages 65536
kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
kernel: [drm] Chained IB support enabled!
kernel: [drm] Found UVD firmware Version: 1.130 Family ID: 16
kernel: [drm] Found VCE firmware Version: 53.26 Binary ID: 3
kernel: BUG: unable to handle page fault for address: ffffa5bd8394f650
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 606549067 P4D 606549067 PUD 0 
kernel: Oops: 0000 [#1] SMP PTI
kernel: CPU: 6 PID: 461 Comm: systemd-udevd Not tainted 5.2.0-0.rc1.git1.1.vanilla.knurd.1.fc30.x86_64 #1
kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./G1.Sniper 3, BIOS F8k 04/29/2013
kernel: RIP: 0010:bw_calcs_data_update_from_pplib.isra.0+0x378/0x4d0 [amdgpu]
kernel: Code: 00 00 5b 5d 41 5c 41 5d 41 5e c3 48 8b 7d 00 4c 89 f2 be 02 00 00 00 e8 26 bf f9 ff 8b 04 24 4c 8b 23 be e8 03 00 00 83 e8 01 <8b> 7c 84 04 e8 6f 4d fb ff be e8 03 00 00 49 89 44 24 60 8b 04 24
kernel: RSP: 0018:ffffa5b98394f650 EFLAGS: 00010297
kernel: RAX: 00000000ffffffff RBX: ffff928b34cb92d8 RCX: 0000000000000000
kernel: RDX: ffffa5b98394f58c RSI: 00000000000003e8 RDI: ffff928b39c12800
kernel: RBP: ffff928b34cb9208 R08: 0000000000000020 R09: 000000032a000000
kernel: R10: 00000003ce000000 R11: 0000001770000000 R12: ffff928b3ac0b300
kernel: R13: ffffa5b98394f76c R14: ffffa5b98394f650 R15: ffffffffc0839d60
kernel: FS:  00007f1133ad1940(0000) GS:ffff928b46b80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: ffffa5bd8394f650 CR3: 00000005faf54004 CR4: 00000000001606e0
kernel: Call Trace:
kernel:  dce112_create_resource_pool+0x6de/0x700 [amdgpu]
kernel:  dc_create_resource_pool+0x16c/0x220 [amdgpu]
kernel:  ? dal_gpio_service_create+0x92/0x110 [amdgpu]
kernel:  dc_create+0x219/0x620 [amdgpu]
kernel:  ? amdgpu_cgs_create_device+0x23/0x50 [amdgpu]
kernel:  amdgpu_dm_init+0xeb/0x160 [amdgpu]
kernel:  dm_hw_init+0xe/0x20 [amdgpu]
kernel:  amdgpu_device_init.cold+0x128d/0x161f [amdgpu]
kernel:  ? kmalloc_order+0x14/0x30
kernel:  amdgpu_driver_load_kms+0x88/0x270 [amdgpu]
kernel:  drm_dev_register+0x111/0x150 [drm]
kernel:  amdgpu_pci_probe+0xbd/0x120 [amdgpu]
kernel:  ? __pm_runtime_resume+0x58/0x80
kernel:  local_pci_probe+0x42/0x80
kernel:  pci_device_probe+0x115/0x190
kernel:  really_probe+0xf0/0x390
kernel:  driver_probe_device+0xb6/0x100
kernel:  device_driver_attach+0x53/0x60
kernel:  __driver_attach+0x8a/0x150
kernel:  ? device_driver_attach+0x60/0x60
kernel:  bus_for_each_dev+0x78/0xc0
kernel:  bus_add_driver+0x14a/0x1e0
kernel:  driver_register+0x6c/0xb0
kernel:  ? 0xffffffffc09b9000
kernel:  do_one_initcall+0x46/0x1f4
kernel:  ? _cond_resched+0x15/0x30
kernel:  ? kmem_cache_alloc_trace+0x154/0x1c0
kernel:  ? do_init_module+0x23/0x230
kernel:  do_init_module+0x5c/0x230
kernel:  load_module+0x22eb/0x28e0
kernel:  ? __do_sys_init_module+0x16e/0x1a0
kernel:  __do_sys_init_module+0x16e/0x1a0
kernel:  do_syscall_64+0x5b/0x180
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: RIP: 0033:0x7f1134ad1bae
kernel: Code: 48 8b 0d dd 42 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d aa 42 0c 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007ffe9cb83118 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
kernel: RAX: ffffffffffffffda RBX: 0000563b364ce650 RCX: 00007f1134ad1bae
kernel: RDX: 0000563b364b50a0 RSI: 00000000006dfa2e RDI: 0000563b36d998b0
kernel: RBP: 0000563b36d998b0 R08: 0000563b364ba730 R09: 0000000000000001
kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 0000563b364b50a0
kernel: R13: 0000000000000006 R14: 0000563b364c9fa0 R15: 0000000000000000
kernel: Modules linked in: amdgpu(+) amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper crc32c_intel serio_raw drm e1000e(+) alx mdio video wmi vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
kernel: CR2: ffffa5bd8394f650
kernel: ---[ end trace e14f412d43dd70ae ]---
kernel: RIP: 0010:bw_calcs_data_update_from_pplib.isra.0+0x378/0x4d0 [amdgpu]
kernel: Code: 00 00 5b 5d 41 5c 41 5d 41 5e c3 48 8b 7d 00 4c 89 f2 be 02 00 00 00 e8 26 bf f9 ff 8b 04 24 4c 8b 23 be e8 03 00 00 83 e8 01 <8b> 7c 84 04 e8 6f 4d fb ff be e8 03 00 00 49 89 44 24 60 8b 04 24
kernel: RSP: 0018:ffffa5b98394f650 EFLAGS: 00010297
kernel: RAX: 00000000ffffffff RBX: ffff928b34cb92d8 RCX: 0000000000000000
kernel: RDX: ffffa5b98394f58c RSI: 00000000000003e8 RDI: ffff928b39c12800
kernel: RBP: ffff928b34cb9208 R08: 0000000000000020 R09: 000000032a000000
kernel: R10: 00000003ce000000 R11: 0000001770000000 R12: ffff928b3ac0b300
kernel: R13: ffffa5b98394f76c R14: ffffa5b98394f650 R15: ffffffffc0839d60
kernel: FS:  00007f1133ad1940(0000) GS:ffff928b46b80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: ffffa5bd8394f650 CR3: 00000005faf54004 CR4: 00000000001606e0
Comment 2 Matt Coffin 2019-06-03 22:03:08 UTC
I'm experiencing the same problem, but with an XFX RX 590 FATBOY.

Strangely enough, I am also running crappy, kinda slow DDR3 RAM at 1600 mhz.

I've read so much on this issue today that I don't know where I read it, but I read somewhere that crappy RAM can cause these kinds of hangs with these cards.

Is there some issue with running the GPU mclk faster than your system's RAM?

My issue does not happen on boot, and does not happen until the GPU is under load, but occasionally (seemingly non-deterministically), I see the powerplay errors on startup, but amdgpu continues to load (seemingly OK) until I put it under heavy load.

Do you have a mobo/cpu/ram setup with faster memory that you might be able to test with? I don't know too much about amdgpu internals but given what I read it seems odd that we are both running what would be considered "slow" RAM these days.

My specs:

* Kernel 5.1.3-arch2-1-ARCH
* LLVM 8.0.0
* Mesa 19.0.4
* Card XFX Radeon RX 590 OC+ FATBOY
* CPU: i7 990x extreme edition

Interestingly, I had to set up fancontrol to get the fans to spin up past their minimum values at all, but I suspect that could be due to the usage of the "stealth" RX 590 VBIOS. Unfortunately, i don't have access to a windows machine to flash to "performance" bios on to my card, or even to check which version is in use currently.

Let me know if I can help out with any logs/debugging information.
Comment 3 Gobinda Joy 2019-06-04 06:29:27 UTC
Thanks for the reply.

By today's standard my whole system spec is slow to be honest.

But slow ram speed shouldn't cause this problems with powerplay. As I can run the card fine with load or idle with kernel version 5.0.17. It seems to me the new commits from AMD for kernel version 5.1 is to blame. But I can't say for certain.

Interestingly enough if I use amdgpu.dpm=0 to disable dynamic power management kernel version up to 5.1.6 works just fine except crappy performance when under graphics load.

I was curious so checked out the commit log on amd-staging-drm-next branch. And I see some reverts of powerplay related commits.

Link: https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next

I am currently waiting for this to getting merged with master branch.
Comment 4 Alex Deucher 2019-06-04 06:44:18 UTC
(In reply to Gobinda Joy from comment #3)
> I was curious so checked out the commit log on amd-staging-drm-next branch.
> And I see some reverts of powerplay related commits.
> 
> Link: https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
> 
> I am currently waiting for this to getting merged with master branch.

Those were just reverts of changes that were accidentally committed just before.  Can you bisect?
Comment 5 Gobinda Joy 2019-06-04 08:34:55 UTC
(In reply to Alex Deucher from comment #4)
> (In reply to Gobinda Joy from comment #3)
> > I was curious so checked out the commit log on amd-staging-drm-next branch.
> > And I see some reverts of powerplay related commits.
> > 
> > Link: https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
> > 
> > I am currently waiting for this to getting merged with master branch.
> 
> Those were just reverts of changes that were accidentally committed just
> before.  Can you bisect?
Help/instruct me on how to do it. I can compile kernel and use git. But I never have used bisect command.
Comment 6 Sylvain BERTRAND 2019-06-04 13:35:24 UTC
bisect is quite common in the git world. You'll find tons of tutorials on the
web, namely you're good for a little bit of reading.
Just don't forget to "git reset --hard" before calling "git bisect good|bad".
(just performed a bisection on linux yesterday).
Comment 7 Gobinda Joy 2019-06-04 18:54:21 UTC
(In reply to Sylvain BERTRAND from comment #6)
> bisect is quite common in the git world. You'll find tons of tutorials on the
> web, namely you're good for a little bit of reading.
> Just don't forget to "git reset --hard" before calling "git bisect good|bad".
> (just performed a bisection on linux yesterday).

Figured out how to bisect but the problem is building the rpm packages to install in my fedora system. As the guide they put out in wiki is almost 6 years old and the python script for building the kernel/modules rpm doesn't work.

However I found the pre-built kernel packages at https://koji.fedoraproject.org/koji/packageinfo?buildStart=0&packageID=8&buildOrder=-completion_time&tagOrder=name&tagStart=50#buildlist

I've tested the kernels from there. It seems they used the snapshot from mainline tree.

I started from the working version 5.0.0 and tested all the kernels until the problem occurred at version 5.1.0-0.rc3.git3.1.fc31.x86_64

I can bisect the kernel at that tag but can't build the rpm packages, have to research further I guess.

Attached the logs from the problematic version and the immediate earlier working version.
Comment 8 Gobinda Joy 2019-06-04 18:55:13 UTC
Created attachment 144448 [details]
Linux version 5.1.0-0.rc3.git2.1.fc31.x86_64

Last working version.
Comment 9 Gobinda Joy 2019-06-04 18:56:06 UTC
(In reply to Gobinda Joy from comment #8)
> Created attachment 144448 [details]
> Linux version 5.1.0-0.rc3.git3.1.fc31.x86_64
> 
> Last working version.

Sorry the version is Linux version 5.1.0-0.rc3.git2.1.fc31.x86_64
Comment 10 Gobinda Joy 2019-06-04 18:57:32 UTC
Created attachment 144450 [details]
Linux version 5.1.0-0.rc3.git3.1.fc31.x86_64

First occurrence of the bug
Comment 11 Gobinda Joy 2019-06-05 09:51:11 UTC
Bisect
ad51c46eec739c18be24178a30b47801b10e0357 is the first bad commit
commit ad51c46eec739c18be24178a30b47801b10e0357
Author: Chengming Gui <Jack.Gui@amd.com>
Date:   Thu Mar 21 13:26:28 2019 +0800

    drm/amd/amdgpu: fix PCIe dpm feature issue (v3)
    
    use pcie_bandwidth_available to get real link state
    to update pcie table.
    
    v2: fix incorrect initialized return value
    v3: expand the fetching method about the link width to all asics.
    
    Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 673bf23a6c2c2e461c6ff36e9bcca9d35c958881 605de484316e97aaa83273d5a08340e873ff851e M      drivers

This is the merge with that and other commits introducing the bug
ea2cec24c8d429ee6f99040e4eb6c7ad627fe777
 Merge tag 'drm-fixes-2019-04-05' of git://anongit.freedesktop.org/drm…

…/drm

Pull drm fixes from Dave Airlie:
 "Pretty quiet week, just some amdgpu and i915 fixes.

  i915:
   - deadlock fix
   - gvt fixes

  amdgpu:
   - PCIE dpm feature fix
   - Powerplay fixes"

* tag 'drm-fixes-2019-04-05' of git://anongit.freedesktop.org/drm/drm:
  drm/i915/gvt: Fix kerneldoc typo for intel_vgpu_emulate_hotplug
  drm/i915/gvt: Correct the calculation of plane size
  drm/amdgpu: remove unnecessary rlc reset function on gfx9
  drm/i915: Always backoff after a drm_modeset_lock() deadlock
  drm/i915/gvt: do not let pin count of shadow mm go negative
  drm/i915/gvt: do not deliver a workload if its creation fails
  drm/amd/display: VBIOS can't be light up HDMI when restart system
  drm/amd/powerplay: fix possible hang with 3+ 4K monitors
  drm/amd/powerplay: correct data type to avoid overflow
  drm/amd/powerplay: add ECC feature bit
  drm/amd/amdgpu: fix PCIe dpm feature issue (v3)

Hope it helps.
Comment 12 Gobinda Joy 2019-06-05 11:47:33 UTC
Actually after reverting the commit ad51c46eec739c18be24178a30b47801b10e0357

    drm/amd/amdgpu: fix PCIe dpm feature issue (v3)
    
    use pcie_bandwidth_available to get real link state
    to update pcie table.
    
    v2: fix incorrect initialized return value
    v3: expand the fetching method about the link width to all asics.
    
    Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

I can boot the latest git snapshot (5.2.0-rc3) without any problem. Seems to me this is the problematic commit.
Comment 13 b6khqjqov4 2019-06-10 11:09:18 UTC
Your bug sounds like mine which came after I bought and added a used RX580 to my system. Since then had random full system freezes (I think only when I was using Firefox or the internal Steam Chromium browser) and desktop hangs where Cinnamon would crash and nothing except the mouse pointer would be movable. No error in logs at first, but then I found this after the system hanged instead of a hard freeze:

$ journalctl -p3:

<pre>
...
Jun 09 06:45:34 test systemd-coredump[1383]: Process 1328 (Web Content) of user 1000 dumped core.
                                             
                                             Stack trace of thread 1332:
                                             #0  0x00007f04ecd3ed36 n/a (/home/test/tor-browser_en-US/Browser/libxul.so)
                                             #1  0x0000000000000000 n/a (n/a)
Jun 09 06:45:35 test systemd-coredump[1374]: Process 1106 (firefox.real) of user 1000 dumped core.
                                             
                                             Stack trace of thread 1124:
                                             #0  0x00007fd89155036f raise (libpthread.so.0)
                                             #1  0x00007fd88b6b3a5f n/a (/home/test/tor-browser_en-US/Browser/libxul.so)
Jun 09 06:45:35 test systemd-coredump[1384]: Process 1162 (Web Content) of user 1000 dumped core.
                                             
                                             Stack trace of thread 1164:
                                             #0  0x00007ffb32c6ad36 n/a (/home/test/tor-browser_en-US/Browser/libxul.so)
                                             #1  0x0000000000000000 n/a (n/a)
Jun 09 06:45:38 test systemd-coredump[1385]: Process 1237 (Web Content) of user 1000 dumped core.
                                             
                                             Stack trace of thread 1241:
                                             #0  0x00007f9aa7f3ed36 n/a (/home/test/tor-browser_en-US/Browser/libxul.so)
                                             #1  0x0000000000000000 n/a (n/a)
Jun 09 06:47:31 test systemd-coredump[1640]: Process 1536 (Web Content) of user 1000 dumped core.
                                             
                                             Stack trace of thread 1536:
                                             #0  0x00007f830d3ee3e7 n/a (/home/test/tor-browser_en-US/Browser/libxul.so)
Jun 09 06:47:32 test systemd-coredump[1650]: Process 1603 (Web Content) of user 1000 dumped core.
                                             
                                             Stack trace of thread 1606:
                                             #0  0x00007f8f129ebd36 n/a (/home/test/tor-browser_en-US/Browser/libxul.so)
                                             #1  0x0000000000000000 n/a (n/a)
Jun 09 06:47:32 test systemd-coredump[1639]: Process 1410 (firefox.real) of user 1000 dumped core.
                                             
                                             Stack trace of thread 1410:
                                             #0  0x00007fe94ed5f36f raise (libpthread.so.0)
                                             #1  0x00007fe948ec2a5f n/a (/home/test/tor-browser_en-US/Browser/libxul.so)
Jun 09 06:47:32 test systemd-coredump[1649]: Process 1467 (Web Content) of user 1000 dumped core.
                                             
                                             Stack trace of thread 1469:
                                             #0  0x00007fb2790c2d36 n/a (/home/test/tor-browser_en-US/Browser/libxul.so)
                                             #1  0x0000000000000000 n/a (n/a)
...
</pre>

My system:
- MSI B450 Tomahawk
- Athlon 200GE (yes, this CPU will be upgraded to a Ryzen 3000 one of course ;))
- RX 580 4G Nitro+
- G.Skill Aegis DIMM Kit 16GB, DDR4-3000, CL16-18-18-38 (F4-3000C16D-16GISB)

My fix: 2 things I can remember I did which are maybe the fix:
1.) I disabled integrated graphics in the BIOS.
2.) I installed amd-ucode.

No hangs/freezes or anything since my fix. I'll report back here should I encounter a crash/freeze/hang again.
Comment 14 b6khqjqov4 2019-06-10 11:19:18 UTC
(In reply to b6khqjqov4 from comment #13)
> Your bug sounds like mine which came after I bought and added a used RX580
> to my system. Since then had random full system freezes (I think only when I
> was using Firefox or the internal Steam Chromium browser) and desktop hangs
> where Cinnamon would crash and nothing except the mouse pointer would be
> movable. No error in logs at first, but then I found this after the system
> hanged instead of a hard freeze:
> 
> $ journalctl -p3:
> 
> <pre>
> ...
> Jun 09 06:45:34 test systemd-coredump[1383]: Process 1328 (Web Content) of
> user 1000 dumped core.
>                                              
>                                              Stack trace of thread 1332:
>                                              #0  0x00007f04ecd3ed36 n/a
> (/home/test/tor-browser_en-US/Browser/libxul.so)
>                                              #1  0x0000000000000000 n/a (n/a)
> Jun 09 06:45:35 test systemd-coredump[1374]: Process 1106 (firefox.real) of
> user 1000 dumped core.
>                                              
>                                              Stack trace of thread 1124:
>                                              #0  0x00007fd89155036f raise
> (libpthread.so.0)
>                                              #1  0x00007fd88b6b3a5f n/a
> (/home/test/tor-browser_en-US/Browser/libxul.so)
> Jun 09 06:45:35 test systemd-coredump[1384]: Process 1162 (Web Content) of
> user 1000 dumped core.
>                                              
>                                              Stack trace of thread 1164:
>                                              #0  0x00007ffb32c6ad36 n/a
> (/home/test/tor-browser_en-US/Browser/libxul.so)
>                                              #1  0x0000000000000000 n/a (n/a)
> Jun 09 06:45:38 test systemd-coredump[1385]: Process 1237 (Web Content) of
> user 1000 dumped core.
>                                              
>                                              Stack trace of thread 1241:
>                                              #0  0x00007f9aa7f3ed36 n/a
> (/home/test/tor-browser_en-US/Browser/libxul.so)
>                                              #1  0x0000000000000000 n/a (n/a)
> Jun 09 06:47:31 test systemd-coredump[1640]: Process 1536 (Web Content) of
> user 1000 dumped core.
>                                              
>                                              Stack trace of thread 1536:
>                                              #0  0x00007f830d3ee3e7 n/a
> (/home/test/tor-browser_en-US/Browser/libxul.so)
> Jun 09 06:47:32 test systemd-coredump[1650]: Process 1603 (Web Content) of
> user 1000 dumped core.
>                                              
>                                              Stack trace of thread 1606:
>                                              #0  0x00007f8f129ebd36 n/a
> (/home/test/tor-browser_en-US/Browser/libxul.so)
>                                              #1  0x0000000000000000 n/a (n/a)
> Jun 09 06:47:32 test systemd-coredump[1639]: Process 1410 (firefox.real) of
> user 1000 dumped core.
>                                              
>                                              Stack trace of thread 1410:
>                                              #0  0x00007fe94ed5f36f raise
> (libpthread.so.0)
>                                              #1  0x00007fe948ec2a5f n/a
> (/home/test/tor-browser_en-US/Browser/libxul.so)
> Jun 09 06:47:32 test systemd-coredump[1649]: Process 1467 (Web Content) of
> user 1000 dumped core.
>                                              
>                                              Stack trace of thread 1469:
>                                              #0  0x00007fb2790c2d36 n/a
> (/home/test/tor-browser_en-US/Browser/libxul.so)
>                                              #1  0x0000000000000000 n/a (n/a)
> ...
> </pre>
> 
> My system:
> - MSI B450 Tomahawk
> - Athlon 200GE (yes, this CPU will be upgraded to a Ryzen 3000 one of course
> ;))
> - RX 580 4G Nitro+
> - G.Skill Aegis DIMM Kit 16GB, DDR4-3000, CL16-18-18-38 (F4-3000C16D-16GISB)
> 
> My fix: 2 things I can remember I did which are maybe the fix:
> 1.) I disabled integrated graphics in the BIOS.
> 2.) I installed amd-ucode.
> 
> No hangs/freezes or anything since my fix. I'll report back here should I
> encounter a crash/freeze/hang again.

Addition:
Arch (through Antergos) with all latest updates:
- $ uname: 5.1.7-arch1-1-ARCH
- $ glxinfo | grep version:
server glx version string: 1.4
client glx version string: 1.4
GLX version: 1.4
    Max core profile version: 4.5
    Max compat profile version: 4.5
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.0.6
OpenGL core profile shading language version string: 4.50
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.0.6
OpenGL shading language version string: 4.50
OpenGL ES profile version string: OpenGL ES 3.2 Mesa 19.0.6
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
    GL_EXT_shader_implicit_conversions, GL_EXT_shader_integer_mix,
Comment 15 Gobinda Joy 2019-06-10 14:08:42 UTC
(In reply to b6khqjqov4 from comment #14)
> (In reply to b6khqjqov4 from comment #13)
> > Your bug sounds like mine which came after I bought and added a used RX580
> > to my system. Since then had random full system freezes (I think only when I
> > was using Firefox or the internal Steam Chromium browser) and desktop hangs
> > where Cinnamon would crash and nothing except the mouse pointer would be
> > movable. No error in logs at first, but then I found this after the system
> > hanged instead of a hard freeze:
> > 
> > $ journalctl -p3:
> > 
> > <pre>
> > ...
> > Jun 09 06:45:34 test systemd-coredump[1383]: Process 1328 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1332:
> >                                              #0  0x00007f04ecd3ed36 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> >                                              #1  0x0000000000000000 n/a (n/a)
> > Jun 09 06:45:35 test systemd-coredump[1374]: Process 1106 (firefox.real) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1124:
> >                                              #0  0x00007fd89155036f raise
> > (libpthread.so.0)
> >                                              #1  0x00007fd88b6b3a5f n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> > Jun 09 06:45:35 test systemd-coredump[1384]: Process 1162 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1164:
> >                                              #0  0x00007ffb32c6ad36 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> >                                              #1  0x0000000000000000 n/a (n/a)
> > Jun 09 06:45:38 test systemd-coredump[1385]: Process 1237 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1241:
> >                                              #0  0x00007f9aa7f3ed36 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> >                                              #1  0x0000000000000000 n/a (n/a)
> > Jun 09 06:47:31 test systemd-coredump[1640]: Process 1536 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1536:
> >                                              #0  0x00007f830d3ee3e7 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> > Jun 09 06:47:32 test systemd-coredump[1650]: Process 1603 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1606:
> >                                              #0  0x00007f8f129ebd36 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> >                                              #1  0x0000000000000000 n/a (n/a)
> > Jun 09 06:47:32 test systemd-coredump[1639]: Process 1410 (firefox.real) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1410:
> >                                              #0  0x00007fe94ed5f36f raise
> > (libpthread.so.0)
> >                                              #1  0x00007fe948ec2a5f n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> > Jun 09 06:47:32 test systemd-coredump[1649]: Process 1467 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1469:
> >                                              #0  0x00007fb2790c2d36 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> >                                              #1  0x0000000000000000 n/a (n/a)
> > ...
> > </pre>
> > 
> > My system:
> > - MSI B450 Tomahawk
> > - Athlon 200GE (yes, this CPU will be upgraded to a Ryzen 3000 one of course
> > ;))
> > - RX 580 4G Nitro+
> > - G.Skill Aegis DIMM Kit 16GB, DDR4-3000, CL16-18-18-38 (F4-3000C16D-16GISB)
> > 
> > My fix: 2 things I can remember I did which are maybe the fix:
> > 1.) I disabled integrated graphics in the BIOS.
> > 2.) I installed amd-ucode.
> > 
> > No hangs/freezes or anything since my fix. I'll report back here should I
> > encounter a crash/freeze/hang again.
> 
> Addition:
> Arch (through Antergos) with all latest updates:
> - $ uname: 5.1.7-arch1-1-ARCH
> - $ glxinfo | grep version:
> server glx version string: 1.4
> client glx version string: 1.4
> GLX version: 1.4
>     Max core profile version: 4.5
>     Max compat profile version: 4.5
>     Max GLES1 profile version: 1.1
>     Max GLES[23] profile version: 3.2
> OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.0.6
> OpenGL core profile shading language version string: 4.50
> OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.0.6
> OpenGL shading language version string: 4.50
> OpenGL ES profile version string: OpenGL ES 3.2 Mesa 19.0.6
> OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
>     GL_EXT_shader_implicit_conversions, GL_EXT_shader_integer_mix,

I have already updated Intel microcode installed from here: https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files which is the latest I can find.

Also there is no more BIOS updates for my motherboard. You would notice that I have bisected this bug to a specific commit. 

That is puzzling to me because The specified commit adds a call to pcie_bandwidth_available() which replaces value already obtained by another call to pcie_get_width_cap() just before. What's is the point of using 2 function to get result from same source. Also why is this related to powerplay (DPM).

Also after reverting the commit I have no issue whatsoever with my gpu. No hangs or freezes in game or in normal desktop uses.
Comment 16 Gobinda Joy 2019-06-10 16:22:00 UTC
(In reply to b6khqjqov4 from comment #14)
> (In reply to b6khqjqov4 from comment #13)
> > Your bug sounds like mine which came after I bought and added a used RX580
> > to my system. Since then had random full system freezes (I think only when I
> > was using Firefox or the internal Steam Chromium browser) and desktop hangs
> > where Cinnamon would crash and nothing except the mouse pointer would be
> > movable. No error in logs at first, but then I found this after the system
> > hanged instead of a hard freeze:
> > 
> > $ journalctl -p3:
> > 
> > <pre>
> > ...
> > Jun 09 06:45:34 test systemd-coredump[1383]: Process 1328 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1332:
> >                                              #0  0x00007f04ecd3ed36 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> >                                              #1  0x0000000000000000 n/a (n/a)
> > Jun 09 06:45:35 test systemd-coredump[1374]: Process 1106 (firefox.real) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1124:
> >                                              #0  0x00007fd89155036f raise
> > (libpthread.so.0)
> >                                              #1  0x00007fd88b6b3a5f n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> > Jun 09 06:45:35 test systemd-coredump[1384]: Process 1162 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1164:
> >                                              #0  0x00007ffb32c6ad36 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> >                                              #1  0x0000000000000000 n/a (n/a)
> > Jun 09 06:45:38 test systemd-coredump[1385]: Process 1237 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1241:
> >                                              #0  0x00007f9aa7f3ed36 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> >                                              #1  0x0000000000000000 n/a (n/a)
> > Jun 09 06:47:31 test systemd-coredump[1640]: Process 1536 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1536:
> >                                              #0  0x00007f830d3ee3e7 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> > Jun 09 06:47:32 test systemd-coredump[1650]: Process 1603 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1606:
> >                                              #0  0x00007f8f129ebd36 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> >                                              #1  0x0000000000000000 n/a (n/a)
> > Jun 09 06:47:32 test systemd-coredump[1639]: Process 1410 (firefox.real) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1410:
> >                                              #0  0x00007fe94ed5f36f raise
> > (libpthread.so.0)
> >                                              #1  0x00007fe948ec2a5f n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> > Jun 09 06:47:32 test systemd-coredump[1649]: Process 1467 (Web Content) of
> > user 1000 dumped core.
> >                                              
> >                                              Stack trace of thread 1469:
> >                                              #0  0x00007fb2790c2d36 n/a
> > (/home/test/tor-browser_en-US/Browser/libxul.so)
> >                                              #1  0x0000000000000000 n/a (n/a)
> > ...
> > </pre>
> > 
> > My system:
> > - MSI B450 Tomahawk
> > - Athlon 200GE (yes, this CPU will be upgraded to a Ryzen 3000 one of course
> > ;))
> > - RX 580 4G Nitro+
> > - G.Skill Aegis DIMM Kit 16GB, DDR4-3000, CL16-18-18-38 (F4-3000C16D-16GISB)
> > 
> > My fix: 2 things I can remember I did which are maybe the fix:
> > 1.) I disabled integrated graphics in the BIOS.
> > 2.) I installed amd-ucode.
> > 
> > No hangs/freezes or anything since my fix. I'll report back here should I
> > encounter a crash/freeze/hang again.
> 
> Addition:
> Arch (through Antergos) with all latest updates:
> - $ uname: 5.1.7-arch1-1-ARCH
> - $ glxinfo | grep version:
> server glx version string: 1.4
> client glx version string: 1.4
> GLX version: 1.4
>     Max core profile version: 4.5
>     Max compat profile version: 4.5
>     Max GLES1 profile version: 1.1
>     Max GLES[23] profile version: 3.2
> OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.0.6
> OpenGL core profile shading language version string: 4.50
> OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.0.6
> OpenGL shading language version string: 4.50
> OpenGL ES profile version string: OpenGL ES 3.2 Mesa 19.0.6
> OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
>     GL_EXT_shader_implicit_conversions, GL_EXT_shader_integer_mix,

This doesn't seems like the same bug. For instance, in my case the whole boot process hangs check the attached log files. In your case you can boot the system but problem arise when you load the GPU maybe. Not the same bug.
Comment 17 b6khqjqov4 2019-06-11 14:44:13 UTC
(In reply to Gobinda Joy from comment #16)
> This doesn't seems like the same bug. For instance, in my case the whole
> boot process hangs check the attached log files. In your case you can boot
> the system but problem arise when you load the GPU maybe. Not the same bug.

You could still try disabling the integrated graphics in the BIOS. AFAICR I could always log in into the Cinnamon desktop and only then I would get a hard freeze within minutes up to several hours, not within the boot process itself, so maybe indeed not the same bug. The RX 590 needed additional commits to be supported in Linux, so maybe similar but different issues.
(I assume you can you boot without the RX 590 using your CPU's integrated graphics without any issues with the problematic commit/it really is a RX590 issue, well it must be I guess since it's in AMDgpu and you tested it.)

Because of the Athlon 200GE my 3000 MHz RAM is also running at slower 2133MHz, but even at the, in fact, safe 2133, that should not be the reason for the hangs/freezes, as the proof is that you fixed it by bisecting and I fixed my freezes with my mentioned method. 

I don't know anything about the calls of those functions and I hope your bug report is on the devs' radar, especially that you found out the problematic commit ad51c46e, now.

PS: Fortunately I still have had no freezes/any issues since my fixes (5.1.8-arch1-1-ARCH now).
Comment 18 Gobinda Joy 2019-06-12 04:35:19 UTC
(In reply to b6khqjqov4 from comment #17)
> (In reply to Gobinda Joy from comment #16)
> > This doesn't seems like the same bug. For instance, in my case the whole
> > boot process hangs check the attached log files. In your case you can boot
> > the system but problem arise when you load the GPU maybe. Not the same bug.
> 
> You could still try disabling the integrated graphics in the BIOS. AFAICR I
> could always log in into the Cinnamon desktop and only then I would get a
> hard freeze within minutes up to several hours, not within the boot process
> itself, so maybe indeed not the same bug. The RX 590 needed additional
> commits to be supported in Linux, so maybe similar but different issues.
> (I assume you can you boot without the RX 590 using your CPU's integrated
> graphics without any issues with the problematic commit/it really is a RX590
> issue, well it must be I guess since it's in AMDgpu and you tested it.)
> 
> Because of the Athlon 200GE my 3000 MHz RAM is also running at slower
> 2133MHz, but even at the, in fact, safe 2133, that should not be the reason
> for the hangs/freezes, as the proof is that you fixed it by bisecting and I
> fixed my freezes with my mentioned method. 
> 
> I don't know anything about the calls of those functions and I hope your bug
> report is on the devs' radar, especially that you found out the problematic
> commit ad51c46e, now.
> 
> PS: Fortunately I still have had no freezes/any issues since my fixes
> (5.1.8-arch1-1-ARCH now).

I'm using a discrete GPU so obviously the integrated GPU is disabled. Also I'm using Vt-D to passthrough a LAN card and a sound card to my windows VM. I don't use that for gaming though. Only use fedora/wine for gaming. Need the windows VM for some work related stuff and sometimes music as the sound card driver is superior in windows.

My 24GB DDR3 Ram is running at 1600 (10-10-10-26 maybe haven't checked in a while). And you are right slow ram shouldn't be the reason for freezes or hangs. Apart from that previous kernel was perfect for me. Since your card is RX590 you do need the new commits/kernel to support that. I was happy with kernel 5.0.17 until fedora decided to push 5.1+ kernel through update.

I am not sure about those function calls myself. But as I read through the source they are traversing the PCIe tree for the min bandwidth bottleneck or limiter. and using that to set the max bandwidth for the device in context.

What I don't get is why they are using 2 calls to get the bandwidth reading. Since both function walking the PCIe tree what's the point. Also it seems like the call to pcie_bandwidth_available() function is casing the freeze/hangs in my system. So that's counts for something.

I hope devs noticed this too. As they asked me to bisect this. If they don't I did all this for nothing. Sorry, not for nothing as I can now run the latest kernel with that commit reverted.
Comment 19 Alex Deucher 2019-06-12 18:30:44 UTC
(In reply to Gobinda Joy from comment #18)
> 
> What I don't get is why they are using 2 calls to get the bandwidth reading.
> Since both function walking the PCIe tree what's the point. Also it seems
> like the call to pcie_bandwidth_available() function is casing the
> freeze/hangs in my system. So that's counts for something.
> 

Can you try a drm-next kernel?  This code was ultimately cleaned in this patch:
https://cgit.freedesktop.org/drm/drm/commit/?id=dbaa922b5706b1aff4572c280e15bbea2d04afe6
I don't know why pcie_bandwidth_available() is causing problems for you, it's just standard PCIE stuff.
Comment 20 Gobinda Joy 2019-06-13 07:44:41 UTC
(In reply to Alex Deucher from comment #19)
> (In reply to Gobinda Joy from comment #18)
> > 
> > What I don't get is why they are using 2 calls to get the bandwidth reading.
> > Since both function walking the PCIe tree what's the point. Also it seems
> > like the call to pcie_bandwidth_available() function is casing the
> > freeze/hangs in my system. So that's counts for something.
> > 
> 
> Can you try a drm-next kernel?  This code was ultimately cleaned in this
> patch:
> https://cgit.freedesktop.org/drm/drm/commit/
> ?id=dbaa922b5706b1aff4572c280e15bbea2d04afe6
> I don't know why pcie_bandwidth_available() is causing problems for you,
> it's just standard PCIE stuff.

Yes, I have tried the drm-next kernel and also tried that patch with current 5.2.0-rc4 same result boot hang. But this time I couldn't even get any log.

As little as I understand this, the difference between these two functions seems one reads the link capability (PCI_EXP_LNKCAP) other one tries to read link status (PCI_EXP_LNKSTA) and causes problem.

It could be that older UEFI BIOS like mine doesn't initialize the device properly when the link status gets accessed because newer board doesn't have this problem.

Also it could be that my board has a PLEX chip between the CPU and PCIE slots and there is no direct CPU<->PCIE slots available.

The PLEX chip is used to provide 2 x16_gen3 PCIE slot and 2 x8_gen3 PCIE slot. If all four slot gets populated first 2 slot will be downgraded to x8_gen3 slots as the 3rd/4th slot shares the bandwidth.

If the older method working fine for the newer cards too is there a reason to use pcie_bandwidth_available() function at all.

I'm way out of my league here. So don't get offended, I'm just curious.
Comment 21 Gobinda Joy 2019-06-28 09:44:19 UTC
The latest drm-next (drm-next-5.3-2019-06-27) kernel still have this bug.
Comment 22 Martin Peres 2019-11-19 09:29:46 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/807.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.