104825 – [amdgpu] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) when unbinding

Bug 104825 - [amdgpu] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) when unbinding

Summary: [amdgpu] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=...

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	XOrg git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-01-29 06:57 UTC by mlen
Modified:	2018-04-02 20:49 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
DAL warnings during driver unbind (14.77 KB, text/plain) 2018-02-27 18:10 UTC, Andrey Grodzovsky	no flags	Details
[PATCH] drm/amd/display: Use atomic crtc_disable for DC on shutdown (1.19 KB, patch) 2018-02-27 19:01 UTC, Harry Wentland	no flags	Details \| Splinter Review
View All

Description mlen 2018-01-29 06:57:03 UTC

I use two amdgpu rx480 cards. During boot one of them is rebound to vfio-pci driver using the following script:

pci_ids=("0000:03:00.0" "0000:03:00.1")

for id in "${pci_ids[@]}"; do
  vendor="$(cat "/sys/bus/pci/devices/$id/vendor")"
  device="$(cat "/sys/bus/pci/devices/$id/device")"

  if [ -e "/sys/bus/pci/devices/$id/driver/unbind" ]; then
    echo "$id" >"/sys/bus/pci/devices/$id/driver/unbind"
  fi

  echo "$vendor $device" >/sys/bus/pci/drivers/vfio-pci/new_id
done

Starting from Linux 4.15 with amdgpu DC enabled (I wanted to use it for HDMI audio), unbind operation causes general protection failure:

[   68.011473] [drm] amdgpu: finishing device.
[   68.377945] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   68.575193] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   68.770107] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   68.971775] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   69.164265] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   69.350089] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   69.538302] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   69.729260] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   69.729733] general protection fault: 0000 [#1] PREEMPT SMP PTI
[   69.730901] Modules linked in:
[   69.731936] CPU: 2 PID: 3934 Comm: openrc-run.sh Not tainted 4.15.0-gentoo #2
[   69.733009] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D16 WS/Z10PE-D16 WS, BIOS 3407 03/10/2017
[   69.734240] RIP: 0010:dm_read_reg_func.isra.0+0x3/0xc
[   69.735314] RSP: 0018:ffffa80d8bd4fc40 EFLAGS: 00010282
[   69.736353] RAX: ccea607dac10c354 RBX: ffff95af35bfca80 RCX: 0000000180200008
[   69.737408] RDX: 0000000180200009 RSI: 0000000000005c02 RDI: ffff95af362b91c0
[   69.738494] RBP: ffff95af35b75c90 R08: 0000000000000001 R09: ffffffffb30fced2
[   69.739561] R10: ffff95af2f770a40 R11: 00000000ffffff80 R12: ffff95af35f30100
[   69.740638] R13: 0000000000000000 R14: ffff95af358a3c20 R15: ffff95bf2d5f9c20
[   69.742007] FS:  00007fe7530af740(0000) GS:ffff95af3da00000(0000) knlGS:0000000000000000
[   69.743083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   69.744134] CR2: 000000c420870008 CR3: 0000001027426003 CR4: 00000000003606e0
[   69.745190] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   69.746246] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   69.747320] Call Trace:
[   69.748359]  destroy+0x21/0x9c
[   69.749395]  dal_i2caux_destruct+0x6f/0xab
[   69.750459]  destroy+0x15/0x27
[   69.751495]  dal_i2caux_destroy+0x26/0x2f
[   69.752537]  destruct+0x86/0xfd
[   69.753571]  dc_destroy+0x11/0x22
[   69.754626]  dm_hw_fini+0x1e/0x22
[   69.755632]  amdgpu_fini+0xf3/0x2d6
[   69.756616]  amdgpu_device_fini+0x5c/0x158
[   69.757597]  amdgpu_driver_unload_kms+0x6b/0x7e
[   69.758601]  drm_dev_unregister+0x4c/0xc6
[   69.759584]  amdgpu_pci_remove+0x19/0x37
[   69.760576]  pci_device_remove+0x3b/0x8b
[   69.761563]  device_release_driver_internal+0x125/0x1f9
[   69.762580]  unbind_store+0x60/0x90
[   69.763568]  kernfs_fop_write+0x111/0x159
[   69.764557]  __vfs_write+0x33/0xd7
[   69.765543]  ? preempt_count_sub+0x8b/0x94
[   69.766552]  ? __sb_start_write+0xc0/0x180
[   69.767525]  vfs_write+0xa5/0xe2
[   69.768490]  SyS_write+0x5f/0xa3
[   69.769439]  do_syscall_64+0x72/0x81
[   69.770419]  entry_SYSCALL64_slow_path+0x25/0x25
[   69.771373] RIP: 0033:0x7fe7529a1408
[   69.772324] RSP: 002b:00007ffe1885c200 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   69.773300] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fe7529a1408
[   69.774290] RDX: 000000000000000d RSI: 0000558994641890 RDI: 0000000000000001
[   69.775290] RBP: 0000558994641890 R08: 000000000000000a R09: 00005589946475f0
[   69.776260] R10: 000000000000009b R11: 0000000000000246 R12: 000000000000000d
[   69.777248] R13: 0000000000000001 R14: 00007fe752c6e740 R15: 000000000000000d
[   69.778244] Code: f7 74 24 04 89 43 5c 48 8b 4c 24 08 65 48 33 0c 25 28 00 00 00 4c 89 e0 74 05 e8 39 58 9e ff 48 83 c4 10 5b 5d 41 5c c3 48 8b 07 <48> 8b 40 30 e9 03 4a 70 00 0f 1f 44 00 00 48 8b 47 30 8b 70 04 
[   69.779406] RIP: dm_read_reg_func.isra.0+0x3/0xc RSP: ffffa80d8bd4fc40
[   69.780491] ---[ end trace 6aa4681ba3a43ec3 ]---
[   71.815503] [drm:amdgpu_fill_buffer] *ERROR* Trying to clear memory with ring turned off.
[   71.899258] amdgpu 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[   71.899263] amdgpu 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[   72.217918] [drm:amdgpu_fill_buffer] *ERROR* Trying to clear memory with ring turned off.

Comment 1 Harry Wentland 2018-01-29 15:47:15 UTC

This patch https://patchwork.freedesktop.org/patch/198719/ should fix it, but there could be some other issues as well.

amd-staging-drm-next has fixes for a whole bunch of driver unload issues, including what you're seeing. It's hosted at https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next

Can you try the patch and/or amd-staging-drm-next?

Comment 2 mlen 2018-01-30 08:00:14 UTC

I'll try running amd-staging-drm-next later today

Comment 3 mlen 2018-01-31 07:21:14 UTC

I tested amd-staging-drm-next with HEAD at f1367d12f5fabb04789c7772594887434c8d9e8b. This time the unbind succeeded, but there are still some errors logged and kernel reports locking problem in amdgpu:

[   77.098923] [drm] amdgpu: finishing device.
[   77.458614] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   77.481247] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[   77.653815] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   77.845085] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   77.855055] IPv6: ADDRCONF(NETDEV_CHANGE): virbr10: link becomes ready
[   78.036695] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   78.233244] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   78.425058] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   78.616635] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   78.808323] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
[   78.810659] amdgpu 0000:03:00.0: 00000000a667dd57 unpin not necessary
[   78.810672] amdgpu 0000:03:00.0: 00000000a7594a2b unpin not necessary

[   78.811733] =====================================
[   78.813109] WARNING: bad unlock balance detected!
[   78.813947] 4.15.0-rc4+ #2 Not tainted
[   78.814835] -------------------------------------
[   78.815731] openrc-run.sh/3931 is trying to release lock (&(&mgr->lock)->rlock) at:
[   78.816646] [<000000006fd39549>] amdgpu_gtt_mgr_fini+0x22/0x37
[   78.817531] but there are no more locks to release!
[   78.818446] 
               other info that might help us debug this:
[   78.820208] 5 locks held by openrc-run.sh/3931:
[   78.821127]  #0:  (sb_writers#6){....}, at: [<00000000322e5044>] vfs_write+0x87/0xe2
[   78.822051]  #1:  (&of->mutex){....}, at: [<00000000660270c4>] kernfs_fop_write+0xca/0x156
[   78.823007]  #2:  (kn->count#211){....}, at: [<000000000634dafb>] kernfs_fop_write+0xd2/0x156
[   78.823936]  #3:  (&dev->mutex){....}, at: [<00000000c386f49f>] unbind_store+0x58/0x90
[   78.824912]  #4:  (&dev->mutex){....}, at: [<00000000eefcc37f>] device_release_driver_internal+0x2f/0x1f3
[   78.825861] 
               stack backtrace:
[   78.827764] CPU: 7 PID: 3931 Comm: openrc-run.sh Not tainted 4.15.0-rc4+ #2
[   78.828747] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D16 WS/Z10PE-D16 WS, BIOS 3407 03/10/2017
[   78.829718] Call Trace:
[   78.830717]  dump_stack+0x67/0x8e
[   78.831689]  ? amdgpu_gtt_mgr_fini+0x22/0x37
[   78.832687]  print_unlock_imbalance_bug+0xcc/0xd3
[   78.833657]  lock_release+0x134/0x267
[   78.834646]  ? _raw_spin_unlock+0x2e/0x40
[   78.835605]  _raw_spin_unlock+0x1c/0x40
[   78.836586]  amdgpu_gtt_mgr_fini+0x22/0x37
[   78.837549]  ttm_bo_clean_mm+0x79/0xab
[   78.838544]  amdgpu_ttm_fini+0x75/0x11c
[   78.839507]  amdgpu_bo_fini+0xe/0x2d
[   78.840495]  gmc_v8_0_sw_fini+0x2e/0x49
[   78.841454]  amdgpu_device_ip_fini+0x21f/0x2d3
[   78.842439]  amdgpu_device_fini+0x4c/0x125
[   78.843394]  amdgpu_driver_unload_kms+0x63/0x76
[   78.844373]  drm_dev_unregister+0x49/0xc3
[   78.845318]  amdgpu_pci_remove+0x19/0x37
[   78.846244]  pci_device_remove+0x36/0x86
[   78.847190]  device_release_driver_internal+0x122/0x1f3
[   78.848120]  unbind_store+0x60/0x90
[   78.849069]  kernfs_fop_write+0x10e/0x156
[   78.849997]  __vfs_write+0x31/0xcc
[   78.850937]  ? preempt_count_sub+0x8b/0x94
[   78.851871]  ? __sb_start_write+0xc0/0x180
[   78.852828]  vfs_write+0xa5/0xe2
[   78.853755]  SyS_write+0x5f/0xa3
[   78.854708]  do_syscall_64+0x6c/0x7b
[   78.855630]  entry_SYSCALL64_slow_path+0x25/0x25
[   78.856583] RIP: 0033:0x7fc5804b6408
[   78.857511] RSP: 002b:00007ffd95228060 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   78.858484] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc5804b6408
[   78.859438] RDX: 000000000000000d RSI: 000055b229e9e890 RDI: 0000000000000001
[   78.860419] RBP: 000055b229e9e890 R08: 000000000000000a R09: 000055b229ea45f0
[   78.861379] R10: 000000000000009b R11: 0000000000000246 R12: 000000000000000d
[   78.862364] R13: 0000000000000001 R14: 00007fc580783740 R15: 000000000000000d
[   78.863411] [drm] amdgpu: ttm finalized

Comment 4 Harry Wentland 2018-02-27 00:09:29 UTC

Good to hear DC issues are gone. Not sure about the unlock balance myself.

Comment 5 Andrey Grodzovsky 2018-02-27 10:55:58 UTC

Hi, can you please provide full dmesg for the unbind sequence ? 
So the issue happens when you unbind the card to from the host driver and before binding it to VFIO ?

Andrey

Comment 6 mlen 2018-02-27 11:48:27 UTC

Currently I don't have access to that machine, I should be able to test next week. I'll extend the script to be more verbose and writing some delimiters to /dev/kmsg

Comment 7 Andrey Grodzovsky 2018-02-27 14:51:03 UTC

On Carrizo unbinding  the driver will indeed show [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disabled failed (scratch(0xC040)=0xCAFEDEAD), I don't have LOCKDEP related kflags enabled in kernel for some reason so that probably why I don't see the locking imbalance warning, will rebuild and check.

What card model are you using ?

Andrey

Comment 8 mlen 2018-02-27 16:17:54 UTC

I'm using two RX 480 cards, but I'm rebinding only one of them

Comment 9 Andrey Grodzovsky 2018-02-27 18:10:47 UTC

Created attachment 137652 [details]
DAL warnings during driver unbind

Harry, with DAL enabled I observe warnings and unbind can't complete - 
to reproduce under root do 

cd /sys/bus/pci/drivers/amdgpu/
echo 'pci_id' > unbind where pci_id is seen when doing ls -l on this folder and seeing soft link like this 
0000:2a:00.0 -> ../../../../devices/pci0000:00/0000:00:03.1/0000:2a:00.0/

Comment 10 Andrey Grodzovsky 2018-02-27 18:12:46 UTC

P.S I am using CZ or Ellsmire with tip of amd-staging-drm-next. Going to disable DAL for now to debug the KCQ ring failure. 

P.S After enabling lockdep i still don't see any locking related warnings.

Comment 11 Harry Wentland 2018-02-27 19:01:26 UTC

Created attachment 137662 [details] [review]
[PATCH] drm/amd/display: Use atomic crtc_disable for DC on shutdown

Andrey, can you see if this fixes the warning for you?

Comment 12 Andrey Grodzovsky 2018-02-27 19:17:54 UTC

(In reply to Harry Wentland from comment #11)
> Created attachment 137662 [details] [review] [review]
> [PATCH] drm/amd/display: Use atomic crtc_disable for DC on shutdown
> 
> Andrey, can you see if this fixes the warning for you?

Get use after free now 

[   82.400097 <    0.000387>] BUG: KASAN: use-after-free in amdgpu_dm_set_pflip_irq_state+0x3d/0xa0 [amdgpu]
[   82.400185 <    0.000088>] Read of size 4 at addr ffff88008f53ee94 by task bash/1178

[   82.400302 <    0.000117>] CPU: 0 PID: 1178 Comm: bash Tainted: G        W  OE    4.16.0-rc1.main+ #14
[   82.400308 <    0.000006>] Hardware name: AMD Gardenia/Gardenia, BIOS RGA1101C 07/20/2015
[   82.400312 <    0.000004>] Call Trace:
[   82.400329 <    0.000017>]  dump_stack+0x5c/0x78
[   82.400342 <    0.000013>]  print_address_description+0xd1/0x270
[   82.400618 <    0.000276>]  ? amdgpu_dm_set_pflip_irq_state+0x3d/0xa0 [amdgpu]
[   82.400627 <    0.000009>]  kasan_report+0x260/0x360
[   82.400913 <    0.000286>]  amdgpu_dm_set_pflip_irq_state+0x3d/0xa0 [amdgpu]
[   82.401189 <    0.000276>]  amdgpu_irq_disable_all+0x111/0x190 [amdgpu]
[   82.401452 <    0.000263>]  amdgpu_device_ip_fini+0x1b7/0x610 [amdgpu]
[   82.401718 <    0.000266>]  amdgpu_device_fini+0xa1/0x320 [amdgpu]
[   82.401973 <    0.000255>]  amdgpu_driver_unload_kms+0x6a/0xd0 [amdgpu]
[   82.402059 <    0.000086>]  drm_dev_unregister+0x79/0x180 [drm]
[   82.402315 <    0.000256>]  amdgpu_pci_remove+0x2a/0x60 [amdgpu]
[   82.402331 <    0.000016>]  pci_device_remove+0x5b/0x100
[   82.402347 <    0.000016>]  device_release_driver_internal+0x1da/0x300
[   82.402363 <    0.000016>]  unbind_store+0x143/0x190
[   82.402376 <    0.000013>]  ? sysfs_file_ops+0xa0/0xa0
[   82.402386 <    0.000010>]  kernfs_fop_write+0x186/0x220
[   82.402402 <    0.000016>]  __vfs_write+0xb9/0x2e0
[   82.402412 <    0.000010>]  ? locks_remove_posix+0x87/0x220
[   82.402421 <    0.000009>]  ? kernel_read+0xa0/0xa0
[   82.402430 <    0.000009>]  ? find_held_lock+0xfb/0x130
[   82.402441 <    0.000011>]  ? __lock_acquire.isra.30+0x414/0xb00
[   82.402465 <    0.000024>]  ? vfs_write+0x227/0x250
[   82.402485 <    0.000020>]  ? __sb_start_write+0xc3/0x1a0
[   82.402492 <    0.000007>]  ? vfs_write+0x227/0x250
[   82.402506 <    0.000014>]  vfs_write+0xe6/0x250
[   82.402522 <    0.000016>]  SyS_write+0xa1/0x120
[   82.402532 <    0.000010>]  ? SyS_read+0x120/0x120
[   82.402547 <    0.000015>]  ? vtime_user_exit+0xc8/0xe0
[   82.402558 <    0.000011>]  ? SyS_read+0x120/0x120
[   82.402570 <    0.000012>]  do_syscall_64+0xf0/0x270
[   82.402588 <    0.000018>]  entry_SYSCALL_64_after_hwframe+0x21/0x86
[   82.402597 <    0.000009>] RIP: 0033:0x7fd050bc32c0
[   82.402603 <    0.000006>] RSP: 002b:00007ffc06b7f3b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   82.402615 <    0.000012>] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fd050bc32c0
[   82.402620 <    0.000005>] RDX: 000000000000000d RSI: 0000000001a6e408 RDI: 0000000000000001
[   82.402626 <    0.000006>] RBP: 0000000001a6e408 R08: 00007fd050e92780 R09: 00007fd0514d9700
[   82.402632 <    0.000006>] R10: 000000000000000c R11: 0000000000000246 R12: 000000000000000d
[   82.402637 <    0.000005>] R13: 0000000000000001 R14: 00007fd050e91620 R15: 0000000000000000

[   82.402711 <    0.000074>] Allocated by task 1084:
[   82.402771 <    0.000060>]  kasan_kmalloc+0xa6/0xd0
[   82.402780 <    0.000009>]  kmem_cache_alloc_trace+0x13a/0x270
[   82.403079 <    0.000299>]  dm_hw_init+0x898/0x1660 [amdgpu]
[   82.403338 <    0.000259>]  amdgpu_device_init+0x1a97/0x2100 [amdgpu]
[   82.403596 <    0.000258>]  amdgpu_driver_load_kms+0xa8/0x3a0 [amdgpu]
[   82.403673 <    0.000077>]  drm_dev_register+0x1d5/0x2f0 [drm]
[   82.403931 <    0.000258>]  amdgpu_pci_probe+0x1bf/0x290 [amdgpu]
[   82.403941 <    0.000010>]  local_pci_probe+0x74/0xe0
[   82.403951 <    0.000010>]  pci_device_probe+0x1dc/0x2d0
[   82.403970 <    0.000019>]  driver_probe_device+0x40e/0x6b0
[   82.403977 <    0.000007>]  __driver_attach+0x11d/0x130
[   82.403984 <    0.000007>]  bus_for_each_dev+0xd8/0x140
[   82.403990 <    0.000006>]  bus_add_driver+0x31d/0x3a0
[   82.403998 <    0.000008>]  driver_register+0xc6/0x170
[   82.404006 <    0.000008>]  do_one_initcall+0x82/0x1d0
[   82.404012 <    0.000006>]  do_init_module+0xe7/0x333
[   82.404020 <    0.000008>]  load_module+0x41b3/0x4c40
[   82.404028 <    0.000008>]  SYSC_finit_module+0x14d/0x180
[   82.404036 <    0.000008>]  do_syscall_64+0xf0/0x270
[   82.404044 <    0.000008>]  entry_SYSCALL_64_after_hwframe+0x21/0x86

[   82.404095 <    0.000051>] Freed by task 1178:
[   82.404152 <    0.000057>]  __kasan_slab_free+0x124/0x170
[   82.404159 <    0.000007>]  kfree+0xd4/0x200
[   82.404239 <    0.000080>]  drm_mode_config_cleanup+0x241/0x450 [drm]
[   82.404536 <    0.000297>]  amdgpu_dm_fini+0x29/0xb0 [amdgpu]
[   82.404834 <    0.000298>]  dm_hw_fini+0x1e/0x30 [amdgpu]
[   82.405091 <    0.000257>]  amdgpu_device_ip_fini+0x157/0x610 [amdgpu]
[   82.405349 <    0.000258>]  amdgpu_device_fini+0xa1/0x320 [amdgpu]
[   82.405607 <    0.000258>]  amdgpu_driver_unload_kms+0x6a/0xd0 [amdgpu]
[   82.405684 <    0.000077>]  drm_dev_unregister+0x79/0x180 [drm]
[   82.405941 <    0.000257>]  amdgpu_pci_remove+0x2a/0x60 [amdgpu]
[   82.405949 <    0.000008>]  pci_device_remove+0x5b/0x100
[   82.405957 <    0.000008>]  device_release_driver_internal+0x1da/0x300
[   82.405963 <    0.000006>]  unbind_store+0x143/0x190
[   82.405971 <    0.000008>]  kernfs_fop_write+0x186/0x220
[   82.405978 <    0.000007>]  __vfs_write+0xb9/0x2e0
[   82.405985 <    0.000007>]  vfs_write+0xe6/0x250
[   82.405991 <    0.000006>]  SyS_write+0xa1/0x120
[   82.405998 <    0.000007>]  do_syscall_64+0xf0/0x270
[   82.406007 <    0.000009>]  entry_SYSCALL_64_after_hwframe+0x21/0x86

[   82.406057 <    0.000050>] The buggy address belongs to the object at ffff88008f53e600
                               which belongs to the cache kmalloc-4096 of size 4096
[   82.406163 <    0.000106>] The buggy address is located 2196 bytes inside of
                               4096-byte region [ffff88008f53e600, ffff88008f53f600)
[   82.406262 <    0.000099>] The buggy address belongs to the page:
[   82.406326 <    0.000064>] page:ffffea00023d4e00 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
[   82.406424 <    0.000098>] flags: 0x1ffff0000008100(slab|head)
[   82.406488 <    0.000064>] raw: 01ffff0000008100 0000000000000000 0000000000000000 0000000100070007
[   82.406571 <    0.000083>] raw: dead000000000100 dead000000000200 ffff880102802600 0000000000000000
[   82.406649 <    0.000078>] page dumped because: kasan: bad access detected

[   82.406754 <    0.000105>] Memory state around the buggy address:
[   82.406816 <    0.000062>]  ffff88008f53ed80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   82.406893 <    0.000077>]  ffff88008f53ee00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   82.406968 <    0.000075>] >ffff88008f53ee80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   82.407036 <    0.000068>]                          ^
[   82.407087 <    0.000051>]  ffff88008f53ef00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   82.407157 <    0.000070>]  ffff88008f53ef80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   82.407226 <    0.000069>] ==================================================================

Comment 13 Harry Wentland 2018-03-27 18:02:41 UTC

This should be fixed in drm-next-4.17-wip and amd-staging-drm-next. Can someone test if this resolves this ticket satisfactorily?

Both branches can be found on https://cgit.freedesktop.org/~agd5f/linux

Comment 14 Andrey Grodzovsky 2018-03-27 20:20:57 UTC

(In reply to Harry Wentland from comment #13)
> This should be fixed in drm-next-4.17-wip and amd-staging-drm-next. Can
> someone test if this resolves this ticket satisfactorily?
> 
> Both branches can be found on https://cgit.freedesktop.org/~agd5f/linux

Yes, DAL issues are gone, also gone KIQ ring error on unbind. 
Still there is a KIQ ring error on rebind which I am investigating now + there is SDMA ring IB test failure which I will get to after KIQ is resloved.

Andrey

Comment 15 Andrey Grodzovsky 2018-03-29 16:07:42 UTC

(In reply to mlen from comment #3)
> I tested amd-staging-drm-next with HEAD at
> f1367d12f5fabb04789c7772594887434c8d9e8b. This time the unbind succeeded,
> but there are still some errors logged and kernel reports locking problem in
> amdgpu:
> 
> [   77.098923] [drm] amdgpu: finishing device.
> [   77.458614] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed
> (scratch(0xC040)=0x00000000)
> [   77.481247] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
> [   77.653815] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed
> (scratch(0xC040)=0x00000000)
> [   77.845085] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed
> (scratch(0xC040)=0x00000000)
> [   77.855055] IPv6: ADDRCONF(NETDEV_CHANGE): virbr10: link becomes ready
> [   78.036695] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed
> (scratch(0xC040)=0x00000000)
> [   78.233244] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed
> (scratch(0xC040)=0x00000000)
> [   78.425058] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed
> (scratch(0xC040)=0x00000000)
> [   78.616635] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed
> (scratch(0xC040)=0x00000000)
> [   78.808323] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed
> (scratch(0xC040)=0x00000000)

Can you retest with latest https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.17-wip to see if KCQ related errors are gone ?

Comment 16 mlen 2018-03-29 19:55:41 UTC

I'll test it on saturday, I don't have access to that machine at the moment

Comment 17 mlen 2018-04-01 09:51:16 UTC

No issues on drm-next-4.17-wip, unbinding works without any lockdep issues.

Do you plan to backport these changes to 4.16 or maybe even 4.15?


For the record, suspend fails on drm-next-4.17-wip. This could be possibly related, but I don't know how to debug this and most likely this is out of scope of this bug.

[   96.222095] PM: suspend entry (deep)
[   96.222099] PM: Syncing filesystems ... done.
[   96.230020] INFO: trying to register non-static key.
[   96.230024] the code is fine but needs lockdep annotation.
[   96.230026] turning off the locking correctness validator.
[   96.230029] CPU: 39 PID: 4506 Comm: pm-suspend Not tainted 4.16.0-rc7+ #2
[   96.230031] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D16 WS/Z10PE-D16 WS, BIOS 3407 03/10/2017
[   96.230033] Call Trace:
[   96.230040]  dump_stack+0x46/0x59
[   96.230046]  register_lock_class+0x192/0x361
[   96.230050]  ? cycles_2_ns+0x55/0x75
[   96.230054]  __lock_acquire.isra.30+0x97/0x595
[   96.230057]  lock_acquire+0x105/0x12e
[   96.230060]  ? devres_for_each_res+0x41/0xc4
[   96.230064]  ? __fw_entry_found+0x3b/0x3b
[   96.230068]  _raw_spin_lock_irqsave+0x3d/0x74
[   96.230071]  ? devres_for_each_res+0x41/0xc4
[   96.230073]  ? kref_get+0xa/0xa
[   96.230075]  ? alloc_fw_cache_entry+0x4e/0x4e
[   96.230077]  devres_for_each_res+0x41/0xc4
[   96.230096]  dev_cache_fw_image+0x59/0x11d
[   96.230098]  ? fw_pm_notify+0xd1/0xd1
[   96.230102]  dpm_for_each_dev+0x41/0x58
[   96.230104]  fw_pm_notify+0xac/0xd1
[   96.230108]  notifier_call_chain+0x39/0x5a
[   96.230127]  __blocking_notifier_call_chain+0x4e/0x65
[   96.230130]  __pm_notifier_call_chain+0x1b/0x2f
[   96.230133]  pm_suspend+0x15b/0x2c1
[   96.230135]  state_store+0x4b/0x7e
[   96.230140]  kernfs_fop_write+0x114/0x15c
[   96.230145]  __vfs_write+0x33/0xd7
[   96.230148]  ? __sb_start_write+0x94/0x180
[   96.230150]  ? __sb_start_write+0xc0/0x180
[   96.230153]  vfs_write+0xa5/0xe2
[   96.230156]  SyS_write+0x5f/0xa3
[   96.230160]  do_syscall_64+0x79/0x88
[   96.230164]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   96.230166] RIP: 0033:0x7fac7a914468
[   96.230168] RSP: 002b:00007ffea25554e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   96.230171] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fac7a914468
[   96.230172] RDX: 0000000000000004 RSI: 0000558679eb4500 RDI: 0000000000000001
[   96.230174] RBP: 0000558679eb4500 R08: 000000000000000a R09: 0000558679eeb6b0
[   96.230175] R10: 00007fac7a9a5b20 R11: 0000000000000246 R12: 0000000000000004
[   96.230177] R13: 0000000000000001 R14: 00007fac7abe1740 R15: 0000000000000004
[   96.230183] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[   96.230189] IP: devres_for_each_res+0x59/0xc4
[   96.230191] PGD 0 P4D 0 
[   96.230196] Oops: 0000 [#1] PREEMPT SMP PTI
[   96.230199] Modules linked in:
[   96.230203] CPU: 39 PID: 4506 Comm: pm-suspend Not tainted 4.16.0-rc7+ #2
[   96.230205] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D16 WS/Z10PE-D16 WS, BIOS 3407 03/10/2017
[   96.230208] RIP: 0010:devres_for_each_res+0x59/0xc4
[   96.230210] RSP: 0018:ffffa63689fabc70 EFLAGS: 00010086
[   96.230214] RAX: 0000000000000000 RBX: ffff9fd574d4ad98 RCX: ffff9fd574d4b1f0
[   96.230216] RDX: ffff9fd576ea0918 RSI: ffff9fd574d4b1c0 RDI: ffff9fd576ea0918
[   96.230218] RBP: ffffffffa8951b3a R08: 0000000000000000 R09: 0000000000000000
[   96.230221] R10: ffff9fd576ea0000 R11: ffffffffab22aa07 R12: ffffffffa8951f44
[   96.230223] R13: dead000000000100 R14: ffffffffa8951c08 R15: ffff9fd574d4b1a8
[   96.230226] FS:  00007fac7b021740(0000) GS:ffff9fe57f400000(0000) knlGS:0000000000000000
[   96.230229] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   96.230231] CR2: 0000000000000008 CR3: 000000202a4a8002 CR4: 00000000003606e0
[   96.230233] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   96.230236] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   96.230238] Call Trace:
[   96.230243]  dev_cache_fw_image+0x59/0x11d
[   96.230247]  ? fw_pm_notify+0xd1/0xd1
[   96.230250]  dpm_for_each_dev+0x41/0x58
[   96.230253]  fw_pm_notify+0xac/0xd1
[   96.230256]  notifier_call_chain+0x39/0x5a
[   96.230261]  __blocking_notifier_call_chain+0x4e/0x65
[   96.230264]  __pm_notifier_call_chain+0x1b/0x2f
[   96.230267]  pm_suspend+0x15b/0x2c1
[   96.230271]  state_store+0x4b/0x7e
[   96.230275]  kernfs_fop_write+0x114/0x15c
[   96.230279]  __vfs_write+0x33/0xd7
[   96.230284]  ? __sb_start_write+0x94/0x180
[   96.230286]  ? __sb_start_write+0xc0/0x180
[   96.230290]  vfs_write+0xa5/0xe2
[   96.230295]  SyS_write+0x5f/0xa3
[   96.230299]  do_syscall_64+0x79/0x88
[   96.230303]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   96.230305] RIP: 0033:0x7fac7a914468
[   96.230308] RSP: 002b:00007ffea25554e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   96.230311] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fac7a914468
[   96.230313] RDX: 0000000000000004 RSI: 0000558679eb4500 RDI: 0000000000000001
[   96.230316] RBP: 0000558679eb4500 R08: 000000000000000a R09: 0000558679eeb6b0
[   96.230318] R10: 00007fac7a9a5b20 R11: 0000000000000246 R12: 0000000000000004
[   96.230320] R13: 0000000000000001 R14: 00007fac7abe1740 R15: 0000000000000004
[   96.230326] Code: 48 83 ec 28 48 89 4c 24 18 4c 89 4c 24 20 e8 ca 7d 4b 00 48 8d 8b 58 04 00 00 48 89 44 24 08 48 8b 83 60 04 00 00 48 89 4c 24 10 <4c> 8b 68 08 48 3b 44 24 10 74 44 4c 3b 70 10 75 35 48 83 c0 28 
[   96.230390] RIP: devres_for_each_res+0x59/0xc4 RSP: ffffa63689fabc70
[   96.230392] CR2: 0000000000000008
[   96.230395] ---[ end trace 43f33fa700a0efa9 ]---
[   96.233392] note: pm-suspend[4506] exited with preempt_count 1

Comment 18 Harry Wentland 2018-04-02 20:49:31 UTC

Thanks for your bug report, and constant testing and feedback. Since the original issue is fixed in drm-next-4.17-wip I'll mark this resolved. If I'm jumping the gun here and you still notice problems feel free to reopen.

The suspend/resume issue you mention seems to be something else and might be best tracked with a new bug report, possibly on kernel.org as I don't see an indication that this is graphics driver related.

As for back-porting the changes, AFAIK this bug report uncovered quite a few issues which were all fixed with separate commits. I don't think all of them have been pulled into 4.16.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.