Bug 100399

Summary: Kernel invalid opcode on unbinding amdgpu
Product: DRI Reporter: nospam
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: e.yunak, jimijames.bove
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description nospam 2017-03-26 03:03:51 UTC
I'm not sure where is the best place to post this report, so let me know if there is a better place than here.

I have a RX480 GPU that I use with amdgpu on linux 4.11.0-rc3+ (compiled with the Ubuntu 4.8.0 lowlatency config), and everything seemingly works fine until I try to unbind amdgpu from the device. This also happened with linux 4.10.0-rc3+

I've reproduced this regardless of whether the amdgpu device is the primary or secondary display device, and whether X is active or not.

Observe:
$ lspci | grep AMD
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c7)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
$ echo 01:00.0 | sudo tee /sys/bus/pci/devices/01:00.0/driver/unbind
Segmentation Fault

At this point, the system becomes unstable and some system calls seems to just hang (not sure which exactly, but sudo and ps a breaks). Trying to shut down the system also hangs.

dmesg output:
[   86.993436] ------------[ cut here ]------------
[   86.993439] kernel BUG at drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c:6930!
[   86.993442] invalid opcode: 0000 [#1] PREEMPT SMP
[   86.993443] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter nf_nat_h323 nf_conntrack_h323 nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c ip_tables x_tables bnep bridge stp llc binfmt_misc dm_snapshot dm_bufio nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf input_leds serio_raw joydev snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi
[   86.993488]  mei_me snd_hda_intel mei snd_hda_codec snd_hda_core intel_pch_thermal snd_hwdep snd_pcm snd_timer snd soundcore hci_uart btbcm btqca btintel bluetooth intel_lpss_acpi intel_lpss shpchp acpi_als acpi_pad mac_hid kfifo_buf tpm_infineon industrialio kvm_intel kvm irqbypass it87 hwmon_vid parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq hid_generic usbhid mxm_wmi amdkfd amd_iommu_v2 i915 amdgpu ttm drm_kms_helper igb e1000e syscopyarea sysfillrect dca psmouse nvme sysimgblt ptp fb_sys_fops nvme_core firewire_ohci pps_core i2c_algo_bit drm ahci firewire_core crc_itu_t libahci wmi video pinctrl_sunrisepoint i2c_hid pinctrl_intel hid fjes
[   86.993519] CPU: 5 PID: 2955 Comm: tee Not tainted 4.11.0-rc3+ #1
[   86.993521] Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F4 10/21/2015
[   86.993523] task: ffff8ee839f4d880 task.stack: ffffacb00624c000
[   86.993539] RIP: 0010:gfx_v8_0_kiq_set_interrupt_state+0xce/0xe0 [amdgpu]
[   86.993541] RSP: 0018:ffffacb00624fb68 EFLAGS: 00010046
[   86.993543] RAX: 0000000000000000 RBX: ffff8ee855f6b2d8 RCX: 0000000000000000
[   86.993545] RDX: 0000000000000000 RSI: ffff8ee855f6c750 RDI: ffff8ee855f68000
[   86.993546] RBP: ffffacb00624fba8 R08: 000000000001e640 R09: ffffffffc039bcb9
[   86.993548] R10: fffff1f06155f200 R11: 0000000000000000 R12: ffff8ee855f68000
[   86.993550] R13: ffff8ee855f6b548 R14: ffff8ee855f6c750 R15: 0000000000000000
[   86.993552] FS:  00007f1260269700(0000) GS:ffff8ee881d40000(0000) knlGS:0000000000000000
[   86.993555] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   86.993556] CR2: 000055651d517908 CR3: 0000000831b78000 CR4: 00000000003406e0
[   86.993558] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   86.993560] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   86.993562] Call Trace:
[   86.993572]  ? amdgpu_irq_disable_all+0x89/0xe0 [amdgpu]
[   86.993582]  amdgpu_irq_uninstall+0x17/0x20 [amdgpu]
[   86.993589]  drm_irq_uninstall+0x8e/0x170 [drm]
[   86.993598]  amdgpu_irq_fini+0x83/0xc0 [amdgpu]
[   86.993606]  tonga_ih_sw_fini+0x12/0x30 [amdgpu]
[   86.993613]  amdgpu_fini+0x2c5/0x490 [amdgpu]
[   86.993620]  amdgpu_device_fini+0x53/0x160 [amdgpu]
[   86.993626]  amdgpu_driver_unload_kms+0x4f/0xa0 [amdgpu]
[   86.993632]  drm_dev_unregister+0x3c/0xe0 [drm]
[   86.993637]  drm_put_dev+0x36/0x70 [drm]
[   86.993643]  amdgpu_pci_remove+0x15/0x20 [amdgpu]
[   86.993646]  pci_device_remove+0x39/0xc0
[   86.993649]  device_release_driver_internal+0x155/0x210
[   86.993651]  device_release_driver+0x12/0x20
[   86.993653]  unbind_store+0x10d/0x160
[   86.993655]  drv_attr_store+0x25/0x30
[   86.993657]  sysfs_kf_write+0x37/0x40
[   86.993659]  kernfs_fop_write+0x120/0x1a0
[   86.993662]  __vfs_write+0x37/0x160
[   86.993665]  ? apparmor_file_permission+0x1a/0x20
[   86.993667]  ? security_file_permission+0x3b/0xc0
[   86.993669]  vfs_write+0xb8/0x1b0
[   86.993672]  SyS_write+0x55/0xc0
[   86.993674]  entry_SYSCALL_64_fastpath+0x1e/0xad
[   86.993676] RIP: 0033:0x7f125fd9f6e0
[   86.993678] RSP: 002b:00007ffe60a95358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   86.993681] RAX: ffffffffffffffda RBX: 000000000126e090 RCX: 00007f125fd9f6e0
[   86.993682] RDX: 000000000000000d RSI: 00007ffe60a95400 RDI: 0000000000000003
[   86.993684] RBP: 0000000000000000 R08: 000000000126e520 R09: 0000000000000000
[   86.993686] R10: 0000000000000837 R11: 0000000000000246 R12: 0000000000000000
[   86.993688] R13: 000000000000002d R14: 000000000126f590 R15: 000000000126e090
[   86.993690] Code: ff 25 ff ff ff df 31 c9 be b4 30 00 00 89 c2 48 89 df e8 86 9a fb ff 31 d2 44 89 e6 48 89 df e8 e9 96 fb ff 25 ff ff ff df eb b6 <0f> 0b 0f 0b 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 
[   86.993716] RIP: gfx_v8_0_kiq_set_interrupt_state+0xce/0xe0 [amdgpu] RSP: ffffacb00624fb68
[   86.993719] ---[ end trace 36bcf8facd6b3d68 ]---
[   86.993722] note: tee[2955] exited with preempt_count 1
Comment 1 jimijames.bove 2017-06-10 19:29:11 UTC
Me and many other people have been having this issue as well, and I only recently learned that freedesktop.org, NOT kernel.org, is the proper place to report it. Here's my bug report that's been getting ignored for almost a year and hopefully has extra information: https://bugzilla.kernel.org/show_bug.cgi?id=150731
Comment 2 Luke A. Guest 2017-07-01 15:14:40 UTC
I can confirm that the OS completely hangs when unbinding R9 380 (Tonga Pro) with X running. Works fine with X off.
Thought I'd add my post from the linked thread, so I can be updated.
------------------

I have amdgpu and vfio-pci both in kernel, used the following to unbind it.

#!/bin/bash
for dev in "$@"; do
        vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
        device=$(cat /sys/bus/pci/devices/$dev/device)
        if [ -e /sys/bus/pci/devices/$dev/driver ]; then
                echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
        fi
        echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
done

lspci -nnk shows:

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] [1002:6939] (rev f1)
        Subsystem: PC Partner Limited / Sapphire Technology Radeon R9 380 Nitro 4G D5 [174b:e308]
        Kernel driver in use: vfio-pci
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga HDMI Audio [Radeon R9 285/380] [1002:aad8]
        Subsystem: PC Partner Limited / Sapphire Technology Radeon R9 285/380 HDMI Audio [174b:aad8]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
Comment 3 Michel Dänzer 2017-07-03 03:32:55 UTC
FWIW, I don't think unbinding is supposed to be possible while Xorg (or anything else) is using the GPU. Sounds like there's something missing somewhere to prevent that.
Comment 4 jimijames.bove 2017-07-05 06:45:30 UTC
(In reply to Michel Dänzer from comment #3)
> FWIW, I don't think unbinding is supposed to be possible while Xorg (or
> anything else) is using the GPU. Sounds like there's something missing
> somewhere to prevent that.

Before I switched to AMD, I was passing an NVidia GPU (GTX 660) into my virtual machine, and I could unbind and rebind it between nouveau and vfio-pci as much as I wanted. No trouble at all. Even while X was running, once DRI3 support came. I switched to AMD expecting the same functionality. Thankfully, not having said functionality isn't the end of the world, but having to reboot my computer every time I want to play a game in Windows right after playing a game in Linux is exactly the kind of pain that I spent a summer setting up the VM to avoid.
Comment 5 Michel Dänzer 2017-07-05 07:25:01 UTC
(In reply to jimijames.bove from comment #4)
> Before I switched to AMD, I was passing an NVidia GPU (GTX 660) into my
> virtual machine, and I could unbind and rebind it between nouveau and
> vfio-pci as much as I wanted. No trouble at all. Even while X was running,
> once DRI3 support came.

You can still do that with DRI3, you just have to prevent Xorg from using the secondary GPU, e.g. via

Section "ServerFlags"
       Option  "AutoAddGPU" "off"
EndSection

in /etc/X11/xorg.conf.
Comment 6 jimijames.bove 2017-07-05 07:56:14 UTC
(In reply to Michel Dänzer from comment #5)
> (In reply to jimijames.bove from comment #4)
> > Before I switched to AMD, I was passing an NVidia GPU (GTX 660) into my
> > virtual machine, and I could unbind and rebind it between nouveau and
> > vfio-pci as much as I wanted. No trouble at all. Even while X was running,
> > once DRI3 support came.
> 
> You can still do that with DRI3, you just have to prevent Xorg from using
> the secondary GPU, e.g. via
> 
> Section "ServerFlags"
>        Option  "AutoAddGPU" "off"
> EndSection
> 
> in /etc/X11/xorg.conf.

Well, sort of. That option is what allows me to bind the card to amdgpu without X crashing (even though I've been told in the past that I shouldn't need that option for that functionality), but this bug--not being able to UNbind it from amdgpu--does not go away with that option.
Comment 7 jimijames.bove 2017-07-05 07:57:23 UTC
(In reply to jimijames.bove from comment #6)
> Well, sort of. That option is what allows me to bind the card to amdgpu
> without X crashing (even though I've been told in the past that I shouldn't
> need that option for that functionality), but this bug--not being able to
> UNbind it from amdgpu--does not go away with that option.

Actually, sorry, I just remembered, I *don't* need that option anymore to bind it to amdgpu while X is running. That did get fixed. But back then and also right now, it still doesn't fix this bug.
Comment 8 Michel Dänzer 2017-07-05 08:09:41 UTC
Make sure nothing else (e.g. gdm in Wayland mode) is using the GPU you're trying to unbind either. Something like

 sudo lsof /dev/dri/*

shows which process is using which GPU device(s).
Comment 9 jimijames.bove 2017-07-05 08:14:50 UTC
(In reply to Michel Dänzer from comment #8)
> Make sure nothing else (e.g. gdm in Wayland mode) is using the GPU you're
> trying to unbind either. Something like
> 
>  sudo lsof /dev/dri/*
> 
> shows which process is using which GPU device(s).

I did that way back when I first discovered this bug. I'll do it again just to make sure when I'm back with the computer that has the virtual machine and AMD GPU in a couple weeks.
Comment 10 jimijames.bove 2017-07-19 20:04:47 UTC
OK, that's weird. Running sudo lsof /dev/dri/* doesn't get me any info about the AMD card at all. I ran it at boot, when it's bound to vfio-pci (I set it up to be that way at boot), then I ran it after unbinding it to that and binding it to amdgpu, and then I ran it after attempting (and failing due to this bug) to unbind it from amdgpu. All 3 times, I got these lines, which are referring to my NVidia GT 740 (card0), and absolutely no lines that have anything to do with Xorg or any other video card:

COMMAND    PID USER   FD   TYPE  DEVICE SIZE/OFF  NODE NAME
Xorg       597 root  mem    CHR   226,0          14346 /dev/dri/card0
Xorg       597 root   14u   CHR   226,0      0t0 14346 /dev/dri/card0
Xorg       597 root   16u   CHR   226,0      0t0 14346 /dev/dri/card0
Xorg       597 root   17u   CHR   226,0      0t0 14346 /dev/dri/card0
Comment 11 Martin Peres 2019-11-19 08:14:50 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/149.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.