Bug 61529

Summary: [r600g][kms][ATI RV710] r600_pcie_gart_tlb_flush+0xf5/0x110
Product: DRI Reporter: Thaddaeus Tintenfisch <thad.fisch>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=49531
https://launchpad.net/bugs/1170917
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg log file
none
drm/radeon: Don't flush the GART TLB if rdev->gart.ptr == NULL
none
dmesg snippet none

Description Thaddaeus Tintenfisch 2013-02-26 20:25:42 UTC
Created attachment 75598 [details]
Xorg log file

I am not able to logout or shutdown my system, a laptop with hybrid graphics,  without triggering a hard lockup. However, this does only happen if the dedicated AMD GPU is powered off by vgaswitcheroo. Moreover, it might be somehow related to PRIME support being enabled. 

The latest version of xserver-xorg-video-radeon for Ubuntu 13.04 is installed (1:7.1.0-0ubuntu1).

$ lshw -C display
  *-display               
       description: VGA compatible controller
       product: RV710 [Mobility Radeon HD 4300 Series]
       vendor: Advanced Micro Devices [AMD] nee ATI
       physical id: 0
       bus info: pci@0000:01:00.0
       version: 00
       width: 32 bits
       clock: 33MHz
       capabilities: pm pciexpress msi vga_controller bus_master cap_list rom
       configuration: driver=radeon latency=0
       resources: irq:46 memory:d0000000-dfffffff ioport:3000(size=256) memory:f4400000-f440ffff memory:f4420000-f443ffff
  *-display
       description: Display controller
       product: Mobile 4 Series Chipset Integrated Graphics Controller
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 07
       width: 64 bits
       clock: 33MHz
       capabilities: msi pm bus_master cap_list rom
       configuration: driver=i915 latency=0
       resources: irq:45 memory:f0000000-f03fffff memory:e0000000-efffffff ioport:4110(size=8)


Here is syslog output of the bug:
-------------------------------------------------------------------------
[  142.230685] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  142.230819] IP: [<ffffffffa01f1ba5>] r600_pcie_gart_tlb_flush+0xf5/0x110 [radeon]
[  142.230977] PGD 0 
[  142.231014] Oops: 0000 [#1] SMP 
[  142.231075] Modules linked in: dm_crypt(F) kvm_intel kvm acer_wmi sparse_keymap snd_hda_codec_realtek xt_hl(F) ip6t_rt(F) snd_hda_intel snd_hda_codec snd_hwdep(F) nf_conntrack_ipv6(F) nf_defrag_ipv6(F) ipt_REJECT(F) microcode(F) xt_LOG(F) snd_pcm(F) xt_limit(F) xt_tcpudp(F) snd_page_alloc(F) xt_addrtype(F) snd_seq_midi(F) snd_seq_midi_event(F) snd_rawmidi(F) arc4(F) psmouse(F) nf_conntrack_ipv4(F) serio_raw(F) nf_defrag_ipv4(F) xt_state(F) iwldvm snd_seq(F) mac80211 ip6table_filter(F) snd_seq_device(F) ip6_tables(F) snd_timer(F) iwlwifi lpc_ich nf_conntrack_netbios_ns(F) nf_conntrack_broadcast(F) nf_nat_ftp(F) nf_nat(F) nf_conntrack_ftp(F) nf_conntrack(F) iptable_filter(F) cfg80211 ip_tables(F) joydev(F) x_tables(F) snd(F) soundcore(F) mac_hid binfmt_misc(F) coretemp lp(F) parport(F) hid_generic usbhid hid radeon i915 i2c_algo_bit ttm drm_kms_helper wmi r8169 ahci(F) drm libahci(F) video(F)
[  142.232175] CPU 0 
[  142.232175] Pid: 1135, comm: Xorg Tainted: GF            3.8.0-7-generic #15-Ubuntu Acer TravelMate 8471/TravelMate 8471
[  142.232175] RIP: 0010:[<ffffffffa01f1ba5>]  [<ffffffffa01f1ba5>] r600_pcie_gart_tlb_flush+0xf5/0x110 [radeon]
[  142.232175] RSP: 0018:ffff88013752bc28  EFLAGS: 00010282
[  142.232175] RAX: ffffc900047a2f34 RBX: 0000000000000000 RCX: 0000000000000000
[  142.232175] RDX: 0000000000000000 RSI: 0000000000002f34 RDI: ffff8801359d6000
[  142.232175] RBP: ffff88013752bc38 R08: 0000000000000000 R09: 0000000000000000
[  142.232175] R10: ffffea0004d3de00 R11: ffffffffa001a448 R12: ffff8801359d6000
[  142.232175] R13: 0000000000000225 R14: 0000000000000225 R15: ffffffffa025d560
[  142.232175] FS:  00007fa8bea89940(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[  142.232175] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  142.232175] CR2: 0000000000000000 CR3: 0000000137882000 CR4: 00000000000407f0
[  142.232175] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  142.232175] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  142.232175] Process Xorg (pid: 1135, threadinfo ffff88013752a000, task ffff88013750c5c0)
[  142.232175] Stack:
[  142.232175]  ffff8801359d6000 0000000000000225 ffff88013752bc68 ffffffffa01c65d7
[  142.232175]  ffff880134f73080 0000000000000002 ffff8801359d69c0 ffff8801397e2848
[  142.232175]  ffff88013752bc78 ffffffffa01c3baa ffff88013752bc90 ffffffffa0099087
[  142.232175] Call Trace:
[  142.232175]  [<ffffffffa01c65d7>] radeon_gart_unbind+0xa7/0xe0 [radeon]
[  142.232175]  [<ffffffffa01c3baa>] radeon_ttm_backend_unbind+0x1a/0x20 [radeon]
[  142.232175]  [<ffffffffa0099087>] ttm_tt_unbind+0x27/0x40 [ttm]
[  142.232175]  [<ffffffffa0099693>] ttm_bo_cleanup_memtype_use+0x33/0x90 [ttm]
[  142.232175]  [<ffffffffa009a930>] ttm_bo_release+0x210/0x280 [ttm]
[  142.232175]  [<ffffffffa009a9d1>] ttm_bo_unref+0x31/0x40 [ttm]
[  142.232175]  [<ffffffffa01c5407>] radeon_bo_unref+0x47/0x80 [radeon]
[  142.232175]  [<ffffffffa01d7cf9>] radeon_gem_object_free+0x39/0x40 [radeon]
[  142.232175]  [<ffffffffa0010aba>] drm_gem_object_free+0x2a/0x30 [drm]
[  142.232175]  [<ffffffffa00111e8>] drm_gem_handle_delete+0xf8/0x130 [drm]
[  142.232175]  [<ffffffffa0011648>] drm_gem_close_ioctl+0x28/0x30 [drm]
[  142.232175]  [<ffffffffa000f559>] drm_ioctl+0x4e9/0x5b0 [drm]
[  142.232175]  [<ffffffffa0011620>] ? drm_gem_destroy+0x60/0x60 [drm]
[  142.232175]  [<ffffffff8115c14b>] ? unmap_region+0xdb/0x120
[  142.232175]  [<ffffffff8115c453>] ? remove_vma+0x63/0x70
[  142.232175]  [<ffffffff811a5059>] do_vfs_ioctl+0x99/0x570
[  142.232175]  [<ffffffff8115e488>] ? do_munmap+0x328/0x410
[  142.232175]  [<ffffffff811a55c1>] sys_ioctl+0x91/0xb0
[  142.232175]  [<ffffffff816cc5dd>] system_call_fastpath+0x1a/0x1f
[  142.232175] Code: 00 c1 e8 04 83 f8 02 74 29 85 c0 74 c9 5b 41 5c 5d c3 0f 1f 40 00 31 c9 31 d2 be 34 2f 00 00 48 8b 9f 90 03 00 00 e8 5b f1 fe ff <8b> 03 e9 42 ff ff ff 48 c7 c7 10 b3 24 a0 31 c0 e8 a0 5c 4c e1 
[  142.232175] RIP  [<ffffffffa01f1ba5>] r600_pcie_gart_tlb_flush+0xf5/0x110 [radeon]
[  142.232175]  RSP <ffff88013752bc28>
[  142.232175] CR2: 0000000000000000
[  142.294959] ---[ end trace aabd94dad6d98857 ]---
-------------------------------------------------------------------------
Comment 1 Thaddaeus Tintenfisch 2015-06-27 23:12:22 UTC
A commit in 3.17-rc6 is causing this kernel panic to occur when switching off the dedicated GPU for the first time after booting the system.
This is still reproducible with 4.1-rcX and radeon.runpm=0 (plus radeon.dpm=0).
Comment 2 Michel Dänzer 2015-07-02 06:49:21 UTC
Created attachment 116867 [details] [review]
drm/radeon: Don't flush the GART TLB if rdev->gart.ptr == NULL

Does this patch fix the problem?
Comment 3 Thaddaeus Tintenfisch 2015-07-02 12:04:39 UTC
I will test your patch as soon as possible. Meanwhile, I just finished bisecting the kernel with the result:

b440bde74f043c8ec31081cb59c9a53ade954701 is the first bad commit
Comment 4 Thaddaeus Tintenfisch 2015-07-02 17:13:13 UTC
Created attachment 116881 [details]
dmesg snippet

I have applied radeon-gart_tlb_flush-NULL.diff to git master and it does fix the hard lockup.
However, powering off the dedicated GPU still triggers some kernel panics (see attached log file).
Comment 5 Alex Deucher 2015-07-02 17:58:08 UTC
You are probably seeing this bug (bug is in the pci hotplug system):
https://bugzilla.kernel.org/show_bug.cgi?id=61891
See if the latest patch there helps.
Comment 6 Thaddaeus Tintenfisch 2015-07-02 19:59:28 UTC
Alex, do you mean the patch from comment #83?
The previous one from comment #78 was already applied -> 0824965140fff1bf640a987dc790d1594a8e0699.
Comment 7 Alex Deucher 2015-07-02 20:18:09 UTC
(In reply to Thaddaeus Tintenfisch from comment #6)
> Alex, do you mean the patch from comment #83?
> The previous one from comment #78 was already applied ->
> 0824965140fff1bf640a987dc790d1594a8e0699.

I didn't realize it had already been applied.
Comment 8 Michel Dänzer 2015-07-03 01:14:55 UTC
The warnings (not panics) would need to be tracked in separate reports, but at least the first one is harmless.
Comment 9 Thaddaeus Tintenfisch 2015-07-03 07:31:05 UTC
The first two warnings are pretty much identical (3 warnings in total).
Should I create two new reports which reference the patch from this report?

The main issue here is that the vgaswitcheroo interface is no longer available after powering off the dGPU (/sys/kernel/debug/vgaswitcheroo/switch is gone).

From the log:

[   60.462677] vga_switcheroo: disabled
Comment 10 Alex Deucher 2015-07-03 19:06:58 UTC
(In reply to Thaddaeus Tintenfisch from comment #9)
> The first two warnings are pretty much identical (3 warnings in total).
> Should I create two new reports which reference the patch from this report?
> 
> The main issue here is that the vgaswitcheroo interface is no longer
> available after powering off the dGPU
> (/sys/kernel/debug/vgaswitcheroo/switch is gone).
> 
> From the log:
> 
> [   60.462677] vga_switcheroo: disabled

It's still an acpi hotplug bug:
[   60.454402]  [<ffffffffc022c295>] radeon_pci_remove+0x15/0x20 [radeon]
[   60.454407]  [<ffffffff813e470f>] pci_device_remove+0x3f/0xc0
[   60.454414]  [<ffffffff814e6a86>] __device_release_driver+0x96/0x130
[   60.454418]  [<ffffffff814e6b43>] device_release_driver+0x23/0x30
[   60.454423]  [<ffffffff813df012>] pci_stop_bus_device+0x92/0xa0
[   60.454427]  [<ffffffff813df136>] pci_stop_and_remove_bus_device+0x16/0x30
[   60.454432]  [<ffffffff813fbf23>] disable_slot+0x53/0xa0
[   60.454436]  [<ffffffff813fc642>] acpiphp_check_bridge.part.8+0xd2/0xf0
[   60.454440]  [<ffffffff813fcec2>] acpiphp_hotplug_notify+0xd2/0x220
[   60.454445]  [<ffffffff813fcdf0>] ? acpiphp_post_dock_fixup+0xc0/0xc0
[   60.454450]  [<ffffffff81429fb7>] acpi_device_hotplug+0x3b0/0x3f8
[   60.454454]  [<ffffffff814233ae>] acpi_hotplug_work_fn+0x1f/0x2b

The acpiphp driver is trying to remove the driver after switcheroo has turned it off.  It should not not kicking in and removing the driver.  It looks like there is some other broken case in the acpiphp code.  I'd suggest filing a new acpiphp bug on bugzilla.kernel.org and referencing this bug:
https://bugzilla.kernel.org/show_bug.cgi?id=61891
or adding a comment to that bug that there are still cases that are broken.
Comment 11 Thaddaeus Tintenfisch 2015-07-06 11:43:40 UTC
I have added a comment to the linked report. Also, the patch from comment #2 for this bug can be forwarded then.

Thanks.
Comment 12 Thaddaeus Tintenfisch 2015-07-20 08:55:08 UTC
Is anything missing? Does the patch need more testing?
Comment 13 Thaddaeus Tintenfisch 2015-07-20 14:05:37 UTC
Oh, 4.2-rc3 includes the patch. Thanks.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=233709d2cd6bbaaeda0aeb8d11f6ca7f98563b39

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.