Bug 88927

Summary: [Acer Aspire 4820TG] Kernel: trying to unbind memory from uninitialized GART !; EIP is at radeon_gart_unbind+0xca/0xe0 [radeon]()
Product: xorg Reporter: tiagdtd-lava
Component: Driver/RadeonAssignee: xf86-video-ati maintainers <xorg-driver-ati>
Status: RESOLVED NOTOURBUG QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: medium CC: christopher.m.penalver
Version: 7.7 (2012.06)   
Hardware: All   
OS: Linux (All)   
See Also: https://launchpad.net/bugs/1414349
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
System information (dmesg, lspci, ...) none

Description tiagdtd-lava 2015-02-02 22:04:28 UTC
Created attachment 113078 [details]
System information (dmesg, lspci, ...)

I have some problems with the new kernel/radeon drivers with my Radeon HD 5650 in my Acer Aspire 4820TG laptop.

The x-server is started but the GUI constantly freezes for a few seconds. It happens regardles of the running applications.
The kernel reports repeatedly about some issues (with stack traces). 
I think this stack traces point to NULL pointer dereferences.


Kernel Version:
Linux version 3.19.0-031900rc7-generic (kernel@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201502020035 SMP Mon Feb 2 05:36:49 UTC 2015

First stack trace:
[   28.702100] WARNING: CPU: 3 PID: 155 at /home/kernel/COD/linux/drivers/gpu/drm/radeon/radeon_gart.c:246 radeon_gart_unbind+0xca/0xe0 [radeon]()
[   28.702101] trying to unbind memory from uninitialized GART !
[   28.702102] Modules linked in: ctr ccm arc4 brcmsmac cordic brcmutil b43 mac80211 cfg80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ssb intel_powerclamp coretemp snd_hda_intel kvm_intel snd_hda_controller rfcomm snd_hda_codec acer_wmi bnep snd_hwdep kvm sparse_keymap crct10dif_pclmul snd_pcm crc32_pclmul ghash_clmulni_intel snd_seq_midi uvcvideo snd_seq_midi_event videobuf2_vmalloc videobuf2_memops aesni_intel snd_rawmidi videobuf2_core aes_x86_64 snd_seq binfmt_misc v4l2_common btusb lrw gf128mul videodev glue_helper ablk_helper media bluetooth cryptd snd_seq_device snd_timer joydev serio_raw snd intel_ips mei_me soundcore mei lpc_ich bcma mac_hid parport_pc ppdev lp parport amdkfd amd_iommu_v2 radeon i915 psmouse i2c_algo_bit ttm drm_kms_helper ahci drm libahci atl1c wmi video
[   28.702134] CPU: 3 PID: 155 Comm: kworker/u16:6 Not tainted 3.19.0-031900rc7-generic #201502020035
[   28.702135] Hardware name: Acer /JM41_CP, BIOS V1.25 03/16/2011
[   28.702142] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   28.702143]  00000000000000f6 ffff8800354db6f8 ffffffff817c47b0 0000000000000007
[   28.702145]  ffff8800354db748 ffff8800354db738 ffffffff81076df7 ffffffff81f12d95
[   28.702146]  ffff880000074000 ffff8800354db948 ffff880000074728 ffff880035708540
[   28.702148] Call Trace:
[   28.702156]  [<ffffffff817c47b0>] dump_stack+0x45/0x57
[   28.702160]  [<ffffffff81076df7>] warn_slowpath_common+0x97/0xe0
[   28.702162]  [<ffffffff81076ef6>] warn_slowpath_fmt+0x46/0x50
[   28.702167]  [<ffffffff814aa774>] ? vt_console_print+0x2d4/0x3b0
[   28.702177]  [<ffffffffc030a5aa>] radeon_gart_unbind+0xca/0xe0 [radeon]
[   28.702187]  [<ffffffffc0306c32>] radeon_ttm_backend_unbind+0x22/0x40 [radeon]
[   28.702193]  [<ffffffffc0138177>] ttm_tt_unbind+0x27/0x40 [ttm]
[   28.702216]  [<ffffffffc013cb68>] ttm_bo_move_ttm+0xf8/0x130 [ttm]
[   28.702220]  [<ffffffffc013a326>] ttm_bo_handle_move_mem+0x636/0x6f0 [ttm]
[   28.702226]  [<ffffffff810cc888>] ? console_unlock+0x18/0x30
[   28.702227]  [<ffffffff810cd1ae>] ? vprintk_emit+0x25e/0x510
[   28.702232]  [<ffffffffc013a522>] ttm_bo_evict+0x142/0x200 [ttm]
[   28.702237]  [<ffffffffc013a76b>] ttm_mem_evict_first+0x18b/0x1f0 [ttm]
[   28.702242]  [<ffffffffc01404b3>] ? ttm_bo_man_takedown+0x53/0x80 [ttm]
[   28.702247]  [<ffffffffc013a841>] ttm_bo_force_list_clean+0x71/0xc0 [ttm]
[   28.702256]  [<ffffffffc013a937>] ttm_bo_clean_mm+0x47/0x90 [ttm]
[   28.702272]  [<ffffffffc03086f5>] radeon_ttm_fini+0xf5/0x1c0 [radeon]
[   28.702288]  [<ffffffffc0309476>] radeon_bo_fini+0x16/0x30 [radeon]
[   28.702311]  [<ffffffffc0365e6f>] evergreen_fini+0xaf/0xe0 [radeon]
[   28.702322]  [<ffffffffc02ec46f>] radeon_device_fini+0x3f/0x140 [radeon]
[   28.702331]  [<ffffffffc02ee969>] radeon_driver_unload_kms+0x59/0x80 [radeon]
[   28.702344]  [<ffffffffc008e52d>] drm_dev_unregister+0x2d/0xc0 [drm]
[   28.702352]  [<ffffffffc008eb07>] drm_put_dev+0x27/0x70 [drm]
[   28.702360]  [<ffffffffc02ea215>] radeon_pci_remove+0x15/0x20 [radeon]
[   28.702363]  [<ffffffff813f0fa6>] pci_device_remove+0x46/0xc0
[   28.702367]  [<ffffffff814f6b9f>] __device_release_driver+0x7f/0xf0
[   28.702369]  [<ffffffff814f6c3c>] device_release_driver+0x2c/0x40
[   28.702372]  [<ffffffff813ead8c>] pci_stop_bus_device+0x9c/0xb0
[   28.702373]  [<ffffffff813ead33>] pci_stop_bus_device+0x43/0xb0
[   28.702375]  [<ffffffff813eaf46>] pci_stop_and_remove_bus_device+0x16/0x30
[   28.702378]  [<ffffffff8140a387>] disable_slot+0x57/0xa0
[   28.702380]  [<ffffffff8140ab58>] acpiphp_check_bridge.part.11+0xe8/0x100
[   28.702382]  [<ffffffff8140b588>] acpiphp_check_host_bridge+0x88/0xb0
[   28.702384]  [<ffffffff8143c6aa>] acpi_pci_root_scan_dependent+0xe/0x12
[   28.702387]  [<ffffffff814392bb>] acpi_scan_bus_check+0x48/0xad
[   28.702389]  [<ffffffff814393f3>] acpi_generic_hotplug_event+0x2e/0x87
[   28.702391]  [<ffffffff814394a1>] acpi_device_hotplug+0x55/0xcd
[   28.702393]  [<ffffffff814329ce>] acpi_hotplug_work_fn+0x20/0x2d
[   28.702397]  [<ffffffff8108f6dd>] process_one_work+0x14d/0x460
[   28.702406]  [<ffffffff810900bb>] worker_thread+0x11b/0x3f0
[   28.702415]  [<ffffffff8108ffa0>] ? create_worker+0x1e0/0x1e0
[   28.702421]  [<ffffffff81095cc9>] kthread+0xc9/0xe0
[   28.702423]  [<ffffffff81095c00>] ? flush_kthread_worker+0x90/0x90
[   28.702427]  [<ffffffff817d1a3c>] ret_from_fork+0x7c/0xb0
[   28.702428]  [<ffffffff81095c00>] ? flush_kthread_worker+0x90/0x90
[   28.702430] ---[ end trace 951e27bb90f8cfd4 ]---


I added detailed information in the attachment. I can try to provide more information if needed.
Comment 1 Alex Deucher 2015-02-02 22:13:43 UTC
Looks like the acpiphp driver is broken again.

https://bugzilla.kernel.org/show_bug.cgi?id=61891
Comment 2 Alex Deucher 2015-02-03 15:36:28 UTC
Someone familiar with the acpiphp drivers needs to fix it again to not try and unload the radeon driver when it turns off the dGPU.  This was fixed last year, but it appears to have been broken again.
Comment 3 tiagdtd-lava 2015-02-08 19:32:32 UTC
I don't know much about the kernel or the acpiphp driver, but I tried to dig around in the kernel to fix this bug.

I made a bugfix which works for me.
Somebody who knows about the system should have a look at this solution, because I'm just guessing.

As far as I can tell everything works great now.

The problem seems to be, that the "slot_no_hotplug" in acpiphp_glue.c doesn't go deep enough to check the "ignore_hotplug" flag of the radeon device.

I had a look at the remove pci device function and made a patch which does the same iteration through all devices to check the flags.

This works great for my Laptop and everything is stable now.

This is the patch I came up with:

--- drivers/pci/hotplug/acpiphp_glue.c.orig	2015-02-08 19:30:53.630214885 +0100
+++ drivers/pci/hotplug/acpiphp_glue.c	2015-02-08 19:30:25.534214491 +0100
@@ -559,15 +559,36 @@ static void disable_slot(struct acpiphp_
 	slot->flags &= (~SLOT_ENABLED);
 }
 
+static bool device_no_hotplug(struct pci_dev *dev)
+{
+	struct pci_bus *bus = dev->subordinate;
+	struct pci_dev *child;
+
+	if (!bus) {
+		return dev->ignore_hotplug;
+	}
+
+	list_for_each_entry(child, &bus->devices, bus_list) {
+		if (device_no_hotplug(child)) {
+			return true;
+		}
+	}
+
+	return false;
+}
+
+
 static bool slot_no_hotplug(struct acpiphp_slot *slot)
 {
 	struct pci_bus *bus = slot->bus;
 	struct pci_dev *dev;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
-		if (PCI_SLOT(dev->devfn) == slot->device && dev->ignore_hotplug)
-			return true;
+		if (PCI_SLOT(dev->devfn) == slot->device) {
+			return device_no_hotplug(dev);
+		}
 	}
+
 	return false;
 }
Comment 4 Lorenzo S. 2015-04-25 14:09:42 UTC
I have the same problem of OP (tiagdtd-lava) on Xubuntu 15.04 on Acer Aspire 4820TG with Radeon HD 5650. GUI freezes every 6-10 seconds for a whole second. Every time it freezes, fan speeds up and all of these lines are printed in kern.log: http://pastebin.com/raw.php?i=Rd63ZvHh

Kernel version is:
Linux TIMELINEX 3.19.0-15-generic #15-Ubuntu SMP Thu Apr 16 23:32:37 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Previosly Xubuntu 14.04.2 LTS was installed on the same laptop with no problems regarding X, Radeon or switchable graphics. Graphics switch was completely automatic (DynOn/DynOff) in 14.04.2, but now in 15.04 is not.

I didn't tested kernel patches because I'm not practical about kernel recompiling (as you could guess when you read I'm a Xubuntu user). But I tried some kernel parameters and these are the results:

1) pci=noacpi OR noacpi

No more freezes, but external USB mouse (and maybe other devices) stopped working at all :(

2) radeon.runpm=0

As suggested here https://bugzilla.kernel.org/show_bug.cgi?id=79701 and here https://bugzilla.kernel.org/show_bug.cgi?id=72701 this solve freezes and does not stop external devices to work. Anyway, as reported here https://bugzilla.kernel.org/show_bug.cgi?id=72701 by klod, this reduces battery life and increase internal temperature, because it does not turn off discrete graphic:

# cat /sys/kernel/debug/vgaswitcheroo/switch
0:DIS-Audio: :Pwr:0000:01:00.1
1:IGD:+:Pwr:0000:00:02.0
2:DIS: :Pwr:0000:01:00.0

If I try to manually turn off discrete grahics using:

# echo "DIGD" > /sys/kernel/debug/vgaswitcheroo/switch
# echo "OFF"  > /sys/kernel/debug/vgaswitcheroo/switch

But as soon as I run these commands, in kern.log these lines are printed: http://pastebin.com/raw.php?i=WwSgEsEt (and it repeats at regular intervals), fan speeds up again and does not stop, system became instable, network stops working and system shut down hangs!

Is there a change I can fix this, or it's better if I downgrade to 14.04.2 LTS?
Comment 5 Alex Deucher 2015-04-25 19:07:03 UTC
Please try the latest patch on this bug:
https://bugzilla.kernel.org/show_bug.cgi?id=61891
Comment 6 Lorenzo S. 2015-05-01 21:27:13 UTC
(In reply to Alex Deucher from comment #5)
> Please try the latest patch on this bug:
> https://bugzilla.kernel.org/show_bug.cgi?id=61891

Sorry for the delay, I just tried it and looks like it's working! Great!
Comment 7 Christopher M. Penalver 2016-02-25 07:39:28 UTC
As per https://bugs.freedesktop.org/show_bug.cgi?id=88927#c3 .

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.