Bug 36003 - [Radeon HD 5650 and 5470] Driver crash during recovery boot and in normal boot (Regression from 2.6.38-3 to -4)
Summary: [Radeon HD 5650 and 5470] Driver crash during recovery boot and in normal boo...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: high critical
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2011-04-05 14:29 UTC by Bryce Harrington
Modified: 2011-10-12 10:00 UTC (History)
6 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg - 2.6.38-7 (79.31 KB, text/plain)
2011-04-05 14:32 UTC, Bryce Harrington
no flags Details
dmesg - 2.6.38-8 (90.52 KB, text/plain)
2011-04-05 14:32 UTC, Bryce Harrington
no flags Details

Description Bryce Harrington 2011-04-05 14:29:05 UTC
Forwarding this bug from Ubuntu reporter afoglia:
http://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-ati/+bug/727620

[Problem]
Crash on HP Envy 14 with discrete ATI card and i5 integrated graphics.  Possibly due to a bad interaction between radeon and other framebuffer drivers.

[Original Description]
I'm running natty, and every since the upgrade to 6.14.0 I've been unable to consistently boot.  After some discussion in the forums, I tried repeatedly to boot into recovery mode.  In most cases, I got a black screen.  One time though, when I was able to successfully increase the brightness, I saw some errors from the radeon module.  I took a photo (available at http://i.imgur.com/P0bQ0.jpg), and here's the stack and call trace, as best as I can read it:

Stack:
 ffff880149eb8000 ffff880149eb8000 0000000000000011 0000000000000911
 00000000fffffff4 ffff88014b6c7800 ffff88014b0f7b58 ffffffffa022aba0
 ffff8801460f7b58 ffff880149eb8000 0000000000000000 0000000000410028
Call Trace:
 [<ffffffffa022aba0>] evergreen_cp_resume+0x3a0/0x630 [radeon]
 [<ffffffffa022c8b7>] evergreen_startup+0x157/0x260 [radeon]
 [<ffffffffa01fe8a0>] ? r600_pcie_gart_init+0x60/0x70 [radeon]
 [<ffffffffa022dbec>] evergreen_init+0x1ac/0x2d0 [radeon]
 [<ffffffffa01a5a69>] radeon_device_init+0x409/0x490 [radeon]
 [<ffffffffa01a7142>] radeon_driver_load_kms+0xb2/0x1a0 [radeon]
 [<ffffffffa007fb2e>] drm_get_pci_dev+0x18e/0x300 [drm]
 [<ffffffff8115426f>] ? kmem_cache_alloc_trace+0xff/0x120
 [<ffffffffa023790e>] radeon_pci_probe+0xb2/0xba [radeon]
 [<ffffffff812fea7f>] local_pci_probe+0x5f/0xd0
 [<ffffffff81300369>] pci_device_probe+0x119/0x120
 [<ffffffff813b8eca>] ? driver_sysfs_add+0x7a/0xb0
 [<ffffffff813b8ff8>] really_probe+0x68/0x190
 [<ffffffff813b9305>] driver_probe_device+0x45/0x70
 [<ffffffff813b93db>] __driver_attach+0xab/0xb0
 [<ffffffff813b9330>] ? __driver_attach+0x0/0xb0
 [<ffffffff813b817e>] bus_for_each_dev+0x5e/0x90
 [<ffffffff813b8e4e>] driver_attach+0x1e/0x20
 [<ffffffff813b89b5>] bus_add_driver+0xc5/0x280
 [<ffffffffa0013000>] ? radeon_init+0x0/0x1000 [radeon]
 [<ffffffff813b9676>] driver_register+0x76/0x140
 [<ffffffffa0013000>] ? radeon_init+0x0/0x1000 [radeon]
 [<ffffffff812ff126>] __pci_register_driver+0x56/0xd0
 [<ffffffffa0080044>] drm_pci_init+0xe4/0xf0 [drm]
 [<ffffffff815bf36e>] ? mutex_lock+0x1e/0x50
 [<ffffffffa0013000>] ? radeon_init+0x0/0x1000 [radeon]
 [<ffffffffa0077688>] drm_init+0x58/0x70 [drm]
 [<ffffffffa00130c4>] radeon_init+0xc4/0x1000 [radeon]
 [<ffffffff81002195>] do_one_initcall+0x45/0x190
 [<ffffffff810a4573>] sys_init_module+0x103/0x260
 [<ffffffff8100c002>] system_call_fastpath+0x16/0x1b
Code: 00 45 8b 84 24 e4 0a 00 00 45 85 c0 0f 8e c7 09 00 00 41 8b 84 24 d4 0a 00 00 89 c2 83 c0 01 40 c1 e2 02 49 03 94 24 c8 0a 00 00 <c7> 02 00 44 05 c0 41 8b 94 24 e4 0a 00 00 41 23 84 24 f4 0a 00
RIP  [<ffffffffa0227ad7>] evergreen_cp_start+0x57/0xc80 [radeon]
 RSP <ffff88014b0f7af8>
CRZ: ffffc90411ce1ffc
---[ end trace 37702c56f2e23247 ]---
udevd-work[94]: '/sbin/modprobe -bv pci:v00001002d000068C1sv0000103Csd00001436bc03sc00i00' unexpected exit with status 0x0009

There is also some register info dumped at the top of the screen visible in the photo, that I didn't bother to write, as I'd most certainly get something wrong.

I did six normal boots with 2.6.38-7.35, then realized there was an update, and booted that both normally and in recovery and here's what I saw

2.6.38-7.36 normal, 6 boots, 5 reached gdm login screen, 1 gdm started but hung before login window appeared (only one of the 5 successful boots showed the plymouth boot screen)
2.6.38-7.36 recovery mode, 5 boots, all hung with the monitor off, no plymouth, brightness key did nothing.
2.6.38-7.35 normal, 6 boots, 3 hung with monitor off, 3 reached gdm
Comment 1 Bryce Harrington 2011-04-05 14:31:45 UTC
A commenter (Johan Fornander) who believes they have the same (or similar) bug provided these logs:

"""
I have taken logs from a set of ubuntu kernels starting up using the requested kernel boot argument "drm.debug=0x0e":

2.6.38-7 ---> fb0: radeondrmfb frame buffer device
2.6.38-8 ---> fb0: inteldrmfb frame buffer device
2.6.39-rc1--> kernel oops (pointing to evergreen something), not caught in the logs. seems to be a new offset than the reported one above

Please see the attached files containng dmesg and kern.log for each kernel. There are some other interesting things in the logs like invalid DSDT and stuff that I will look into further.
"""
Comment 2 Bryce Harrington 2011-04-05 14:32:07 UTC
Created attachment 45317 [details]
dmesg - 2.6.38-7
Comment 3 Bryce Harrington 2011-04-05 14:32:30 UTC
Created attachment 45318 [details]
dmesg - 2.6.38-8
Comment 4 Bryce Harrington 2011-04-05 14:34:05 UTC
kern.log files are also available but too large for bugzilla.  They can be obtained from launchpad though if you're interested:

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-ati/+bug/727620/+attachment/1985108/+files/3_sets_of_kernel_logs.tar.bz2
Comment 5 Alex Deucher 2011-04-05 14:38:34 UTC
This sounds like a problem with vga_switcheroo.  Does setting the video to the radeon/discrete in the bios configuration work ok?  Probably a duplicate of this bug:
https://bugzilla.kernel.org/show_bug.cgi?id=30052
Comment 6 Bryce Harrington 2011-04-05 14:43:13 UTC
The reason we believed it to be a regression between -3 and -4 was due to this
testing by the original bug reporter:

"""
I tried Bryce's second suggestion of using old kernels. I have two previous
versions of 2.6.38 installed, 2.6.38-3-generic and 2.6.38-4-generic. I booted
each into recovery and normal mode 4 times, for a total of 16 boots. Here's the
number of times the boot was a success, where I either got to the recovery boot
menu or gdm, (regardless of whether the screen brightness had to be manually
increased from 0, or if the plymouth boot screen displayed).

2.6.38-4-generic, normal: 1 success, 3 failures
2.6.38-4-generic, recovery: 4 successes
2.6.38-3-generic, normal: 4 successes
2.6.38-3-generic, recovery: 3 successes, 1 failure

At no time did I see a stack trace like the one I posted, but I've only seen
that in recovery mode. (Would it be written somewhere persistent between boots?
It's not in /var/log/syslog.)
"""
Comment 7 Bryce Harrington 2011-04-05 14:44:06 UTC
Others who are reporting as having the same bug indicate that there is a bit of a race condition and sometimes the system loads with the radeon drm driver, sometimes with intel:

From Johan Fornander:
"""
I have a theory... Maybe there is a race condition between the intel and ati driver involved here? My notebook starts up in two seemingly random configurations, or three including the radeon crash:

1. X server is on VT8 -> unable to unload radeon module because it is in use (by some framebuffer I guess). I am also unable to switch to consoles VT1-7. If I use vga_switcheroo to switch to integrated gpu in this mode then radeon crashes.

2. X server is on VT7 -> I can unload the radeon module and use the consoles VT1-6. vga_switcheroo works and I can also use acpi calls to turn off the gpu.

3. The radeon driver crashes. Forcing reboot through RSEIUB.

This makes it difficult to control the temperature since I cannot know if the radeon module is in use or not (i.e. I might or might not be able to use the vga_switcheroo, or unload the module and use a specific acpi call to shut of the gpu).
"""
Comment 8 Bryce Harrington 2011-04-05 14:45:04 UTC
@alex, thanks for the pointer I'll have them test bios settings.
Comment 9 Bryce Harrington 2011-04-07 18:52:59 UTC
Response from the original reporter:

Bryce Harrington wrote in #25 "Upstream would like to see if setting the video to the radeon/descrete setting in the BIOS configuration makes it function properly."

Answer: Yes it does. I tried that a while ago already, but can't (don't want to) use that for regular running, because the radeon-card makes the laptop-battery ruin out too fast. I read also that there's a patched/hacked BIOS somewhere that allows to switch off the radeon-card via BIOS, but if it can be solved with software I'd prefer that ;)



Chris Halse Rogers wrote in #22: "This does look a lot like some bad interaction between i915/radeon"

I agree - I did some more testing and placed an entry for the radeon-module into /etc/initramfs-tools/modules, and voilá - no crash! (after booting into X I can't get back to the text-console, but that's possibly another issue).

What proves the "bad interaction" even more is: when also placing "i915" into /etc/initramfs-tools/modules BEFORE "radeon", booting isn't possible at all any more, but when placing i915 AFTER radeon, booting is possible, but I got a black screen (as in totally black, that is: no backlight) until X starts up.

I'll try to get logs of four different initrd-configurations (though I doubt I'll be able to record the one where the crash occurs already during running of initrd...)
Comment 10 Alex Deucher 2011-04-07 19:07:45 UTC
It's not a bad interaction between i915 and radeon per se.  The switcheroo code needs more work to switch properly on some systems it seems.  There are a set acpi methods required to activate/deactivate the respective gpus.  The drivers need to load and initialize active hw.  If the hw is not active when the driver loads, then the hw is not set up properly and it won't work.  Probably some ordering issues in how the switcheroo acpi methods are called.
Comment 11 Johan Fornander 2011-04-19 04:34:18 UTC
(In reply to comment #10)
> It's not a bad interaction between i915 and radeon per se.  The switcheroo code
> needs more work to switch properly on some systems it seems.  There are a set
> acpi methods required to activate/deactivate the respective gpus.  The drivers
> need to load and initialize active hw.  If the hw is not active when the driver
> loads, then the hw is not set up properly and it won't work.  Probably some
> ordering issues in how the switcheroo acpi methods are called.

Is there anything I can do to help out with this bug? Do you need more logs or info on my hardware setup? I am one of the users reporting this bug against Ubuntu, using an HD5650 + Intel i5 graphics.
Comment 12 Seth Forshee 2011-05-19 09:17:59 UTC
(In reply to comment #10)
> It's not a bad interaction between i915 and radeon per se.  The switcheroo code
> needs more work to switch properly on some systems it seems.  There are a set
> acpi methods required to activate/deactivate the respective gpus.  The drivers
> need to load and initialize active hw.  If the hw is not active when the driver
> loads, then the hw is not set up properly and it won't work.  Probably some
> ordering issues in how the switcheroo acpi methods are called.

I've been staring at lots of logs from boots that do and don't trigger the oops, and I can't see that the interactions with the switcheroo code are at all related. The oops happens during the radeon probe, and by that point the atpx handler has been registered with vga_switcheroo and possibly radeon_atpx_init() has been called, depending on what else has happened at that point, but whether or not radeon_atpx_init() has been called doesn't correlate to whether or not the oops happens. Nor does it correlate to the order in which the i915 and radeon probes happen or anything else I can see from the logs. I'd love to hear more ideas about what's leading to the inconsistency of the power state at boot, because I'm coming up dry (unless it's just the state inherited from the BIOS or bootloader).

Does the driver just need to unconditionally enable power via the ATPX handler during the probe to ensure the probe can continue successfully?

And is it better to carry on this conversation here or on the kernel bugzilla linked to in comment #5?
Comment 13 Ap. Syvertsen 2011-05-26 02:52:21 UTC
(In reply to comment #12)

Okay, first of all I'm not a developer but I've been pouring over the logs posted at launchpad (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/727620) and this is what I notice:

The crash seems to happen during the loading of the microcode, Redwood for HD560, RV710 for the HD3650. On a normal boot the detected VRAM in kern.log is 512M, which is the amount of dedicated graphic memory, and the Microcode loading happens as follows:
[   21.896251] [drm] Loading RV710 Microcode
[   21.909170] radeon 0000:01:00.0: WB enabled
[   21.955754] [drm] ring test succeeded in 1 usecs

On a bad boot however the detected VRAM shown in kern.log is 3584M, which is more than the actual amount of "HyperMemory" these graphic cards. The loading of the Microcode then crashes:
[   21.511875] [drm] Loading RV710 Microcode
[   21.722682] radeon 0000:01:00.0: Wait for MC idle timedout !
[   21.924181] radeon 0000:01:00.0: Wait for MC idle timedout !
[   21.927041] radeon 0000:01:00.0: WB enabled
[   21.973606] BUG: unable to handle kernel paging request at ffffc90405701ffc

the two "Wait for MC idle timedout !" and the amount of VRAM detected is very consistent with the crashing. On a bad boot, the following line is also displayed right before the detected VRAM is displayed for the first time (never on a good boot):
[   21.510435] radeon 0000:01:00.0: limiting VRAM

On my HD5650 with Redwood Microcode, i also get a GPU softreset after the ATOM bios is loaded, right before the limiting VRAM event.

Is there any special order of loading modules, either related to the kernel or the driver that would cause this chain of event and discrepancy?

Feel free to ask if you need more info.
Comment 14 Alex Deucher 2011-05-26 05:17:18 UTC
(In reply to comment #13)
> On a bad boot however the detected VRAM shown in kern.log is 3584M, which is
> more than the actual amount of "HyperMemory" these graphic cards. The loading
> of the Microcode then crashes:
> [   21.511875] [drm] Loading RV710 Microcode
> [   21.722682] radeon 0000:01:00.0: Wait for MC idle timedout !
> [   21.924181] radeon 0000:01:00.0: Wait for MC idle timedout !
> [   21.927041] radeon 0000:01:00.0: WB enabled
> [   21.973606] BUG: unable to handle kernel paging request at ffffc90405701ffc

It's not the ucode.  It looks like the driver is loading on a gfx card that is not powered up.  As such the driver is not able to initialize the card.  There's an acpi method that's required to power up the discrete card.
Comment 15 Ap. Syvertsen 2011-05-27 05:29:27 UTC
(In reply to comment #14)
> It's not the ucode.  It looks like the driver is loading on a gfx card that is
> not powered up.  As such the driver is not able to initialize the card. 
> There's an acpi method that's required to power up the discrete card.

Is this acpi method supposed to be run by the kernel prior to the loading of the driver?
Would a fail in this acpi method call cause the GPU softreset, the discrepancy of the VRAM and ultimately the fail during the loading of the driver?

I can't find any differences between the acpi-calls in the good and bad kern.log files but then again I'm not a linux developer and don't really know what to look for.
Comment 16 Alex Deucher 2011-05-27 06:56:57 UTC
(In reply to comment #15)

> Is this acpi method supposed to be run by the kernel prior to the loading of
> the driver?
> Would a fail in this acpi method call cause the GPU softreset, the discrepancy
> of the VRAM and ultimately the fail during the loading of the driver?
> 
> I can't find any differences between the acpi-calls in the good and bad
> kern.log files but then again I'm not a linux developer and don't really know
> what to look for.

You need to power up the GPU before the driver loads.  If the GPU is not powered on, you can't detect how much vram it has or initialize the card.
Comment 17 Seth Forshee 2011-05-31 18:45:33 UTC
(In reply to comment #16)
> You need to power up the GPU before the driver loads.  If the GPU is not
> powered on, you can't detect how much vram it has or initialize the card.

I added a call to turn on GPU power via the ATPX method from radeon_register_atpx_handler(). Feedback from testers indicates that this isn't fixing the issues.

What's the right way to power on the GPU?
Comment 18 Jeremy Huddleston Sequoia 2011-10-09 17:01:39 UTC
It's been about 5 months now from the last comment here.  Is there an update on 
this issue?
Comment 19 Seth Forshee 2011-10-11 12:16:28 UTC
Many people have reported this as being fixed starting with kernel version 3.0. On the launchpad bug we bisected it down to the following commit, although I'm not sure why this commit fixes the issue.

commit 3448a19da479b6bd1e28e2a2be9fa16c6a6feb39
Author: Dave Airlie <airlied@redhat.com>
Date:   Tue Jun 1 15:32:24 2010 +1000

    vgaarb: use bridges to control VGA routing where possible.
Comment 20 Jeremy Huddleston Sequoia 2011-10-12 10:00:48 UTC
Thanks.  Closing based on the above comment.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.