Summary: | [Radeon HD 5650 and 5470] Driver crash during recovery boot and in normal boot (Regression from 2.6.38-3 to -4) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Bryce Harrington <bryce> | ||||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||
Severity: | critical | ||||||||
Priority: | high | CC: | afoglia, hramrach, jeremyhu, mail, nexor, seth.forshee | ||||||
Version: | unspecified | Keywords: | regression | ||||||
Hardware: | All | ||||||||
OS: | Linux (All) | ||||||||
Whiteboard: | |||||||||
i915 platform: | i915 features: | ||||||||
Attachments: |
|
Description
Bryce Harrington
2011-04-05 14:29:05 UTC
A commenter (Johan Fornander) who believes they have the same (or similar) bug provided these logs: """ I have taken logs from a set of ubuntu kernels starting up using the requested kernel boot argument "drm.debug=0x0e": 2.6.38-7 ---> fb0: radeondrmfb frame buffer device 2.6.38-8 ---> fb0: inteldrmfb frame buffer device 2.6.39-rc1--> kernel oops (pointing to evergreen something), not caught in the logs. seems to be a new offset than the reported one above Please see the attached files containng dmesg and kern.log for each kernel. There are some other interesting things in the logs like invalid DSDT and stuff that I will look into further. """ Created attachment 45317 [details]
dmesg - 2.6.38-7
Created attachment 45318 [details]
dmesg - 2.6.38-8
kern.log files are also available but too large for bugzilla. They can be obtained from launchpad though if you're interested: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-ati/+bug/727620/+attachment/1985108/+files/3_sets_of_kernel_logs.tar.bz2 This sounds like a problem with vga_switcheroo. Does setting the video to the radeon/discrete in the bios configuration work ok? Probably a duplicate of this bug: https://bugzilla.kernel.org/show_bug.cgi?id=30052 The reason we believed it to be a regression between -3 and -4 was due to this testing by the original bug reporter: """ I tried Bryce's second suggestion of using old kernels. I have two previous versions of 2.6.38 installed, 2.6.38-3-generic and 2.6.38-4-generic. I booted each into recovery and normal mode 4 times, for a total of 16 boots. Here's the number of times the boot was a success, where I either got to the recovery boot menu or gdm, (regardless of whether the screen brightness had to be manually increased from 0, or if the plymouth boot screen displayed). 2.6.38-4-generic, normal: 1 success, 3 failures 2.6.38-4-generic, recovery: 4 successes 2.6.38-3-generic, normal: 4 successes 2.6.38-3-generic, recovery: 3 successes, 1 failure At no time did I see a stack trace like the one I posted, but I've only seen that in recovery mode. (Would it be written somewhere persistent between boots? It's not in /var/log/syslog.) """ Others who are reporting as having the same bug indicate that there is a bit of a race condition and sometimes the system loads with the radeon drm driver, sometimes with intel: From Johan Fornander: """ I have a theory... Maybe there is a race condition between the intel and ati driver involved here? My notebook starts up in two seemingly random configurations, or three including the radeon crash: 1. X server is on VT8 -> unable to unload radeon module because it is in use (by some framebuffer I guess). I am also unable to switch to consoles VT1-7. If I use vga_switcheroo to switch to integrated gpu in this mode then radeon crashes. 2. X server is on VT7 -> I can unload the radeon module and use the consoles VT1-6. vga_switcheroo works and I can also use acpi calls to turn off the gpu. 3. The radeon driver crashes. Forcing reboot through RSEIUB. This makes it difficult to control the temperature since I cannot know if the radeon module is in use or not (i.e. I might or might not be able to use the vga_switcheroo, or unload the module and use a specific acpi call to shut of the gpu). """ @alex, thanks for the pointer I'll have them test bios settings. Response from the original reporter: Bryce Harrington wrote in #25 "Upstream would like to see if setting the video to the radeon/descrete setting in the BIOS configuration makes it function properly." Answer: Yes it does. I tried that a while ago already, but can't (don't want to) use that for regular running, because the radeon-card makes the laptop-battery ruin out too fast. I read also that there's a patched/hacked BIOS somewhere that allows to switch off the radeon-card via BIOS, but if it can be solved with software I'd prefer that ;) Chris Halse Rogers wrote in #22: "This does look a lot like some bad interaction between i915/radeon" I agree - I did some more testing and placed an entry for the radeon-module into /etc/initramfs-tools/modules, and voilá - no crash! (after booting into X I can't get back to the text-console, but that's possibly another issue). What proves the "bad interaction" even more is: when also placing "i915" into /etc/initramfs-tools/modules BEFORE "radeon", booting isn't possible at all any more, but when placing i915 AFTER radeon, booting is possible, but I got a black screen (as in totally black, that is: no backlight) until X starts up. I'll try to get logs of four different initrd-configurations (though I doubt I'll be able to record the one where the crash occurs already during running of initrd...) It's not a bad interaction between i915 and radeon per se. The switcheroo code needs more work to switch properly on some systems it seems. There are a set acpi methods required to activate/deactivate the respective gpus. The drivers need to load and initialize active hw. If the hw is not active when the driver loads, then the hw is not set up properly and it won't work. Probably some ordering issues in how the switcheroo acpi methods are called. (In reply to comment #10) > It's not a bad interaction between i915 and radeon per se. The switcheroo code > needs more work to switch properly on some systems it seems. There are a set > acpi methods required to activate/deactivate the respective gpus. The drivers > need to load and initialize active hw. If the hw is not active when the driver > loads, then the hw is not set up properly and it won't work. Probably some > ordering issues in how the switcheroo acpi methods are called. Is there anything I can do to help out with this bug? Do you need more logs or info on my hardware setup? I am one of the users reporting this bug against Ubuntu, using an HD5650 + Intel i5 graphics. (In reply to comment #10) > It's not a bad interaction between i915 and radeon per se. The switcheroo code > needs more work to switch properly on some systems it seems. There are a set > acpi methods required to activate/deactivate the respective gpus. The drivers > need to load and initialize active hw. If the hw is not active when the driver > loads, then the hw is not set up properly and it won't work. Probably some > ordering issues in how the switcheroo acpi methods are called. I've been staring at lots of logs from boots that do and don't trigger the oops, and I can't see that the interactions with the switcheroo code are at all related. The oops happens during the radeon probe, and by that point the atpx handler has been registered with vga_switcheroo and possibly radeon_atpx_init() has been called, depending on what else has happened at that point, but whether or not radeon_atpx_init() has been called doesn't correlate to whether or not the oops happens. Nor does it correlate to the order in which the i915 and radeon probes happen or anything else I can see from the logs. I'd love to hear more ideas about what's leading to the inconsistency of the power state at boot, because I'm coming up dry (unless it's just the state inherited from the BIOS or bootloader). Does the driver just need to unconditionally enable power via the ATPX handler during the probe to ensure the probe can continue successfully? And is it better to carry on this conversation here or on the kernel bugzilla linked to in comment #5? (In reply to comment #12) Okay, first of all I'm not a developer but I've been pouring over the logs posted at launchpad (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/727620) and this is what I notice: The crash seems to happen during the loading of the microcode, Redwood for HD560, RV710 for the HD3650. On a normal boot the detected VRAM in kern.log is 512M, which is the amount of dedicated graphic memory, and the Microcode loading happens as follows: [ 21.896251] [drm] Loading RV710 Microcode [ 21.909170] radeon 0000:01:00.0: WB enabled [ 21.955754] [drm] ring test succeeded in 1 usecs On a bad boot however the detected VRAM shown in kern.log is 3584M, which is more than the actual amount of "HyperMemory" these graphic cards. The loading of the Microcode then crashes: [ 21.511875] [drm] Loading RV710 Microcode [ 21.722682] radeon 0000:01:00.0: Wait for MC idle timedout ! [ 21.924181] radeon 0000:01:00.0: Wait for MC idle timedout ! [ 21.927041] radeon 0000:01:00.0: WB enabled [ 21.973606] BUG: unable to handle kernel paging request at ffffc90405701ffc the two "Wait for MC idle timedout !" and the amount of VRAM detected is very consistent with the crashing. On a bad boot, the following line is also displayed right before the detected VRAM is displayed for the first time (never on a good boot): [ 21.510435] radeon 0000:01:00.0: limiting VRAM On my HD5650 with Redwood Microcode, i also get a GPU softreset after the ATOM bios is loaded, right before the limiting VRAM event. Is there any special order of loading modules, either related to the kernel or the driver that would cause this chain of event and discrepancy? Feel free to ask if you need more info. (In reply to comment #13) > On a bad boot however the detected VRAM shown in kern.log is 3584M, which is > more than the actual amount of "HyperMemory" these graphic cards. The loading > of the Microcode then crashes: > [ 21.511875] [drm] Loading RV710 Microcode > [ 21.722682] radeon 0000:01:00.0: Wait for MC idle timedout ! > [ 21.924181] radeon 0000:01:00.0: Wait for MC idle timedout ! > [ 21.927041] radeon 0000:01:00.0: WB enabled > [ 21.973606] BUG: unable to handle kernel paging request at ffffc90405701ffc It's not the ucode. It looks like the driver is loading on a gfx card that is not powered up. As such the driver is not able to initialize the card. There's an acpi method that's required to power up the discrete card. (In reply to comment #14) > It's not the ucode. It looks like the driver is loading on a gfx card that is > not powered up. As such the driver is not able to initialize the card. > There's an acpi method that's required to power up the discrete card. Is this acpi method supposed to be run by the kernel prior to the loading of the driver? Would a fail in this acpi method call cause the GPU softreset, the discrepancy of the VRAM and ultimately the fail during the loading of the driver? I can't find any differences between the acpi-calls in the good and bad kern.log files but then again I'm not a linux developer and don't really know what to look for. (In reply to comment #15) > Is this acpi method supposed to be run by the kernel prior to the loading of > the driver? > Would a fail in this acpi method call cause the GPU softreset, the discrepancy > of the VRAM and ultimately the fail during the loading of the driver? > > I can't find any differences between the acpi-calls in the good and bad > kern.log files but then again I'm not a linux developer and don't really know > what to look for. You need to power up the GPU before the driver loads. If the GPU is not powered on, you can't detect how much vram it has or initialize the card. (In reply to comment #16) > You need to power up the GPU before the driver loads. If the GPU is not > powered on, you can't detect how much vram it has or initialize the card. I added a call to turn on GPU power via the ATPX method from radeon_register_atpx_handler(). Feedback from testers indicates that this isn't fixing the issues. What's the right way to power on the GPU? It's been about 5 months now from the last comment here. Is there an update on this issue? Many people have reported this as being fixed starting with kernel version 3.0. On the launchpad bug we bisected it down to the following commit, although I'm not sure why this commit fixes the issue. commit 3448a19da479b6bd1e28e2a2be9fa16c6a6feb39 Author: Dave Airlie <airlied@redhat.com> Date: Tue Jun 1 15:32:24 2010 +1000 vgaarb: use bridges to control VGA routing where possible. Thanks. Closing based on the above comment. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.