This bug originated in Ubuntu's Bugzilla (http://bugzilla.ubuntu.com/show_bug.cgi?id=14055) and has since been reproduced with CVS HEAD. I'm running bleeding-edge Xorg (server only), drm, and Mesa on an Ubuntu Hoary Hedgehog system (X.org release 6.8.2) on a Dell 2350 with a PCI/DVI+VGA PowerColor Radeon 9250 card. The Xorg server locks up hard, ignoring all console input. This happens in three different scenarios: 1. On startup (Xorg CVS only, always on first bootup) 2. While using a maximized vncviewer(1) window (observed twice, very haphazard) 3. While a GL screen saver was running (observed only once) When this occurs, I can login remotely, and observe the server pegging the CPU. I have obtained backtraces and core dumps from the server, and found the following two places in the code where it locks up: 1. ioctl() call in (xf86drm.c)drmDMA(), for the vncviewer lockup 2. ioctl() call in (xf86drm.c)drmCommandNone(), for all other instances. The drmCommandIndex argument is always 4. Two other potential clues: 1. Using the bleeding-edge Xorg/Mesa/drm: When I switch to a virtual terminal and shut down kdm, I get the following kernel messages: kernel: [drm:drm_ati_pcigart_cleanup] *ERROR* no scatter/gather memory! kernel: [drm:radeon_do_cleanup_cp] *ERROR* failed to cleanup PCI GART! 2. When I run "Xorg-cvs -probeonly" in single-user mode, I see the following output (after the standard boilerplate): (==) Log file: "/var/log/Xorg.0.log", Time: Tue Aug 30 18:48:39 2005 (==) Using config file: "/etc/X11/xorg.conf" (WW) RADEON: No matching Device section for instance (BusID PCI:1:4:1) found (WW) ****INVALID IO ALLOCATION**** b: 0xc000 e: 0xc100 correcting (The first warning is harmless, as far as I know; I've always seen it. The second one is worrisome.)
the warning in 2 is harmless. pcigart is known-broken, I think ...
Created attachment 3235 [details] Kernel log debug output from CVS "drm" module With a recent CVS pull of Xorg+Mesa+DRM, loading the "drm" kernel module with "debug=1" no longer scares the bug away. Here is a grep of my kernel log with all DRM-related messages, cut off at 400 lines (once it's clear that something borked up).
Okay. I've found something interesting. (The bug in question here is the 99%-CPU hang when the X server starts up, not the original one encountered during normal use. The two may or may not be related, but damn, the startup one was getting on my nerves.) Some experimentation determined that the bug came into being sometime between the XORG-6_8_99_16 and XORG-6_8_99_900 branches. Trial and error narrowed that down to exactly this change in radeon_driver.c: revision 1.64 date: 2005-07-29 19:45:14 +0000; author: daenzer; state: Exp; lines: +9 -1; commitid: 772742ea86db4567; * programs/Xserver/hw/xfree86/drivers/ati/radeon_driver.c: (RADEONSetFBLocation): bugzilla #3911 (https://bugs.freedesktop.org/show_bug.cgi?id=3911) attachment #3191 [details] (http://bugs.freedesktop.org/attachment.cgi?id=3191) Disable bus mastering while updating MC_FB_LOCATION and friends to prevent the X server from hanging on startup every now and then under some circumstances. (ATI Technologies Inc.) If I remove just the following two lines, at the top of RADEONSetFBLocation() in CVS-head... OUTREG (RADEON_BUS_CNTL, bus_cntl | RADEON_BUS_MASTER_DIS); RADEONWaitForIdleMMIO(pScrn); the X server ceases to hang on startup. Reliably. Something about disabling bus-mastering seems to make my machine unhappy :-( Ironic that the cause of this bug was a fix for the same problem... the only clue I can offer is the slightly oddball bus configuration on this Dell. (Intel AGP onboard video, no AGP slot, Radeon PCI card. There was a kernel bug, recently fixed [2.6.13?] which caused the machine to lock hard on inserting the intel-agp.ko module after the PCI card was installed.)
[Benjamin Herrenschmidt should be interested in this one, adding him and Jon Chaplick to CC list] Note that problems 1, 2 and 3 from your original report are most likely separate problems. The symptoms you're describing are common symptoms for a GPU lockup, which can be due to any of a huge number of causes. So, I suggest sticking to problem 1. for this entry. Does it also occur if you don't enable the DRI? I'm wondering if we shouldn't manipulate the location of the PCI GART instead of the AGP GART in that function when using the former...
ACK on focusing on problem 1; the other two are extremely difficult to reproduce anyway. (I was hoping the three might be related, but, perhaps not.) I prevented drm.ko/radeon.ko from loading, and sure enough, an unmodified CVS-head X server is able to start without hanging.
Well, disabling bus mastering when a PCI GART is setup is indeed a bad idea ... that might explain a similar lockup on launch that I've been experiencing here. We should at least disable the PCI GART before disabling BM. I'll have a look later this week.
(In reply to comment #6) > Well, disabling bus mastering when a PCI GART is setup is indeed a bad idea ... Why, what kind of PCI GART transfer would you expect to happen during the execution of that function? My suspicion rather lies with the fact that it updates RADEON_MC_AGP_LOCATION but not the corresponding PCI GART register(s). > that might explain a similar lockup on launch that I've been experiencing here. I was actually thinking it might be the very same problem, that's why I added you to the CC list, do you think that's not the case?
Ben, can you attach your proposed fix for this for people to test?
https://bugs.freedesktop.org/attachment.cgi?id=3620 (Bug 3911) Note that my fix might be breaking surface management under some circumstances, I have to rework that part (which is why I didn't commit it yet). I'll come up with a new patch this week-end hopefully
Ben committed his patch to xf86-video-ati CVS, so this might be fixed. Please verify.
Thanks a lot. I applied you modifications to "xorg-x11-6.8.2-37.FC4.49.2" on my FC4 box. Before, I could not boot into graphical mode unless I disabled "DRI" for this R100 QD based Radeon 7200 PCI card. The "X" server would hang and not even respond to the keyboard anymore. Now, everything works as expected. This is great news. I had posted the bug report for modular "Xorg" 7.0.0 where the issue is also present, but reverted my system later to FC4 and discovered that the RV100 bus master fix patch of the above FC4 update broke "DRI". For details, see: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=180150
Apologies for the delay; I had to deal with the new modular build system for the first time. (Thank goodness for build.sh!) With a current CVS build of Xorg, I am no longer able to reproduce the original bug. The X server is starting up without a hitch through multiple reboots. Either the problem is fixed, or it's a lot scarcer than it used to be :-) (Resolving to FIXED; I trust everything is in order?)
Monolithic X still has the bug. But after battling with modular X (there is now a considerable chunk missing in my desk) I am now running 7 xine instances at once (which brought down X in about 5 seconds) for aprox. 30 Minutes. BTW this is Debian Sid with modular CVS HEAD as of today.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.