Bugzilla – Bug 4324
X locks up (99% CPU) w/ drm; disabling bus mastering apparent cause
Last modified: 2006-03-12 06:02:22 UTC
This bug originated in Ubuntu's Bugzilla
(http://bugzilla.ubuntu.com/show_bug.cgi?id=14055) and has since been
reproduced with CVS HEAD.
I'm running bleeding-edge Xorg (server only), drm, and Mesa on an Ubuntu Hoary
Hedgehog system (X.org release 6.8.2) on a Dell 2350 with a PCI/DVI+VGA
PowerColor Radeon 9250 card.
The Xorg server locks up hard, ignoring all console input. This happens in
three different scenarios:
1. On startup (Xorg CVS only, always on first bootup)
2. While using a maximized vncviewer(1) window (observed twice, very haphazard)
3. While a GL screen saver was running (observed only once)
When this occurs, I can login remotely, and observe the server pegging the CPU.
I have obtained backtraces and core dumps from the server, and found the
following two places in the code where it locks up:
1. ioctl() call in (xf86drm.c)drmDMA(), for the vncviewer lockup
2. ioctl() call in (xf86drm.c)drmCommandNone(), for all other instances. The
drmCommandIndex argument is always 4.
Two other potential clues:
1. Using the bleeding-edge Xorg/Mesa/drm: When I switch to a virtual terminal
and shut down kdm, I get the following kernel messages:
kernel: [drm:drm_ati_pcigart_cleanup] *ERROR* no scatter/gather memory!
kernel: [drm:radeon_do_cleanup_cp] *ERROR* failed to cleanup PCI GART!
2. When I run "Xorg-cvs -probeonly" in single-user mode, I see the following
output (after the standard boilerplate):
(==) Log file: "/var/log/Xorg.0.log", Time: Tue Aug 30 18:48:39 2005
(==) Using config file: "/etc/X11/xorg.conf"
(WW) RADEON: No matching Device section for instance (BusID PCI:1:4:1) found
(WW) ****INVALID IO ALLOCATION**** b: 0xc000 e: 0xc100 correcting
(The first warning is harmless, as far as I know; I've always seen it. The
second one is worrisome.)
the warning in 2 is harmless. pcigart is known-broken, I think ...
Created attachment 3235 [details]
Kernel log debug output from CVS "drm" module
With a recent CVS pull of Xorg+Mesa+DRM, loading the "drm" kernel module with
"debug=1" no longer scares the bug away. Here is a grep of my kernel log with
all DRM-related messages, cut off at 400 lines (once it's clear that something
Okay. I've found something interesting.
(The bug in question here is the 99%-CPU hang when the X server starts up, not
the original one encountered during normal use. The two may or may not be
related, but damn, the startup one was getting on my nerves.)
Some experimentation determined that the bug came into being sometime between
the XORG-6_8_99_16 and XORG-6_8_99_900 branches. Trial and error narrowed that
down to exactly this change in radeon_driver.c:
date: 2005-07-29 19:45:14 +0000; author: daenzer; state: Exp; lines: +9 -1;
bugzilla #3911 (https://bugs.freedesktop.org/show_bug.cgi?id=3911)
attachment #3191 [details] (http://bugs.freedesktop.org/attachment.cgi?id=3191)
Disable bus mastering while updating MC_FB_LOCATION and friends to
prevent the X server from hanging on startup every now and then
under some circumstances. (ATI Technologies Inc.)
If I remove just the following two lines, at the top of RADEONSetFBLocation()
OUTREG (RADEON_BUS_CNTL, bus_cntl | RADEON_BUS_MASTER_DIS);
the X server ceases to hang on startup. Reliably. Something about disabling
bus-mastering seems to make my machine unhappy :-(
Ironic that the cause of this bug was a fix for the same problem... the only
clue I can offer is the slightly oddball bus configuration on this Dell. (Intel
AGP onboard video, no AGP slot, Radeon PCI card. There was a kernel bug,
recently fixed [2.6.13?] which caused the machine to lock hard on inserting the
intel-agp.ko module after the PCI card was installed.)
[Benjamin Herrenschmidt should be interested in this one, adding him and Jon
Chaplick to CC list]
Note that problems 1, 2 and 3 from your original report are most likely separate
problems. The symptoms you're describing are common symptoms for a GPU lockup,
which can be due to any of a huge number of causes.
So, I suggest sticking to problem 1. for this entry. Does it also occur if you
don't enable the DRI? I'm wondering if we shouldn't manipulate the location of
the PCI GART instead of the AGP GART in that function when using the former...
ACK on focusing on problem 1; the other two are extremely difficult to
reproduce anyway. (I was hoping the three might be related, but, perhaps not.)
I prevented drm.ko/radeon.ko from loading, and sure enough, an unmodified
CVS-head X server is able to start without hanging.
Well, disabling bus mastering when a PCI GART is setup is indeed a bad idea ...
that might explain a similar lockup on launch that I've been experiencing here.
We should at least disable the PCI GART before disabling BM.
I'll have a look later this week.
(In reply to comment #6)
> Well, disabling bus mastering when a PCI GART is setup is indeed a bad idea ...
Why, what kind of PCI GART transfer would you expect to happen during the
execution of that function? My suspicion rather lies with the fact that it
updates RADEON_MC_AGP_LOCATION but not the corresponding PCI GART register(s).
> that might explain a similar lockup on launch that I've been experiencing here.
I was actually thinking it might be the very same problem, that's why I added
you to the CC list, do you think that's not the case?
Ben, can you attach your proposed fix for this for people to test?
Note that my fix might be breaking surface management under some circumstances,
I have to rework that part (which is why I didn't commit it yet). I'll come up
with a new patch this week-end hopefully
Ben committed his patch to xf86-video-ati CVS, so this might be fixed. Please
Thanks a lot. I applied you modifications to "xorg-x11-6.8.2-37.FC4.49.2"
on my FC4 box. Before, I could not boot into graphical mode unless I
disabled "DRI" for this R100 QD based Radeon 7200 PCI card. The "X" server
would hang and not even respond to the keyboard anymore. Now, everything
works as expected. This is great news.
I had posted the bug report for modular "Xorg" 7.0.0 where the issue is
also present, but reverted my system later to FC4 and discovered that the
RV100 bus master fix patch of the above FC4 update broke "DRI". For details,
Apologies for the delay; I had to deal with the new modular build system for
the first time. (Thank goodness for build.sh!)
With a current CVS build of Xorg, I am no longer able to reproduce the original
bug. The X server is starting up without a hitch through multiple reboots.
Either the problem is fixed, or it's a lot scarcer than it used to be :-)
(Resolving to FIXED; I trust everything is in order?)
Monolithic X still has the bug.
But after battling with modular X (there is now a considerable chunk missing in
my desk) I am now running 7 xine instances at once (which brought down X in
about 5 seconds) for aprox. 30 Minutes.
BTW this is Debian Sid with modular CVS HEAD as of today.