Bug 4324

Summary: X locks up (99% CPU) w/ drm; disabling bus mastering apparent cause
Product: xorg Reporter: Daniel Richard G. <skunk>
Component: Driver/RadeonAssignee: Xorg Project Team <xorg-team>
Status: RESOLVED FIXED QA Contact:
Severity: critical    
Priority: high CC: airlied, bd, benh, chaplick, michel, pierre42d
Version: git   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Kernel log debug output from CVS "drm" module none

Description Daniel Richard G. 2005-08-31 10:44:49 UTC
This bug originated in Ubuntu's Bugzilla   
(http://bugzilla.ubuntu.com/show_bug.cgi?id=14055) and has since been   
reproduced with CVS HEAD.  
  
I'm running bleeding-edge Xorg (server only), drm, and Mesa on an Ubuntu Hoary  
Hedgehog system (X.org release 6.8.2) on a Dell 2350 with a PCI/DVI+VGA  
PowerColor Radeon 9250 card.  
  
The Xorg server locks up hard, ignoring all console input. This happens in  
three different scenarios:  
  
1. On startup (Xorg CVS only, always on first bootup)  
2. While using a maximized vncviewer(1) window (observed twice, very haphazard)  
3. While a GL screen saver was running (observed only once)  
  
When this occurs, I can login remotely, and observe the server pegging the CPU.  
I have obtained backtraces and core dumps from the server, and found the  
following two places in the code where it locks up:  
  
1. ioctl() call in (xf86drm.c)drmDMA(), for the vncviewer lockup  
2. ioctl() call in (xf86drm.c)drmCommandNone(), for all other instances. The  
drmCommandIndex argument is always 4.  
  
Two other potential clues:  
  
1. Using the bleeding-edge Xorg/Mesa/drm: When I switch to a virtual terminal  
and shut down kdm, I get the following kernel messages:  
  
  kernel: [drm:drm_ati_pcigart_cleanup] *ERROR* no scatter/gather memory!  
  kernel: [drm:radeon_do_cleanup_cp] *ERROR* failed to cleanup PCI GART!  
  
2. When I run "Xorg-cvs -probeonly" in single-user mode, I see the following  
output (after the standard boilerplate):  
 
  (==) Log file: "/var/log/Xorg.0.log", Time: Tue Aug 30 18:48:39 2005  
  (==) Using config file: "/etc/X11/xorg.conf"  
  (WW) RADEON: No matching Device section for instance (BusID PCI:1:4:1) found  
  (WW) ****INVALID IO ALLOCATION**** b: 0xc000 e: 0xc100 correcting  
 
(The first warning is harmless, as far as I know; I've always seen it. The  
second one is worrisome.)
Comment 1 Daniel Stone 2005-09-06 21:43:25 UTC
the warning in 2 is harmless.  pcigart is known-broken, I think ...
Comment 2 Daniel Richard G. 2005-09-12 10:09:22 UTC
Created attachment 3235 [details]
Kernel log debug output from CVS "drm" module

With a recent CVS pull of Xorg+Mesa+DRM, loading the "drm" kernel module with
"debug=1" no longer scares the bug away. Here is a grep of my kernel log with
all DRM-related messages, cut off at 400 lines (once it's clear that something
borked up).
Comment 3 Daniel Richard G. 2005-09-19 18:38:02 UTC
Okay. I've found something interesting. 
 
(The bug in question here is the 99%-CPU hang when the X server starts up, not 
the original one encountered during normal use. The two may or may not be 
related, but damn, the startup one was getting on my nerves.) 
 
Some experimentation determined that the bug came into being sometime between 
the XORG-6_8_99_16 and XORG-6_8_99_900 branches. Trial and error narrowed that 
down to exactly this change in radeon_driver.c: 
 
revision 1.64 
date: 2005-07-29 19:45:14 +0000;  author: daenzer;  state: Exp;  lines: +9 -1; 
commitid: 772742ea86db4567; 
        * programs/Xserver/hw/xfree86/drivers/ati/radeon_driver.c: 
        (RADEONSetFBLocation): 
        bugzilla #3911 (https://bugs.freedesktop.org/show_bug.cgi?id=3911) 
        attachment #3191 [details] (http://bugs.freedesktop.org/attachment.cgi?id=3191) 
        Disable bus mastering while updating MC_FB_LOCATION and friends to 
        prevent the X server from hanging on startup every now and then 
        under some circumstances. (ATI Technologies Inc.) 
 
If I remove just the following two lines, at the top of RADEONSetFBLocation() 
in CVS-head... 
 
    OUTREG (RADEON_BUS_CNTL, bus_cntl | RADEON_BUS_MASTER_DIS); 
    RADEONWaitForIdleMMIO(pScrn); 
 
the X server ceases to hang on startup. Reliably. Something about disabling 
bus-mastering seems to make my machine unhappy :-( 
 
Ironic that the cause of this bug was a fix for the same problem... the only 
clue I can offer is the slightly oddball bus configuration on this Dell. (Intel 
AGP onboard video, no AGP slot, Radeon PCI card. There was a kernel bug, 
recently fixed [2.6.13?] which caused the machine to lock hard on inserting the 
intel-agp.ko module after the PCI card was installed.) 
Comment 4 Michel Dänzer 2005-09-20 10:57:20 UTC
[Benjamin Herrenschmidt should be interested in this one, adding him and Jon
Chaplick to CC list]

Note that problems 1, 2 and 3 from your original report are most likely separate
problems. The symptoms you're describing are common symptoms for a GPU lockup,
which can be due to any of a huge number of causes.

So, I suggest sticking to problem 1. for this entry. Does it also occur if you
don't enable the DRI? I'm wondering if we shouldn't manipulate the location of
the PCI GART instead of the AGP GART in that function when using the former...
Comment 5 Daniel Richard G. 2005-09-20 14:35:12 UTC
ACK on focusing on problem 1; the other two are extremely difficult to  
reproduce anyway. (I was hoping the three might be related, but, perhaps not.)  
  
I prevented drm.ko/radeon.ko from loading, and sure enough, an unmodified  
CVS-head X server is able to start without hanging.  
Comment 6 Benjamin Herrenschmidt 2005-09-20 15:53:28 UTC
Well, disabling bus mastering when a PCI GART is setup is indeed a bad idea ...
that might explain a similar lockup on launch that I've been experiencing here.
We should at least disable the PCI GART before disabling BM.

I'll have a look later this week.
Comment 7 Michel Dänzer 2005-09-22 22:11:41 UTC
(In reply to comment #6)
> Well, disabling bus mastering when a PCI GART is setup is indeed a bad idea ...

Why, what kind of PCI GART transfer would you expect to happen during the
execution of that function? My suspicion rather lies with the fact that it
updates RADEON_MC_AGP_LOCATION but not the corresponding PCI GART register(s).

> that might explain a similar lockup on launch that I've been experiencing here.

I was actually thinking it might be the very same problem, that's why I added
you to the CC list, do you think that's not the case?
Comment 8 Michel Dänzer 2005-11-03 23:56:11 UTC
Ben, can you attach your proposed fix for this for people to test?
Comment 9 Benjamin Herrenschmidt 2005-11-04 03:05:43 UTC
https://bugs.freedesktop.org/attachment.cgi?id=3620

(Bug 3911)

Note that my fix might be breaking surface management under some circumstances,
I have to rework that part (which is why I didn't commit it yet). I'll come up
with a new patch this week-end hopefully
Comment 10 Michel Dänzer 2006-02-20 21:42:58 UTC
Ben committed his patch to xf86-video-ati CVS, so this might be fixed. Please
verify.
Comment 11 Joachim Frieben 2006-02-21 00:53:34 UTC
Thanks a lot. I applied you modifications to "xorg-x11-6.8.2-37.FC4.49.2"
on my FC4 box. Before, I could not boot into graphical mode unless I
disabled "DRI" for this R100 QD based Radeon 7200 PCI card. The "X" server
would hang and not even respond to the keyboard anymore. Now, everything
works as expected. This is great news.
I had posted the bug report for modular "Xorg" 7.0.0 where the issue is
also present, but reverted my system later to FC4 and discovered that the
RV100 bus master fix patch of the above FC4 update broke "DRI". For details,
see:

  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=180150
Comment 12 Daniel Richard G. 2006-03-11 13:02:03 UTC
Apologies for the delay; I had to deal with the new modular build system for 
the first time. (Thank goodness for build.sh!) 
 
With a current CVS build of Xorg, I am no longer able to reproduce the original 
bug. The X server is starting up without a hitch through multiple reboots. 
Either the problem is fixed, or it's a lot scarcer than it used to be :-) 
 
(Resolving to FIXED; I trust everything is in order?) 
Comment 13 bd 2006-03-13 01:02:22 UTC
Monolithic X still has the bug.

But after battling with modular X (there is now a considerable chunk missing in
my desk) I am now running 7 xine instances at once (which brought down X in
about 5 seconds) for aprox. 30 Minutes.

BTW this is Debian Sid with modular CVS HEAD as of today.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.