Bug 3911 - Intermittent fail due to improper memory access, SERR generated when starting XWindow in Linux RH4
Summary: Intermittent fail due to improper memory access, SERR generated when starting...
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/Radeon (show other bugs)
Version: git
Hardware: x86 (IA32) Linux (All)
: high normal
Assignee: Benjamin Herrenschmidt
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-07-30 03:53 UTC by jon chaplick
Modified: 2011-10-15 15:33 UTC (History)
6 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Rework setup of the memory map (12.72 KB, patch)
2005-10-24 16:54 UTC, Benjamin Herrenschmidt
no flags Details | Splinter Review
X.org log (42.97 KB, text/plain)
2006-02-21 11:22 UTC, Diego Calleja
no flags Details
xorg.conf when Accelmethod = EXA (2.75 KB, text/plain)
2006-02-21 11:24 UTC, Diego Calleja
no flags Details
X.org log when accelmethod = EXA (41.65 KB, text/plain)
2006-02-21 11:52 UTC, Diego Calleja
no flags Details
log diff between accelmethod=exa (-) and without it (+) (5.32 KB, text/plain)
2006-02-21 11:55 UTC, Diego Calleja
no flags Details

Description jon chaplick 2005-07-30 03:53:25 UTC
Description of Failure Mode
===========================
                                                                               
                                                                               
                                     
                                                                               
                                                                               
                                     
When running Redhat4.0 and during x-server initialization once in a while the
system will hang.
The failure frequency is from hour to few hours depending upon platform and
configuration.
                                                                               
                                                                               
                                     
The failure is observed using a RV100 video card on a 32 bit system.
                                                                               
                                                                               
                                     
A DMA request memory read line is performed by RV100 at address 0x000Exxxx.
On the memory controller this address range is masked off, that is PCI devices
do not have access this space.
So when access to this space occurs, MCH is reporting as unsupported request and
asserts SERR which is causing system to hang.
                                                                               
                                                                               
                                     
During non fail case, the DMA access to this space does not occur.
                                                                               
                                                                               
                                     
                                                                               
                                                                               
                                     
Software Root Cause
=========================
                                                                               
                                                                               
                                     
There is a pending request on the RV100 hardware that is being blocked
waiting for a valid FB_LOCATION. When the FB_LOCATION register is set
the pending request is unblocked. Since the FB_LOCATION register is being
set at the time the pending request is unblocked the pending request
generates a bus cycle for a DMA transfer. This DMA transfer is to an undefined
region of memory and causes the system hang.
                                                                               
                                                                               
                                     
Software Fix
=========================
                                                                               
                                                                               
                                     
At the start of RADEONSetFBLocation we take preventitive action and disable all
bus mastering.
Once all FB_LOCATION and related register writes are completed we re-enable bus
mastering. This
will prevent a bus cycle from being generated while updating FB_LOCATION.
Comment 1 Michel Dänzer 2005-07-30 04:36:47 UTC
Please attach the patch to this entry Jon, I'm going to commit it to HEAD shortly.
Comment 2 jon chaplick 2005-07-30 04:38:55 UTC
Created attachment 3191 [details]
SSH key for Andrew Cowie
Comment 3 Michel Dänzer 2005-07-30 05:47:41 UTC
CVSROOT:        /cvs/xorg
Module name:    xc
Changes by:     daenzer@gabe.freedesktop.org    05/07/29 12:45:14

Log message:
        * programs/Xserver/hw/xfree86/drivers/ati/radeon_driver.c:
        (RADEONSetFBLocation):
        bugzilla #3911 (https://bugs.freedesktop.org/show_bug.cgi?id=3911)
        attachment #3191 [details] (http://bugs.freedesktop.org/attachment.cgi?id=3191)
        Disable bus mastering while updating MC_FB_LOCATION and friends to
        prevent the X server from hanging on startup every now and then
        under some circumstances. (ATI Technologies Inc.)

Modified files:
      ./:
        ChangeLog 
      xc/programs/Xserver/hw/xfree86/drivers/ati/:
        radeon_driver.c 
  
  Revision      Changes    Path
  1.1161        +10 -0     xc/ChangeLog
  1.64          +9 -1     
xc/programs/Xserver/hw/xfree86/drivers/ati/radeon_driver.c
Comment 4 Michel Dänzer 2005-07-30 05:48:58 UTC
Comment on attachment 3191 [details]
SSH key for Andrew Cowie

Note that the patch was made against and tested on 6.8 in the first place.
Comment 5 Daniel Richard G. 2005-09-20 09:03:24 UTC
Gentlemen, please have a look at Bug #4324. It seems that the above fix may in  
fact trigger a similar hang-on-startup problem on other systems.  
Comment 6 Benjamin Herrenschmidt 2005-09-22 22:37:23 UTC
Ok, I'm not sure this fix is correct. I have a slightly different analysis of
the problem. I think there is no such things as a memory request beeing "blocked
waiting for a valid MC_FB_LOCATION", the value of that register is always
"valid" as far as the chip is concerned (unless your top is below your bottom
but that should never happen, unless you chip was really in a bad shape in the
first place).

I think what happens is that you have a scanout in progress via a CRTC at the
point where you change MC_FB_LOCATION. At this point, the scanout continues from
the old address and thus generates bus master reads, until you change
DISPLAY_BASE_ADDRESS (and all other registers tha may be relevant, like
DISP2_BASE_* etc...)

Imho, the proper fix is to disable CRTCs, not disable bus mastering. In fact,
there is even a bit in CRTC registers (and in one of the LVDS one as well iirc)
to prevent them from doing any memory access.

Comment 7 Benjamin Herrenschmidt 2005-10-24 16:54:38 UTC
Created attachment 3620 [details] [review]
Rework setup of the memory map

This patch reworks how the memory map is initialized to do it as part of the
mode setting and to properly disable CRTCs before moving things around. We
don't touch BUS_CNTL anymore as this caused lockups with PCI GART. We might
want to add a bit more safety there in the future like disabling the capture
engines too. On the UseFBDev case, it changes the behaviour as we no longer try
to move thing around (we use whatever the fbdev driver had setup). This appear
to work on the few machines I tested so far. Please test for regression as I
intend to commit before 7.0 is final
Comment 8 Michel Dänzer 2005-10-25 01:24:55 UTC
(In reply to comment #7)
> Created an attachment (id=3620) [edit]
> Rework setup of the memory map

I like the approach of this patch. The only cosmetic comment I have is that

+    /* Default to existing values */
+    save->mc_fb_location = INREG(RADEON_MC_FB_LOCATION);
+    save->mc_agp_location = INREG(RADEON_MC_AGP_LOCATION);

can be removed in RADEONInitMemMapRegisters() because those fields are always
initialized later on in that function.

Adding Hui to the CC list, maybe he has other comments.
Comment 9 Diego Calleja 2005-10-25 05:18:06 UTC
This seems to fix a hang I had been having when loading the serverworks AGP
kernel module (not loading it made the system work fine). With this patch, I can
run with the serverworks agp module loaded (Radeon 9200 SE graphics card) and no
hangs (they were easily reproducible when enabling kompmgr and I can't reproduce
it anymore)
Comment 10 Diego Calleja 2005-10-25 05:26:08 UTC
This may be unrelated, but I see this after setting
           Option     "AGPMode" "2"
in my xorg.conf file:

agpgart: X tried to set rate=x12. Setting to AGP3 x8 mode.
agpgart: X requested AGPx8 but bridge not capable.
agpgart: Putting AGP V2 device at 0000:00:00.1 into 1x mode
agpgart: Putting AGP V2 device at 0000:01:00.0 into 1x mode


The Xorg.0.log file confirms I'm not configuring it wrong:
(**) RADEON(0): Option "AGPMode" "2"
(**) RADEON(0): Option "AGPFastWrite" "True"
(**) RADEON(0): Option "EnablePageFlip" "True"
(**) RADEON(0): Option "MonitorLayout" "CRT, CRT"
(**) RADEON(0): Option "RenderAccel" "True"
(**) RADEON(0): Option "AccelMethod" "EXA"

[...]

(**) RADEON(0): AGP 2x mode is configured
(**) RADEON(0): Enabling AGP Fast Write
Comment 11 Diego Calleja 2005-10-27 10:40:16 UTC
Ok, forget what I said about this patch killing the hangs I were having - they
still happen, they're just more difficult to reproduce :/
Comment 12 Benjamin Herrenschmidt 2005-10-27 15:19:32 UTC
Hrm.. I'm not sure those hangs are related to the bug I'm trying to fix then... 
Comment 13 Benjamin Herrenschmidt 2005-12-25 19:22:28 UTC
I have a new patch, but I'm still facing a few regressions. I'll attach it to
this bug once I've figured those out
Comment 14 Benjamin Herrenschmidt 2006-02-20 08:51:33 UTC
New patch is in upstream, please test
Comment 15 Diego Calleja 2006-02-21 00:05:32 UTC
This seems to work fine for me (Debian's 6.9 ati driver hangs my box after using
it for some minutes), except that apparently I can't enable EXA - if I do, the
X.org will not say nothing about exa (grep -i exa returns nothing), it uses XAA
and everything feels much slower than when I comment out the EXA line in
x.org.conf (I can't say if I did something wrong when I compiled it though :P)
Comment 16 Benjamin Herrenschmidt 2006-02-21 08:10:04 UTC
Can you attach your log and config files ?
Comment 17 Diego Calleja 2006-02-21 11:22:45 UTC
Created attachment 4697 [details]
X.org log 

Of course not! How you dare.... :P
Comment 18 Diego Calleja 2006-02-21 11:24:10 UTC
Created attachment 4698 [details]
xorg.conf when Accelmethod = EXA
Comment 19 Diego Calleja 2006-02-21 11:33:17 UTC
The startup log when I comment out accelmethod = EXA is *exactly* the same (no
diff) than when it's enabled, except that everything redraws much slower (ie:
like if I were using the vesa driver or something)

Diff between the accelmethod=exa and noaccelmethod specified:

--- /var/log/Xorg.0.log.old	2006-02-21 01:29:15.000000000 +0100
+++ /var/log/Xorg.0.log	2006-02-21 01:29:26.000000000 +0100
@@ -12,7 +12,7 @@
 Markers: (--) probed, (**) from config file, (==) default setting,
 	(++) from command line, (!!) notice, (II) informational,
 	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
-(==) Log file: "/var/log/Xorg.0.log", Time: Tue Feb 21 01:29:10 2006
+(==) Log file: "/var/log/Xorg.0.log", Time: Tue Feb 21 01:29:23 2006
 (==) Using config file: "/root/xorg.conf"
 (==) ServerLayout "Default Layout"
 (**) |-->Screen "Default Screen" (0)
@@ -735,35 +735,30 @@
 (==) RADEON(0): Write-combining range (0xf0000000,0x8000000)
 (II) RADEON(0): BIOS HotKeys Disabled
 drmOpenDevice: node name is /dev/dri/card0
-drmOpenDevice: open result is -1, (No such device or address)
-drmOpenDevice: open result is -1, (No such device or address)
-drmOpenDevice: Open failed
+drmOpenDevice: open result is 8, (OK)
 drmOpenDevice: node name is /dev/dri/card0
-drmOpenDevice: open result is -1, (No such device or address)
-drmOpenDevice: open result is -1, (No such device or address)
-drmOpenDevice: Open failed
+drmOpenDevice: open result is 8, (OK)
 drmOpenByBusid: Searching for BusID pci:0000:01:00.0
 drmOpenDevice: node name is /dev/dri/card0
 drmOpenDevice: open result is 8, (OK)
 drmOpenByBusid: drmOpenMinor returns 8
 drmOpenByBusid: drmGetBusid reports pci:0000:01:00.0
-(II) RADEON(0): [drm] loaded kernel module for "radeon" driver
 (II) RADEON(0): [drm] DRM interface version 1.2
 (II) RADEON(0): [drm] created "radeon" driver at busid "pci:0000:01:00.0"
 (II) RADEON(0): [drm] added 8192 byte SAREA at 0xf091e000
-(II) RADEON(0): [drm] mapped SAREA 0xf091e000 to 0xa7f2e000
+(II) RADEON(0): [drm] mapped SAREA 0xf091e000 to 0xa7f53000
 (II) RADEON(0): [drm] framebuffer handle = 0xf0000000
 (II) RADEON(0): [drm] added 1 reserved context for kernel
 (II) RADEON(0): [agp] Mode 0x1f00021b [AGP 0x1166/0x0007; Card 0x1002/0x5960]
 (II) RADEON(0): [agp] 8192 kB allocated with handle 0x00000001
 (II) RADEON(0): [agp] ring handle = 0xd8000000
-(II) RADEON(0): [agp] Ring mapped at 0x9f872000
+(II) RADEON(0): [agp] Ring mapped at 0x9f897000
 (II) RADEON(0): [agp] ring read ptr handle = 0xd8101000
-(II) RADEON(0): [agp] Ring read ptr mapped at 0x9f871000
+(II) RADEON(0): [agp] Ring read ptr mapped at 0x9f896000
 (II) RADEON(0): [agp] vertex/indirect buffers handle = 0xd8102000
-(II) RADEON(0): [agp] Vertex/indirect buffers mapped at 0x9f671000
+(II) RADEON(0): [agp] Vertex/indirect buffers mapped at 0x9f696000
 (II) RADEON(0): [agp] GART texture map handle = 0xd8302000
-(II) RADEON(0): [agp] GART Texture map mapped at 0x9f191000
+(II) RADEON(0): [agp] GART Texture map mapped at 0x9f1b6000
 (II) RADEON(0): [drm] register handle = 0xfe6f0000
 (II) RADEON(0): [dri] Visual configs initialized
 (II) RADEON(0): Depth moves disabled by default
@@ -848,5 +843,5 @@
 Warning: font renderer for ".pmf" already registered at priority 0
 Could not init font path element /usr/lib/X11/fonts/Speedo, removing from list!
 (II) RADEON(0): [drm] removed 1 reserved context for kernel
-(II) RADEON(0): [drm] unmapping 8192 bytes of SAREA 0xf091e000 at 0xa7f2e000
+(II) RADEON(0): [drm] unmapping 8192 bytes of SAREA 0xf091e000 at 0xa7f53000
 FreeFontPath: FPE "/usr/lib/X11/fonts/misc" refcount is 2, should be 1; fixing.
Comment 20 Diego Calleja 2006-02-21 11:35:01 UTC
forget this last commentary, this was using the wrong config file
(/root/xorg.conf) because it was being run as root.
Comment 21 Diego Calleja 2006-02-21 11:52:56 UTC
Created attachment 4699 [details]
X.org log when accelmethod = EXA

My setup writes the log in /opt/var/log not in /var/log. Sorry for all the
noise and for my stupidity
Comment 22 Diego Calleja 2006-02-21 11:55:35 UTC
Created attachment 4700 [details]
log diff between accelmethod=exa (-) and without it (+)

This log really shows that exa is being used (but the slow-redraw problem
persist of course)
Comment 23 Diego Calleja 2006-02-23 11:15:16 UTC
I've been running this for days with cero problems except for performance - it's
solid as a rock, i hope 6.9/7.0 is updated with this when the patch is ready,
it's definitively an improvement!
Comment 24 Diego Calleja 2006-03-17 23:50:07 UTC
It looks like current CVS makes my system hang again - it started working well
on 21-2 and I've using CVS for a while, but the latest changse make it behave in
the same way than previously. Sadly I don't know what change started to cause
this again, I'll try to go back in the time a do a manual bisection search - I
wish X.org used git :/
Comment 25 Benjamin Herrenschmidt 2006-03-18 09:19:23 UTC
By current CVS, do you mean actually current as of today or maybe a couple of
days ago ? I found another cause for these hangs and commited a fix yesterday...
Just make sure you are really testing the latest CVS.

If it still hangs, then yes, it would be useful to know what specific change
commited over the past few days is causing the hang.
Comment 26 Diego Calleja 2006-03-22 10:56:08 UTC
Hm, right now I'm using current CVS...but the problem doesn't seems to be the
CVS. Apparently, My box only hangs when I've the agp and radeon kernel modules
loaded (i've tried 2.6.16-git right now). When I remove them, everything seems
to work fine. I'm not doing anything which uses 3D, just a regular kde 3.5
session using firefox 1.5
Comment 27 Benjamin Herrenschmidt 2006-03-22 11:49:08 UTC
The version of the ddx driver in CVS does matter a lot. There are some very
subtle and incestuous interactions going on between the X side driver and the
kernel DRM around the way the memory map is setup.

I would expect the DDX that is currently in CVS HEAD or ati-1-0-branch to not
cause bogus bus master accesses any more, I took all sort of precautions against
it. If it still happens, then I suppose there is some more weird voodoo going on
with the card and we'll need ATI to shed some light on the matter.

In the meantime, please test what happens when running current top of tree X ati
driver and the current DRM CVS kernel module and tell me. Then test downgrading
the kernel module to what is in 2.6.15 or 2.6.16 (doesn't matter).

 
Comment 28 Michel Dänzer 2006-03-22 20:07:25 UTC
(In reply to comment #26)
> Hm, right now I'm using current CVS...but the problem doesn't seems to be the
> CVS. Apparently, My box only hangs when I've the agp and radeon kernel modules
> loaded (i've tried 2.6.16-git right now).

Not that there are many possible causes for hangs; in this entry, we're only
interested in SERR conditions, which can probably only be diagnosed on server
type machines.
Comment 29 Michel Dänzer 2006-04-05 20:59:47 UTC
(In reply to comment #28)
> Not that there are many possible causes for hangs; [...]

Whoops, that was supposed to say 'Note that...'.
Comment 30 Diego Calleja 2006-04-06 05:38:22 UTC
(In reply to comment #27)
> I would expect the DDX that is currently in CVS HEAD or ati-1-0-branch to not

I'm using kernel 2.6.17 git and CVS HEAD from the ati driver, and i still get
hangs (which go away without loading the radeon kernel module).

(I know nothing about "SERR" conditions I just get "hangs" sorry, for some
reason I reported my at bug to this bug. is there other bug where I should take
this?)
Comment 31 Michel Dänzer 2006-04-06 18:22:36 UTC
(In reply to comment #30)
> I'm using kernel 2.6.17 git and CVS HEAD from the ati driver, and i still get
> hangs (which go away without loading the radeon kernel module).

I'm not 100% sure, but IIRC the problem described in this bug happens regardless
of whether the DRM is loaded.

> is there other bug where I should take this?)

Yes, e.g. bug 6271.
Comment 32 Timo Jyrinki 2007-02-22 03:06:32 UTC
The original problem was fixed, all the patches have been committed and people having other problems redirected to other bugs. Resolving as fixed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.