Bug 17075

Summary: Radeon + DRI on r300: X goes 99.9% CPU
Product: DRI Reporter: Doron <doron.fediuck>
Component: GeneralAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: critical    
Priority: high CC: dusanc
Version: XOrg git   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Configuration file.
none
X Log File.
none
X 1.5.0 log file
none
X with Accel on, DRI off log file
none
gdb tracing with break point.
none
gdb avoiding normal switch, using finish none

Description Doron 2008-08-11 02:42:52 UTC
Hi,
I've just installed the latest versions of X on my machine,
and enabled DRI (I waited more than 2 years for it to work...).

Overview:
Everything works OK, until I switch to console.
When I try to come back to X (Fn+F7) X goes 99.9% CPU and keyboard
is disabled (mouse works). The only way out is hardware power-off
or ssh from another machine and reboot. 

I saw similar bugs, but either for different drivers (nVidia) or
other which were not the same.

Steps to Reproduce: 
1. Enable DRI (set NoAccel to false).
2. startx
3. Move to console (Ctrl+Alt+F1).
4. Move to X (Ctrl+Alt+F7).

Actual Results:
X goes 99% CPU, keyboard is disabled (mouse works).
Network and other (mpd- music) is working, so SSH can be done.
No specific issues in the log file.

Work-around:
1. Use NoAccel (disable DRI).

Build Date & Platform: 
doronf ~ # emerge --info
Portage 2.1.4.4 (default/linux/x86/2008.0/desktop, gcc-4.1.2, glibc-2.6.1-r0, 2.6.25-gentoo-r7 i686)
=================================================================
System uname: 2.6.25-gentoo-r7 i686 Intel(R) Pentium(R) M processor 2.13GHz
Timestamp of tree: Mon, 11 Aug 2008 01:45:01 +0000
app-shells/bash:     3.2_p33
dev-java/java-config: 1.3.7, 2.1.6
dev-lang/python:     2.4.4-r14, 2.5.2-r6
sys-apps/baselayout: 2.0.0
sys-apps/openrc:     0.2.5
sys-apps/sandbox:    1.2.18.1-r2
sys-devel/autoconf:  2.13, 2.61-r2
sys-devel/automake:  1.4_p6, 1.5, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10.1
sys-devel/binutils:  2.18-r3
sys-devel/gcc-config: 1.4.0-r4
sys-devel/libtool:   1.5.26
virtual/os-headers:  2.6.23-r3
ACCEPT_KEYWORDS="x86"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-O3 -march=pentium-m -fomit-frame-pointer -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/config /var/lib/hsqldb"
CONFIG_PROTECT_MASK="/etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/splash /etc/terminfo /etc/udev/rules.d"
CXXFLAGS="-O3 -march=pentium-m -fomit-frame-pointer -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="distlocks metadata-transfer parallel-fetch sandbox sfperms strict unmerge-orphans userfetch userpriv usersandbox webrsync-gpg"
GENTOO_MIRRORS="http://mirror.hamakor.org.il/pub/mirrors/gentoo/ http://de-mirror.org/distro/gentoo/ http://gentoo.ynet.sk/pub http://mirror.qubenet.net/mirror/gentoo/"
LANG="he_IL.utf8"
LDFLAGS="-Wl,-O1"
LINGUAS="en he"
MAKEOPTS="-j2"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage /usr/local/alon-barlev-portage /usr/local/ase-portage"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="X aac acl acpi alsa arts berkdb bidi bluetooth branding bzip2 cairo cdr cli cracklib crypt cups curl dbus dri dv dvd dvdr dvdread eds emboss encode esd evo fam firefox fortran gdbm gif gphoto2 gpm gstreamer gtk hal hdaps iconv ipv6 isdnlog jpeg jpeg2k kde kdeenablefinal kerberos ldap libnotify logrotate mad midi mikmod mmx mp3 mpeg mudflap ncurses nls nptl nptlonly ogg opengl openmp pam pcre pdf perl png ppds pppd python qt3 qt3support qt4 quicktime readline reflection samba sdl session spell spl sse sse2 ssl startup-notification svg svga sysfs tcpd tiff truetype unicode usb v4l vorbis wifi win32codecs x86 xcomposite xinerama xml xorg xv zlib" ALSA_CARDS="intel8x0" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CAMERAS="canon" ELIBC="glibc" INPUT_DEVICES="mouse keyboard" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en he" USERLAND="GNU" VIDEO_CARDS="radeon"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LC_ALL, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

doronf ~ # X -version

X.Org X Server 1.4.2
Release Date: 11 June 2008
X Protocol Version 11, Revision 0
Build Operating System: Linux 2.6.25-gentoo-r7 i686
Current Operating System: Linux doronf 2.6.25-gentoo-r7 #5 PREEMPT Thu Jul 31 17:27:26 IDT 2008 i686
Build Date: 07 August 2008  09:47:57AM

        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Module Loader present

I'm attaching xorg.conf and a sample log file.
Comment 1 Doron 2008-08-11 02:45:27 UTC
Created attachment 18215 [details]
Configuration file. 

The relevant device for my layout is with Identifier "Alone".
Comment 2 Doron 2008-08-11 02:49:17 UTC
Created attachment 18216 [details]
X Log File.
Comment 3 Michel Dänzer 2008-08-11 03:01:19 UTC
Does it also happen without Option "DynamicClocks"?
Comment 4 Doron 2008-08-11 03:38:18 UTC
Hi Michel,
Thanks for the quick response.

Sadly, yes. I just remarked DynamicClocks
(default is off), and the same behavior occurred.
Comment 5 Doron 2008-09-06 13:15:55 UTC
Hi,
Almost a month has past, and no comments... Any chance to fix this bug?
How can I help ?

Thanks,
Doron.
Comment 6 Giacomo Perale 2008-09-06 13:56:40 UTC
(In reply to comment #5)
> Hi,
> Almost a month has past, and no comments... Any chance to fix this bug?
> How can I help ?
> 
> Thanks,
> Doron.
> 

Did you try with EXA instead of XAA? (Add 'Option "AccelMethod" "exa"' to xorg.conf)
Comment 7 Doron 2008-09-07 01:28:34 UTC
Thanks Giacomo,
But no change- same behavior.

Also I moved to the new xorg-server, and it remains the same.
I'll attach the latest version's log.
Version details:
doronf ~ # Xorg -version

X.Org X Server 1.5.0
Release Date:
X Protocol Version 11, Revision 0
Build Operating System: Linux 2.6.25-gentoo-r7 i686
Current Operating System: Linux doronf 2.6.25-gentoo-r7 #6 PREEMPT Fri Aug 15 19:13:18 IDT 2008 i686
Build Date: 07 September 2008  10:43:07AM

I hope someone can give me a hand here,
since this will help many r300 users.

Thanks,
Doron
Comment 8 Doron 2008-09-07 01:31:31 UTC
Created attachment 18719 [details]
X 1.5.0 log file
Comment 9 Michel Dänzer 2008-09-08 01:50:14 UTC
(In reply to comment #7)
> 
> I hope someone can give me a hand here,
> since this will help many r300 users.

How do you know that? This is the only report I've seen of a problem like this.

Does the problem also occur with the radeon kernel module from drm Git instead of from the Linux kernel?

BTW, you don't need to disable acceleration completely to disable the DRI, you can use Option "DRI" "off". Does that still avoid the problem?
Comment 10 Doron 2008-09-08 03:20:46 UTC
(In reply to comment #9)

Hi Michel,

> (In reply to comment #7)
> > 
> > I hope someone can give me a hand here,
> > since this will help many r300 users.
> 
> How do you know that? This is the only report I've seen of a problem like this.
> 
I saw many users with this issue. Not all of them knows how to pin-point
the problem, so they open various issues in places like X mailing lists,
distro forums, etc. You can just google for it and see-
http://www.google.com/search?hl=en&q=X+hang+%2Bcpu+r300&btnG=Search

Since it looks like a busy loop (which causes X to go 100%), breaking
this loop will help others as well.

> Does the problem also occur with the radeon kernel module from drm Git instead
> of from the Linux kernel?
No. This is company's laptop and I can't afford harming it with bleeding
edge sources. Sorry... if there's a "safe" branch I may give it a go,
but I didn't see such branch so far.

> 
> BTW, you don't need to disable acceleration completely to disable the DRI, you
> can use Option "DRI" "off". Does that still avoid the problem?
> 
I'll try that, thanks !

Doron
Comment 11 Michel Dänzer 2008-09-08 03:50:01 UTC
(In reply to comment #10)
> Since it looks like a busy loop (which causes X to go 100%), breaking
> this loop will help others as well.

Those are typical symptoms of a GPU lockup, which can be caused by any number of different things. The usual causes result in the hang after e.g. running certain 3D applications or on X server startup, it's rare on VT switches.

BTW, were the log files captured after reproducing the problem? If not, please attach one that was. Also, which version of xf86-video-ati is this? If it's older than 6.9.0 or at least 6.8.0, please try a newer one.

Comment 12 Doron 2008-09-08 04:21:41 UTC
(In reply to comment #11)
Dear Michel,
> Those are typical symptoms of a GPU lockup, which can be caused by any number
> of different things. The usual causes result in the hang after e.g. running
> certain 3D applications or on X server startup, it's rare on VT switches.
I'm sorry, from a programmer's point of view it behaves like it's in a loop.
I'm always ready to learn ;) 
I just want to add that I do not need to run a heavy-duty 3D application; I
only need to turn off the xdm service. Then, from console I run startx and
switch back to console. When I'll try to switch back to X (vt 7) X will hang.
Same behavior of course, occurs when I start xdm (kde), login and switch to vt1.
X will hang the minute I'll try to back to X.

> 
> BTW, were the log files captured after reproducing the problem? If not, please
> attach one that was. Also, which version of xf86-video-ati is this? If it's
> older than 6.9.0 or at least 6.8.0, please try a newer one.
> 

All logs were captured during X hanging. I have no choice when it hangs, so I
reboot the machine. These are the logs which X reproduces. I checked the .xsession-errors file, but nothing significant there.
As for xf86-video-ati, I'm using version 6.9.0.

Since I'm using Gentoo, I can recompile anything with the relevant USE flag.
So if you have a debug USE flag somewhere, I can turn it on. Just tell me how
to help.

Thanks again !
Doron.
Comment 13 Michel Dänzer 2008-09-08 05:10:42 UTC
(In reply to comment #12)
> I'm sorry, from a programmer's point of view it behaves like it's in a loop.

It is in a loop, waiting for the GPU to finish processing the commands emitted to it previously, but that never happens because the GPU is locked up. The loop is just a symptom of the actual problem.
Comment 14 Doron 2008-09-08 05:53:58 UTC
Hi Michel,
We may have some progress due to your DRI suggestion...
I remarked NoAccl, and added DRI false. For some strange reason,
I had acceleration (window moving didn't flicker as I'm used to...).
Also the hang behavior maintained. IE- switching vt's caused X to
go 99.9% CPU. I double checked the X log. You can see the DRI is off,
and a mesa driver is being used for GLX.

So this leaves me confused- is mesa causing this behavior ?
I'm attaching the relevant log.

Thanks,
Doron
Comment 15 Doron 2008-09-08 05:55:44 UTC
Created attachment 18737 [details]
X with Accel on, DRI off log file
Comment 16 Michel Dänzer 2008-09-08 07:02:27 UTC
(In reply to comment #14)
> For some strange reason, I had acceleration (window moving didn't flicker as
> I'm used to...).

Nothing strange, you only disabled the DRI, not all acceleration.

> Also the hang behavior maintained. IE- switching vt's caused X to
> go 99.9% CPU.

Hmm, then I'm not sure anymore this really is a GPU lockup... it would be interesting if when the X server is hanging, you could log in via ssh, attach gdb to the X server process and attach the output of 'bt full'. It'll be much more useful if the X server binaries have debugging symbols.
Comment 17 Doron 2008-09-08 07:19:42 UTC
(In reply to comment #16)
> Hmm, then I'm not sure anymore this really is a GPU lockup... it would be
> interesting if when the X server is hanging, you could log in via ssh, attach
> gdb to the X server process and attach the output of 'bt full'. It'll be much
> more useful if the X server binaries have debugging symbols.
> 

Michel,
I'm not sure gdb will attach to the X process since it's consuming most of the
CPU... But I can give it a go.
As for the symbols, I found that xorg-server has a debug USE flag I can turn on and recompile. Any other binaries which should be recompiled in debug mode ? (ie- xf86-video-ati, etc.)
Comment 18 Michel Dänzer 2008-09-08 07:39:07 UTC
(In reply to comment #17)
> Any other binaries which should be recompiled in debug mode ? (ie-
> xf86-video-ati, etc.)

Yeah, xserver and xf86-video-ati for starters.
Comment 19 Doron 2008-09-08 09:11:55 UTC
OK, here's the gdb output and then my explanations:
====================================================
Continuing.

Program received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0xb7a796c0 (LWP 7591)]
0xb7f26424 in __kernel_vsyscall ()
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0xb7f26424 in __kernel_vsyscall ()
Continuing.

Program received signal SIGINT, Interrupt.
0xb7f26424 in __kernel_vsyscall ()
#0  0xb7f26424 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7b34fe9 in ioctl () from /lib/libc.so.6
No symbol table info available.
#2  0xb79d6a71 in drmCommandNone () from /usr/lib/libdrm.so.2
No symbol table info available.
#3  0x00006444 in ?? ()
No symbol table info available.
#4  0x00000000 in ?? ()
No symbol table info available
====================================================

Explanations:
1. I recompiled X and x11-drivers with the debug USE flag, but it looks like a lot is still missing. I'm not sure which.
2. I tried following your scenario, ie- attaching gdb to X when it hangs, but
as I expected, gdb fails to attach. So instead I attached gdb before the X hang
and then switched vt's to hang.
3. I had to use ctrl+c in order to make gdb stop and give me a prompt.
4. I used the original configuration- ie including DRI.

That's it.
If there's anything I can do better just let me know.

Doron.
Comment 20 Dushan Tcholich 2008-09-08 15:39:52 UTC
I'd like to report that I had pretty similar problems.
Had them with kernels 2.6.25 and .26, but they happen randomly, when I use some 3D app (tremulous, flightgear) and when I finish after some time system becomes unresponsive, only mouse moves, no keyboard, and when I log in through nxserver on that machine everything works fine, only X consumes 100% on dualcore cpu.
Only reboot solves it.
Happened to me with exa and xaa.
GPU is radeon X800AIW
There's nothing in log files.
I'm using:
X -version
X Window System Version 1.3.0

 dmesg|grep "drm"
[drm] Initialized drm 1.1.0 20060810
[drm] Initialized radeon 1.29.0 20080528 on minor 0
In xorg.conf:
	Identifier  "Card0"
	Driver      "radeon"
	VendorName  "ATI Technologies Inc"
	BoardName   "R430 [Radeon X800 XL] (PCIe)"
	BusID       "PCI:1:0:0"
	Option	    "EnablePageFlip"		"on"
	Option	    "ColorTiling"		"1"
#	Option	    "AccelMethod"		"EXA"
	Option	    "AccelDFS"			"1"

Had these same symptoms way earlier in time of 2.6.23 kernel but those freezes were during playing 3D games (ET:RTCW), not after.

Please ask if you need more info.
Have a nice day
Comment 21 Doron 2008-09-09 01:42:58 UTC
OK, better luck this time...
I read about no stripping and some compiler flags and re-compiled
X, DRM and xf86-video-ati (again...).
I also found the way to turn on the debug of the drm module.

So here it is, this is what gdb shows while X hangs:
(I caused the SIGINT with ctrl+c in order to see where X hangs...)
==============================================================================
Continuing.

Program received signal SIGINT, Interrupt.
0xb7f84424 in __kernel_vsyscall ()
#0  0xb7f84424 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7b92fe9 in ioctl () from /lib/libc.so.6
No symbol table info available.
#2  0xb7a34af2 in drmCommandNone (fd=10, drmCommandIndex=4) at xf86drm.c:2247
No locals.
#3  0xb79d2534 in RADEONWaitForIdleCP (pScrn=0xa126fd0) at radeon_commonfuncs.c:697
        _ret = <value optimized out>
        ret = -16
        info = (RADEONInfoPtr) 0xa129f08
        i = 186
        __FUNCTION__ = "RADEONWaitForIdleCP"
#4  0xb7a162d3 in RADEONSyncCP (pScreen=0xa130048, marker=2462) at radeon_exa_funcs.c:80
        pScrn = (ScrnInfoPtr) 0xa126fd0
#5  0xb78ab803 in exaWaitSync (pScreen=0xa130048) at exa.c:1036
No locals.
#6  0xb78acaab in ExaDoPrepareAccess (pDrawable=0xa2b35f8, index=0) at exa.c:495
        pExaScr = (ExaScreenPrivPtr) 0xa12f008
        pPixmap = (PixmapPtr) 0xa17dff8
        offscreen = 1
#7  0xb78acbab in exaPrepareAccessReg (pDrawable=0xa2b35f8, index=0, pReg=0xa17e0cc) at exa.c:520
        pixmaps = {{as_dst = 1, as_src = 0, pPix = 0xa17dff8, pReg = 0xa17e0cc}}
#8  0xb78ad0a1 in exaImageGlyphBlt (pDrawable=0xa2b35f8, pGC=0xa32d320, x=2, y=947, nglyph=6,
    ppciInit=0xa326f18, pglyphBase=0x0) at exa_accel.c:895
        pPriv = (FbGCPrivPtr) 0xa329a44
        ppci = <value optimized out>
        pci = <value optimized out>
        pglyph = <value optimized out>
        gWidth = <value optimized out>
        gHeight = <value optimized out>
        opaque = <value optimized out>
        gx = <value optimized out>
        gy = <value optimized out>
        glyph = (void (*)(FbBits *, FbStride, int, FbStip *, FbBits, int,
    int)) 0xb78bdba0 <fbGlyph32>
        dst = <value optimized out>
        dstStride = <value optimized out>
        dstBpp = <value optimized out>
        dstXoff = <value optimized out>
        dstYoff = <value optimized out>
        depthMask = <value optimized out>
        pPixmap = (PixmapPtr) 0xa17dff8
        pending_damage = (RegionPtr) 0xa17e0cc
        xoff = 0
        yoff = 0
#9  0x0816e0e6 in damageText (pDrawable=0xa2b35f8, pGC=0xa32d320, x=2, y=13, count=6,
    chars=0xa2aaa78 "", fontEncoding=TwoD16Bit, textType=3) at damage.c:1466
        info = <value optimized out>
        i = 6
        n = 6
        w = 0
        imageblt = 1
#10 0x0816e18d in damageImageText16 (pDrawable=0xa2b35f8, pGC=0xa32d320, x=2, y=13, count=6,
    chars=0xa2aaa78) at damage.c:1547
        pGCPriv = (DamageGCPrivPtr) 0xa32d38c
        oldFuncs = (GCFuncs *) 0x81c99c0
#11 0x0808ba1b in doImageText (client=0xa2a4148, c=0xbf99d940) at dixfonts.c:1561
        err = <value optimized out>
        lgerr = <value optimized out>
        fpe = <value optimized out>
#12 0x0808bbb4 in ImageText (client=0xa2a4148, pDraw=0xa2b35f8, pGC=0x6444, nChars=6,
    data=0xa2aaa78 "", xorg=2, yorg=13, reqType=<value optimized out>, did=10485784)
    at dixfonts.c:1612
        local_closure = {client = 0xa2a4148, pDraw = 0xa2b35f8, pGC = 0xa32d320,
  nChars = 6 '\006', data = 0xa2aaa78 "", xorg = 2, yorg = 13, reqType = 77 'M',
  imageText = 0x816e0f0 <damageImageText16>, itemSize = 2, did = 10485784, slept = 0}
#13 0x08086803 in ProcImageText16 (client=0xa2a4148) at dispatch.c:2231
        err = -16
        pDraw = (DrawablePtr) 0x6444
        pGC = (GC *) 0x0
#14 0x08089144 in Dispatch () at dispatch.c:454
        result = <value optimized out>
        client = <value optimized out>
        nready = 0
        start_tick = 27860
#15 0x0806f98b in main (argc=9, argv=0xbf99db04, envp=Cannot access memory at address 0x644c
) at main.c:441
        pScreen = <value optimized out>
        i = 1
        error = 134673718
        xauthfile = <value optimized out>
        alwaysCheckForInput = {0, 1}
The program is running.  Quit anyway (and detach it)? (y or n) Detaching from program: /usr/bin/X, process 9236
==============================================================================

As for drm kernel module, these are the debug messages I got:
a lot of these message loops-
==============================================================================
Sep  9 10:56:23 doronf [drm:drm_unlocked_ioctl] pid=9236, cmd=0x6444, nr=0x44, dev 0xe200, auth=1
Sep  9 10:56:23 doronf [drm:radeon_cp_idle]
Sep  9 10:56:23 doronf [drm:radeon_do_cp_idle]
Sep  9 10:56:23 doronf [drm:drm_unlocked_ioctl] ret = -16
Sep  9 10:56:23 doronf [drm:drm_unlocked_ioctl] pid=9236, cmd=0x6444, nr=0x44, dev 0xe200, auth=1
Sep  9 10:56:23 doronf [drm:radeon_cp_idle]
Sep  9 10:56:23 doronf [drm:radeon_do_cp_idle]
Sep  9 10:56:23 doronf [drm:drm_unlocked_ioctl] ret = -16
Sep  9 10:56:23 doronf [drm:drm_unlocked_ioctl] pid=9236, cmd=0x6444, nr=0x44, dev 0xe200, auth=1
Sep  9 10:56:23 doronf [drm:radeon_cp_idle]
==============================================================================
(Here I rebooted).


How can we proceed from here ?
Thanks !
Doron
Comment 22 Michel Dänzer 2008-09-09 02:33:45 UTC
(In reply to comment #21)
> #3  0xb79d2534 in RADEONWaitForIdleCP (pScrn=0xa126fd0) at
> radeon_commonfuncs.c:697

Okay, so this does look like a GPU lockup after all. Can you get a backtrace with the DRI disabled as well? GPU lockups are unusual with the DRI disabled. One more idea though - does Option "RenderAccel" "off" avoid the problem?

BTW, as I requested, please attach backtraces (and generally larger bits of information) instead of cluttering up the comments with them.

Comment 23 Michel Dänzer 2008-09-09 02:41:36 UTC
(In reply to comment #20)
> I'd like to report that I had pretty similar problems.
> Had them with kernels 2.6.25 and .26, but they happen randomly, when I use some
> 3D app (tremulous, flightgear) and when I finish after some time [...]

Did you read comment #11 and comment #13? It's not the eventual symptoms that matter (they tend to be the same or similar) but what triggers them - VT switching from console back to X for this bug report.
Comment 24 Doron 2008-09-09 04:39:11 UTC
(In reply to comment #22)
> Okay, so this does look like a GPU lockup after all. Can you get a backtrace
> with the DRI disabled as well? GPU lockups are unusual with the DRI disabled.
> One more idea though - does Option "RenderAccel" "off" avoid the problem?
I tried several times repeating it with DRI off. The problem is much more intense. ie- Once X hangs, gdb hangs as well, including the ssh session...
At this stage I can only press the power button and wait...  I need some help
to try and set a break point just before the hang occurs. Can you give me a
function name or similar to use as a break point before vt switch ?

As for RenderAccel, I'll give it a go in the next session (with the break
point), since I want to cold boot as less as possible. I'm afraid I'll harm
the hard-disk or anything else...

> 
> BTW, as I requested, please attach backtraces (and generally larger bits of
> information) instead of cluttering up the comments with them.
> 
My apologies. Accepted starting now.
Comment 25 Michel Dänzer 2008-09-09 04:44:03 UTC
(In reply to comment #24)
> I tried several times repeating it with DRI off. The problem is much more
> intense. ie- Once X hangs, gdb hangs as well, including the ssh session...

Sounds like there may even be two separate problems with the DRI enabled or disabled.

> Can you give me a function name or similar to use as a break point before vt
> switch ?

RADEONEnterVT is the driver function called when switching from console to X.
Comment 26 Doron 2008-09-09 07:07:02 UTC
Hi Michel.
I tried RenderAccel off, but no real change.

So I remarked NoAccel, and added DRI off.
I'm attaching gdb log. I used RADEONEnterVT as a break point.
It looks like I managed to get the loop, but I may be wrong, since
it may be a loop over icons.

Can you have a look and see ?

Doron
Comment 27 Doron 2008-09-09 07:09:24 UTC
Created attachment 18782 [details]
gdb tracing with break point.
Comment 28 Michel Dänzer 2008-09-09 07:31:20 UTC
(In reply to comment #26)
> Can you have a look and see ?

I've never seen this kind of gdb output, and looking at it I can't help but feeling like looking for a needle in a haystack.

If you tell gdb to 'finish' at the RADEONEnterVT breakpoint with the DRI disabled, do you get a gdb prompt back or does it hang before that? If the former, the problem is somewhere outside that function.
Comment 29 Doron 2008-09-09 07:53:18 UTC
(In reply to comment #28)
> If you tell gdb to 'finish' at the RADEONEnterVT breakpoint with the DRI
> disabled, do you get a gdb prompt back or does it hang before that? If the
> former, the problem is somewhere outside that function.
> 
Well, I actually get this break twice: first time when I switch to VT1 (always
OK). Then when I try to switch back to VT7 (will hang). I did what you
suggested, and on the 2nd break I got gdm prompt. So I wrote finish and got
the following:

Program received signal SIGUSR1, User defined signal 1.
0xb7f47424 in __kernel_vsyscall ()
Run till exit from #0  0xb7f47424 in __kernel_vsyscall ()
0xb7b5695d in select () from /lib/libc.so.6

at this point screen was black and I got another prompt, so I quit gdb.
X returned (and I quickly turned it off to avoid hangs...)

What do you make of it ?
Comment 30 Michel Dänzer 2008-09-09 08:15:35 UTC
(In reply to comment #29)
> Program received signal SIGUSR1, User defined signal 1.
> 0xb7f47424 in __kernel_vsyscall ()
> Run till exit from #0  0xb7f47424 in __kernel_vsyscall ()
> 0xb7b5695d in select () from /lib/libc.so.6

That's not the RADEONEnterVT breakpoint but a SIGUSR1, which is part of any VT switch. You can tell gdb not to stop on SIGUSR1 using 'handle SIGUSR1 nostop'.
Comment 31 Doron 2008-09-10 03:00:15 UTC
Created attachment 18801 [details]
gdb avoiding normal switch, using finish

Hi Michel,
OK, used the handle command I got to the break point.
I used finish 7 times, ang X hanged !

Can you see something meaningful here ?

Doron.
Comment 32 Michel Dänzer 2008-09-10 03:10:43 UTC
(In reply to comment #31)
> Can you see something meaningful here ?

I'm afraid not - Dispatch() is the main protocol request processing function, so it could still be pretty much anything, we've only ruled out that the problem is triggered from RADEONEnterVT directly.

Does the problem also happen with a 'naked' X server (without any clients)?

Other than that I'm running out of ideas, maybe someone else can chime in...
Comment 33 Doron 2008-09-10 04:02:14 UTC
(In reply to comment #32)
> 
> Does the problem also happen with a 'naked' X server (without any clients)?
No. I can switch out of and into X with no issues.

> 
> Other than that I'm running out of ideas, maybe someone else can chime in...
> 

I can only hope...
Comment 34 Michel Dänzer 2008-09-10 07:22:12 UTC
(In reply to comment #33)
> > Does the problem also happen with a 'naked' X server (without any clients)?
> No. I can switch out of and into X with no issues.

Interesting - so the problem could be related to a specific acceleration primitive which isn't hit by the root weave.

With XAA, you can disable specific acceleration primitives using Option "XaaNo..." documented in the xorg.conf manpage. Basically, for each primitive listed after the log line

(II) RADEON(0): Using XFree86 Acceleration Architecture (XAA)

try the corresponding XaaNo... option and see if you can find a single such option which avoids the problem.
Comment 35 Doron 2008-09-14 01:53:27 UTC
Some good news !
Today I emerged the latest drm released- x11-base/x11-drm-20080710.
So far I've been using x11-base/x11-drm-20071019. I'm very happy
to say the problem is gone. I want to test it for a day or so and
then I'll close this bug.

In the meanwhile cross your fingers ;)

Doron.
Comment 36 Doron 2008-09-15 23:39:44 UTC
Hi Michel and others,

Everything is up and running. Even managed to run compiz-fusion,
sleep and resume. All fine. Here's a list of relevant configuration
for future reference:

The main packages' working versions:
x11-base/xorg-server-1.5.0
x11-base/x11-drm-20080710
x11-apps/mesa-progs-7.1
x11-libs/libdrm-2.3.1
x11-drivers/xf86-video-ati-6.9.0
x11-drivers/xf86-input-keyboard-1.3.1
x11-drivers/xf86-input-mouse-1.3.0

Device settings (in xorg.conf):
        Identifier  "Alone"
        Driver      "radeon"
        VendorName  "ATI Technologies Inc"
        BoardName   "M24GL [Mobility FireGL V3200] rev 128"
       #Option      "NoAccel"                   # [<bool>]
        Option      "MonitorLayout" "AUTO,NONE" # [<str>]
       #Option      "DynamicClocks" "on"        # [<bool>]
        Option      "AccelMethod" "EXA" #"XAA"
       #Option      "AccelDFS" "1"
        BusID       "PCI:1:0:0"

I hope others will benefit this as well.

Thanks a lot for all the help !
Doron.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.