Bug 16549

Summary: DRI Locking code broken on DEC Alpha with Radeon
Product: xorg Reporter: Matt Turner <mattst88>
Component: Driver/RadeonAssignee: Matt Turner <mattst88>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium    
Version: 7.4 (2008.09)   
Hardware: Alpha   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
strace of glxgears
none
dmesg from a fresh boot and X startup
none
kernel config for 2.6.24-gentoo-r3
none
xorg.conf
none
output of `dmesg` with drm debug before running glxgears
none
output of `dmesg` with drm debug after running glxgears
none
Xorg.0.log
none
output of `dmesg` with drm debug before running glxgears
none
output of `dmesg` with drm debug after running glxgears
none
output of `dmesg` using 9800 with 6.10.0 (shows out of memory errors)
none
Xorg.0.log using 9800 with 6.10.0 (shows where X stops loading)
none
Logs for Rage128+DRM
none
First try at fixing DRM_CAS on Alpha
none
alpha locking fix none

Description Matt Turner 2008-06-27 17:34:23 UTC
Created attachment 17433 [details]
strace of glxgears

Running any 3D applcation on my DEC Alpha will cause X to lock solid. Quake3 triggers it after being in the Q3 menu for two or three seconds. glxgears will sporadically trigger it, and triggers it with higher frequency if any part of the gears window is covered, dragging the cursor over or placing another window over the top.

Specs:
Dual 833MHz EV68
Radeon 9100

xserver-1.4.2
mesa-7.0.3
libdrm-git 5d27fd94afaaf434c3a92af0075420b550055bfb (June 5, 2008)
x11-drm-git
xf86-video-ati-6.8.0 or 6.9.0.

Attached is the contents of glxgears.log as generated by 'strace glxgears &> glxgears.log'; Note: it is 9MB of text generated in only a few seconds, so I've bz2'd it to < 200 K. The log was originally 26MB, but I've trimmed the 17MB of the same exact repeating line from the end of the file.
Comment 1 Matt Turner 2008-06-27 17:35:12 UTC
Created attachment 17434 [details]
dmesg from a fresh boot and X startup
Comment 2 Matt Turner 2008-06-27 17:36:00 UTC
Created attachment 17435 [details]
kernel config for 2.6.24-gentoo-r3
Comment 3 Alex Deucher 2008-06-28 09:50:47 UTC
Can you attach your xorg log and config as well?
Comment 4 Matt Turner 2008-06-28 10:42:54 UTC
Created attachment 17447 [details]
xorg.conf

I can't believe I forgot to attach this.
Comment 5 Matt Turner 2008-06-28 12:20:07 UTC
Created attachment 17448 [details]
output of `dmesg` with drm debug before running glxgears

This is the output of `dmesg` before glxgears has been run. I modprobed drm with `modprobe drm debug=1` and modprobed radeon before starting X.
Comment 6 Matt Turner 2008-06-28 12:21:38 UTC
Created attachment 17449 [details]
output of `dmesg` with drm debug after running glxgears

This is the output of `dmesg` after glxgears has been run. I modprobed drm
with `modprobe drm debug=1` and modprobed radeon before starting X.

Note, pid=3770 was glxgears during this run.
Comment 7 Matt Turner 2008-06-28 12:37:57 UTC
Created attachment 17450 [details]
Xorg.0.log

This log is as it is after glxgears has been run. Running glxgears adds nothing to the log.
Comment 8 Matt Turner 2009-01-23 16:01:27 UTC
I just tried this again on my UP1500 with the Radeon 9100, kernel 2.6.29_rc2 with DRM/radeon built as modules, xserver-1.5.3, libdrm-2.4.4, mesa-7.2, xf86-video-ati.6.10.

glxgears fails with -22.

The following is output to dmesg, but other than that it does not kill the system like before.

[drm:radeon_cp_cmdbuf] *ERROR* radeon_cp_cmdbuf called without lock held, held  0 owner fffffc00f24a1600 fffffc00f24a1600

What can I do to debug further?
Comment 9 Michel Dänzer 2009-01-24 01:32:51 UTC
(In reply to comment #8)
> 
> [drm:radeon_cp_cmdbuf] *ERROR* radeon_cp_cmdbuf called without lock held, held 
> 0 owner fffffc00f24a1600 fffffc00f24a1600

Looks like maybe the DRM lock code is broken on alpha, maybe try writing 1 to /sys/module/drm/parameters/debug before starting glxgears to get more DRM debugging output.

This may be a separate issue which just masks the one originally reported here though.
Comment 10 Matt Turner 2009-01-24 15:03:53 UTC
Created attachment 22217 [details]
output of `dmesg` with drm debug before running glxgears

In this, pid=3409 is /usr/bin/X as seen by `ps aux`

root      3409  0.4  0.2  21808  9232 tty7     Ss+  17:55   0:02 /usr/bin/X :0 vt7 -auth /etc/X11/xdm/authdir/authfiles/A:0-7ZT4Pa
Comment 11 Matt Turner 2009-01-24 15:04:59 UTC
Created attachment 22218 [details]
output of `dmesg` with drm debug after running glxgears

pid 3409 is /usr/bin/X. I can only assume pid 3454 is glxgears.
Comment 12 Matt Turner 2009-01-24 17:58:43 UTC
Using libdrm-2.3.0, mesa-6.5.2 and xserver-1.3 -> DRI works

Nothing in the locking code in libdrm/xf86drm.h (DRM_CAS etc) has changed from libdrm 2.3.0 to 2.4.4.

Other suggestions for where to look?
Comment 13 Michel Dänzer 2009-01-25 02:45:40 UTC
(In reply to comment #12)
> Using libdrm-2.3.0, mesa-6.5.2 and xserver-1.3 -> DRI works

What about the problem you reported originally?

> Nothing in the locking code in libdrm/xf86drm.h (DRM_CAS etc) has changed from
> libdrm 2.3.0 to 2.4.4.

Is that using the same kernel / DRM modules? If so, you should be able to isolate the userspace change which broke things for you.

Anyway, looking at the debugging output:

[drm:drm_lock] 3 (pid 3454) requests lock (0x00000001), flags = 0x00000000
[drm:drm_lock] 3 has lock
[drm:drm_ioctl] pid=3454, cmd=0x80206450, nr=0x50, dev 0xe200, auth=1
[drm:radeon_cp_cmdbuf] *ERROR* radeon_cp_cmdbuf called without lock held, held  0 owner fffffc00f2a57c00 fffffc00f2a57c00

The first two lines indicate that drm_lock_take() succeeds for glxgears' context, but then LOCK_TEST_WITH_RETURN() fails in radeon_cp_cmdbuf(). There does seem to be an inconsistency.
Comment 14 Matt Turner 2009-01-25 12:14:48 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > Using libdrm-2.3.0, mesa-6.5.2 and xserver-1.3 -> DRI works
> 
> What about the problem you reported originally?

I haven't tried to reproduce it again, but I would think it is a symptom of the locking bug.

> 
> > Nothing in the locking code in libdrm/xf86drm.h (DRM_CAS etc) has changed from
> > libdrm 2.3.0 to 2.4.4.
> 
> Is that using the same kernel / DRM modules? If so, you should be able to
> isolate the userspace change which broke things for you.

Yes, same kernel and DRM modules.

I have tested the following configurations. All fail the same way.

Holding constant:
kernel-2.6.29_rc2 (and DRM modules)
xorg-server-1.5.3-r1
mesa-7.2

Variations:
libdrm-2.4.4
xf86-video-ati-6.10.0

libdrm-2.4.4
xf86-video-ati-6.8.0-r1

libdrm-2.3.1
xf86-video-ati-6.10.0

libdrm-2.3.1
xf86-video-ati-6.7.197

Including so much code from libdrm statically into Mesa and X.Org server make finding this bug difficult.
Comment 15 Matt Turner 2009-01-31 12:50:52 UTC
I retested with a Radeon 9800 AGP (forced into PCI mode) and can reproduce with the following versions (glxgears fails with drmRadeonCmdBuf: -22)

linux kernel-2.6.29_rc2
xserver-1.5.3
mesa-7.3
libdrm-2.4.4
xf86-video-ati-6.8.0 or xf86-video-ati-6.9.0.

Using 6.10.0, X, hal, and sshd all fail with out of memory errors. (I did not see this running the 9100 PCI card)
Comment 16 Matt Turner 2009-01-31 12:52:12 UTC
Created attachment 22419 [details]
output of `dmesg` using 9800 with 6.10.0 (shows out of memory errors)
Comment 17 Matt Turner 2009-01-31 12:52:53 UTC
Created attachment 22420 [details]
Xorg.0.log using 9800 with 6.10.0 (shows where X stops loading)
Comment 18 Tobias Klausmann 2009-02-02 13:29:59 UTC
Created attachment 22506 [details]
Logs for Rage128+DRM

The attached file contains the xorg log output and dmesg output (with and without debug=1 for drm) for my Rage 128.

01:05.0 VGA compatible controller: ATI Technologies Inc Rage 128 Pro Ultra TF (prog-if 00 [VGA controller])
	Subsystem: PC Partner Limited Device 7106
	Flags: bus master, stepping, 66MHz, medium devsel, latency 255, IRQ 5
	Memory at fc000000 (32-bit, prefetchable) [size=64M]
	I/O ports at 8000 [size=256]
	Memory at f9000000 (32-bit, non-prefetchable) [size=16K]
	Expansion ROM at fa000000 [disabled] [size=128K]
	Capabilities: [50] AGP version 2.0
	Capabilities: [5c] Power Management version 2

# uname -sr 
Linux 2.6.29-rc2
(+patches)

Xorg is v1.5.3
Comment 19 Matt Turner 2009-02-04 17:28:00 UTC
I retested xserver-1.3, mesa-6.5.2, libdrm-2.3 with 2.6.29_rc2. It fails the same way. Now I don't have any idea where the problem is.

Going to try an older kernel.
Comment 20 Matt Turner 2009-02-04 17:56:23 UTC
Tried with 2.6.26, same thing.

Also, I ran the 'lock' test program included with libdrm. Got the following output for both 2.6.29_rc2 and 2.6.26 kernels.

lt-lock: Unlocking unlocked lock succeeded: Invalid argument

I think we're finally on the right track.
Comment 21 Matt Turner 2009-02-08 18:02:19 UTC
Created attachment 22697 [details] [review]
First try at fixing DRM_CAS on Alpha

First try at fixing DRM_CAS on Alpha.

Whereas before dmesg showed stuff like

> [drm:radeon_cp_cmdbuf] *ERROR* radeon_cp_cmdbuf called without lock held, held 0 owner fffffc00f2a57c00 fffffc00f2a57c00

glxgears would die with -22, with the patch, I only get in dmesg

> [drm:radeon_cp_init] *ERROR* radeon_cp_init called without lock held, held  0 owner (null) fffffc00e8add980

and glxgears runs (albeit software rendered).

Looks like I've got a corner case to fix with the patch. Tobias, could you test the patch?

To do so, patch and build libdrm and rebuild mesa, xorg-server and your video driver.
Comment 22 Ivan Kokshaysky 2009-02-13 04:50:54 UTC
Created attachment 22898 [details] [review]
alpha locking fix

I see two problems with alpha DRM_CAS() implementation:
- it doesn't retry on the lock contention;
- the return value is stored on the stack (ouch...), which may
  produce "interesting" results depending on compiler and usage
  of this macro.

This patch makes DRM_CAS() on alpha to behave the same way
as on x86 or powerpc. Also it's better to define DRM_CAS_RESULT()
as "long" - this eliminates extra instruction for sign extension.

Matt says that the patch works for him.

Ivan.
Comment 23 Adam Jackson 2009-02-23 12:55:42 UTC
Applied, thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.