Bug 9379

Summary:	drmCommandNone( fd, DRM_R128_CCE_IDLE ) - gives errno 22
Product:	Mesa	Reporter:	Miroslav Šustek <sustmidown>
Component:	Drivers/DRI/r128	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED WONTFIX	QA Contact:
Severity:	normal
Priority:	high	CC:	alexdeucher, jhansonxi, marauder, peter, whysogreen
Version:	git
Hardware:	x86 (IA32)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	patch to disable span locking log showing two consecutive locking attempts patch to fix various span locking issues span locking patch patch to add support for projective textures

Description Miroslav Šustek 2006-12-17 11:07:13 UTC

When I'm running some game (in this case Paintball2 (it is 3D 
FirstPersonShooter based on Quake2 engine)) and go to some very "hard-
renderable" (there is very low fps) place (in that game), it quits with error 
"Error: Rage 128 timed out... exiting".
I found out that this error is in r128_ioctl.c file in function:
void r128WaitForIdleLocked( r128ContextPtr rmesa )
So i did some changes to see what is wrong and edited code a little:

do {
  i = 0;
  do {
    ret = drmCommandNone( fd, DRM_R128_CCE_IDLE );
+   fprintf( stderr, "ret=%d, i=%d, to=%d\n" , ret, i, to);
+   if (ret) DO_USLEEP( 100 );
- } while ( ret && errno == EBUSY && i++ < R128_IDLE_RETRY );
+ } while ( ret && i++ < R128_IDLE_RETRY );
} while ( ( ret == -EBUSY ) && ( to++ < R128_TIMEOUT ) );

if ( ret < 0 ) {
+ fprintf( stderr, "(ret=%d, i=%d, to=%d, errno=%d)\n" , ret, i, to, errno);
  drmCommandNone( fd, DRM_R128_CCE_RESET);

And when I run the game and go around the level looking on the wall everything 
if OK, it prints out:

    ret=0, i=0, to=0

(with every frame, I think)
So IDLE call is successful instantly (in the first iteration).

But when I look's on some "hard-to-render" place (many polygons or something),
game hangs and it prints out:

ret=-22, i=0, to=0
ret=-22, i=1, to=0
ret=-22, i=2, to=0
...
ret=-22, i=32, to=0
(ret=-22, i=33, to=0, errno=22)
Error: Rage 128 timed out... exiting

Is seems that something causes the IDLE call can't be done successful, anymore.
Also I tried to add line with DO_USLEEP(500000), but it didn't help.
Error number 22 is EINVAL (invalid argument) and is thrown when ioctl's 
"request" or "argp" arguments are not valid.
It should throw error 16 (busy) at most...
I'm not so expert in Mesa's DRI and DRM code to find the problem.
(ps. this problem I had with some other games and with Blender(3d modeling).
and some games eg. Quake3Arena doesn't do this problem...
maybe it occurs only when rendered scene is very very full of polygons...
win32 version of Paintball2 doesn't hangs in this scenes but have big spf 
(seconds per frame))

Thank your for helping.

Comment 1 Andrew Randrianasulu 2006-12-22 15:57:24 UTC

I can confirm this bug, Paintball2_alpha016 exit during loading first map.

in dmseg:

[drm] Initialized drm 1.1.0 20060810
ACPI: PCI Interrupt 0000:01:00.0[A] -> Link [LNKA] -> GSI 11 (level, low) -> IRQ
 11
[drm] Initialized r128 2.5.0 20030725 on minor 0
agpgart: Found an AGP 1.0 compliant device at 0000:00:00.0.
agpgart: Putting AGP V2 device at 0000:00:00.0 into 2x mode
agpgart: Putting AGP V2 device at 0000:01:00.0 into 2x mode
[drm:r128_cce_idle] *ERROR* r128_cce_idle called without lock held, held  0 owne
r c947c5a0 c947c5a0

And i'm not sure related this with kpovmodeler lockup? I can see program window,
but after all UI elements rendered - whole X freeze.

[drm:drm_lock_take] *ERROR* 3 holds heavyweight lock
SysRq : Terminate All Tasks

kde 3.5.5, X.org 7.1.1 , Mesa/drm from git (22/12/2006). (also with Mesa
6.5.2/kernel 2.6.17.14 drm)

Comment 2 Will L G 2007-01-28 18:52:50 UTC

AlphaPC 164SX/UP2000
XOrg-7.1
Mesa-6.5 and Mesa-6.5.2
Kernel-2.6.13.1 [alpha linux]
Radeon 7500 PCI

Have the same error with GLCORE/GLX enabled... GL screensavers generate this same error 100% of the time, along with ANY other applications using DRI/OpenGL:

Jan 28 20:46:18 jericho kernel: [drm:drm_lock_take] *ERROR* 3 holds heavyweight lock

Comment 3 Miroslav Šustek 2007-02-07 11:39:10 UTC

Created attachment 8627 [details] [review]
patch to disable span locking

I found similar problem in R300 drivers and I also found patch which should solve it (but I don´t know if its real solution).
So I rewrote this patch for r128 with little change - disabling span locking by default.
Set R128_ENABLE_SPAN_LOCKING if you want to temporarily disable this patch.

Comment 4 Miroslav Šustek 2007-02-07 12:30:43 UTC

Edit:
> Set R128_SPAN_ENABLE_LOCKING if you want to temporarily disable this patch.

Comment 5 Andrew Randrianasulu 2007-03-13 20:54:42 UTC

Patch from comment #3 works fine!

No more lockups with kpovmodeler, paintball2 nearly hangs, but system is responsible, i can switch to another VTs and kill game. More or less, it work!


(today Mesa git + patch, and 2-3 days old drm git, kernel 2.6.19.7 X.org 7.2 xf86-video-ati from 22 feb 2007 git )

HW: p2-420Mhz/416 Mb RAM/
video-HW:
(--) R128(0): Chipset: "ATI Rage 128 Pro GL PF (AGP)" (ChipID = 0x5046)
(--) R128(0): VideoRAM: 16384 kByte (64-bit SDR SGRAM 1:1)

Comment 6 Michel Dänzer 2007-03-14 00:27:28 UTC

I'm afraid the patch is not acceptable in this form. If anything, it should default to the span locking being enabled, with the possibility of disabling it via an environment variable, as is done in the r300 driver. However, AFAIK the latter hasn't actually needed it in a while (certainly not in my experience), so it would be even better to identify and fix the actual cause of the hangs. It sounds like other parts of the driver may not handle the locking correctly.

Comment 7 Peter Cordes 2007-03-19 18:06:04 UTC

On my core2duo running Ubuntu Edgy amd64, mesa and dri from git (as of March 2), dual seat i965+r128 (X -sharevts ...), I can reproduce the "r128_cce_idle called without lock held", program exits, thing with armagetron on the r128 X server.

peter@tesla:~$ armagetron
 menu comes up.  start a game and...
Error: Rage 128 timed out... exiting
peter@tesla:~$ echo $?
255
 dmesg:
[1465378.259327] [drm:r128_cce_idle] *ERROR* r128_cce_idle called without lock held, held  0 owner ffff81003b14e180 ffff81003b14e180



peter@tesla:~$ LIBGL_DEBUG=verbose MESA_DEBUG=1 armagetron
libGL: XF86DRIGetClientDriverName: 4.1.0 r128 (screen 0)
libGL: OpenDriver: trying /usr/local/lib/dri/r128_dri.so
drmOpenDevice: node name is /dev/dri/card0
drmOpenDevice: open result is 5, (OK)
drmOpenByBusid: Searching for BusID pci:0000:06:01.0
drmOpenDevice: node name is /dev/dri/card0
drmOpenDevice: open result is 5, (OK)
drmOpenByBusid: drmOpenMinor returns 5
drmOpenByBusid: drmGetBusid reports pci:0000:00:02.0
drmOpenDevice: node name is /dev/dri/card1
drmOpenDevice: open result is 5, (OK)
drmOpenByBusid: drmOpenMinor returns 5
drmOpenByBusid: drmGetBusid reports pci:0000:06:01.0
Mesa warning: couldn't open libtxc_dxtn.so, software DXTn compression/decompression unavailable
libGL error: 
Can't open configuration file /etc/drirc: No such file or directory.
Error: Rage 128 timed out... exiting
Segmentation fault
peter@tesla:~$ echo $?
139
 dmesg:
[1465931.617255] [drm:r128_cce_idle] *ERROR* r128_cce_idle called without lock held, held  0 owner ffff81002c9ce0c0 ffff81002c9ce0c0
[1465931.733266] armagetron.real[3733]: segfault at 00002ad9f4b427f9 rip 00002ad9f4b427f9 rsp 00007fffb8ab5948 error 14

 It didn't segfault the previous two times, without the debug env vars...
My armagetron settings include full mirroring of everything, dual textured plane, and a bunch of stuff that my i965 can handle just fine.  It's probably unplayably slow on r128.  If anyone tries to reproduce this, but can't, I can send my armagetron config file...

 BTW, https://bugs.freedesktop.org/show_bug.cgi?id=7994 mentions seeing the same kernel log message.

Comment 8 Alex Deucher 2007-03-26 19:45:01 UTC

does mesa commit 25f21b5331d27225b1f6b7aaf2c9bf3f32764d91 (see bug 7994) help?

Comment 9 Peter Cordes 2007-03-27 18:22:32 UTC

> does mesa commit 25f21b5331d27225b1f6b7aaf2c9bf3f32764d91 (see bug 7994) help?

 Unfortunately no, it doesn't.

armagetron still crashes.  In System Settings->Display Settings->Detail Settings, if I set "object textures" to "off", it runs fine when I start a game.  In game, if I hit escape and change the "object textures" setting, the program crashes right away with stderr of:
Error: Rage 128 timed out... exiting

kernel log:
[drm:r128_cce_idle] *ERROR* r128_cce_idle called without lock held, held  0 owner ffff810052cc0680 ffff810052cc0680

 Another problem that's still present (and was mentioned in this bug re: kpovmodeler) is the lockup of the whole server until the client program is killed.  This is https://bugs.freedesktop.org/show_bug.cgi?id=3493.  If you have access to any r128 hardware, progs/demos/engine is the best test case for it.  It locks up, with 100% repeatability, before it can even draw its window the first time (the  previous contents of that screen area are still visible inside the window frame).  And killall engine restores everything to normal.  The behaviour of the "engine" lockup is completely unchanged by that commit.  It still says:
[drm:drm_lock_take] *ERROR* 3 holds heavyweight lock

Comment 10 András Péteri 2007-09-05 01:37:48 UTC

Created attachment 11428 [details]
log showing two consecutive locking attempts

Comment 11 András Péteri 2007-09-05 01:42:11 UTC

(Sorry for the double post, I'm a first-time bugzilla user.)

re: [drm:drm_lock_take] *ERROR* 3 holds heavyweight lock

I have attached some heavily formatted log output, which shows drm debug messages at the time of the hang (the messages come from the kernel log, obtained using "modrobe drm debug=1". You can see that after the cce_idle ioctl call, context 3 tries to lock again without unlocking first. I recompiled the Mesa DRI driver with the DEBUG_LOCKING flag set in r128_lock.h, and found out that this double locking behavior occurs in the depth buffer calls in r128_span.c. Basically

 /* These functions require locking */

+/*
 #undef HW_LOCK
 #undef HW_UNLOCK
 #define HW_LOCK()    LOCK_HARDWARE(R128_CONTEXT(ctx));
 #define HW_UNLOCK()  UNLOCK_HARDWARE(R128_CONTEXT(ctx));
+*/

 /* 16-bit depth buffer functions
  */

fixed the problem for me. Apparently, the lock is taken before these functions get called (AFAICR, r128SpanRenderStart() does the job, but I'm not sure).

re: *ERROR* r128_cce_idle called without lock held

drmP.h has a LOCK_TEST_WITH_RETURN macro, which returns -EINVAL along this error message when the lock isn't held. Because of this, the infinite loop in r128WaitForIdleLocked exits with ret == -EINVAL, so the actual error message is quite misleading in this case. So, the real error is that r128WaitForIdleLocked got called without a lock held.

I modified DEBUG_LOCK in r128_lock.h to print all lock operations to the standard error output, and modified the code to call DEBUG_LOCK on unlocks too. r128WaitForIdleLocked was also replaced with a macro to show where it's called from. It turned out that sometimes r128SpanRenderFinish() is called without calling r128SpanRenderStart() first -- so no lock is held during the call to r128WaitForIdleLocked. I think this may be a problem with the software fallback. Armagetron exhibits this behavior, while gl-117 calls RenderStart and RenderFinish in pairs; lock operations also.

Using kernel 2.6.22 with drm and mesa from the freedesktop git.

Comment 12 András Péteri 2008-01-22 03:53:55 UTC

Created attachment 13855 [details] [review]
patch to fix various span locking issues

Here goes my proposed patch for fixing some locking issues with the r128.

I think that most of the problems are caused by some primitives which ended up in the software rasterizer without setting up the software part first. The defined HW_LOCK() and HW_UNLOCK() macros mentioned in comment #11 caused "double locking" ("*ERROR* 3 holds heavyweight lock"), if the correct fallback path was taken, and r128SpanRenderStart was called before reading/writing buffers. Removing these #defines as in the radeon source - leaving the job solely to ...SpanRenderStart and ...SpanRenderFinish - in turn left the software rasterizer without *any* locking for some primitives, which resulted in early application termination (because now eg. r128WaitForIdleLocked could be called without acquiring the locks -- hence the "*ERROR* r128_cce_idle called without lock held").

I added two new fallback flags based on error logs - after removing the #defines, blender crashed shortly after enabling GL_LINE_STIPPLE, and armagetron had some problems with GL_POLYGON_SMOOTH. (armagetron also has GL_LINE_SMOOTH, but it didn't crash when enabled, so I didn't add a separate case for it.)

Another issue is caused by projective textures (seen in armagetron). In r128RenderStart, when the routine determines from the vertex format (texture coordinate size) that software rendering should be enabled, it is already "too late" - some of the function pointer changes that _swsetup_Wakeup makes (particularly tnl->Driver.Render.Start, which is changed to _swsetup_RenderStart in _swsetup_Wakeup, which in turn would call r128SpanRenderStart to acquire proper locking) will only take effect (if at all) at the beginning of the next frame. So, once again, no locking would take place for that iteration. I added a check to see if R128_FALLBACK_PROJTEX was actually set in FALLBACK, and a call to run tnl->Driver.Render.Start if needed. This part is commented out, because I'm not sure if this is the right way to handle this; so, for now, projective textures are disabled by this patch.

Comment 13 András Péteri 2008-01-22 13:41:48 UTC

Comment on attachment 13855 [details] [review]
patch to fix various span locking issues

I tried to run more games after applying the patch, and some of them terminated with a segmentation fault and/or put multiple "r128_cce_* called without lock held" messages into the kernel log. So it's not working properly.

Comment 14 Jeff D. Hanson 2009-01-15 20:44:42 UTC

I believe I've encountered this same bug on Ubuntu with Armagetron Advanced and Chromium:
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-r128/+bug/214442

Comment 15 András Péteri 2009-04-19 09:39:57 UTC

Created attachment 24949 [details] [review]
span locking patch

The previous patch left three calls to the software rasterizer without locking (these are called when both hardware accelerated and software rendered primitives are drawn), this one wraps them correctly.

Texture projection is still commented out, because not too many OpenGL applications have an option to disable it (if I remember correctly, armagetronad recently had this setting removed), and rendering these kind of primitives in software hits the frame rate quite hard. Since only the first and second texture coordinate components are left in place, leaving it this way (=unhandled) can cause rendering artifacts (some texcoord transformation magic could solve this, but it could only occur at the vertex level, and the texture would still be linearly interpolated pixel-wise).

The attached patch applies to git commit dc1153ce83041a397b1d1815db4133ce8c53eaa1 and mesa_7.4-0ubuntu3 source package from the ubuntu jaunty repository, but was tested only with the latter; to test, start a free flight in gl-117, leave the camera in first person mode and fly your plane into the Sun :)

Comment 16 András Péteri 2009-07-10 15:03:43 UTC

Created attachment 27576 [details] [review]
patch to add support for projective textures

Enables hardware-accelerated rendering of primitives that have vertices with 'q' texture coordinates used in projective texture mapping. It must be applied after attachment 24949 [details] [review].

The patch is generally based on the "ptex hack" found in the S3 Savage sources (which uses the standard vertex format, then shuffles values around to their correct position just before sending the vertices to the DMA buffer), modified to support two sets of texture coordinates; the vertex layout used by the r128, along with the additional vertex format flag needed, were obtained from earlier Mesa/DRI sources (eg. [1][2]). The performed calculations are essentially the same in both cases (savage & r128).

Both patches apply against the mesa_7.4-0ubuntu3.1 source package and git commit eb33c0ab8b3594f0b1d58534a13a26e3fb050cff, however only the former has been tested.

(ps. I'm unsure which vertex size the SAREA structure should store, but I also couldn't find any piece of code where that member gets referenced, so it may not matter either way.)

[1] http://www.koders.com/c/fid0D7829772F2499BEA37AC17768976CD64AD3459C.aspx?s=rhw2#L62

[2] http://www.koders.com/c/fidDF4ACA3257B4F85D79E4E734EA089EC50D1DBB7F.aspx#L79

Comment 17 András Péteri 2009-07-11 12:31:24 UTC

Comment on attachment 27576 [details] [review]
patch to add support for projective textures

Well, apparently it *does* matter, or something else is amiss, because after further testing I managed to hardlock the machine and got a few related segfaults in the process. I will try to look into this as time permits.

Comment 18 Adam Jackson 2009-08-24 12:25:29 UTC

Mass version move, cvs -> git

Comment 19 chemtech 2013-03-15 08:02:54 UTC

Miroslav Šustek,
Do you still experience this issue with newer drivers ?
Please check the status of your issue.

Comment 20 Miroslav Šustek 2013-03-15 13:27:00 UTC

(In reply to comment #19)
> Miroslav Šustek,
> Do you still experience this issue with newer drivers ?
> Please check the status of your issue.

I'm sorry, I no longer have the HW to test it.

Comment 21 chemtech 2013-03-26 07:55:50 UTC

Please check the status of your issue.
Or close this bug.

Comment 22 Miroslav Šustek 2013-03-28 08:21:14 UTC

Guys, thank you for all your work here. I also spent good times hacking r128 drivers.
Unfortunately, I gave the video card away four years ago, so I can't participate on this bug anymore.
Closing. *** drying nostalgic tear ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.