Bug 12902 - VBlank related server hang in indirect rendering mode
Summary: VBlank related server hang in indirect rendering mode
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/Radeon (show other bugs)
Version: git
Hardware: Other Linux (All)
: medium normal
Assignee: xf86-video-ati maintainers
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords: have-backtrace
Depends on:
Blocks:
 
Reported: 2007-10-23 15:28 UTC by Mathieu Bérard
Modified: 2008-10-07 02:24 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
glxgears running and killing backtrace (5.03 KB, text/plain)
2007-10-23 15:34 UTC, Mathieu Bérard
no flags Details
not responding server backtrace (1.21 KB, text/plain)
2007-10-23 15:56 UTC, Mathieu Bérard
no flags Details
workaround patch (947 bytes, patch)
2007-10-23 16:06 UTC, Mathieu Bérard
no flags Details | Splinter Review
fixed patch (934 bytes, patch)
2007-10-23 16:12 UTC, Mathieu Bérard
no flags Details | Splinter Review
Fix for interrupted vblank wait (959 bytes, patch)
2008-02-28 09:16 UTC, Michel Dänzer
no flags Details | Splinter Review

Description Mathieu Bérard 2007-10-23 15:28:30 UTC
Hi,
I'm running git head (24/10/2007) of drm,mesa and xorg with a r300 hardware

I have found a 100% reproductible hang condition:
- launch LIBGL_ALWAYS_INDIRECT=1 glxgears
- hide the glxgears window behind another one
The X server then stop responding

I have observed that the xorg server send enable VBlank interrupts command in response to various events such as 
- at least 1 3D apps is launched
- de-hiding a previously hidden 3D app window
and send disable VBlank interrupts command when
- exiting the last running 3D apps
- all the running 3D apps windows are hidden

The 3D apps seems to be unaware of that and continue to issue wait for VBlank commands whatever the interrupts status (enabled or disabled) is.
The problem is that in AIGLX mode it's the xorg server process itself which try to wait for VBlank on behalf of the 3D app.
I may then tries to wait for a disabled interrupts.

I have collected a collection of backtrace which shows that when stuck the X server process is in fact spinning in the kernel waiting for VBlank, adding some
debug printk's in radeon_irq.c shows that the corresponding DRM_WAIT_FOR calls are then always interrupted by the arrival of a signal, most often in the firt iteration of the DRM_WAIT_FOR loop.

In the backtrace which is obtained placing breakpoints in radeon DDX driver RADEONDRITransitionTo3d, RADEONDRITransitionTo2d and libdrm drmWaitVBlank one can see that the firts call to drmWaitVBlank is even issued before RADEONDRITransitionTo3d...

I also make a quick and dirty hack that workaround the hang: the idea here is to   temporarily enable VBlank interrupts whenever we are about to try to wait for it.
Comment 1 Mathieu Bérard 2007-10-23 15:34:38 UTC
Created attachment 12165 [details]
glxgears running and killing backtrace

Backtrace obtained when launching indirect rendering glxgears and the killing it
Comment 2 Mathieu Bérard 2007-10-23 15:56:40 UTC
Created attachment 12166 [details]
not responding server backtrace

Backtrace of the Xorg process that is not responding anymore after launching an indirect rendering glxgears and hiding its window behind another one
Comment 3 Mathieu Bérard 2007-10-23 16:06:09 UTC
Created attachment 12167 [details] [review]
workaround patch

dirty hack that sucessfully prevent the X server from hanging
Comment 4 Mathieu Bérard 2007-10-23 16:09:24 UTC
Comment on attachment 12167 [details] [review]
workaround patch

>diff --git a/shared-core/radeon_irq.c b/shared-core/radeon_irq.c
>index 1ece639..6875a62 100644
>--- a/shared-core/radeon_irq.c
>+++ b/shared-core/radeon_irq.c
>@@ -153,12 +153,21 @@ static int radeon_driver_vblank_do_wait(struct drm_device * dev,
> 	unsigned int cur_vblank;
> 	int ret = 0;
> 	int ack = 0;
>+	int fixup = 0;
>+	int saved_crtc;
> 	atomic_t *counter;
> 	if (!dev_priv) {
> 		DRM_ERROR("%s called with no initialization\n", __FUNCTION__);
> 		return -EINVAL;
> 	}
> 
>+	if (!(radeon_vblank_crtc_get(dev) == crtc)) {
>+		saved_crtc = crtc;
>+		fixup = 1;
>+		radeon_vblank_crtc_set(dev, crtc);
>+		DRM_INFO("Fixup mode !\n");
>+	}
>+	
> 	if (crtc == DRM_RADEON_VBLANK_CRTC1) {
> 		counter = &dev->vbl_received;
> 		ack |= RADEON_CRTC_VBLANK_STAT;
>@@ -182,6 +191,9 @@ static int radeon_driver_vblank_do_wait(struct drm_device * dev,
> 
> 	*sequence = cur_vblank;
> 
>+	if (fixup == 1)
>+		radeon_vblank_crtc_set(dev, saved_crtc);
>+
> 	return ret;
> }
>
Comment 5 Mathieu Bérard 2007-10-23 16:12:49 UTC
Created attachment 12168 [details] [review]
fixed patch

Previous patch contained a leftover "return 0" from one of my many experiments,
this one should be fixed
Comment 6 Michel Dänzer 2007-10-24 03:02:04 UTC
I think we should simply remove the radeon DDX code to disable the vblank interrupts, finish up the drm vblank-rework branch and merge it. Dave, what do you think?
Comment 7 Mathieu Bérard 2007-11-04 16:35:01 UTC
(In reply to comment #6)
I think that this commit is also very interesting:
http://gitweb.freedesktop.org/?p=mesa/drm.git;a=commit;h=c06808fb6521822238bca4574758f30246b71c2d

As it enforce a correct timeout when a signal is received for some ioctls. 
I think this is another aspect of the hang I observed: if the X server continuously 
receive SIGIO while sleeping in drmWaitVBlank then it may spin in the 
"while (ret && errno == EINTR)" loop forever. 
In fact I have tried to hack drmWaitVBlank to incorporate the same kind of fix 
as for the mentioned patch, and it successfully prevented the X server hang.
(I can provide a patch if desired)

As a sidenote, that modification add at least 2 calls of gettimeofday for each
ioctls, I don't know if those are considered hotpaths but that might be suboptimal.
I think that those can be avoided by handling this issue in the kernel: just
put the task in a uninterruptible sleep state, for exemple by creating an
uninterruptible version of DRM_WAIT_FOR and we can forgot this signal issue.
That may not be valid/possile for all the ioctls, but I think that can work at
leat for drmWaitVBlank.
Comment 8 Michel Dänzer 2007-11-05 02:30:45 UTC
(In reply to comment #7)
> I think this is another aspect of the hang I observed: if the X server
> continuously 
> receive SIGIO while sleeping in drmWaitVBlank then it may spin in the 
> "while (ret && errno == EINTR)" loop forever. 

No, I think that should be handled correctly, the vblank ioctl updates the arguments for this before returning to userspace.

> In fact I have tried to hack drmWaitVBlank to incorporate the same kind of fix 
> as for the mentioned patch, and it successfully prevented the X server hang.
> (I can provide a patch if desired)

Feel free.

> I think that those can be avoided by handling this issue in the kernel: just
> put the task in a uninterruptible sleep state, for exemple by creating an
> uninterruptible version of DRM_WAIT_FOR and we can forgot this signal issue.
> That may not be valid/possile for all the ioctls, but I think that can work at
> leat for drmWaitVBlank.

Handling SIGIO ASAP is critical for input processing, in particular for smooth mouse movement, so this would probably be a bad idea.
Comment 9 Michel Dänzer 2008-02-28 03:14:09 UTC
I think I may be seeing this now - happens sometimes (but not reliably, unfortunately) running compiz and exiting fullscreen games such as xmoto or after running e.g. glxgears when invoking a compiz effect that involves full buffer swaps, such as the alt-tab switcher.

I don't understand why DRM_WAIT_ON seems to constantly get interrupted by a signal before it hits the three second timeout. Also, I can't seem to reproduce this using the DRM from the 2.6.24 kernel, so it seems to be a DRM regression that I'm trying to bisect, but it's slow progress.

Is this in line with your observations, and/or have you made any new ones?
Comment 10 Mathieu Bérard 2008-02-28 05:43:08 UTC
Well, I think that there might be some mismatch between the ddx driver and
the drm code when the vblank-rework branch was merged.
ddx driver's RADEONDRITransitionTo2d/3d still enable/disable VBlank interrupts right now, If I understood correctly that's not correct anymore with vblank-rework. On the other hand drm from 2.6.24 kernel does not have vblank-rework.

Is there any way for the ddx to know if the drm layer has been vblank-reworked ?
Maybe some version minor number in the drm sould have been incremented ...

I don't know if that's directly related to this particular bug, just some random
thoughts... 

I will try to do some tests to see if I can still reproduce the bug.


 
Comment 11 Michel Dänzer 2008-02-28 05:59:52 UTC
(In reply to comment #10)
> I don't know if that's directly related to this particular bug, [...]

Apparently not, as I have already disabled that code in my xf86-video-ati tree. Sorry, I forgot to mention that before.

So far it's looking like this is indeed related to the vblank-rework changes though - or did you see the problem without them? - but I still don't understand how any DRM changes could have an impact on the return value of signal_pending()...
Comment 12 Mathieu Bérard 2008-02-28 06:44:47 UTC
(In reply to comment #11)
> [...] as I have already disabled that code in my xf86-video-ati tree.
> Sorry, I forgot to mention that before.
> 

Yes I did that too at one point to do some test after vblank-rework got merged.
But I've dropped that after a while.

> So far it's looking like this is indeed related to the vblank-rework changes
> though - or did you see the problem without them?

When I first reported that bug, it was without vblank-rework (and without compiz) so at one point yes.
I need to redo some test to be sure how it behave now with and without vblank-rework and/or compiz.

> but I still don't
> understand how any DRM changes could have an impact on the return value of
> signal_pending()...
> 

Neither do I. And I do not understand ethier why drmWaitVBlank clear the DRM_VBLANK_RELATIVE flag when retrying after being interrupted by a signal ?

Comment 13 Michel Dänzer 2008-02-28 09:08:07 UTC
(In reply to comment #12)
> Neither do I. And I do not understand ethier why drmWaitVBlank clear the
> DRM_VBLANK_RELATIVE flag when retrying after being interrupted by a signal ?

It's just a workaround for older DRMs that didn't do it yet. drm_wait_vblank() adds the current sequence number to the relative wait period and treats it the same way as an absolute wait then. So if the relative flag wasn't cleared, interrupted relative waits would add the current sequence number again and again.

Now that you mention this though, I notice there does seem to be a bug related to interrupted waits. I'll attach a patch as I can't test it right now, it'll be interesting whether it helps with this problem.
Comment 14 Michel Dänzer 2008-02-28 09:16:55 UTC
Created attachment 14651 [details] [review]
Fix for interrupted vblank wait
Comment 15 Michel Dänzer 2008-10-07 02:24:22 UTC
This should be fixed with current Git, please reopen if you're still seeing the same problem.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.