Bug 16118 - [965] hang waiting for vblank in LOCK_HARDWARE when staring compiz
Summary: [965] hang waiting for vblank in LOCK_HARDWARE when staring compiz
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Jesse Barnes
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-05-27 10:41 UTC by Ben Gamari
Modified: 2008-07-22 09:41 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Backtrace of frozen X server (2.19 KB, text/plain)
2008-05-27 10:41 UTC, Ben Gamari
no flags Details
Xorg startup log (56.47 KB, text/plain)
2008-05-27 21:17 UTC, Ben Gamari
no flags Details
GDB log of drm with vblank fixes (3.32 KB, text/plain)
2008-06-06 12:31 UTC, Ben Gamari
no flags Details
don't wait for non-existent page flips (591 bytes, patch)
2008-06-20 14:32 UTC, Jesse Barnes
no flags Details | Splinter Review
Enable page flip & buffer swap code on 965 (4.11 KB, patch)
2008-07-21 16:48 UTC, Jesse Barnes
no flags Details | Splinter Review

Description Ben Gamari 2008-05-27 10:41:01 UTC
Created attachment 16768 [details]
Backtrace of frozen X server

On mesa/xserver/drm git, starting compiz locks up the xserver in drmWaitVBlank. Can bring up a new xserver without reboot after remotely killing frozen server. Backtrace attached.
Comment 1 Gordon Jin 2008-05-27 18:56:38 UTC
What's the chipset model (915/965?) Please attach Xorg.0.log.
Comment 2 Ben Gamari 2008-05-27 20:04:29 UTC
It's a 965. Log will be coming soon.
Comment 3 Ben Gamari 2008-05-27 21:17:18 UTC
Created attachment 16777 [details]
Xorg startup log
Comment 4 Michel Dänzer 2008-05-28 00:22:57 UTC
Does it come up if you tell gdb to 'finish' ioctl() and then to 'return ret' from drmWaitVBlank()? If so, this is probably a DRM vblank-rework quirk. I'm working on some fixes for those.
Comment 5 Ben Gamari 2008-05-28 00:40:55 UTC
Yep, that looks like the issue. Comes out fine after that.
Comment 6 Ben Gamari 2008-05-28 00:58:25 UTC
Moreover, good news. The driver runs compiz for a while when the workaround in Comment #9 of Bug #14441 is applied to xf86-video-intel. Unfortunately, it did crash after a few minutes without leaving a stack trace. I'll look into it tomorrow.

(In reply to comment #5)
> Yep, that looks like the issue. Comes out fine after that.
> 

Comment 7 Jesse Barnes 2008-06-06 11:52:02 UTC
Ben, does this startup hang still occur if you use the latest DRM modules that include Michel's vblank fixes?
Comment 8 Ben Gamari 2008-06-06 12:31:44 UTC
Created attachment 16972 [details]
GDB log of drm with vblank fixes
Comment 9 Ben Gamari 2008-06-06 12:33:30 UTC
Yep, looks like the same bug occurs with 6905c7a29d2a3bc0e605a09b98ac02a4a50893d0. I just attached another gdb backtrace for good measure. Didn't look any different but I might have missed something.
Comment 10 Michel Dänzer 2008-06-09 02:42:32 UTC
xf86-video-intel may need to call the DRM_IOCTL_MODESET_CTL ioctl appropriately.
Comment 11 Gordon Jin 2008-06-19 00:43:29 UTC
Ben, how about using the latest mesa/xserver/drm git?
Comment 12 Ben Gamari 2008-06-19 08:20:02 UTC
Nope, just tried latest git mesa (cf29ab3ba075905cca), xserver (3c5a8200111a), and drm (00f549bd5f40) and no dice. Same symptoms so I would assume it's the same issue although I didn't have a chance to throw it on a remote debugger. I can do that later today to verify if necessary.


(In reply to comment #11)
> Ben, how about using the latest mesa/xserver/drm git?
> 

Comment 13 Jesse Barnes 2008-06-20 14:32:36 UTC
Created attachment 17261 [details] [review]
don't wait for non-existent page flips

Ben, can you give this Mesa patch a try?  It turns out the 965 Mesa driver was waiting for a bogus vblank event, because there's some 915 state we try to use that we shouldn't.
Comment 14 Ben Gamari 2008-06-21 13:11:36 UTC
Seems to work. I think that did it.

(In reply to comment #13)
> Created an attachment (id=17261) [details]
> don't wait for non-existent page flips
> 
> Ben, can you give this Mesa patch a try?  It turns out the 965 Mesa driver was
> waiting for a bogus vblank event, because there's some 915 state we try to use
> that we shouldn't.
> 

Comment 15 Jesse Barnes 2008-06-23 10:45:54 UTC
Ok, at least that works around the issue for you.  Still working with Michel on the correct fix though...
Comment 16 Steven Newbury 2008-07-07 17:17:23 UTC
(In reply to comment #15)
> Ok, at least that works around the issue for you.  Still working with Michel on
> the correct fix though...
> 

This is still not working for me, without the above patch any GL programs lock up with sync to vblank turned on.  With the patch I get no lock up but at when tested with glxgears shows a still image which flickers to black about once per second. If I resize the window you can see old buffer images until the gears resizes to fill the window and again you get a still image or a black window with an occasional flicker of the gears depending on the size of the window.  It's like the buffer isn't getting swapped.
Comment 17 Steven Newbury 2008-07-07 17:25:41 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > Ok, at least that works around the issue for you.  Still working with Michel on
> > the correct fix though...
> > 
> 
> This is still not working for me, without the above patch any GL programs lock
> up with sync to vblank turned on.  With the patch I get no lock up but at when
> tested with glxgears shows a still image which flickers to black about once per
> second. If I resize the window you can see old buffer images until the gears
> resizes to fill the window and again you get a still image or a black window
> with an occasional flicker of the gears depending on the size of the window. 
> It's like the buffer isn't getting swapped.
> 
I'm getting a lot of these in my syslog:
[drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0

Could it be waiting for an interrupt on the wrong pipe?
Comment 18 Jesse Barnes 2008-07-07 17:56:38 UTC
Could be, when I dug deeper I found quite a few other problems.  I'm going to try to get the fixes pushed this week.  In the meantime, you can check out the patches: i915-965-pageflip-support.patch, i830-page-flip-vblank.patch      mesa-965-pageflip.patch, from this thread on dri-devel:
http://sourceforge.net/mailarchive/message.php?msg_name=200806231647.14827.jbarnes%40virtuousgeek.org
(it's in the message from the 23rd).

Comment 19 Michel Dänzer 2008-07-08 02:28:12 UTC
I'm not sure any of those patches directly address this problem. The wait for vblank in question is related to scheduled buffer swaps, so I think either the code for supporting those needs to be implemented/completed/fixed for 965, or this wait (along with any existing related code) needs to be disabled for 965. (When I wrote this code for the i915 driver, it was still separate from i965)
Comment 20 Michel Dänzer 2008-07-08 02:30:41 UTC
Forgot to mention I removed myself from CC: as I read the xorg-team list.
Comment 21 Jesse Barnes 2008-07-08 08:31:03 UTC
Right, that's what those patches do.  The one attached to this bug effectively removes the wait for swap on 965 (though it could be done better), the others (on the mailing list) actually implement it for 965, which would be preferred.
Comment 22 Michel Dänzer 2008-07-08 09:01:13 UTC
'It' being page flipping? My point is that this bug isn't directly related to that but to scheduled buffer swaps. Unless I'm missing something, none of the patches discussed here address those.
Comment 23 Jesse Barnes 2008-07-08 09:49:00 UTC
Arg, now you're going to make me look at all this again...  :)  IIRC when I did the patchset to dri-devel I found that all of this code was interrelated; this particular bug is the result of LOCK_HARDWARE waiting on a 0 vblank count, since vbl_pending was never updated.

The render buffer vbl_pending fields that LOCK_HARDWARE depends on get updated in intelWindowMoved, but only if page flipping is active (pf_num_pages is nonzero), so it's an important field for page flipping.

It's also updated in intelScheduleSwap, but it looks like only if the DRM_I915_VBLANK_SWAP ioctl fails, and of course assuming we don't return from intelScheduleSwap early, but there's a problem there, since intelScheduleSwap wants to do LOCK_HARDWARE.

To fix plain buffer swapping, maybe we just need to make intelWindowMoved set the vbl_pending fields in the non page flipping case too.  So even though my patch set fixes the buffer swap case also, it leaves page flipping & buffer swap conflated, which is probably a bad thing...  But this is your code, I could very well be missing some other aspect of this behavior; it's pretty subtle and doesn't seem to be commented anywhere.
Comment 24 Michel Dänzer 2008-07-09 00:29:11 UTC
(In reply to comment #23)
> The render buffer vbl_pending fields that LOCK_HARDWARE depends on get updated
> in intelWindowMoved, but only if page flipping is active (pf_num_pages is
> nonzero), so it's an important field for page flipping.

Why do you say 'only if page flipping is active'? The only conditions I see are

   if (!intel->intelScreen->driScrnPriv->dri2.enabled &&
       intel->intelScreen->driScrnPriv->ddx_version.minor >= 7) {

and

      if (flags != dPriv->vblFlags && dPriv->vblFlags &&
          !(dPriv->vblFlags & VBLANK_FLAG_NO_IRQ)) {

One problem might be that the latter test isn't satisfied initially, so the field doesn't get initialized?

> It's also updated in intelScheduleSwap, but it looks like only if the
> DRM_I915_VBLANK_SWAP ioctl fails, 

The other way around - it's only necessary to wait for a swap when one has actually been scheduled.

> and of course assuming we don't return from intelScheduleSwap early, but
> there's a problem there, since intelScheduleSwap wants to do LOCK_HARDWARE.

Why is that a problem? Anyway, it only returns early if it's not possible/necessary to schedule a swap.

> So even though my patch set fixes the buffer swap case also, it leaves page
> flipping & buffer swap conflated, which is probably a bad thing...  

Yes, it merely avoids the problem because page flipping happens to be disabled. But the two concepts aren't directly related (other than flips can be scheduled as well).

> But this is your code, I could very well be missing some other aspect of this
> behavior; it's pretty subtle and doesn't seem to be commented anywhere.

Yeah, sorry about that. Hope this clears up some things, or I'm happy to answer more questions.
Comment 25 Jesse Barnes 2008-07-09 11:52:13 UTC
On Wednesday, July 9, 2008 12:29:12 am bugzilla-daemon@freedesktop.org wrote:
> > The render buffer vbl_pending fields that LOCK_HARDWARE depends on get
> > updated in intelWindowMoved, but only if page flipping is active
> > (pf_num_pages is nonzero), so it's an important field for page flipping.
>
> Why do you say 'only if page flipping is active'? The only conditions I see
> are
>
>    if (!intel->intelScreen->driScrnPriv->dri2.enabled &&
>        intel->intelScreen->driScrnPriv->ddx_version.minor >= 7) {
>
> and
>
>       if (flags != dPriv->vblFlags && dPriv->vblFlags &&
>           !(dPriv->vblFlags & VBLANK_FLAG_NO_IRQ)) {

Yeah, that's the general code, but the specific update for vbl_pending is:

         for (i = 0; i < intel_fb->pf_num_pages; i++) {
            if (intel_fb->color_rb[i])
               intel_fb->color_rb[i]->vbl_pending = intel_fb->vbl_waited;
         }

> One problem might be that the latter test isn't satisfied initially, so the
> field doesn't get initialized?

I think it's just that no one ever tested/implemented this stuff on 965, so 
while some of the shared stuff has been fixed up, most of it was really i915 
code (or something, I haven't gone through the history in any detail).

> > It's also updated in intelScheduleSwap, but it looks like only if the
> > DRM_I915_VBLANK_SWAP ioctl fails,
>
> The other way around - it's only necessary to wait for a swap when one has
> actually been scheduled.

Oh duh, right.

> > and of course assuming we don't return from intelScheduleSwap early, but
> > there's a problem there, since intelScheduleSwap wants to do
> > LOCK_HARDWARE.
>
> Why is that a problem? Anyway, it only returns early if it's not
> possible/necessary to schedule a swap.

I'll test again on my 965GM, iirc I wasn't even getting to the buffer swap 
schedule ioctl in some cases...

> > So even though my patch set fixes the buffer swap case also, it leaves
> > page flipping & buffer swap conflated, which is probably a bad thing...
>
> Yes, it merely avoids the problem because page flipping happens to be
> disabled. But the two concepts aren't directly related (other than flips
> can be scheduled as well).
>
> > But this is your code, I could very well be missing some other aspect of
> > this behavior; it's pretty subtle and doesn't seem to be commented
> > anywhere.
>
> Yeah, sorry about that. Hope this clears up some things, or I'm happy to
> answer more questions.

Yeah, thanks.  As I fix things on 965 I'll try to comment the various methods 
(or at least the fields involved), since eventually I'd like to just use 
scheduled swaps or page flips by default...

Thanks,
Jesse
Comment 26 Michel Dänzer 2008-07-10 00:21:22 UTC
(In reply to comment #25)
> [...] the specific update for vbl_pending is:
> 
>          for (i = 0; i < intel_fb->pf_num_pages; i++) {
>             if (intel_fb->color_rb[i])
>                intel_fb->color_rb[i]->vbl_pending = intel_fb->vbl_waited;
>          }

Ah. The intention is for pf_num_pages to be 2 (or 3 with triple buffering enabled) regardless of whether page flipping is enabled. See intelUpdatePageFlipping().


> > > It's also updated in intelScheduleSwap, but it looks like only if the
> > > DRM_I915_VBLANK_SWAP ioctl fails,
> >
> > The other way around - it's only necessary to wait for a swap when one has
> > actually been scheduled.
> 
> Oh duh, right.

That said, it might make sense to set vbl_pending = vbl_waited if intelScheduleSwap() returns GL_FALSE, to prevent them from diverging too much, eventually resulting in wait for vblank ioctl timeouts.


> > > and of course assuming we don't return from intelScheduleSwap early, but
> > > there's a problem there, since intelScheduleSwap wants to do
> > > LOCK_HARDWARE.
> >
> > Why is that a problem? Anyway, it only returns early if it's not
> > possible/necessary to schedule a swap.
> 
> I'll test again on my 965GM, iirc I wasn't even getting to the buffer swap 
> schedule ioctl in some cases...

Looking at intelScheduleSwap(), that should only happen if sync to vblank is effectively disabled.
Comment 27 Jesse Barnes 2008-07-21 16:48:15 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > [...] the specific update for vbl_pending is:
> > 
> >          for (i = 0; i < intel_fb->pf_num_pages; i++) {
> >             if (intel_fb->color_rb[i])
> >                intel_fb->color_rb[i]->vbl_pending = intel_fb->vbl_waited;
> >          }
> 
> Ah. The intention is for pf_num_pages to be 2 (or 3 with triple buffering
> enabled) regardless of whether page flipping is enabled. See
> intelUpdatePageFlipping().

Yeah, I think that's the real root cause here.  Somehow when this code was unified the actual initialization of pf_num_pages got lost, which means we never update vbl_pending which means we hang in LOCK_HARDWARE.

> > > > It's also updated in intelScheduleSwap, but it looks like only if the
> > > > DRM_I915_VBLANK_SWAP ioctl fails,
> > >
> > > The other way around - it's only necessary to wait for a swap when one has
> > > actually been scheduled.
> > 
> > Oh duh, right.
> 
> That said, it might make sense to set vbl_pending = vbl_waited if
> intelScheduleSwap() returns GL_FALSE, to prevent them from diverging too much,
> eventually resulting in wait for vblank ioctl timeouts.

Please see the attached patch.  Along with the recent fixes to the DRM, it makes my 965 happy again with vblank_mode=2 or 3.

If you think it makes sense I could also rename pf_num_pages to num_render_buffers or something and set it outside of intelUpdatePageFlipping() to split the buffer swap & page flip code a bit better...

Thanks,
Jesse
Comment 28 Jesse Barnes 2008-07-21 16:48:58 UTC
Created attachment 17810 [details] [review]
Enable page flip & buffer swap code on 965
Comment 29 Michel Dänzer 2008-07-21 23:58:14 UTC
(In reply to comment #27)
> 
> If you think it makes sense I could also rename pf_num_pages to
> num_render_buffers or something and set it outside of intelUpdatePageFlipping()
> to split the buffer swap & page flip code a bit better...

I don't think it matters what I think there. :)

The patch looks good, except for one thing:

>  	    driWaitForVBlank(dPriv, &missed_target);
> +	    dPriv->vblSeq = missed_target;

missed_target is a boolean which indicates whether the target sequence was missed. driWaitForVBlank() already updated dPriv->vblSeq to the current sequence number.
Comment 30 Jesse Barnes 2008-07-22 09:34:45 UTC
> --- Comment #29 from Michel Dänzer <michel@tungstengraphics.com> 
> 2008-07-21 23:58:14 PST --- (In reply to comment #27)
>
> > If you think it makes sense I could also rename pf_num_pages to
> > num_render_buffers or something and set it outside of
> > intelUpdatePageFlipping() to split the buffer swap & page flip code a bit
> > better...
>
> I don't think it matters what I think there. :)

Hey, I value your opinion! :)

> The patch looks good, except for one thing:
> >       driWaitForVBlank(dPriv, &missed_target);
> > +     dPriv->vblSeq = missed_target;
>
> missed_target is a boolean which indicates whether the target sequence was
> missed. driWaitForVBlank() already updated dPriv->vblSeq to the current
> sequence number.

Ah ok, I'll drop that bit then.  Thanks.

Comment 31 Jesse Barnes 2008-07-22 09:41:58 UTC
Fixed in 97988ccc46c0248177cd71937021ca8cc2a7452b.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.