Bug 86690 - glmark2-es2-wayland shortly freezes on some frames with egl_dri2 backend (Nouveau/GK20A)
Summary: glmark2-es2-wayland shortly freezes on some frames with egl_dri2 backend (Nou...
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: git
Hardware: ARM Linux (All)
: medium normal
Assignee: Alexandre Courbot
QA Contact: Alexandre Courbot
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-11-25 08:02 UTC by Alexandre Courbot
Modified: 2014-12-04 07:46 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Trace when running glmark2 with WAYLAND_DEBUG=client (532.04 KB, text/plain)
2014-11-25 08:14 UTC, Alexandre Courbot
no flags Details
Output of "strace -tt weston-simple-egl" (961.57 KB, text/plain)
2014-11-28 09:23 UTC, Alexandre Courbot
no flags Details

Description Alexandre Courbot 2014-11-25 08:02:09 UTC
When using the egl_dri2 driver, glmark2 will sometimes keep displaying the same frame for up to 1/2 second before resuming normally. This also affects the reported frame rate, which drops dramatically with each occurence.

This only seems to happen for applications that set eglSwapInterval to 0 in order to exceed the monitor frame rate. With applications that do not set eglSwapInterval (like weston-simple-egl) or the (recently removed) egl_gallium driver that also does not allow more than 60fps, the issue is not visible.

Relevant comments from Pekka Paalanen when discussing this on the mailing-list:

"I have a hunch (wl_buffer.release not delivered in time, and client
side EGL running out of available buffers), but confirming that would
require a Wayland protocol dump up to such hickup. You could try to get
one by setting the enviroment variable WAYLAND_DEBUG=client for
glmark2. It will be a flood, especially if glmark2 succeeds in running
at uncapped framerates. The trace will come to stderr, so you want to
redirect that to file. You need to find out where in the trace the
hickup happened. The timestamps are in milliseconds. I could then take
a look (will need the whole trace)."
Comment 1 Alexandre Courbot 2014-11-25 08:14:23 UTC
Created attachment 109990 [details]
Trace when running glmark2 with WAYLAND_DEBUG=client

Attached the trace suggested by Pekka. The stuttering is visible almost immediatly after the bench started, and occured regularly until the end of the trace.
Comment 2 Pekka Paalanen 2014-11-25 12:45:38 UTC
(In reply to Alexandre Courbot from comment #1)
> Created attachment 109990 [details]
> Trace when running glmark2 with WAYLAND_DEBUG=client
> 
> Attached the trace suggested by Pekka. The stuttering is visible almost
> immediatly after the bench started, and occured regularly until the end of
> the trace.

Ok, here's a piece of what I detected as delays > 50ms in the protocol dump (timestamp, skip in ms):
2007458 55
2007628 56
2008779 62
2008926 142
2009104 128
2009638 529
2009733 84
2012659 105
2012769 62
2012838 64

It looks quite random in both occurrence and length. I checked the first six, and the delay always happens between damage and commit requests. Due to how libwayland-client prints these, these timestamp are when Mesa calls the functions wl_surface_damage and wl_surface_commit.

In src/egl/drivers/dri2/platform_wayland.c (egl_dri2), in dri2_wl_swap_buffers_with_damage() you see these function calls. The things between them are flush and invalidate. One of these probably causes the stall.

If we look at egl_gallium in src/gallium/state_trackers/egl/wayland/native_wayland.c, well, the code is different. Swapbuffers and present seem to be two different hooks.

I'm not familiar with this code to say what the actual difference is. Maybe egl_dri2 allows queueing many frames' worth of drawing to the GPU and at some point the driver simply has to wait for it all to drain? And egl_gallium does something slightly different?

Oh right, you said egl_gallium didn't respect swapinterval=0 but stayed vsynced. That explains why the drawing never accumulates. Looks like wayland_surface_swap_buffers() always waits for the frame callback, which causes exactly the throttling to vsync rate.

I believe this would be a case of queueing too much GPU work and eventually having to wait for it to drain. I'm not sure how or where you would start fixing it. Buffer re-use could also be a factor.

I am fairly sure this is not a Weston bug. It seems to swapping between the two wl_buffers just fine.
Comment 3 Alexandre Courbot 2014-11-28 09:23:23 UTC
Created attachment 110159 [details]
Output of "strace -tt weston-simple-egl"
Comment 4 Alexandre Courbot 2014-11-28 09:23:58 UTC
As discussed on IRC, it appears that running "perf top" at the same time as any Weston EGL client delivers the same behavior of frozen frames in said client, without affecting Weston itself or other non-EGL clients. "perf top" reports a high usage (~30% CPU time) of _raw_spin_unlock_irq() *only* when the condition occurs.

Very strange. I have run strace on the EGL client for both conditions (non-bound EGL program, and EGL program while "perf top" is running), and noticed that these freezes correspond to long consecutive series of calls to sched_yield(), e.g:

17:42:56.337425 sched_yield()           = 0
17:42:56.337830 sched_yield()           = 0
...
17:42:57.713679 sched_yield()           = 0
17:42:57.714122 sched_yield()           = 0

sched_yield() is never called in the case of a well-behaving client.

Grepping into Mesa, I found one loop in nouveau_fence_wait() where sched_yield() is called. Removing this call does not fix the issue, but the calls the sched_yield() are not visible in the trace anymore, which tends to confirm this is where the misbehavior is happening.

The series of sched_yield() calls are always between the two following ioctls:

17:42:56.266742 ioctl(6, 0xc0406481, 0xbeeec788) = 0
17:42:56.268952 sched_yield()           = 0
17:42:56.269466 sched_yield()           = 0
...
17:42:56.281481 sched_yield()           = 0
17:42:56.281848 sched_yield()           = 0
17:42:56.282210 ioctl(6, 0xc0406481, 0xbeeec750) = 0

with the calls to sched_yield() removed, the strace log shows large delays between these ioctls, hinting again that nouveau_fence_wait() is busy-waiting in that loop:

17:56:00.248145 ioctl(6, 0xc0406481, 0xbecf2768) = 0
17:56:01.155312 ioctl(6, 0xc0406481, 0xbecf2730) = 0

What is really remarkable is that this is true for both cases: a non-capped EGL client, or a (capped or non-capped) EGL client while "perf top" is running. The same ioctls are involved, in the same pattern.

While this seems to make it clear that we are waiting for fences, I still cannot imagine how "perf top" can have such a strong influence on this.

So it seems that in both cases we are waiting for fences. I have attached a full strace of weston-simple-egl in case this could be helpful.
Comment 5 Pekka Paalanen 2014-11-28 10:10:14 UTC
Maarten Lankhorst suspects this might have something to do with
http://lists.freedesktop.org/archives/dri-devel/2014-November/072367.html
Comment 6 Alexandre Courbot 2014-12-01 09:01:36 UTC
I have tried reverting the following patches as suggested on http://lists.freedesktop.org/archives/dri-devel/2014-November/072367.html:

809e9447: "drm/nouveau: use shared fences for readable objects"
055dffdf: "drm/nouveau: bump driver patchlevel to 1.2.1"
e3be4c23: "drm/nouveau: specify if interruptible wait is desired in nouveau_fence_sync"
15a996bb: "drm/nouveau: assign fence_chan->name correctly"

Doing so helped with another (unrelated) issue I was having, but sadly did not change anything to this problem or its symptoms. Looks like we are dealing with something completely different, even though somehow fence-related.

Also FWIW, running weston-simple-egl with the -b option (eglSwapInterval 0) leads to the same behavior as glmark2.

I will keep investigating where/why the fences are blocking.
Comment 7 Alexandre Courbot 2014-12-03 03:48:39 UTC
Some more input about this oddity.

We are really waiting for fences here. During the freezes I can see the EGL client looping in nouveau_fence_wait(), and displaying the fence sequence and the highest signaled fence by the screen (screen->fence.sequence_ack) shows that sequence_ack < fence->sequence, indicating the GPU has not signaled this fence yet.

I am starting to suspect a CPU scheduling issue here. Another very interesting phenomenon happens if I disable all but 1 CPU core: the issue becomes immediately visible (even worse than when perf top is running), for all Wayland EGL clients that set eglSwapInterval to 0. Wayland itself (or any DRM program for that instance) is still not affected, neither are Wayland clients not touching eglSwapInterval *until* they exit, at which point nouveau_fence_wait() will again endlessly loop on a fence. One way to release that fence is to trigger a draw event in another client. Then the screen's sequence_ack finally increases and the client goes through and exits. Doing the same in a client that set eglSwapInterval(0) allows it to draw a few frames, and then it blocks again.

I suspect that some event in Wayland remains stuck somewhere, which leads to the EGL client to wait for a fence that Wayland fails to release when expected. Here is a gdb backtrace of weston-simple-egl when waiting for the fence:

#0  0xb6d9298c in sched_yield ()
#1  0xb6a448ce in nouveau_fence_wait ()
#2  0xb6a45480 in nouveau_screen_fence_finish ()
#3  0xb69a351e in dri_flush ()
#4  0xb6fc399e in dri2_wl_swap_buffers_with_damage ()
#5  0xb6fc0f66 in dri2_swap_buffers_with_damage ()
#6  0xb6fb700c in eglSwapBuffersWithDamageEXT ()
#7  0x0000aa4a in redraw (data=0xbefff9e4, callback=0x0, time=0)
#8  0x0000b254 in main (argc=2, argv=0xbefffc64)

I guess the next step for me is to find how to get some output about Wayland events as they happen and see which one remains stuck.
Comment 8 Michel Dänzer 2014-12-03 06:28:48 UTC
(In reply to Alexandre Courbot from comment #7)
> I guess the next step for me is to find how to get some output about Wayland
> events as they happen and see which one remains stuck.

FWIW, I don't think these fences have anything to do with Wayland events. A Gallium fence is created as the result of a flush and should signal when the flushed operations have finished. It sounds like something in the nouveau driver/winsys/kernel code is preventing that from happening.
Comment 9 Alexandre Courbot 2014-12-03 06:50:45 UTC
(been reminded we should have a drink one of these days! ;))

Mmm, since this happens *only* for Wayland EGL programs (e.g. no problem with DRM), I was hoping to see unexpected things happening at the Wayland level.

But what seems even stranger is that I think I can see the fence release pushbuffers being queued *and* run by the GPU from the kernel side. So it may also be that the screen's fence value is properly updated but the new value not seen by user-space because of some memory incoherency.
Comment 10 Alexandre Courbot 2014-12-03 09:41:28 UTC
After further confirmation (checking the GET/PUT pointers to make sure pushbuffers are executed, and directly peeking the value of the screen fence counter using PRAMIN), it appears that the fence value is indeed updated by the GPU as expected, but user-space is reading an old value. Seems like our old friend ARM memory coherency issue is back. :(

Moving this issue to Nouveau, since this is clearly a driver problem, and taking ownership of it.
Comment 11 Alexandre Courbot 2014-12-04 07:23:17 UTC
Finally figured this out and been able to come with a workaround.

It was indeed an ARM memory coherency issue. The page containing the fence counter is allocated by TTM, using page_alloc. As a result the returned page has a high chance of coming from the lowmem region, which is identity-mapped in the kernel's virtual address space with caching enabled.

The problem is that we are trying to re-map this page uncached to user-space so Mesa can properly manage fences. On ARM, all the memory mappings of a given page must share the same cache settings, or "unspecified behavior" can occur. And unspecified behavior we had. The user-space mapping would perform as expected most of the time, but some specific conditions (high GPU load, single processor, or having perf top running) would make it behaved as a cached one, returning an old value and thus making Mesa wait.

The temporary workaround is to force all GPU objects allocated with the NOUVEAU_GEM_DOMAIN_MAPPABLE to use the TTM-DMA allocator, which uses the DMA API instead of page_alloc, and (supposedly) removes the identity mapping for allocated pages to let page re-map the pages as we want. The drawback is that this makes us more dependent on the DMA API which we wanted to avoid using too much.

I need to think of a proper fix, but marking this issue as solved since the root cause is understood.
Comment 12 Alexandre Courbot 2014-12-04 07:46:37 UTC
(In reply to Alexandre Courbot from comment #11)
> The temporary workaround is to force all GPU objects allocated with the
> NOUVEAU_GEM_DOMAIN_MAPPABLE to use the TTM-DMA allocator, which uses the DMA
> API instead of page_alloc, and (supposedly) removes the identity mapping for
> allocated pages to let page re-map the pages as we want.

This bit is actually not correct, the DMA API can remap its allocated pages to be CPU-coherent, allowing us to perform the same mapping for user-space, but by doing so it still limits us as to which mapping we can perform (i.e. only coherent mappings).


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.