27139 – html5 ogg video playback in firefox unusable

Bug 27139 - html5 ogg video playback in firefox unusable

Summary: html5 ogg video playback in firefox unusable

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/Radeon (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	xf86-video-ati maintainers
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-03-17 14:52 UTC by Török Edwin
Modified:	2012-04-12 15:04 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Xorg.0.log and dmesg (142.89 KB, application/octet-stream) 2010-03-17 14:52 UTC, Török Edwin	no flags	Details
Avoid CPU reads of VRAM (9.90 KB, patch) 2010-08-16 07:34 UTC, Karl Tomlinson	no flags	Details \| Splinter Review
[PATCH 1/6] DownloadFromScreenCS: download via a scratch BO if pixmap domain is unknown (1.47 KB, patch) 2010-08-26 15:10 UTC, Karl Tomlinson	no flags	Details \| Splinter Review
[PATCH 2/6] FinishAccess_CS: set bo_mapped to FALSE on unmap (953 bytes, patch) 2010-08-26 15:16 UTC, Karl Tomlinson	no flags	Details \| Splinter Review
[PATCH 3/6] RADEONDownloadFromScreenCS: flush CS writes before mapping BO for read (1.84 KB, patch) 2010-08-26 15:24 UTC, Karl Tomlinson	no flags	Details \| Splinter Review
[PATCH 3/6] RADEONDownloadFromScreenCS: flush CS writes before mapping BO for read (1.84 KB, patch) 2010-08-26 15:26 UTC, Karl Tomlinson	no flags	Details \| Splinter Review
[PATCH 4/6] radeon: complete big endian UTS and DFS even when scratch allocation fails (8.22 KB, patch) 2010-08-26 15:28 UTC, Karl Tomlinson	no flags	Details \| Splinter Review
[PATCH 5/6] radeon: complete UTS and DFS even when a scratch BO is not necessary (12.36 KB, patch) 2010-08-26 15:29 UTC, Karl Tomlinson	no flags	Details \| Splinter Review
[PATCH 6/6] RADEONPrepareAccess_CS: fallback to DFS when pixmap is in VRAM (2.46 KB, patch) 2010-08-26 15:33 UTC, Karl Tomlinson	no flags	Details \| Splinter Review
Show Obsolete (2) View All

Description Török Edwin 2010-03-17 14:52:05 UTC

Tried to playback some html5 video (using ogg fileformat), and its unusable: I get maybe 1 FPS, or even lower, CPU is hogged 100%.
This is on a Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz.

Oprofile shows:
 87534    72.0000  libpixman-1.so.0.16.4    libpixman-1.so.0.16.4    fetch_pixel_x8r8g8b8

suokko on #radeon captured these fallback messages:
      Composite fallback: op Over,
      src 0x9578268:s fmt XRGB8888 (1280x678),
      mask None,
      dst 0x94c3538:s fmt XRGB8888 (801x423),


Test pages with html5 video showing bad performance:
http://www.dailymotion.com/openvideodemo
http://videos.videoonwikipedia.org/

HTML5 video pages with good performance:
http://people.xiph.org/~maikmerten/demos/bigbuckbunny-videoonly.html

With FGLRX I remember I was getting good performance on the dailymotion html5 testpage!

Sadly with the OSS radeon/r600 driver that is not the case, flash playback is much faster than the html5 one :(

Xorg: X.Org X Server 1.7.5.902 (1.7.6 RC 2)
xf86-video-ati-6.12.192 - built from git
mesa - built from git master (but I don't think mesa version matters here, does it?)

Comment 1 Török Edwin 2010-03-17 14:52:30 UTC

Created attachment 34163 [details]
Xorg.0.log and dmesg

Comment 2 Török Edwin 2010-03-17 14:53:52 UTC

lspci output:
01:00.0 VGA compatible controller: ATI Technologies Inc RV730 PRO [Radeon HD 4650] (prog-if 00 [VGA controller])
        Subsystem: PC Partner Limited Device e930
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 33
        Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at e5000000 (64-bit, non-prefetchable) [size=64K]
        Region 4: I/O ports at 9000 [size=256]
        [virtual] Expansion ROM at e4000000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: radeon

And this is when using KMS, I didn't try without.
Compositing off, or compositing off in KDE4, doesn't matter, same low fps.

Comment 3 Bug 2010-03-17 15:01:45 UTC

IRC comments

A: standalone the vids work ok

B: yes. It is the scaling composite operation that is missing for 
                the video playback


THe problem is the same on r300 btw, and with pixman-0.17.12.

Unscaled the videos play fine at negligible load (0.00 - 0.05)

Comment 4 Alex Deucher 2010-03-17 20:41:01 UTC

I think you are hitting this fallback:

    /* for REPEAT_NONE, Render semantics are that sampling outside the source                                    
     * picture results in alpha=0 pixels. We can implement this with a border color                              
     * *if* our source texture has an alpha channel, otherwise we need to fall                                   
     * back. If we're not transformed then we hope that upper layers have clipped                                
     * rendering to the bounds of the source drawable, in which case it doesn't                                  
     * matter. I have not, however, verified that the X server always does such                                  
     * clipping.                                                                                                 
     */
    /* FIXME R6xx */
    if (pPict->transform != 0 && repeatType == RepeatNone && PICT_FORMAT_A(pPict->format) == 0) {
        if (!(((op == PictOpSrc) || (op == PictOpClear)) && (PICT_FORMAT_A(pDstPict->format) == 0)))
            RADEON_FALLBACK(("REPEAT_NONE unsupported for transformed xRGB source\n"));
    }

commenting out that code should improve performance at the expense of render compliance.

Comment 5 Török Edwin 2010-03-18 01:26:55 UTC

(In reply to comment #4)
> I think you are hitting this fallback:
> 
>     /* for REPEAT_NONE, Render semantics are that sampling outside the source   
>      * picture results in alpha=0 pixels. We can implement this with a border
> color                              
>      * *if* our source texture has an alpha channel, otherwise we need to fall  
>      * back. If we're not transformed then we hope that upper layers have
> clipped                                
>      * rendering to the bounds of the source drawable, in which case it doesn't 
>      * matter. I have not, however, verified that the X server always does such 
>      * clipping.                                                                
>      */
>     /* FIXME R6xx */
>     if (pPict->transform != 0 && repeatType == RepeatNone &&
> PICT_FORMAT_A(pPict->format) == 0) {
>         if (!(((op == PictOpSrc) || (op == PictOpClear)) &&
> (PICT_FORMAT_A(pDstPict->format) == 0)))
>             RADEON_FALLBACK(("REPEAT_NONE unsupported for transformed xRGB
> source\n"));
>     }
> 
> commenting out that code should improve performance at the expense of render
> compliance.
> 

Yes, after putting a #if 0 around that if this performance bug is fixed, 
and the video plays nicely.
Oprofile shows this, so no more fallbacks!
samples  %        image name               app name                 symbol name
25791    38.1028  libxul.so                libxul.so                /usr/lib/xulrunner-1.9.1/libxul.so
7004     10.3475  libmozjs.so.2d           libmozjs.so.2d           /usr/lib/libmozjs.so.2d
2556      3.7761  libcairo.so.2.10800.10   libcairo.so.2.10800.10   /usr/lib/libcairo.so.2.10800.10

Comment 6 Siarhei Siamashka 2010-03-18 01:31:27 UTC

Maybe it makes sense trying to patch firefox not to use REPEAT_NONE if it is bad for hardware acceleration in general?

Comment 7 Siarhei Siamashka 2010-03-18 01:46:11 UTC

(In reply to comment #3)
> THe problem is the same on r300 btw, and with pixman-0.17.12.

At least 'fetch_pixel_x8r8g8b8' should not be used by pixman-0.17.12, otherwise it is missing some optimized software scaling path there too.

Versions 0.17.x of pixman should have some optimizations for scaling with both bilinear (maybe it gets just 3x faster, so in practice it may be still too slow) and nearest (this should have a really huge boost) filters. The they *should* be better than 0.16.x for scaling performance.

Comment 8 Siarhei Siamashka 2010-03-18 02:13:07 UTC

Another question (sorry for continuous spamming). If the source image does not have alpha channel, isn't OVER operation just equivalent to SRC?

And if the operation is SRC, then it should not fallback according to the quoted fragment of code?

Comment 9 Pauli 2010-03-18 03:01:13 UTC

> --- Comment #7 from Siarhei Siamashka <siarhei.siamashka@gmail.com>  2010-03-18 01:46:11 PST ---
> (In reply to comment #3)
>> THe problem is the same on r300 btw, and with pixman-0.17.12.
>
> At least 'fetch_pixel_x8r8g8b8' should not be used by pixman-0.17.12, otherwise
> it is missing some optimized software scaling path there too.
>
> Versions 0.17.x of pixman should have some optimizations for scaling with both
> bilinear (maybe it gets just 3x faster, so in practice it may be still too
> slow) and nearest (this should have a really huge boost) filters. The they
> *should* be better than 0.16.x for scaling performance.
>
>

I don't think that scaling should be that slow with CPU. It looks so
slow that I suspect that pixman is accessing VRAM directly.

Comment 10 Török Edwin 2010-03-18 03:25:29 UTC

Without the X driver patch, and with latest pixman I get this:

samples  %        image name               app name                 symbol name
45054    94.6075  libpixman-1.so.0.17.12   libpixman-1.so.0.17.12   bits_image_fetch_bilinear_no_repeat_8888
1112      2.3351  libxul.so                libxul.so                /usr/lib/xulrunner-1.9.1/libxul.so


9669 21.4224 :   3ec6c:       or     0x4(%rsi,%r8,4),%r9d
 23512 52.0926 :   3ec71:       mov    0x48(%rsp),%r8
               :   3ec76:       or     (%r8,%r15,4),%r13d
   954  2.1137 :   3ec7a:       mov    0x54(%rsp),%r8d
               :   3ec7f:       or     0x4(%r14,%r15,4),%r8d
 10768 23.8573 :   3ec84:       mov    %r10,%r14

             :    /* Alpha and Blue */
               :    tl64 = tl & 0xff0000ff;
  9675 21.4357 :    tr64 = tr & 0xff0000ff;
 23517 52.1037 :    bl64 = bl & 0xff0000ff;
   954  2.1137 :    br64 = br & 0xff0000ff;

 10817 23.9659 :    f = tl64 * distixiy + tr64 * distxiy + bl64 * distixy + br64 * distxy;
    35  0.0775 :    r |= ((f >> 16) & 0x000000ff00000000ull) | (f & 0xff000000ull);

Comment 11 Michel Dänzer 2010-03-18 03:38:04 UTC

(In reply to comment #6)
> Maybe it makes sense trying to patch firefox not to use REPEAT_NONE if it is
> bad for hardware acceleration in general?

Absolutely! The RENDER semantics don't match other APIs or GPUs very well, and most apps which use RepeatNone probably rather want RepeatPad.

That said, as the FIXME comment in that code indicates, it should be possible to accelerate all cases despite these quirky semantics with current GPUs.

Comment 12 Karl Tomlinson 2010-08-11 23:50:22 UTC

(In reply to comment #8)
> Another question (sorry for continuous spamming). If the source image does not
> have alpha channel, isn't OVER operation just equivalent to SRC?

SRC and OVER differ in how the alpha=0 sampling from outside the source picture is applied to the destination.  OVER leaves the destination unaltered, while SRC clears the destination.

Comment 13 Karl Tomlinson 2010-08-12 05:45:44 UTC

Thanks Alex, for clearly pointing to the cause of the fallback.

And Michel, you are probably right that RepeatPad with a clip would be suitable for Firefox.  I'll see if I can make that change.  One of the reasons for Firefox using RepeatNone though was to work around cairo falling back to software to work around server-side render bugs with RepeatPad.

I'm puzzled by / curious about two things:

1. I don't think/recall that this issue was noticeable with user mode setting.
   The same fallback happens without kms, right?
   Is the migration policy different with kms, or why the apparent regression?

2. The fallback composite seems much slower than even fetching a snaphot of
   the whole screen.
   Is the vram being accessed via mmap or similar rather than a faster bulk
   fetch?  (Is this what Pauli was implying?)

Comment 14 Alex Deucher 2010-08-12 05:55:13 UTC

(In reply to comment #13)
> I'm puzzled by / curious about two things:
> 
> 1. I don't think/recall that this issue was noticeable with user mode setting.
>    The same fallback happens without kms, right?
>    Is the migration policy different with kms, or why the apparent regression?
> 

Fallback is the same regardless of kms vs. ums.  With kms the driver manages pixmaps vs. exa core with ums.

> 2. The fallback composite seems much slower than even fetching a snaphot of
>    the whole screen.
>    Is the vram being accessed via mmap or similar rather than a faster bulk
>    fetch?  (Is this what Pauli was implying?)

EXA migrates the pixmap to system ram for a fallback, then migrates it back to vram when it's needed for accel.

Comment 15 Karl Tomlinson 2010-08-12 23:22:54 UTC

https://bugzilla.mozilla.org/show_bug.cgi?id=581797 covers changing from
RepeatNone to RepeatPad for Firefox video.

However, I fear there still may be situations where software fallback happens
and fallback seems much much slower than necessary (as indicated by better
perf with ums).

Can mmap access be satisfactory/tolerable for software fallback, if only reads
(no writes) are performed?

i.e. Is the big problem here that pixman is alternating reading and writing
from the mmapped vram (which invalidates readahead/caches, triggers barriers,
or something)?

If so, I wonder whether which of these would be more appropriate:

A) that exa passes a new EXA_PREPARE_RW flag to indicate that read *and*
   write access is required, so that RADEONPrepareAccess_CS can return false
   and the pixmap be migrated to system ram, or

B) that exa or pixman be modified so that read/write ping-ponging does not
   happen?

Comment 16 Michel Dänzer 2010-08-13 02:30:56 UTC

(In reply to comment #15)
> A) that exa passes a new EXA_PREPARE_RW flag to indicate that read *and*
>    write access is required, so that RADEONPrepareAccess_CS can return false
>    and the pixmap be migrated to system ram, or

You can easily test this by making RADEONPrepareAccess_CS always return FALSE.

Comment 17 Karl Tomlinson 2010-08-15 23:03:10 UTC

Given that all the time is spent in bits_image_fetch_bilinear_no_repeat_8888
is guess the issue is simply reading from the source pixmap.

I'll try out some modifications to DownloadFromScreen and UploadToScreen so that PrepareAccess can return FALSE when the pixmap might be in vram.

Comment 18 Michel Dänzer 2010-08-16 03:22:05 UTC

(In reply to comment #17)
> I'll try out some modifications to DownloadFromScreen and UploadToScreen so
> that PrepareAccess can return FALSE when the pixmap might be in vram.

This already works in the big endian paths, so no modification other than possibly enabling parts of those should be necessary.

Comment 19 Karl Tomlinson 2010-08-16 07:34:49 UTC

Created attachment 37900 [details] [review]
Avoid CPU reads of VRAM

This changes video rate from 2 seconds per frame with 100% CPU in Xorg to
something close to 24 fps with 40% CPU in Xorg.  (Of course, CPU usage drops
to 10% when Firefox uses RepeatPad, but this patch is useful so that fallback
is not so punishing.)

The key is avoiding reading from VRAM via CPU, but using instead
the RADEONBlitChunk path in RADEONDownloadFromScreenCS.

Modifications to RADEONUploadToScreenCS and RADEONDownloadFromScreenCS include:

* Completing the operation even when a scratch BO is not necessary
  (like the big endian byte-swap paths).

* Flushing CS before mapping the pixmap BO for read, if CS references the BO
  for writing.  (I don't know exactly which situations lead to an unflushed CS
  here, but RADEONPrepareAccess_CS ensured a flush, so it seems consistent to
  do so here.)

* Completing the operation even when scratch BO space allocation fails.

  This sometimes requires a flush even in UploadToScreen.

  Currently, this just falls back to a similarly slow CPU read even from VRAM.
  I guess scratch allocation could be retried after flushing CS, but I haven't
  added that support here.  I don't know what leads to an unflushed CS here so
  don't really know how much space might be freed by doing that.

* If radeon_bo_is_busy doesn't set src_domain (and it stays zero), then the
  scratch BO path is taken because the pixmap BO might be in VRAM.

Making RADEONUploadToScreenCS and RADEONDownloadFromScreenCS reliable means
that RADEONPrepareAccess_CS can choose when to proceed (succeed).

In this patch, RADEONPrepareAccess_CS proceeds if it knows that the BO is not
going to be in VRAM.  Maybe, in some ways, it might be better to fail so that
EXA can consider migrating the pixmap out.  However, if the BO is in GTT, then
proceeding in PrepareAccess saves some memcpy and leaves the BO available for
future GPU reads.

AFAIK migrating the BO from VRAM to GTT in PrepareAccess doesn't seem to be a
good idea without more information.  Only EXA really knows whether a read is
necessary, and only EXA knows which portions of the pixmap will be read, so
DownloadFromScreen when EXA knows it is necessary seems the best solution.

Comment 20 Michel Dänzer 2010-08-16 08:02:24 UTC

The patch seems to contain some good ideas but to try to do too many things at once. Please post it to the xorg-driver-ati mailing list directly using git send-email (or at least generated by git format-patch) for easier review and discussion.

Comment 21 Alex Deucher 2010-08-16 08:20:13 UTC

FWIW, the r6xx/r7xx code in r600_exa.c will need a similar treatment.

Comment 22 Karl Tomlinson 2010-08-16 22:04:30 UTC

In case anyone is trying this at home, this is also needed:

@@ -342,2 +367,3 @@ void RADEONFinishAccess_CS(PixmapPtr pPi
     radeon_bo_unmap(driver_priv->bo);
+    driver_priv->bo_mapped = FALSE;
     pPix->devPrivate.ptr = NULL;

Comment 23 Karl Tomlinson 2010-08-23 00:53:10 UTC

(In reply to comment #20)
> The patch seems to contain some good ideas but to try to do too many things at
> once. Please post it to the xorg-driver-ati mailing list directly using git
> send-email (or at least generated by git format-patch) for easier review and
> discussion.

I broke the patch up and touched up a couple of things.
Apologies for the resend of patches with bad headers due to my own failure.

If someone knows any secrets to make lists.x.org archive more than the first part of multipart messages, please let me know.
Unfortunately I don't know of anywhere where the attachments are archived.

I guess I can configure git send-email if necessary.

Comment 24 Alex Deucher 2010-08-26 11:07:37 UTC

probably makes sense to attach the relevant patches here as well.

Comment 25 Karl Tomlinson 2010-08-26 15:10:02 UTC

Created attachment 38202 [details] [review]
[PATCH 1/6] DownloadFromScreenCS: download via a scratch BO if pixmap domain is unknown

Comment 26 Karl Tomlinson 2010-08-26 15:16:48 UTC

Created attachment 38203 [details] [review]
[PATCH 2/6] FinishAccess_CS: set bo_mapped to FALSE on unmap

Comment 27 Karl Tomlinson 2010-08-26 15:24:46 UTC

Created attachment 38204 [details] [review]
[PATCH 3/6] RADEONDownloadFromScreenCS: flush CS writes before mapping BO for read

Comment 28 Karl Tomlinson 2010-08-26 15:26:11 UTC

Created attachment 38205 [details] [review]
[PATCH 3/6] RADEONDownloadFromScreenCS: flush CS writes before mapping BO for read

(A subsequent patch proposes removing #if X_BYTE_ORDER.)

Comment 29 Karl Tomlinson 2010-08-26 15:28:16 UTC

Created attachment 38206 [details] [review]
[PATCH 4/6] radeon: complete big endian UTS and DFS even when scratch allocation fails

Comment 30 Karl Tomlinson 2010-08-26 15:29:32 UTC

Created attachment 38207 [details] [review]
[PATCH 5/6] radeon: complete UTS and DFS even when a scratch BO is not necessary

Comment 31 Karl Tomlinson 2010-08-26 15:33:56 UTC

Created attachment 38208 [details] [review]
[PATCH 6/6] RADEONPrepareAccess_CS: fallback to DFS when pixmap is in VRAM

Perhaps something else to consider in the future is moving the
BO from VRAM to GTT in DFS (and not moving it back in UTS),
but that also has pros and cons.
This approach seems to work well enough so far.

In this patch, RADEONPrepareAccess_CS still proceeds if it knows
that the BO is not going to be in VRAM.  EXA will release its
system memory copy, so that there is only one copy in system
memory.

 (Maybe, in some ways, it might be better to fail so that EXA can
  keep a copy and won't have to refetch if the BO gets moved to
  VRAM, but it seems pointless to keep around two copies in system
  memory and memcpy between them for GPU reads.)

I wondered whether PrepareAccess could fail for the visible screen
with mixed pixmaps as suggested here
http://www.mentby.com/maarten-maathuis/exa-classic-problem-with-xv.html
When I tried that, however, I ended up with pixels in the wrong
places, a bit like what I would expect if the pitch were wrong.

Comment 32 Sedat Dilek 2010-08-30 13:39:37 UTC

I have tested the patch-series on an IBM T40p notebook with RV250 on an i386 Debian/sid system:

Linux-kernel 2.6.36-rc3, libdrm 2.4.21-1 (Debian/sid), mesa-from-git (commit cd4bd4fb53f82361480f388923ef9e2fa7379d68: r600g: use the values from the correct literals), xserver 1.7.7-4 (Debian/sid) Firefox 3.5.11.

[1] Without patch-series:
The system is unusable, has a CPU-load of 100% and dropouts in audio/video while doing html5 video-playback [1] in FFX.

[2] With patch-series:
No A/V dropouts, CPU-load max 70% and system is usable, playback in FFX is a bit jerking (in both cases) but this might due to lame GPU.

Thanks Karl!

- Sedat (dile{X,ks} on IRC -

[1] http://www.dailymotion.com/openvideodemo

Comment 33 Török Edwin 2010-08-30 13:44:19 UTC

(In reply to comment #32)
> I have tested the patch-series on an IBM T40p notebook with RV250 on an i386
> Debian/sid system:
> 
> Linux-kernel 2.6.36-rc3, libdrm 2.4.21-1 (Debian/sid), mesa-from-git (commit
> cd4bd4fb53f82361480f388923ef9e2fa7379d68: r600g: use the values from the
> correct literals), xserver 1.7.7-4 (Debian/sid) Firefox 3.5.11.
> 
> [1] Without patch-series:
> The system is unusable, has a CPU-load of 100% and dropouts in audio/video
> while doing html5 video-playback [1] in FFX.
> 
> [2] With patch-series:
> No A/V dropouts, CPU-load max 70% and system is usable, playback in FFX is a
> bit jerking (in both cases) but this might due to lame GPU.
> 
> Thanks Karl!
> 
> - Sedat (dile{X,ks} on IRC -
> 
> [1] http://www.dailymotion.com/openvideodemo

I can confirm that it is much better on my GPU as well (RV730pro) on amd64.

Comment 34 Alex Deucher 2010-08-30 15:05:56 UTC

Michel, any objections?  These look good to me.  evergreen will need to be updated as well once I merge it to master.

Comment 35 Török Edwin 2010-09-18 13:25:23 UTC

(In reply to comment #34)
> Michel, any objections?  These look good to me.  evergreen will need to be
> updated as well once I merge it to master.

Any progress on this? I haven't seen commits to xf86-video-ati since quite a while...

Comment 36 Michel Dänzer 2010-09-20 01:25:52 UTC

Pushed Karl's patches to Git master.

Comment 37 Fabio Pedretti 2010-09-24 07:56:49 UTC

Some videos are still slow, but maybe they are different bugs. E.g.:
http://hacks.mozilla.org/2010/04/account-manager-coming-to-firefox/

Comment 38 Clemens Eisserer 2012-04-12 15:04:28 UTC

Firefox still doesn't use SHM for uploading video data to the X-Server, instead they pump all the data through unix domain sockets even for the local case.
At xlib/xcb's default buffer size of 16kb this of course results in context switch storms.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.