Tried to playback some html5 video (using ogg fileformat), and its unusable: I get maybe 1 FPS, or even lower, CPU is hogged 100%. This is on a Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz. Oprofile shows: 87534 72.0000 libpixman-1.so.0.16.4 libpixman-1.so.0.16.4 fetch_pixel_x8r8g8b8 suokko on #radeon captured these fallback messages: Composite fallback: op Over, src 0x9578268:s fmt XRGB8888 (1280x678), mask None, dst 0x94c3538:s fmt XRGB8888 (801x423), Test pages with html5 video showing bad performance: http://www.dailymotion.com/openvideodemo http://videos.videoonwikipedia.org/ HTML5 video pages with good performance: http://people.xiph.org/~maikmerten/demos/bigbuckbunny-videoonly.html With FGLRX I remember I was getting good performance on the dailymotion html5 testpage! Sadly with the OSS radeon/r600 driver that is not the case, flash playback is much faster than the html5 one :( Xorg: X.Org X Server 1.7.5.902 (1.7.6 RC 2) xf86-video-ati-6.12.192 - built from git mesa - built from git master (but I don't think mesa version matters here, does it?)
Created attachment 34163 [details] Xorg.0.log and dmesg
lspci output: 01:00.0 VGA compatible controller: ATI Technologies Inc RV730 PRO [Radeon HD 4650] (prog-if 00 [VGA controller]) Subsystem: PC Partner Limited Device e930 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 33 Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at e5000000 (64-bit, non-prefetchable) [size=64K] Region 4: I/O ports at 9000 [size=256] [virtual] Expansion ROM at e4000000 [disabled] [size=128K] Capabilities: <access denied> Kernel driver in use: radeon And this is when using KMS, I didn't try without. Compositing off, or compositing off in KDE4, doesn't matter, same low fps.
IRC comments A: standalone the vids work ok B: yes. It is the scaling composite operation that is missing for the video playback THe problem is the same on r300 btw, and with pixman-0.17.12. Unscaled the videos play fine at negligible load (0.00 - 0.05)
I think you are hitting this fallback: /* for REPEAT_NONE, Render semantics are that sampling outside the source * picture results in alpha=0 pixels. We can implement this with a border color * *if* our source texture has an alpha channel, otherwise we need to fall * back. If we're not transformed then we hope that upper layers have clipped * rendering to the bounds of the source drawable, in which case it doesn't * matter. I have not, however, verified that the X server always does such * clipping. */ /* FIXME R6xx */ if (pPict->transform != 0 && repeatType == RepeatNone && PICT_FORMAT_A(pPict->format) == 0) { if (!(((op == PictOpSrc) || (op == PictOpClear)) && (PICT_FORMAT_A(pDstPict->format) == 0))) RADEON_FALLBACK(("REPEAT_NONE unsupported for transformed xRGB source\n")); } commenting out that code should improve performance at the expense of render compliance.
(In reply to comment #4) > I think you are hitting this fallback: > > /* for REPEAT_NONE, Render semantics are that sampling outside the source > * picture results in alpha=0 pixels. We can implement this with a border > color > * *if* our source texture has an alpha channel, otherwise we need to fall > * back. If we're not transformed then we hope that upper layers have > clipped > * rendering to the bounds of the source drawable, in which case it doesn't > * matter. I have not, however, verified that the X server always does such > * clipping. > */ > /* FIXME R6xx */ > if (pPict->transform != 0 && repeatType == RepeatNone && > PICT_FORMAT_A(pPict->format) == 0) { > if (!(((op == PictOpSrc) || (op == PictOpClear)) && > (PICT_FORMAT_A(pDstPict->format) == 0))) > RADEON_FALLBACK(("REPEAT_NONE unsupported for transformed xRGB > source\n")); > } > > commenting out that code should improve performance at the expense of render > compliance. > Yes, after putting a #if 0 around that if this performance bug is fixed, and the video plays nicely. Oprofile shows this, so no more fallbacks! samples % image name app name symbol name 25791 38.1028 libxul.so libxul.so /usr/lib/xulrunner-1.9.1/libxul.so 7004 10.3475 libmozjs.so.2d libmozjs.so.2d /usr/lib/libmozjs.so.2d 2556 3.7761 libcairo.so.2.10800.10 libcairo.so.2.10800.10 /usr/lib/libcairo.so.2.10800.10
Maybe it makes sense trying to patch firefox not to use REPEAT_NONE if it is bad for hardware acceleration in general?
(In reply to comment #3) > THe problem is the same on r300 btw, and with pixman-0.17.12. At least 'fetch_pixel_x8r8g8b8' should not be used by pixman-0.17.12, otherwise it is missing some optimized software scaling path there too. Versions 0.17.x of pixman should have some optimizations for scaling with both bilinear (maybe it gets just 3x faster, so in practice it may be still too slow) and nearest (this should have a really huge boost) filters. The they *should* be better than 0.16.x for scaling performance.
Another question (sorry for continuous spamming). If the source image does not have alpha channel, isn't OVER operation just equivalent to SRC? And if the operation is SRC, then it should not fallback according to the quoted fragment of code?
> --- Comment #7 from Siarhei Siamashka <siarhei.siamashka@gmail.com> 2010-03-18 01:46:11 PST --- > (In reply to comment #3) >> THe problem is the same on r300 btw, and with pixman-0.17.12. > > At least 'fetch_pixel_x8r8g8b8' should not be used by pixman-0.17.12, otherwise > it is missing some optimized software scaling path there too. > > Versions 0.17.x of pixman should have some optimizations for scaling with both > bilinear (maybe it gets just 3x faster, so in practice it may be still too > slow) and nearest (this should have a really huge boost) filters. The they > *should* be better than 0.16.x for scaling performance. > > I don't think that scaling should be that slow with CPU. It looks so slow that I suspect that pixman is accessing VRAM directly.
Without the X driver patch, and with latest pixman I get this: samples % image name app name symbol name 45054 94.6075 libpixman-1.so.0.17.12 libpixman-1.so.0.17.12 bits_image_fetch_bilinear_no_repeat_8888 1112 2.3351 libxul.so libxul.so /usr/lib/xulrunner-1.9.1/libxul.so 9669 21.4224 : 3ec6c: or 0x4(%rsi,%r8,4),%r9d 23512 52.0926 : 3ec71: mov 0x48(%rsp),%r8 : 3ec76: or (%r8,%r15,4),%r13d 954 2.1137 : 3ec7a: mov 0x54(%rsp),%r8d : 3ec7f: or 0x4(%r14,%r15,4),%r8d 10768 23.8573 : 3ec84: mov %r10,%r14 : /* Alpha and Blue */ : tl64 = tl & 0xff0000ff; 9675 21.4357 : tr64 = tr & 0xff0000ff; 23517 52.1037 : bl64 = bl & 0xff0000ff; 954 2.1137 : br64 = br & 0xff0000ff; 10817 23.9659 : f = tl64 * distixiy + tr64 * distxiy + bl64 * distixy + br64 * distxy; 35 0.0775 : r |= ((f >> 16) & 0x000000ff00000000ull) | (f & 0xff000000ull);
(In reply to comment #6) > Maybe it makes sense trying to patch firefox not to use REPEAT_NONE if it is > bad for hardware acceleration in general? Absolutely! The RENDER semantics don't match other APIs or GPUs very well, and most apps which use RepeatNone probably rather want RepeatPad. That said, as the FIXME comment in that code indicates, it should be possible to accelerate all cases despite these quirky semantics with current GPUs.
(In reply to comment #8) > Another question (sorry for continuous spamming). If the source image does not > have alpha channel, isn't OVER operation just equivalent to SRC? SRC and OVER differ in how the alpha=0 sampling from outside the source picture is applied to the destination. OVER leaves the destination unaltered, while SRC clears the destination.
Thanks Alex, for clearly pointing to the cause of the fallback. And Michel, you are probably right that RepeatPad with a clip would be suitable for Firefox. I'll see if I can make that change. One of the reasons for Firefox using RepeatNone though was to work around cairo falling back to software to work around server-side render bugs with RepeatPad. I'm puzzled by / curious about two things: 1. I don't think/recall that this issue was noticeable with user mode setting. The same fallback happens without kms, right? Is the migration policy different with kms, or why the apparent regression? 2. The fallback composite seems much slower than even fetching a snaphot of the whole screen. Is the vram being accessed via mmap or similar rather than a faster bulk fetch? (Is this what Pauli was implying?)
(In reply to comment #13) > I'm puzzled by / curious about two things: > > 1. I don't think/recall that this issue was noticeable with user mode setting. > The same fallback happens without kms, right? > Is the migration policy different with kms, or why the apparent regression? > Fallback is the same regardless of kms vs. ums. With kms the driver manages pixmaps vs. exa core with ums. > 2. The fallback composite seems much slower than even fetching a snaphot of > the whole screen. > Is the vram being accessed via mmap or similar rather than a faster bulk > fetch? (Is this what Pauli was implying?) EXA migrates the pixmap to system ram for a fallback, then migrates it back to vram when it's needed for accel.
https://bugzilla.mozilla.org/show_bug.cgi?id=581797 covers changing from RepeatNone to RepeatPad for Firefox video. However, I fear there still may be situations where software fallback happens and fallback seems much much slower than necessary (as indicated by better perf with ums). Can mmap access be satisfactory/tolerable for software fallback, if only reads (no writes) are performed? i.e. Is the big problem here that pixman is alternating reading and writing from the mmapped vram (which invalidates readahead/caches, triggers barriers, or something)? If so, I wonder whether which of these would be more appropriate: A) that exa passes a new EXA_PREPARE_RW flag to indicate that read *and* write access is required, so that RADEONPrepareAccess_CS can return false and the pixmap be migrated to system ram, or B) that exa or pixman be modified so that read/write ping-ponging does not happen?
(In reply to comment #15) > A) that exa passes a new EXA_PREPARE_RW flag to indicate that read *and* > write access is required, so that RADEONPrepareAccess_CS can return false > and the pixmap be migrated to system ram, or You can easily test this by making RADEONPrepareAccess_CS always return FALSE.
Given that all the time is spent in bits_image_fetch_bilinear_no_repeat_8888 is guess the issue is simply reading from the source pixmap. I'll try out some modifications to DownloadFromScreen and UploadToScreen so that PrepareAccess can return FALSE when the pixmap might be in vram.
(In reply to comment #17) > I'll try out some modifications to DownloadFromScreen and UploadToScreen so > that PrepareAccess can return FALSE when the pixmap might be in vram. This already works in the big endian paths, so no modification other than possibly enabling parts of those should be necessary.
Created attachment 37900 [details] [review] Avoid CPU reads of VRAM This changes video rate from 2 seconds per frame with 100% CPU in Xorg to something close to 24 fps with 40% CPU in Xorg. (Of course, CPU usage drops to 10% when Firefox uses RepeatPad, but this patch is useful so that fallback is not so punishing.) The key is avoiding reading from VRAM via CPU, but using instead the RADEONBlitChunk path in RADEONDownloadFromScreenCS. Modifications to RADEONUploadToScreenCS and RADEONDownloadFromScreenCS include: * Completing the operation even when a scratch BO is not necessary (like the big endian byte-swap paths). * Flushing CS before mapping the pixmap BO for read, if CS references the BO for writing. (I don't know exactly which situations lead to an unflushed CS here, but RADEONPrepareAccess_CS ensured a flush, so it seems consistent to do so here.) * Completing the operation even when scratch BO space allocation fails. This sometimes requires a flush even in UploadToScreen. Currently, this just falls back to a similarly slow CPU read even from VRAM. I guess scratch allocation could be retried after flushing CS, but I haven't added that support here. I don't know what leads to an unflushed CS here so don't really know how much space might be freed by doing that. * If radeon_bo_is_busy doesn't set src_domain (and it stays zero), then the scratch BO path is taken because the pixmap BO might be in VRAM. Making RADEONUploadToScreenCS and RADEONDownloadFromScreenCS reliable means that RADEONPrepareAccess_CS can choose when to proceed (succeed). In this patch, RADEONPrepareAccess_CS proceeds if it knows that the BO is not going to be in VRAM. Maybe, in some ways, it might be better to fail so that EXA can consider migrating the pixmap out. However, if the BO is in GTT, then proceeding in PrepareAccess saves some memcpy and leaves the BO available for future GPU reads. AFAIK migrating the BO from VRAM to GTT in PrepareAccess doesn't seem to be a good idea without more information. Only EXA really knows whether a read is necessary, and only EXA knows which portions of the pixmap will be read, so DownloadFromScreen when EXA knows it is necessary seems the best solution.
The patch seems to contain some good ideas but to try to do too many things at once. Please post it to the xorg-driver-ati mailing list directly using git send-email (or at least generated by git format-patch) for easier review and discussion.
FWIW, the r6xx/r7xx code in r600_exa.c will need a similar treatment.
In case anyone is trying this at home, this is also needed: @@ -342,2 +367,3 @@ void RADEONFinishAccess_CS(PixmapPtr pPi radeon_bo_unmap(driver_priv->bo); + driver_priv->bo_mapped = FALSE; pPix->devPrivate.ptr = NULL;
(In reply to comment #20) > The patch seems to contain some good ideas but to try to do too many things at > once. Please post it to the xorg-driver-ati mailing list directly using git > send-email (or at least generated by git format-patch) for easier review and > discussion. I broke the patch up and touched up a couple of things. Apologies for the resend of patches with bad headers due to my own failure. If someone knows any secrets to make lists.x.org archive more than the first part of multipart messages, please let me know. Unfortunately I don't know of anywhere where the attachments are archived. I guess I can configure git send-email if necessary.
probably makes sense to attach the relevant patches here as well.
Created attachment 38202 [details] [review] [PATCH 1/6] DownloadFromScreenCS: download via a scratch BO if pixmap domain is unknown
Created attachment 38203 [details] [review] [PATCH 2/6] FinishAccess_CS: set bo_mapped to FALSE on unmap
Created attachment 38204 [details] [review] [PATCH 3/6] RADEONDownloadFromScreenCS: flush CS writes before mapping BO for read
Created attachment 38205 [details] [review] [PATCH 3/6] RADEONDownloadFromScreenCS: flush CS writes before mapping BO for read (A subsequent patch proposes removing #if X_BYTE_ORDER.)
Created attachment 38206 [details] [review] [PATCH 4/6] radeon: complete big endian UTS and DFS even when scratch allocation fails
Created attachment 38207 [details] [review] [PATCH 5/6] radeon: complete UTS and DFS even when a scratch BO is not necessary
Created attachment 38208 [details] [review] [PATCH 6/6] RADEONPrepareAccess_CS: fallback to DFS when pixmap is in VRAM Perhaps something else to consider in the future is moving the BO from VRAM to GTT in DFS (and not moving it back in UTS), but that also has pros and cons. This approach seems to work well enough so far. In this patch, RADEONPrepareAccess_CS still proceeds if it knows that the BO is not going to be in VRAM. EXA will release its system memory copy, so that there is only one copy in system memory. (Maybe, in some ways, it might be better to fail so that EXA can keep a copy and won't have to refetch if the BO gets moved to VRAM, but it seems pointless to keep around two copies in system memory and memcpy between them for GPU reads.) I wondered whether PrepareAccess could fail for the visible screen with mixed pixmaps as suggested here http://www.mentby.com/maarten-maathuis/exa-classic-problem-with-xv.html When I tried that, however, I ended up with pixels in the wrong places, a bit like what I would expect if the pitch were wrong.
I have tested the patch-series on an IBM T40p notebook with RV250 on an i386 Debian/sid system: Linux-kernel 2.6.36-rc3, libdrm 2.4.21-1 (Debian/sid), mesa-from-git (commit cd4bd4fb53f82361480f388923ef9e2fa7379d68: r600g: use the values from the correct literals), xserver 1.7.7-4 (Debian/sid) Firefox 3.5.11. [1] Without patch-series: The system is unusable, has a CPU-load of 100% and dropouts in audio/video while doing html5 video-playback [1] in FFX. [2] With patch-series: No A/V dropouts, CPU-load max 70% and system is usable, playback in FFX is a bit jerking (in both cases) but this might due to lame GPU. Thanks Karl! - Sedat (dile{X,ks} on IRC - [1] http://www.dailymotion.com/openvideodemo
(In reply to comment #32) > I have tested the patch-series on an IBM T40p notebook with RV250 on an i386 > Debian/sid system: > > Linux-kernel 2.6.36-rc3, libdrm 2.4.21-1 (Debian/sid), mesa-from-git (commit > cd4bd4fb53f82361480f388923ef9e2fa7379d68: r600g: use the values from the > correct literals), xserver 1.7.7-4 (Debian/sid) Firefox 3.5.11. > > [1] Without patch-series: > The system is unusable, has a CPU-load of 100% and dropouts in audio/video > while doing html5 video-playback [1] in FFX. > > [2] With patch-series: > No A/V dropouts, CPU-load max 70% and system is usable, playback in FFX is a > bit jerking (in both cases) but this might due to lame GPU. > > Thanks Karl! > > - Sedat (dile{X,ks} on IRC - > > [1] http://www.dailymotion.com/openvideodemo I can confirm that it is much better on my GPU as well (RV730pro) on amd64.
Michel, any objections? These look good to me. evergreen will need to be updated as well once I merge it to master.
(In reply to comment #34) > Michel, any objections? These look good to me. evergreen will need to be > updated as well once I merge it to master. Any progress on this? I haven't seen commits to xf86-video-ati since quite a while...
Pushed Karl's patches to Git master.
Some videos are still slow, but maybe they are different bugs. E.g.: http://hacks.mozilla.org/2010/04/account-manager-coming-to-firefox/
Firefox still doesn't use SHM for uploading video data to the X-Server, instead they pump all the data through unix domain sockets even for the local case. At xlib/xcb's default buffer size of 16kb this of course results in context switch storms.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.