107670 – Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

Bug 107670 - Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

Summary: Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backw...

Status:	RESOLVED NOTOURBUG

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Other (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) All

Importance:	medium normal
Assignee:	mesa-dev
QA Contact:	mesa-dev

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-08-24 00:15 UTC by i.kalvachev
Modified:	2018-09-06 00:07 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments

Description i.kalvachev 2018-08-24 00:15:07 UTC

I've traced the massive slowdown to the memcpy() in "mesa/src/gallium/auxiliary/util/u_upload_mgr.c::u_upload_data()" that seems to be used to move data from the host memory into the video card memory.

The slowdown could be observed if non-SIMD version of the glibc-2.27 function is used (like the one that comes with the 32 bit Slackware-current). The system mesa3d package does not exhibit the same slowdown, but it seems to be linked to glibc-2.5.

I do suspect that the slowdown is caused by memcpy() implementation that copies data backwards, starting from the end and moving to the beginning. This is likely treated as non-sequential data transfer over the PCI bus (it probably sends the full 32 bit address for every 32 bits of data).
Using SSE2 memcpy seems to avoid this problem, but I have no idea if it is because it copies more data at once or because it copies forward.

In my benchmarks, `perf top` showed that the problematic memcpy() consumes 25% CPU time. In a particular game benchmark, I was getting 50fps instead of 70fps.


Just replacing that memcpy() with memmove() fixed the issue for me, without having to recompile and replace glibc.
However I do not consider it reliable fix, as there is nothing guaranteeing that memmove() would do the right thing.


I think that the correct solution would be to create a new function memcpy_to_pci() and having assembly implementation(s) that are specifically crafted to maximize PCI/PCIe throughput.
The kernel has memcpy_toio/fromio(), but they don't seem to be asm optimized.
I've seen MPlayer MMX optimized mem2agpcpy() in aclib_template.c .

Comment 1 Timothy Arceri 2018-08-24 00:30:54 UTC

There already is asm optimized version of memcpy() in glibc. Why would we want to reinvent that in Mesa?

glibc should pick the right implementation for you system.

Comment 2 i.kalvachev 2018-08-24 01:55:49 UTC

(In reply to Timothy Arceri from comment #1)
> There already is asm optimized version of memcpy() in glibc. Why would we
> want to reinvent that in Mesa?
> 
> glibc should pick the right implementation for you system.

Because some implementations copy data backwards and this creates a huge problem when it is written over PCIe.

To be clear:
for(i=0;i<size;i++)    dst[i]=src[i]; // forward copy.
for(i=size-1;i>=0;i--) dst[i]=src[i]; // backwards copy

Comment 3 Roland Scheidegger 2018-08-24 05:36:00 UTC

Isn't this mapped as WC?
In this case I'd expect the direction to make little difference, since write combine of any decent cpu should be able to combine the writes regardless the order?
Although if it's UC I suppose someone needs to ensure that the maximum possible size is picked...

Comment 4 Grazvydas Ignotas 2018-08-24 08:56:30 UTC

What game/benchmark do you see this with?

Can you try calling _mesa_streaming_load_memcpy() there? It's for reading uncached memory, but by the looks of it it might be suitable for writing too.

Comment 5 i.kalvachev 2018-08-24 12:48:52 UTC

(In reply to Roland Scheidegger from comment #3)
> Isn't this mapped as WC?
> In this case I'd expect the direction to make little difference, since write
> combine of any decent cpu should be able to combine the writes regardless
> the order?
> Although if it's UC I suppose someone needs to ensure that the maximum
> possible size is picked...

The theory that this is a caching issue has a merit since the distribution version and my build seem to use the exact same memcpy(), one that goes backwards, yet the distribution one is not triggering the massive slowdown.
The memmove() uses `rep movsb` and direction flag.

The question is, what controls the cache? How userland Mesa3D controls the PAT cache flags? Because I am just changing the libraries, without rebooting machine or restarting Xorg, I don't even stop the steam client. This means that MTRR registers are not changed and the exact same kernel module and configuration is used.

I do use modified build script, that disables support for hardware I don't have, like intel and nvidia. Some of my options might cause the cache problem, but I need to know what I am looking for.
BTW, the system libdrm is latest version (2.4.92).

Comment 6 Eero Tamminen 2018-08-24 12:55:12 UTC

(In reply to Timothy Arceri from comment #1)
> There already is asm optimized version of memcpy() in glibc. Why would we
> want to reinvent that in Mesa?
> 
> glibc should pick the right implementation for you system.

How would memcpy() know that the destination is mapped to PCI-E address space i.e. gets transparently transferred over the PCI-E bus (which has its own performance constraints)?

Comment 7 i.kalvachev 2018-08-24 15:33:35 UTC

(In reply to Grazvydas Ignotas from comment #4)
> What game/benchmark do you see this with?
> 
> Can you try calling _mesa_streaming_load_memcpy() there? It's for reading
> uncached memory, but by the looks of it it might be suitable for writing too.

I'm running Left4Dead2 under wine with Gallium Nine. The game has a `timedemo` option where it could replay a previously `record`-ed gameplay, so the benchmark is consistent.
I run it in a window, so I could watch the terminal with `perf top`.
When the problem is present, memcpy() is always the first with 25% usage, while everything else is less than 2%.
I have to point out that I do run 64bit kernel, I just need the 32 bit libraries, since the game is 32bit.

_mesa_streaming_load_memcpy() is a little problematic to test, since it is written in intrinsic and I'm compiling for i486 (that's what my distribution does). The function also has strong requirement for alignment of both src&dst, and could fall back to regular memcpy().
Still its existence is proof that there is need for such functionality.

Comment 8 Timothy Arceri 2018-08-30 23:45:13 UTC

(In reply to Eero Tamminen from comment #6)
> (In reply to Timothy Arceri from comment #1)
> > There already is asm optimized version of memcpy() in glibc. Why would we
> > want to reinvent that in Mesa?
> > 
> > glibc should pick the right implementation for you system.
> 
> How would memcpy() know that the destination is mapped to PCI-E address
> space i.e. gets transparently transferred over the PCI-E bus (which has its
> own performance constraints)?

"The slowdown could be observed if non-SIMD version of the glibc-2.27 function is used (like the one that comes with the 32 bit Slackware-current). 

....

Using SSE2 memcpy seems to avoid this problem"

Glib should select the SSE2 (or better) version of memcpy. If Slackware doesn't ship and SSE2 support for glibc I don't see how this is Mesas fault.

If I'm misunderstanding somthing please clarify. Otherwise I'm inclined to close this as won't fix.

Comment 9 i.kalvachev 2018-08-31 22:40:02 UTC

(In reply to Timothy Arceri from comment #8)
> Using SSE2 memcpy seems to avoid this problem"
> 
> Glib should select the SSE2 (or better) version of memcpy. If Slackware
> doesn't ship and SSE2 support for glibc I don't see how this is Mesas fault.
> 
> If I'm misunderstanding somthing please clarify. Otherwise I'm inclined to
> close this as won't fix.

Please,
I'm not done investigating this bug.
I also intent on writing some patches for it.

1.
The glibc memcpy() is optimized for system->system memory transfer. While it might be faster than the problematic one, it still may not be the optimal one.

Also, nothing guarantees that glibc memcpy() will continue to work properly in future. That's why it is good idea for Mesa to have its own implementation that is known to always do the right thing, when going sys->vid mem transfer.

I can write the x86(_64), MMX/AVX assembly, I've written SIMD before.
Finding the all functions that have to use it, might be more tricky and need help by experts.
(The memcpy I've reported is mostly used by Nine, but I'm getting the same problem with other memcpy()s when using OpenGL.)
---
2.
Another issue that has to be checked is related to Write Combine caching.

In the past the XFree86 DDX driver was setting video memory region caching through MTRR registers. That was removed in favor of using PAT (Page Attribute Table, aka setting caching per memory page).

I have asked developers where is the PAT handling code. Is it in the kernel kms, libdrm or Mesa3D itself? Where exactly? How do I check the caching status?

So far nobody was brave enough to answer. And if nobody has checked that code recently, it might have silently stopped working some time ago.

(One reason why SSE2 code might be working better is that it usually employs MOVNTQ. That instruction forces WC to avoid cache pollution.)


I want Mesa3D to always be fast. So help me help you.

Comment 10 Emil Velikov 2018-09-03 14:45:26 UTC

My personal train of though:

Details such as WC are left to the kernel module. Even on the case where userspace can provide hints, it's ultimately up-to the kernel to manage it.

Optimising w/o saying the benchmark/game name is _seriously_ moot.
Furthermore, doing benchmarks on a i586 build is also fairly moot.

You are correct though - _if_ glibc decides to change things perf. _may_ drop.

If memcpy shows so prominently in perf, we should look why we're using it so often. Polishing the memcpy implementation is putting a band-aid instead of fixing the actual problem.

Again, that's my personal take. Feel free to ignore.

Comment 11 Eero Tamminen 2018-09-03 15:09:51 UTC

Libc memcpy() obviously won't be optimized for PCI bus transfers, it's way too rare use-case for it.

E.g. libpciaccess would seem more suitable place for PCI bus transfer optimized memory copy function, but unfortunately it doesn't (currently) provide an API for that.


(In reply to Emil Velikov from comment #10)
> If memcpy shows so prominently in perf, we should look why we're using it so
> often. Polishing the memcpy implementation is putting a band-aid instead of
> fixing the actual problem.

I.e. are the uploads triggered by something in driver, rather than application itself directly doing it?

"valgrind --tool=callgrind <program>" would output callgraph info with call counts etc, which can be viewed in kcachegrind.

Comment 12 Emil Velikov 2018-09-03 15:31:35 UTC

Why are we even discussing a potential optimisation where the user is _unknown_?
It contradicts with the principles that we've been using in Mesa for years.

Comment 13 i.kalvachev 2018-09-04 08:48:23 UTC

As I've said, I'm still investigating the issue.
Here are some of the things I've found so far:

1. Slackware32, i586 and glibc. 
Slackware tries to support as many machines as possible, since i586 is still supported by the kernel, Slackware compiles everything to be able to run on i586.

The problem is that for some reason Glibc compiled for i586 does NOT support multi-arch. It does not use CPUID (that is available on all i586 and some i486) to pick specific version for the running CPU. Glibc supports multi-arch only for i686 builds.

2. Glibc i586 memcpy()
At first I thought that the problem is writing backwards. It made sense.
I was wrong.

Actually the glibc.i586 memcpy() does forward copy, but it also does a read of the _destination_ in order to load the entire cache line(32 bytes), before over-writing that cache line.
It seems that Pentium1 processors had no "write store" and did not "write allocate", so "write miss" would fall through, aka writes would be sent to the system RAM without been cached. So the "optimization" involves manual loading of the cache line first, by explicitly reading it.

Of course, reading from PCI is slow, not cached; and in this exact case also completely unnecessary. 

Here is the source of the memcpy function and the comment explaining the read-ahead-destination.
https://github.com/lattera/glibc/blob/master/sysdeps/i386/i586/memcpy.S#L70

3. Why the system package had no issue.
Well, it turned out quite simple - gcc inlined its own built-in memcpy(), that was just `rep movsb`. It does not do it if you compile for i486, i686 or newer; or if you touch the compile flags.

4. The Upload Data function.
I did add a printf() to the problem function u_upload_data() to check what parameters its memcpy() gets.

An apitrace file I had called the problem function about 820 times per rendered frame. Most of the time (56%) with biggest size 3136 bytes, then 32% with sizes around 512 and the rest were smaller (up to 128).

I do suspect that the transfer might actually be vram->vram.
e.g. 
In MTRR I have:
    reg01: base=0x0e0000000 ( 3584MB), size=  512MB, count=1: uncachable

Most of the logs look like:
    u_upload_data.memcpy(0xec90cf00, 0xed4e84c0, 544)

If I run the trace with `R600_DEBUG=nodma`, I do get:
    u_upload_data.memcpy(0xeab60900, 0x7cbef570, 3136)

(The "nodma" does not help with the glibc i586 memcpy slowness.)

It looks to me like the data is first moved ram->vram using dma, then vram->vram using CPU...
There should be a better way to do that.


My video card is AMD Radeon HD5670 Evergreen Redwood, that uses the r600 driver.
This transfer function is highlighted by Nine. There are others that involve OpenGL too, I just haven't tracked them down yet.

Comment 14 Michel Dänzer 2018-09-04 09:44:50 UTC

(In reply to iive from comment #13)
> Of course, reading from PCI is slow, not cached; and in this exact case also
> completely unnecessary. 

Right, reading from uncacheable memory can certainly explain the slowness.


> It looks to me like the data is first moved ram->vram using dma, then
> vram->vram using CPU...

No. u_upload_data is for copying data from normal system memory into GPU accessible memory. (You're comparing physical and virtual memory addresses, AKA apples and oranges :)

Comment 15 i.kalvachev 2018-09-04 11:50:15 UTC

(In reply to Michel Dänzer from comment #14)
> (In reply to iive from comment #13)
> > It looks to me like the data is first moved ram->vram using dma, then
> > vram->vram using CPU...
> 
> No. u_upload_data is for copying data from normal system memory into GPU
> accessible memory. (You're comparing physical and virtual memory addresses,
> AKA apples and oranges :)

Not really.
It is custom of the linux kernel to repeat the physical addresses as effective mappings, it simplifies a number of things. Also, this is not system memory that could be freed and reused in any order at kernel discretion, It is a frame buffer that is mapped from another device and the relative addressing should be preserved, as much as possible.

Aka, I do expect that the whole 512MB buffer is mapped at once.
So if one of these addresses is in vram, then the other should be too.

You could probably help me (dis)prove this, by telling me how to obtain the effective address of the frame-buffer. Xorg.0.log lists only:
[   114.494] (--) PCI:*(1@0:0:0) 1002:68d8:1458:21d9 rev 0, Mem @ 0xe0000000/268435456, 0xfbdc0000/131072, I/O @ 0x0000ce00/256, BIOS @ 0x????????/131072


I do understand that u_upload_data() is for coping data from normal system memory into GPU accessible memory, so coping vram->vram should be some kind of bug.

Comment 16 Michel Dänzer 2018-09-04 14:30:36 UTC

(In reply to iive from comment #15)
> Aka, I do expect that the whole 512MB buffer is mapped at once.

It's not (if it was, one process could access the buffer object memory of another process, bypassing process separation), TTM maps the memory of each buffer object into userspace individually. 


The whole MTRR thing is irrelevant anyway due to PAT.


You've found the problem in glibc's memcpy() reading from the destination, no need to look any further.

Comment 17 i.kalvachev 2018-09-05 00:33:24 UTC

(In reply to Michel Dänzer from comment #16)
> (In reply to iive from comment #15)
> > Aka, I do expect that the whole 512MB buffer is mapped at once.
> 
> It's not (if it was, one process could access the buffer object memory of
> another process, bypassing process separation), TTM maps the memory of each
> buffer object into userspace individually. 
> 
> The whole MTRR thing is irrelevant anyway due to PAT.
> 
> You've found the problem in glibc's memcpy() reading from the destination,
> no need to look any further.

The physical and effective addresses could be mapped 1:1, while each process loads only the pages that belong to it. Meaning that pages owned by other processes would simply remain unloaded in the current one.

Anyway,
This does not answer my question of how to (dis)prove that the memcpy does or does not do vmem->vmem.

It is relevant, as one way to fix this issue is to NOT use memcpy() for transfer, if DMA is already employed.

Axel Davy promised to take a look at that one, as it is related to Nine.

Comment 18 Axel Davy 2018-09-05 06:45:46 UTC

I doubled checked that it is indeed likely to be GTT WC read issue by looking at the mentionned trace. Some vertex buffers are in GTT WC (but with no memcpy inside mesa) and some buffers are in VRAM, with the content being filled by nine with buffer_subdata, which does a memcpy inside radeonsi (it maps the buffer then does memcpy).

That said, the elements of the default pool are allocated with:
res->domains = RADEON_DOMAIN_VRAM;
res->flags |= RADEON_FLAG_GTT_WC;

So while it lists VRAM for the location, I'm not sure how the flag is used and if it affects the mapping.

Comment 19 Michel Dänzer 2018-09-05 07:17:41 UTC

Axel, I'm not sure what you're saying. Anyway, if the problem was that the source of the memcpy is uncacheable, surely it would always be slow, regardless of which memcpy implementation is used?


> So while it lists VRAM for the location, I'm not sure how the flag is used
> and if it affects the mapping.

RADEON_FLAG_GTT_WC means a write-combined CPU mapping will be used while the buffer object resides in the GTT domain. CPU mappings of VRAM are always write-combined.

Comment 20 Axel Davy 2018-09-05 08:18:41 UTC

To clarify what I said, based on our source code and the calls made by the game trace, the only "upload" that could occur every frame is buffer data upload.

The game has uses two types of d3d vertex buffers.
. One is mapped to the gallium STREAM pool, thus to GTT WC. For this d3d buffer, the game uses unsynchronized writes. There is no memcpy on mesa side.

. One is stored into a ram buffer, linked to a buffer in the gallium DEFAULT pool. The application writes to the ram buffer, and when required, nine uploads dirty locations to the gpu buffer with the buffer_subdata call. si_buffer_subdata seems to map the buffer, and memcpy the data.

If as you say VRAM mapping is write-combined, then we don't need to look further. The slowdown comes from the distro memcpy, which will read the VRAM content on the buffer_subdata call.

Comment 21 Timothy Arceri 2018-09-06 00:07:41 UTC

(In reply to iive from comment #13)
> Slackware32, i586 and glibc. 
> Slackware tries to support as many machines as possible, since i586 is still
> supported by the kernel, Slackware compiles everything to be able to run on
> i586.
> 
> The problem is that for some reason Glibc compiled for i586 does NOT support
> multi-arch. It does not use CPUID (that is available on all i586 and some
> i486) to pick specific version for the running CPU. Glibc supports
> multi-arch only for i686 builds.
> 

I'm all for allowing old hardware to continue to be used but if you want performance you should pick a distro that targets "modern" hardware. 

Alternatively file a bug against / submit a patch for Glibc.

Given this and comment 20 I'm going to close this as not our bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.