Ubuntu about: Gallium 0.4 on AMD TAHITI (DRM 2.43.0 / 4.4.0-57-generic, LLVM 3.9.1) lspci: 02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] grab godot engine & samples: https://godotengine.org/download run one of the 3d scenes... Notice the extremely bad performance. There is a thread open on the godot github, link here, and ill link back to this bug. https://github.com/godotengine/godot/issues/4799
I did a small profiling session with my mesa-git RX460 setup. It appears we are spending almost all of our time in amdgpu_cs_context_cleanup. Haven't invested more time into debuging this yet, but thought I'd mention it in case somebody starts working on it.
A small update: It appears we are spending all of the time walking in a for-loop at the bottom of context_cleanup: for (i = 0; i < ARRAY_SIZE(cs->buffer_indices_hashlist); i++) { cs->buffer_indices_hashlist[i] = -1; }
It seems a VBO with a stride of 14 gets used, the u_vbuf determines that is not 4 byte aligned and then translates it on the CPU to be 4-byte aligned. For the 3 vertex attributes each it maps the VBO read-only, for which radeonsi uses a staging buffer. As the copy to staging buffer needs to finish during the map we do an IB flush. In the end we end up with 3 GFX IB flushes per draw call, which costs a lot of performance. In the end I don't think the translation is even necessary, as all attributes are vectors of GL_BYTE and GL_HALF_FLOAT. Two things to improve: - don't use a staging buffer for small maps (or if we decide to do it to serve future maps, make sure we actually share the staging buffer) - make u_vbuf smarter so that it doesn't translate ion this case.
(In reply to Thomas Helland from comment #2) > A small update: It appears we are spending all of the time walking in a > for-loop at the bottom of context_cleanup: > > for (i = 0; i < ARRAY_SIZE(cs->buffer_indices_hashlist); i++) { > cs->buffer_indices_hashlist[i] = -1; > } Interesting fact that GCC with -O2 doesn't optimize loops like that to memcpy or memset, even if they are large. It needs additional flag -ftree-loop-distribute-patterns from -O3 or even -mtune options to do it. Proof: https://godbolt.org/g/VwbSqK
Ive also thrown up a demo showing the situation with some stuff from my current project. There is a build here: https://drive.google.com/open?id=0B_nQZvJoqbFmRUtYT3pnUm14Szg It will load straight into a map, press "p" to bring up there performance dialog. On my various amd card in the house i get between 2-12 fps. On my integrated intel I get about 50-60fps. So there is definitely some issue here with the amd cards.
Posting to say I have this problem on my RX 460. Mesa 17, I get good performance in other games but I get <10fps in the Godot platformer demo. Is there anything I can do to help test or debug this?
I think this is actually fixed in Mesa git. I get 60 fps for the platformer demo now.
(In reply to Bas Nieuwenhuizen from comment #7) > I think this is actually fixed in Mesa git. I get 60 fps for the platformer > demo now. If it's really fixed, that's good. It would be interesting to know what the change was.
(In reply to i9i7soft from comment #8) > It would be interesting to know what the change was. git bisect is your friend ;-) If we have good performance numbers and reasonably well isolated patch I won't object to having it in 17.0. See [1], and include the perf numbers as part of the nomination. Thanks [1] https://www.mesa3d.org/submittingpatches.html#nominations
I think this was actually the rework Marek did to not use the gallium shared vbo format preprocessing, which caused this. IIRC that is not really small enough for backporting, besides 17.0 not getting releases anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.