Dear mesa/freedesktop team! I am a happy user of the open source radeon r300 driver for my Mobility radeon 200M card in my (pretty old, but good-enough) laptop. I have updated my system + changed distro and got a complete slowdown. I have checked if it is the kernel or the distro and found that I can get back my performance if I revert to mesa 11.x and corresponding xorg while still using the latest linux kernel. This seems to be some kind of performance regression. The performance drop is heavy: 50%..100% slowdown and very high CPU usage. For example now extreme tux racer reports 100% CPU usage and before it reports 25-50% at maximum (including any spikes) and mostly around 30% actually when averaging. I have used the perf tool to locate what causes the heavy CPU usage and I find that there is a lot of memory movements in a create_bo call. See this log: Samples: 171K of event 'cycles', Event count (approx.): 67632101793 Children Self Command Shared Object Symbol - 61,12% 0,09% etr [kernel.vmlinux] [k] entry_SYSENTER_32 ◆ - 61,07% entry_SYSENTER_32 - 60,94% do_fast_syscall_32 - 57,92% sys_ioctl - 57,84% do_vfs_ioctl - 57,43% radeon_drm_ioctl - 57,04% drm_ioctl - 56,81% drm_ioctl_kernel - 55,86% radeon_gem_create_ioctl - 55,46% radeon_gem_object_create - 55,36% radeon_bo_create - 55,20% ttm_bo_init - 55,14% ttm_bo_init_reserved - 54,75% ttm_bo_validate - 54,42% ttm_bo_handle_move_mem - 54,07% ttm_tt_bind - 53,36% radeon_ttm_tt_populate + 53,33% ttm_populate_and_map_pages 0,62% radeon_ttm_backend_bind I see these are code paths in the kernel, but the same happens regardless I use an old kernel (and modules) or the newest one, while this gets a near complete disappear when I revert to old mesa and X. I do not see anything related to bo (buffer object??) creation in mesa sources below the gallium/r300 directory, but I have found things that make ioctl in: gallium/winsys/radeon/drm/radeon_drm_bo.c ^^Is this also used for r300 cards? The "source tree" documentation page seems to tell me this is a "shared code for r600 and radeonsi", but where is r300 doing calls to the ioctls then? Something happened in the last 1-3 years that changed stuff to move memory around crazily for some reason and use more CPU for that. Surely it was not having this heavy slowdown before. Now it is nearly as slow as llvmpipe for practical cases (but not slower!). Can anyone help me with this? I am a developer myself, but I am not well versed in the source code of mesa and in how to analyse its performance bottlenecks. PS.: On phoronix I was already analysing the problem for long: https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1099745-how-to-tell-if-a-driver-is-gallium-or-just-mesa-slow-renderng-with-radeon ^^There the whole process of what I was trying is written with every step, but maybe only perf outputs are of interest from there... Feel free to ask me anything about the issue. If I would be able to help solving this myself I will be happy too, but I have never really did any patches to these kind of core system libraries and I am quite rookie for gpu drivers...
PS.: As you can see I already have the mesa sources and can build and test my own in case there is any kind of idea what I should search for...
The "gallium/winsys/radeon/drm/radeon_drm_bo.c" is indeed used by r300 because I have put a log in it.
When doing an strace this is what I am getting: ... ioctl(6, DRM_IOCTL_RADEON_GEM_CREATE, 0xbfafd880) = 0 <0.000068> > [vdso]() [0x891] ioctl(6, DRM_IOCTL_RADEON_CS, 0xafe2404c) = 0 <0.000102> > [vdso]() [0x891] ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfafd9c4) = 0 <0.000030> > [vdso]() [0x891] ioctl(6, DRM_IOCTL_GEM_CLOSE, 0xbfafd99c) = 0 <0.000043> > [vdso]() [0x891] ioctl(6, DRM_IOCTL_RADEON_GEM_CREATE, 0xbfafd880) = 0 <0.000070> > [vdso]() [0x891] ioctl(6, DRM_IOCTL_RADEON_CS, 0xafe380e0) = 0 <0.000088> > [vdso]() [0x891] ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfafd9c4) = 0 <0.000029> > [vdso]() [0x891] ioctl(6, DRM_IOCTL_GEM_CLOSE, 0xbfafd99c) = 0 <0.000047> > [vdso]() [0x891] ... These continously appear in the mesa+xorg combination that is slow. I have no idea if the same is happening in the old 11.x version where speed is still good (sadly I could not try a later mesa between 11.x and 19.x so far, but I quess some of them works fast still). I have tried reverting the following changes manually in my mesa.git version: 1. https://github.com/mesa3d/mesa/commit/9ac5504df5c31e60801d9d37c1f3d547c486bf0a 2. https://github.com/mesa3d/mesa/commit/a1e391e39df2b1d8169e773a30153167ab8e13e8 3. https://github.com/mesa3d/mesa/commit/0e40c6a7b70673734dfecf5957c086b30f11befe Actually after reverting the latter things got 1FPS slower than with unchanged 19.x, but the first two have helped a very small amount. In my journey to find the slowdown I started to look around "radeon_create_bo" and "radeon_winsys_bo_create" so that is how I was trying to revert exactly these changes because these seemed to be maybe relevant using git blame lookup. If anyone knows where to look further or have any idea about my problem (maybe on the level of X or other parts of mesa making a lot of "bo"s?) please tell me :-) What is "bo" in this sense btw? Is there any documentation I should read to understand these acronyms in the code? It takes considerable amount of time to understand that cs is some kind of "command stream" (still unsure) and bo is some kind of "buffer object" while I have no idea about "pb_" and a lot of other things and not even knowing if a bo is a general buffer for anything (vertex buffers, constant buffers, backbuffers, zbuffers, who-knows-what-else-stuff) or just for one specific thing here. Is there a list for advised reading before touching the code as a noob like me?
Btw what is a "slab buffer" or "slab"? It bugs me and I see this everywhere in the code too...
Still fast with mesa 17.2.8 and X.Org X Server 1.19.5 The problem is somewhere between 17.x and 19.x mesa versions (and corresponding xorg). Also I have made an strace when it is good in one older system to see number of CREATE and CLOSE ioctl calls (also the number of CS ioctl calls) are a magnitude smaller than in case of 19.x! For example 10-20 seconds of glxgears running leads to 9-10 calls to DRM_IOCTL_RADEON_GEM_CREATE on mesa 17.2.8 while it leads to 708 (!!!) number of same calls in the same time period on mesa 19.x! This is surely a quite big of a difference! The similar pattern in 17.x is never creating a new gem object: ... ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfcf9f04) = 0 <0.000055> ioctl(6, DRM_IOCTL_RADEON_GEM_BUSY, 0xbfcf9d44) = 0 <0.000022> ioctl(6, DRM_IOCTL_RADEON_CS, 0xb307d03c) = 0 <0.000089> ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfcf9f04) = 0 <0.000053> ioctl(6, DRM_IOCTL_RADEON_GEM_BUSY, 0xbfcf9d44) = 0 <0.000023> ioctl(6, DRM_IOCTL_RADEON_CS, 0xb30910d0) = 0 <0.000095> ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfcf9f04) = 0 <0.000054> ioctl(6, DRM_IOCTL_RADEON_GEM_BUSY, 0xbfcf9d44) = 0 <0.000023> ioctl(6, DRM_IOCTL_RADEON_CS, 0xb307d03c) = 0 <0.000090> ... Sometimes when the *_BUSY ioctl call returns -1, it issues a CREATE, but otherwise not. I think GEM is some kind of memory handler for the GPU (just like "ttm" in the perf output) and I think something have messed up with memory handling schemes for Mobility Radeon 200M (r300) at some mesa update between 17.x and 19.x. Will try to bisect a closer version as 17.2.8 is from 2017.12.22 in time...
Latest bisect result so far: Fast mesa version: 17.2.8 (2017 december 22) Slow mesa version: 18.0.5 (2018 june 3) The change that slows the system is between the two. The strace outputs are generally the same as before so it is not worth sharing but I have them and they differ the same way as I told above - more GEM_CREATE ioctl calls in the slow ver.
Latest bisect result so far: Fast mesa version: 17.2.8 (2017 december 22) Slow mesa version: 18.0.0 (git: 44c7d1aa2ea) The change that slows the system is between the two.
I became a bit uncertain now as after every bisect I am choosing "git bisect bad" and see no change so far. There is 4 more bisect to do still, but I expected at least one change so far. Maybe it is in xorg and not mesa??? Before bisecting with git I was "bisecting" with the arch and ubuntu repo archives and there I had to move xorg and mesa together always... Also I am now confused about the version string in the glxinfo output because I went through a point where I expected it to write "17.2.8" and still it is writing "17.4.0". Is there a trick to exactly know what git revision the 17.2.8 version was built out of? I might be bisecting against the wrong "good" target... :-(
The problem is now surely been between cf2e92262b4 (good) and 04ea03d99d8 (bad) now. Still there are a number of some bisects still to do. Also there is a chance there is a minor issue on the other side of 04ea03d99d8 as it was not that bad as the complete end-result, but the change is really small on the other side so I went on bisecting the bigger problem.
I am getting really close to bisecting the bigger identified problem. bad: b672c3833b7 good: aff1ad0798 Now at this point: Bisecting: 9 revisions left to test after this (roughly 3 steps) [3efedf98e8272d1a32a3837b22161582d28c4487] i965: Always flush caches after blitting to a GL buffer object. Also now compiation is not even taking an hour like before where bisect was jumping big time periods as code barely changes I guess ;-)
Slowdown cause is found when bisecting: 8b3a257851905ff444d981e52938cbf2b36ba830 is the first bad commit commit 8b3a257851905ff444d981e52938cbf2b36ba830 Author: Marek Olšák <marek.olsak@amd.com> Date: Tue Jul 18 16:08:44 2017 -0400 radeonsi: set a per-buffer flag that disables inter-process sharing (v4) For lower overhead in the CS ioctl. Winsys allocators are not used with interprocess-sharable resources. v2: It shouldn't crash anymore, but the kernel will reject the new flag. v3 (christian): Rename the flag, avoid sending those buffers in the BO list. v4 (christian): Remove setting the kernel flag for now Reviewed-by: Marek Olšák <marek.olsak@amd.com>
My friend "Googling" at the commit hash we found that my problem might possibly relate to this other bug report already existing: https://bugs.freedesktop.org/show_bug.cgi?id=105171 They also report a slowdown, but not the amount I see. My slowdown is of a much higher percentage but I am on gallium/r300 code paths...
This is interesting. I have the exact same problem on an old Turion X2 laptop I've been given. I attributed it to memory controller contention (since the GPU resides in the northbridge and the memory controller is integrated on the CPU), but it feels unusually slow, to the point I had to completely disable compositing (I'm running Ubuntu MATE 19.04), whereas an even slower Eee PC 901 has no trouble at all running compiz. In another note, according to the specs, the R300 _should_ be able to run purely on modesetting/GLAMOR (I have a laptop with a Mobility Radeon X1600 and 128 MiB of VRAM running GLAMOR with only some very minor glitches), but the graphics are completely corrupted. This doesn't happen when using the radeon DDX.
Now seeing that the commit diff is actually small, I can put some printf logs in place to see what is going on and where the code ends up.
It seems similar issues not only affect me then...
Created attachment 144427 [details] [review] Simple diff/patch to test the issue See attached diff/patch for what I am trying as a quickfix and testing what causes the issue. Not a "real fix" just supposed to be used for TESTING!!! My suspection is that the change in the "r600_buffer_common.c" file might be needed for other cards too for setting the flag for the faster code path but it maybe happens only for r600 because it is in this file?! That is only my wild guess so far and will test this with the lot of logging in the patch... Only provided for those who want to understand my later logs I provide after building a version with this patch in place.
Created attachment 144431 [details] [review] possible fix Hey fellas, can you test this patch? Thanks.
Will try that supposed patch as soon as possible. It seems reasonable to me as far as I understand it. So you basically set the flag for r300. I was searching a lot to see where to add these, but I was too noob for telling where I am supposed to put these in to only affect/fix my old card range... Sorry for answering only now. I was in wine garden ;-)
Hmmm... I have tried the possible fix and Extreme Tux Racer became fast again however I still see a lot of "DRM_IOCTL_RADEON_GEM_CREATE" if I do an strace -T -e ioctl glxgears 2>&1 | vim - So things are fast now, but I still see these logged out. Is this normal in any ways? I am happy with the performance of course just I don't understand this. Maybe I made some mistake...
GEM_CREATE just creates a buffer, you should always see them, but a lot less.
Hi! I have applied this "possible fix" patch to the (nearly) latest 19.x from git and I get an assertion error there. glxinfo: ../src/gallium/winsys/radeon/drm/radeon_drm_bo.c:1023: radeon_winsys_bo_create: Assertion `heap >= 0 && heap < RADEON_MAX_CACHED_HEAPS' failed.
I also get the same error if I apply my simple hack of use_reusable_pool=true; Applying the changes to the revision where things went wrong in the past seem to be working though (but I did not test them for very long periods).
I have added this before the assert: fprintf(stderr, "radeon_get_heap_index(%u, %u): %d", domain, flags, heap); And I see this logged out before the assertion fail: radeon_get_heap_index(6, 1): -1 Will look around more if possible.
Sorry! I have to add that the above printf only prints as such if I use my hacky patch on latest git! I will try what it says with the "proposed fix" patch of Marek as the body of radeon_get_heap_index seems to need RADEON_FLAG_NO_INTERPROCESS_SHARING so the above might be misleading without me telling what are the exact circumstances. But the assert I was running into even with the proposed fix - just I wanted to try if there is a difference between the hacky fix and the normal one in the assert appearing or not...
In the assert problem this is the domain and flag meanings in case of only my hack is in place: domain(6) == RADEON_DOMAIN_GTT(2) | RADEON_DOMAIN_VRAM(4) flags(1) == RADEON_FLAG_GTT_WC (1 << 0) So I get -1 even if I "outcomment" this check in radeon_get_heap_index(..): /* Resources with interprocess sharing don't use any winsys allocators. if (!(flags & RADEON_FLAG_NO_INTERPROCESS_SHARING)) return -1;*/ There is a switch statement in this function where they switch based on the domain parameter: switch (domain) { case RADEON_DOMAIN_VRAM: ... case RADEON_DOMAIN_GTT: ... default: break; } return -1; I suspect we return -1 here because we go into the "default" case. I think that is right because domain is: 6 == 2 | 4 == GTT+VRAM at the same time. Maybe once it was meaningful to have both GTT (is that GART?) and VRAM at the same time being or-ed together but not anymore? I will try to add a hack to make the domain always just VRAM to see what happens... I know that is bad, but just for the analysis.
Created attachment 144439 [details] [review] Hacky fix working also for 19.x versions I can confirm the problem with the "possible fix" is because radeon_get_heap_index does not handle both GTT and VRAM domains at the same time, but r300 still asks for that. Added a hacky fix that works on 19.x versions but forces VRAM domain. This highlights problematic areas for the new versions. I think this latter issue is because of bitrot happening with r300 as now we re-enabled a code path that was never happening after the slowdown in 17.x as analysed before. PS.: Of course the "possible fix" still works when applied directly to the 17.x commit that made things slow and that is the better direction, just not to the latest code because of this issue.
This is amazing work, Richard. I'd love to help you with testing, since I also have the same hardware, but unfortunately I don't have a build environment ready. I'm using padoka's PPA, though. FWIW, I'll be able to test a couple of weeks after the (proper) fix hits mainline Mesa. In any case, if you'd like to send me a prebuilt x86-64 lib, I can manually replace it and test on my laptop.
Hi Rui! We should also thank Marek for him proposing a possible fix. I am currently creating a "less hacky" solution based on that attachment and a better idea to not always force VRAM domain, but only if the r300 caller codes ask "VRAM or something" and forcing GTT if it asks "GTT or something - but not VRAM". This basically lead to the driver favoring the domain choice to be in VRAM, but still supporting other things if the caller asks other domains directly. Also incorporating the parts that Marek proposes is much better and cleaner than my hack that just sets the flag right before it is read. After I am completed with these changes and their testing I will publish it here and only maybe then have any time to compile an x86_64 version for other people. I highly suggest however for you and others to just grab the latest revision from git and apply the changesets until the solution arrives in mesa releases. I am on a single core, 32bit laptop with 1512-128MB of memory I got in 2007 and I think most of the hardware is like from 2005 technology in it so I might be some among the slowest cross-compiling person so please give a shot to put together a build environment yourself. It is not that hard to build mesa, this was my first time doing so too! Just believe in yourself and read the nice docs on the mesa page about building with meson. If you have any questions of the build process look around phoronix or mesa docs and forums or if you give up just you can ask any people. Even I can help now and you will be able too after you build yourself once ;-)
Created attachment 144442 [details] [review] quickfix, working with 19.x mesa versions (tested: latest) Added a proposed quickfix based on "possible fix" from Marek. This version now works with latest mesa and I suspect it does not break anything else with other cards neither. Still could be cleaned up a bit by finding and changing which r300 code paths ask for VRAM and GTT domains together and changing code there, but this works so far. This change favors VRAM domain over GTT domain when both are provided. For more available memory, you can change it the other way around easily, but this seems to favor faster speed over more memory available.
Interesting to me is that the CREATE call number is still higher in strace output than how it was originally before the first slowdown, but the performance seems to be the same after applying the quickfix to latest 19.x from git. This one still bugs me a bit, but as there is no visible degradation it is fine I guess.
While bisecting the original issue I saw a point where speed was already bad, but not that much as with latest however mostly like that (at least 90-95%). So there might be a little bit more speedup gain if someone or me bisects between 04ea03d99d8 and the latest to see when is the minor slowdown, but it might not worth it as it is so small that it might be less than 1% in effect. Just documenting this as I still see this from strace output after the quickfix patch - but not visible or measurable when playing or running 3D apps. At least now there is a quickfix solution based on Marek's attachment that works with 19.x too - enjoy ;-)
(In reply to Richard Thier from comment #28) > Hi Rui! Hi, Richard! > We should also thank Marek for him proposing a possible fix. Yeah, Marek kicks ass, he's one of the original developers of r300g. :) > I am on a single core, 32bit laptop with 1512-128MB of > memory I got in 2007 and I think most of the hardware is like from 2005 > technology in it so I might be some among the slowest cross-compiling person > so please give a shot to put together a build environment yourself. My Xpress 200M laptop has a Turion X2 TL-60 with 4 GiB of RAM, so I'll just shut up. :P > It is not that hard to build mesa, this was my first time doing so too! Just > believe in yourself and read the nice docs on the mesa page about building > with meson. I build my own kernels straight from Linus' git, but that's trivial. I've never built Mesa, but I might as well just bite the bullet. ;) Something tells my I'll be needing it soon, when I try to get a X1950 XT running on my Power Mac G5. I have no idea about the current state of r300g on big-endian, it's most likely bitrotted (I'd love to be wrong, though). Speaking of which (and totally off-topic), how hard and/or beneficial would it be to also implement shader caching on r300g? > If you have any questions of the build process look around > phoronix or mesa docs and forums or if you give up just you can ask any > people. Even I can help now and you will be able too after you build > yourself once ;-) Great! I often lurk around Phoronix too. I don't think I'll need extra help after RTFM, but I appreciate it. Thanks a lot, Richard! :)
I have compiled a "how to build" just for you - linking sources I used: http://ballmerpeak.web.elte.hu/devblog/building-your-own-mesa.html Everyone should be able to try the quickfix using this and the links on this page, but I guess even if questions arise there will be people answering them - just please do not clutter the bug report with these anymore. Write emails or ask around forums or something as this is a general topic unrelated to the slowdown bug. Wondering about two things now: - Is my quickfix patch working for other people too? - Is my quickfix good-enough or should we dig deeper and solve the domain bits issue other ways? (latter is a question towards more experiences devs)
Created attachment 144454 [details] [review] possible fix v2 Please test this one.
(In reply to Marek Olšák from comment #34) > Created attachment 144454 [details] [review] [review] > possible fix v2 > > Please test this one. Hi, Marek, Unfortunately I couldn't do it yesterday, but I'll try to do so today. I'm also entertaining the idea of implementing support for shader disk caching, but I'd like your opinion before moving forward. I read the implementation on nouveau and r600 and it seemed simple enough (or, most likely, I'm missing something fundamental ;)). I don't know how beneficial it could be, though, since R300-R500 shaders are much simpler, so measurements would have to be done.
I think if you just implement the disk cache callback, it'll automatically enable the shader cache for the GLSL compiler.
(In reply to Marek Olšák from comment #36) > I think if you just implement the disk cache callback, it'll automatically > enable the shader cache for the GLSL compiler. Hi, Marek Yeah, that was my reasoning. Will try. :)
Hello everyone! While hopefully someone is testing Marek's solution as I went to test if the "minor problem" between 04ea03d99d8 and 19.x is causing a slowdown or not. Despite most of the heavy slowdown went away now, I am still unsure if the more ioctl calls in later versions (let's call it ioctl spamming) can cause a slowdown in cases or not so I plan to compare performance of more heavy games between a patched 19.x and the last working 17.x just to see how it is and maybe start bisecting from 04ea03d99d8 if there is still a slowdown. It just bugs me as a bad feeling so I want to be sure with heavier games than extreme tux racer...
First I went on to strace a bit and measure directly... 10 sec glxgears - lastfast 17.x: - 2105 ioctl calls - 6 DRM_IOCTL_RADEON_GEM_CREATE - 7 DRM_IOCTL_GEM_CLOSE 10 sec glxgears - patched 19.x: - 2916 ioctl calls - 712 DRM_IOCTL_RADEON_GEM_CREATE - 713 DRM_IOCTL_GEM_CLOSE In comparison this is the unpatched 19.x originally (10 sec): -> 12466 (any) ioctl calls -> 3111 DRM_IOCTL_RADEON_GEM_CREATE -> 3112 DRM_IOCTL_GEM_CLOSE Surely it is already an order of magnitude better now, but maybe still not as best as it should be. As I told you, when bisecting there was a point I saw this happening and it was hinting there are two issues - but one is much smaller. glxgears however runs roughly the same FPS now. Maybe things start to slow down if I run out of CPU power as glxgears don't run out now - neither tuxracer, but will try heavy things like mount&blade: warband... it was working well before..
Okay... I cannot easily test more heavy games because some of them crash with the 17.x last good one (maybe because missing external S3TC or whatever, or just something similar). Will try to bisect the minor issue too however still, just to see what that is.
Hi, guys I haven't tried Marek's solution yet. It's the first time I build Mesa from scratch and I've never used Meson before, so I'm studying along the way. I applied the patch, configured with no DRI drivers and only the r300g driver, compiled and installed to the default location (/usr/local/lib/x86_64-linux-gnu, it seems). Unfortunately, glxinfo can't find the libraries when I point the environment variables to them, so I must have done something stupid. Will dig deeper. (I know this isn't a developer support thread, I just wanted you to know I haven't given up or stopped!)
(In reply to Rui Salvaterra from comment #41) > Hi, guys > > I haven't tried Marek's solution yet. It's the first time I build Mesa from > scratch and I've never used Meson before, so I'm studying along the way. I > applied the patch, configured with no DRI drivers and only the r300g driver, > compiled and installed to the default location > (/usr/local/lib/x86_64-linux-gnu, it seems). Unfortunately, glxinfo can't > find the libraries when I point the environment variables to them, so I must > have done something stupid. Will dig deeper. > (I know this isn't a developer support thread, I just wanted you to know I > haven't given up or stopped!) meson actually accepts prefix , so I use it like this (from BUILD dir inside mesa tree): meson ../ --prefix=/usr/X11R7 rest_of_options (obviously, for x86-64 prefix will be diffeent, I just wanted to say I have non-standard X prefix and install mesa into it just fine) full line in my case looks like: meson ../ --prefix=/usr/X11R7 --strip --buildtype debugoptimized -Degl=true -Ddri-drivers=r100,r200,i965,nouveau -Dplatforms=drm,x11 -Dgallium-drivers=i915,r600,radeonsi,swrast,virgl,nouveau,r300 -Dvulkan-drivers=amd,intel -Dgallium-nine=true -Dgallium-opencl=icd -Dgallium-va=true -Dgallium-xvmc=true -Dgallium-xa=false You probably want to add --reconfigure if you re-run meson after some git pull from same BUILD directory.
Hi, everyone! I finally did it, I was able to point the envvars in the right direction, tested Marek's patch and it seems to be working nicely. Compiz feels much snappier, indeed. Feel free to add, Tested-by: Rui Salvaterra <rsalvaterra@gmail.com> Now that I'm able to do code-build-test cycles, it's time to implement on-disk shader caching. Stay tuned… :)
Created attachment 144469 [details] [review] r300g: implement glsl shader caching. And here's a preliminary patch. If it looks good enough to you, I'll send it officially to mesa-dev. :)
Yeah that looks good.
In comparison this is the unpatched 19.x originally (10 sec): -> 12466 (any) ioctl calls -> 3111 DRM_IOCTL_RADEON_GEM_CREATE -> 3112 DRM_IOCTL_GEM_CLOSE ^^Sorry these values were bad, it is not this much and I don't know how I measured this back then but I cannot do it anymore. Actually it is not that easy to find the "more minor" issue, but I still feel it is there. The only way for finding it is to bisect in this way: -> bisect ... -> apply patches of Marek -> See if result has 7-8 CREATE calls or a bit more -> stash away patches -> bisect ... Only this way it is properly measurable because "measuring" without the patch in place is much harder to do by naked eye.
(In reply to Marek Olšák from comment #45) > Yeah that looks good. Great! In that case, I'm going to do a proper version and send it. I don't like the fact that I'm doing chip_families[r300screen->caps.family] directly in r300_disk_cache_create, when we already have r300_get_name(struct pipe_screen* pscreen), so I'm going to factor that out to something like r300_get_family_name(struct r300_screen* r300screen). Thanks for the review, Marek!
(In reply to Rui Salvaterra from comment #43) > Compiz feels much snappier, indeed. To be 100 % honest, the difference is noticeable, but my laptop has one big problem: as if the Xpress 200M wasn't slow enough, the screen has a resolution of 1680x1050 when the typical resolution, at the time, was 1280x800. That's 72 % more pixels to toss around on a GPU with shared memory. :(
> as if the Xpress 200M wasn't slow enough, the screen has > a resolution of > 1680x1050 when the typical resolution, > at the time, was 1280x800. I don't get it. Is there no supported resolution below that big one? That is quite hard to believe. Mine is 1024x768 when maxed, but sometimes I reverse engineer game binaries that do not let me use 640x480 in case they lag haha. Fill rate really drops a lot when the resolution is high because of the mentioned square-relation of the screen area to pixel count...
(In reply to Richard Thier from comment #49) > I don't get it. Is there no supported resolution below that big one? That is > quite hard to believe. Mine is 1024x768 when maxed, but sometimes I reverse > engineer game binaries that do not let me use 640x480 in case they lag haha. Heh… :) I only used Compiz at the native resolution on the LCD. I do have an old 1024x768 monitor with VGA input, I might give it a spin this weekend.
> I do have an old 1024x768 monitor with VGA input, > I might give it a spin this weekend. But didn't the LCD also supports smaler resolutions natively? Most of them do even if their best native is more exotic ratios and stuff. Then you just need to change with xrandr to something smaller.
(In reply to Richard Thier from comment #51) > > I do have an old 1024x768 monitor with VGA input, > > I might give it a spin this weekend. > > But didn't the LCD also supports smaler resolutions natively? Most of them > do even if their best native is more exotic ratios and stuff. Then you just > need to change with xrandr to something smaller. Generally panels only have a single native mode. The GPU has a scaler that can scale a smaller image to the native timing. Many LCD monitors also have a scaler built in to provide the same functionality.
(In reply to Richard Thier from comment #49) > I don't get it. Is there no supported resolution below that big one? That is > quite hard to believe. Mine is 1024x768 when maxed, but sometimes I reverse > engineer game binaries that do not let me use 640x480 in case they lag haha. 1024x768?! Jesus, that's less than half the number of pixels, no wonder it's faster…! And yes, like Alex Deucher said, I've never seen an LCD with more than one native resolution. At different resolutions, there's always scaling involved (either at the panel or at the GPU).
> The GPU has a scaler that can scale a smaller image to the native timing. Many > LCD monitors also have a scaler built in to provide the same functionality. I see, but still I see a big performance rise when I change the resolution smaller and let this scaler do its work in the last phase. I have no idea if there is a real speed difference between this two: - having 640x480 on a 4:3 aspect screen without a scaler - Having hardware pixels of 1024x768, but using the scaler in the end as 640x480 software. - Z tests run for smaller amount of pixels - Stencil and Z buffers are smaller just like the backbuffer - etc. Memory usage on the GPU should be still smaller, pixel shaders still run rarer times, post processing things still run on smaller amount of pixels etc... The only reason it can be slower if for some reason the scaler circuit would be so slow while doing its job that it cannot keep its FPS, but I think this is a highly specialized hardware not doing anything else so I hoped it is just "fast" and even if there is some loss I guess it is barely measurable - but tell me if I am wrong as that would be new to me if this has a really measurable overhead. In any ways I think it is better to use a same aspect ratio, but smaller resolution than using the biggest possible native size. Actually on my machine I get faster frame rate even if I run a game in 640x480 window on the top left corner the same way as if I turn the screen into 640x480 itself and even then I barely see a difference despite this case I feel there should be some overhead against using directly a video mode for that..
Performance will be better because the 3D engine is rendering to a smaller surface (e.g., 640x480) regardless of whether you are fullscreen or windowed. The size of the rendered image is what impact 3D performance. The scaler is fixed function hw so it doesn't impact the 3D engine directly other than using some memory bandwidth to read the source image.
And speaking of performance, I wonder if this will have any effect… I'll try it later. https://git.congatec.com/android/qmx6_kernel/commit/a0a53aa8c7b491a43e2ef66786f9511bae8cbc35
Hello everyone! I have found one more place where we can end up calling extra ioctls for GEM_CREATE. I remember telling you above after the very first proposed patch of Marek, that "Interesting to me is that the CREATE call number is still higher in strace output than how it was originally before the first slowdown" I went on to find the cause because I was using his patch when bisecting the minor issue and this makes it really hard to spot the minor issue I mentined earlier. So I decided to find out what causes the 600-700 CREATE calls despite the performance seems fast. This is what I did: /* Shared resources don't use cached heaps. */ if (use_reusable_pool) { int heap = radeon_get_heap_index(domain, flags); assert(heap >= 0 && heap < RADEON_MAX_CACHED_HEAPS); usage = 1 << heap; /* Only set one usage bit for each heap. */ pb_cache_bucket = radeon_get_pb_cache_bucket_index(heap); assert(pb_cache_bucket < ARRAY_SIZE(ws->bo_cache.buckets)); bo = radeon_bo(pb_cache_reclaim_buffer(&ws->bo_cache, size, alignment, usage, pb_cache_bucket)); if (bo) return &bo->base; } else { // first only logs where here fprintf(stderr, "1 KULAKVA!!!\n"); fprintf(stderr, "2 KULAKVA!!!\n"); fprintf(stderr, "3 KULAKVA!!!\n"); // this segfault got added later uint kula = 0; int *kulakva = (int*)kula; *kulakva=42; } I saw that still I can get to the code path where "KULAKVA" gets printed, while in the very original code of course this was never the case ever as the whole if and the bool variable if using reusable pools or not was not existing yet. I am now recompiling with debug symbols so that running glxgears with gdb will hopefully fail at the last line of the else clause and tell me a backtrace where I am coming from ;-)
The output from gdb glxgears on the segfault: #0 0xb75a8892 in radeon_winsys_bo_create (rws=0x45e100, size=4096, alignment=4096, domain=RADEON_DOMAIN_GTT, flags=(unknown: 0)) at radeon_drm_bo.c:993 #1 0xb75a972b in radeon_cs_create_fence (rcs=rcs@entry=0xb06a4010) at radeon_drm_cs.c:753 #2 0xb75aa327 in radeon_drm_cs_flush (rcs=0xb06a4010, flags=2, pfence=0xbffff7d8) at radeon_drm_cs.c:593 #3 0xb757845a in r300_flush_and_cleanup (r300=r300@entry=0x446090, flags=flags@entry=2, fence=fence@entry=0xbffff7d8) at r300_flush.c:56 #4 0xb757860f in r300_flush (pipe=0x446090, flags=2, fence=0xbffff7d8) at r300_flush.c:82 #5 0xb731508f in st_context_flush (stctxi=0x55ed20, flags=3, fence=0xbffff7d8) at state_tracker/st_manager.c:635 #6 0xb74511f8 in dri_flush (cPriv=0x445fd0, dPriv=0x5760d0, flags=3, reason=__DRI2_THROTTLE_SWAPBUFFER) at dri_drawable.c:568 #7 0xb7f9dbe8 in dri2Flush (psc=psc@entry=0x4165d0, ctx=<optimized out>, draw=draw@entry=0x5590e0, flags=3, throttle_reason=__DRI2_THROTTLE_SWAPBUFFER) at dri2_glx.c:553 #8 0xb7f9e09d in dri2SwapBuffers (pdraw=0x5590e0, target_msc=0, divisor=0, remainder=0, flush=1) at dri2_glx.c:845 #9 0xb7f73049 in glXSwapBuffers (dpy=0x407160, drawable=18874370) at glxcmds.c:843 #10 0x00401742 in ?? () #11 0xb7b5c669 in __libc_start_main () from /usr/lib/libc.so.6 #12 0x00401cd5 in ?? () I have changed this relevant part with the added flag: /* Create a fence, which is a dummy BO. */ fence = cs->ws->base.buffer_create(&cs->ws->base, 1, 1, RADEON_DOMAIN_GTT, RADEON_FLAG_NO_SUBALLOC | RADEON_FLAG_NO_INTERPROCESS_SHARING); Currently I am testing this, in latest but the build is still going. Will upload patch for it only if it works. I think it does not harm, but this is happening when the bo is a "fence". I have no idea what a fence is in this terminology, but surely it sounds some synchronization primitive so I am unsure that I can add the flag safely, but knowing earlier it was not in the code likely I can just add it. Will see how this shrinks the GEM_CREATE call numbers.
Richard, your change for create_fence is correct.
Created attachment 144481 [details] [review] Possible fix v3 Added attachment for v3 with also adding the flag at the create_fence calls. Now there is only 6 (!) GEM_CREATE ioctl calls in 10 seconds of glxgears in 19.x versions just like once it was before. Performance seems very close to v2 so this was not causing any big change.
I still get some slowdown when playing Mount&Blade: Warband which worked before, but there is some chances it is something else or not even in mesa. On my last system I was playing it with wine 1.8 and now with latest wine-staging 4.x so there are just too many variables. I have played a bit of Urban Terror though and it became smooth and playable once again. Many times I actually get 60FPS top and 18-20 FPS when looking at big areas. Before our patches Urban Terror was running like a slideshow and completely unplayable. Also I have built my mesa using -O3 now - and updated my blog with knowledge about how to add params for the compiler such as debug symbols or this. I think the latter have helped a bit more FPS too, but still there is always room for improvement. The original problem this bug was about seems solved now. PS.: I have issues/glitches with HYPERZ turned on. Should I open a different bug report for that one? I think it belongs somewhere else as it might be no even a regession and I feel it never worked for my card.
Added a blog post about the whole issue and its solving process here: http://ballmerpeak.web.elte.hu/devblog/debugging-mesa-and-the-linux-3d-graphics-stack.html I have linked to this bug report and every other place and people where communications were done for solving this slowdown issue. I just wanted to put things together as it might be valuable for others wanting to do similar things - I was and still is a rookie for these things after all.
(In reply to Richard Thier from comment #62) > Added a blog post about the whole issue and its solving process here: > > http://ballmerpeak.web.elte.hu/devblog/debugging-mesa-and-the-linux-3d- > graphics-stack.html > > I have linked to this bug report and every other place and people where > communications were done for solving this slowdown issue. > > I just wanted to put things together as it might be valuable for others > wanting to do similar things - I was and still is a rookie for these things > after all. Nice article! I'll read it thoroughly when I get home, probably while also testing your v3 and trying to squeeze some performance numbers out of my patch. ;) Cheers!
Rui! Btw I am also running your patch already too and did not see much problems so far, but many of the stuff I am running was DX7 and DX8 level in graphics settings so I do not know how many times I actually tried your code literally, but it is built into my mesa too. Also Mount&Blade is working well too now! It was misconfigured in the new wine to have CSMT and that was über-slow on my single core machine leading to starvation and a lot of overhead that was even visible in the perf output. So far everything works at full speed now! I am thinking about maybe trying to figure out HyperZ on r300. According to the feature matrix it is untested and I see bad glitches but elevated performance. I already made some changes to try it... But if I continue with hyperZ stuff likely it will be another bug report, just for sake of transparency about knowing this one is pretty much seems solved completely...
(In reply to Richard Thier from comment #64) > Rui! > > Btw I am also running your patch already too and did not see much problems > so far, but many of the stuff I am running was DX7 and DX8 level in graphics > settings so I do not know how many times I actually tried your code > literally, but it is built into my mesa too. Good to know! It worked fine here, too. > Also Mount&Blade is working well too now! It was misconfigured in the new > wine to have CSMT and that was über-slow on my single core machine leading > to starvation and a lot of overhead that was even visible in the perf output. > > So far everything works at full speed now! Great! > I am thinking about maybe trying to figure out HyperZ on r300. According to > the feature matrix it is untested and I see bad glitches but elevated > performance. I already made some changes to try it... > > But if I continue with hyperZ stuff likely it will be another bug report, > just for sake of transparency about knowing this one is pretty much seems > solved completely... Yes, I think it's a good idea to open a new bug. By the way, I have a vague memory of HiZ on the R4xx being exclusive (i.e., only the first app to use the Z-buffer gets HiZ), but I could be completely wrong.
> Yes, I think it's a good idea to open a new bug. Added seperate bug for working on HyperZ glitches: https://bugs.freedesktop.org/show_bug.cgi?id=110897 > By the way, I have a vague memory of HiZ on the R4xx being exclusive (i.e., > only the first app to use the Z-buffer gets HiZ), but I could be completely wrong. It is exclusive, but not to the first app it seems. The code to get hyperZ ownership is in the buffer clearing function and I see it with my bare eyes right now so I think not only the first app can get it, but if you exit that automatically some of the next one (that gets clear called first). > It worked fine here, too. Is there any indicator to look for? Like shader files on disk at some places or near the runned binary or current dir or whatever with this or that name?
(In reply to Richard Thier from comment #66) > Is there any indicator to look for? Like shader files on disk at some places > or near the runned binary or current dir or whatever with this or that name? Yes, it's easy to see if it's being used. The cache is per user and it's likely stored at ~/.cache/mesa_shader_cache (or at the path pointed by XDG_CACHE_HOME, if your distro sets it).
I've pushed the fix. Closing.
*** Bug 105171 has been marked as a duplicate of this bug. ***
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.