110781 – Radeon: heavy r300 performance drop regression between 11.x and 19.x

Bug 110781 - Radeon: heavy r300 performance drop regression between 11.x and 19.x

Summary: Radeon: heavy r300 performance drop regression between 11.x and 19.x

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/r300 (show other bugs)
Version:	git
Hardware:	x86 (IA32) Linux (All)

Importance:	medium major
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Duplicates (1):	105171 (view as bug list)
Depends on:
Blocks:

Reported:	2019-05-28 11:51 UTC by Richard Thier
Modified:	2019-06-24 08:44 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Simple diff/patch to test the issue (3.63 KB, patch) 2019-06-03 13:36 UTC, Richard Thier	Details \| Splinter Review
possible fix (4.20 KB, patch) 2019-06-03 18:52 UTC, Marek Olšák	Details \| Splinter Review
Hacky fix working also for 19.x versions (2.20 KB, patch) 2019-06-04 08:24 UTC, Richard Thier	Details \| Splinter Review
quickfix, working with 19.x mesa versions (tested: latest) (5.91 KB, patch) 2019-06-04 10:23 UTC, Richard Thier	Details \| Splinter Review
possible fix v2 (4.66 KB, patch) 2019-06-04 23:09 UTC, Marek Olšák	Details \| Splinter Review
r300g: implement glsl shader caching. (3.56 KB, patch) 2019-06-06 19:50 UTC, Rui Salvaterra	Details \| Splinter Review
Possible fix v3 (5.50 KB, patch) 2019-06-08 06:51 UTC, Richard Thier	Details \| Splinter Review
Show Obsolete (4) View All

Description Richard Thier 2019-05-28 11:51:35 UTC

Dear mesa/freedesktop team!

I am a happy user of the open source radeon r300 driver for my Mobility radeon 200M card in my (pretty old, but good-enough) laptop.

I have updated my system + changed distro and got a complete slowdown. I have checked if it is the kernel or the distro and found that I can get back my performance if I revert to mesa 11.x and corresponding xorg while still using the latest linux kernel. This seems to be some kind of performance regression.

The performance drop is heavy: 50%..100% slowdown and very high CPU usage. For example now extreme tux racer reports 100% CPU usage and before it reports 25-50% at maximum (including any spikes) and mostly around 30% actually when averaging.

I have used the perf tool to locate what causes the heavy CPU usage and I find that there is a lot of memory movements in a create_bo call.

  See this log:
  Samples: 171K of event 'cycles', Event count (approx.): 67632101793                                                                              
    Children      Self  Command      Shared Object              Symbol                                                                             
  -   61,12%     0,09%  etr          [kernel.vmlinux]           [k] entry_SYSENTER_32                                                             ◆
   - 61,07% entry_SYSENTER_32                                                                                                                   
    - 60,94% do_fast_syscall_32                                                                                                               
     - 57,92% sys_ioctl                                                                                                                     
      - 57,84% do_vfs_ioctl                                                                                                               
       - 57,43% radeon_drm_ioctl                                                                                                        
        - 57,04% drm_ioctl                                                                                                            
         - 56,81% drm_ioctl_kernel                                                                                                  
          - 55,86% radeon_gem_create_ioctl                                                                                        
           - 55,46% radeon_gem_object_create                                                                                    
            - 55,36% radeon_bo_create                                                                                         
             - 55,20% ttm_bo_init                                                                                           
              - 55,14% ttm_bo_init_reserved                                                                               
               - 54,75% ttm_bo_validate                                                                                 
                - 54,42% ttm_bo_handle_move_mem                                                                       
                 - 54,07% ttm_tt_bind                                                                               
                  - 53,36% radeon_ttm_tt_populate                                                                 
                   + 53,33% ttm_populate_and_map_pages                                                          
                     0,62% radeon_ttm_backend_bind

I see these are code paths in the kernel, but the same happens regardless I use an old kernel (and modules) or the newest one, while this gets a near complete disappear when I revert to old mesa and X.

I do not see anything related to bo (buffer object??) creation in mesa sources below the gallium/r300 directory, but I have found things that make ioctl in:

gallium/winsys/radeon/drm/radeon_drm_bo.c

^^Is this also used for r300 cards? The "source tree" documentation page seems to tell me this is a "shared code for r600 and radeonsi", but where is r300 doing calls to the ioctls then?

Something happened in the last 1-3 years that changed stuff to move memory around crazily for some reason and use more CPU for that. Surely it was not having this heavy slowdown before. Now it is nearly as slow as llvmpipe for practical cases (but not slower!).

Can anyone help me with this? I am a developer myself, but I am not well versed in the source code of mesa and in how to analyse its performance bottlenecks.

PS.: On phoronix I was already analysing the problem for long: 

https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1099745-how-to-tell-if-a-driver-is-gallium-or-just-mesa-slow-renderng-with-radeon

^^There the whole process of what I was trying is written with every step, but maybe only perf outputs are of interest from there...

Feel free to ask me anything about the issue. If I would be able to help solving this myself I will be happy too, but I have never really did any patches to these kind of core system libraries and I am quite rookie for gpu drivers...

Comment 1 Richard Thier 2019-05-28 11:54:30 UTC

PS.: As you can see I already have the mesa sources and can build and test my own in case there is any kind of idea what I should search for...

Comment 2 Richard Thier 2019-05-28 16:19:27 UTC

The "gallium/winsys/radeon/drm/radeon_drm_bo.c" is indeed used by r300 because I have put a log in it.

Comment 3 Richard Thier 2019-05-29 10:16:59 UTC

When doing an strace this is what I am getting:

  ...
  ioctl(6, DRM_IOCTL_RADEON_GEM_CREATE, 0xbfafd880) = 0 <0.000068>
   > [vdso]() [0x891]
  ioctl(6, DRM_IOCTL_RADEON_CS, 0xafe2404c) = 0 <0.000102>
   > [vdso]() [0x891]
  ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfafd9c4) = 0 <0.000030>
   > [vdso]() [0x891]
  ioctl(6, DRM_IOCTL_GEM_CLOSE, 0xbfafd99c) = 0 <0.000043>
   > [vdso]() [0x891]
  ioctl(6, DRM_IOCTL_RADEON_GEM_CREATE, 0xbfafd880) = 0 <0.000070>
   > [vdso]() [0x891]
  ioctl(6, DRM_IOCTL_RADEON_CS, 0xafe380e0) = 0 <0.000088>
   > [vdso]() [0x891]
  ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfafd9c4) = 0 <0.000029>
   > [vdso]() [0x891]
  ioctl(6, DRM_IOCTL_GEM_CLOSE, 0xbfafd99c) = 0 <0.000047>
   > [vdso]() [0x891]
  ...

These continously appear in the mesa+xorg combination that is slow. I have no idea if the same is happening in the old 11.x version where speed is still good (sadly I could not try a later mesa between 11.x and 19.x so far, but I quess some of them works fast still).

I have tried reverting the following changes manually in my mesa.git version:

1. https://github.com/mesa3d/mesa/commit/9ac5504df5c31e60801d9d37c1f3d547c486bf0a
2. https://github.com/mesa3d/mesa/commit/a1e391e39df2b1d8169e773a30153167ab8e13e8
3. https://github.com/mesa3d/mesa/commit/0e40c6a7b70673734dfecf5957c086b30f11befe

Actually after reverting the latter things got 1FPS slower than with unchanged 19.x, but the first two have helped a very small amount.

In my journey to find the slowdown I started to look around "radeon_create_bo" and "radeon_winsys_bo_create" so that is how I was trying to revert exactly these changes because these seemed to be maybe relevant using git blame lookup.

If anyone knows where to look further or have any idea about my problem (maybe on the level of X or other parts of mesa making a lot of "bo"s?) please tell me :-)

What is "bo" in this sense btw? Is there any documentation I should read to understand these acronyms in the code? It takes considerable amount of time to understand that cs is some kind of "command stream" (still unsure) and bo is some kind of "buffer object" while I have no idea about "pb_" and a lot of other things and not even knowing if a bo is a general buffer for anything (vertex buffers, constant buffers, backbuffers, zbuffers, who-knows-what-else-stuff) or just for one specific thing here. Is there a list for advised reading before touching the code as a noob like me?

Comment 4 Richard Thier 2019-05-29 10:32:42 UTC

Btw what is a "slab buffer" or "slab"? It bugs me and I see this everywhere in the code too...

Comment 5 Richard Thier 2019-05-29 15:49:02 UTC

Still fast with mesa 17.2.8 and X.Org X Server 1.19.5

The problem is somewhere between 17.x and 19.x mesa versions (and corresponding xorg).

Also I have made an strace when it is good in one older system to see number of CREATE and CLOSE ioctl calls (also the number of CS ioctl calls) are a magnitude smaller than in case of 19.x!

For example 10-20 seconds of glxgears running leads to 9-10 calls to DRM_IOCTL_RADEON_GEM_CREATE on mesa 17.2.8 while it leads to 708 (!!!) number of same calls in the same time period on mesa 19.x! This is surely a quite big of a difference!

The similar pattern in 17.x is never creating a new gem object:

  ...
  ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfcf9f04) = 0 <0.000055>
  ioctl(6, DRM_IOCTL_RADEON_GEM_BUSY, 0xbfcf9d44) = 0 <0.000022>
  ioctl(6, DRM_IOCTL_RADEON_CS, 0xb307d03c) = 0 <0.000089>
  ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfcf9f04) = 0 <0.000053>
  ioctl(6, DRM_IOCTL_RADEON_GEM_BUSY, 0xbfcf9d44) = 0 <0.000023>
  ioctl(6, DRM_IOCTL_RADEON_CS, 0xb30910d0) = 0 <0.000095>
  ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfcf9f04) = 0 <0.000054>
  ioctl(6, DRM_IOCTL_RADEON_GEM_BUSY, 0xbfcf9d44) = 0 <0.000023>
  ioctl(6, DRM_IOCTL_RADEON_CS, 0xb307d03c) = 0 <0.000090>
  ...

Sometimes when the *_BUSY ioctl call returns -1, it issues a CREATE, but otherwise not.

I think GEM is some kind of memory handler for the GPU (just like "ttm" in the perf output) and I think something have messed up with memory handling schemes for Mobility Radeon 200M (r300) at some mesa update between 17.x and 19.x.

Will try to bisect a closer version as 17.2.8 is from 2017.12.22 in time...

Comment 6 Richard Thier 2019-05-29 23:17:27 UTC

Latest bisect result so far:

Fast mesa version: 17.2.8 (2017 december 22)
Slow mesa version: 18.0.5 (2018 june 3)

The change that slows the system is between the two. The strace outputs are generally the same as before so it is not worth sharing but I have them and they differ the same way as I told above - more GEM_CREATE ioctl calls in the slow ver.

Comment 7 Richard Thier 2019-05-30 12:23:09 UTC

Latest bisect result so far:

Fast mesa version: 17.2.8 (2017 december 22)
Slow mesa version: 18.0.0 (git: 44c7d1aa2ea)

The change that slows the system is between the two.

Comment 8 Richard Thier 2019-05-31 08:21:07 UTC

I became a bit uncertain now as after every bisect I am choosing "git
bisect bad" and see no change so far. There is 4 more bisect to do
still, but I expected at least one change so far.

Maybe it is in xorg and not mesa??? Before bisecting with git I was
"bisecting" with the arch and ubuntu repo archives and there I had to
move xorg and mesa together always...

Also I am now confused about the version string in the glxinfo output
because I went through a point where I expected it to write "17.2.8"
and still it is writing "17.4.0". Is there a trick to exactly know
what git revision the 17.2.8 version was built out of? I might be
bisecting against the wrong "good" target... :-(

Comment 9 Richard Thier 2019-06-02 20:42:43 UTC

The problem is now surely been between cf2e92262b4 (good) and 04ea03d99d8 (bad) now. Still there are a number of some bisects still to do.

Also there is a chance there is a minor issue on the other side of 04ea03d99d8 as it was not that bad as the complete end-result, but the change is really small on the other side so I went on bisecting the bigger problem.

Comment 10 Richard Thier 2019-06-03 08:47:33 UTC

I am getting really close to bisecting the bigger identified problem.

bad: b672c3833b7
good: aff1ad0798

Now at this point:

Bisecting: 9 revisions left to test after this (roughly 3 steps)
[3efedf98e8272d1a32a3837b22161582d28c4487] i965: Always flush caches after blitting to a GL buffer object.

Also now compiation is not even taking an hour like before where bisect was jumping big time periods as code barely changes I guess ;-)

Comment 11 Richard Thier 2019-06-03 10:11:35 UTC

Slowdown cause is found when bisecting:

8b3a257851905ff444d981e52938cbf2b36ba830 is the first bad commit
commit 8b3a257851905ff444d981e52938cbf2b36ba830
Author: Marek Olšák <marek.olsak@amd.com>
Date:   Tue Jul 18 16:08:44 2017 -0400

    radeonsi: set a per-buffer flag that disables inter-process sharing (v4)
    
    For lower overhead in the CS ioctl.
    Winsys allocators are not used with interprocess-sharable resources.
    
    v2: It shouldn't crash anymore, but the kernel will reject the new flag.
    v3 (christian): Rename the flag, avoid sending those buffers in the BO list.
    v4 (christian): Remove setting the kernel flag for now
    
    Reviewed-by: Marek Olšák <marek.olsak@amd.com>

Comment 12 Richard Thier 2019-06-03 10:49:28 UTC

My friend "Googling" at the commit hash we found that my problem might possibly relate to this other bug report already existing:

https://bugs.freedesktop.org/show_bug.cgi?id=105171

They also report a slowdown, but not the amount I see. My slowdown is of a much higher percentage but I am on gallium/r300 code paths...

Comment 13 Rui Salvaterra 2019-06-03 12:35:38 UTC

This is interesting. I have the exact same problem on an old Turion X2 laptop I've been given. I attributed it to memory controller contention (since the GPU resides in the northbridge and the memory controller is integrated on the CPU), but it feels unusually slow, to the point I had to completely disable compositing (I'm running Ubuntu MATE 19.04), whereas an even slower Eee PC 901 has no trouble at all running compiz.

In another note, according to the specs, the R300 _should_ be able to run purely on modesetting/GLAMOR (I have a laptop with a Mobility Radeon X1600 and 128 MiB of VRAM running GLAMOR with only some very minor glitches), but the graphics are completely corrupted. This doesn't happen when using the radeon DDX.

Comment 14 Richard Thier 2019-06-03 12:50:12 UTC

Now seeing that the commit diff is actually small, I can put some printf logs in place to see what is going on and where the code ends up.

Comment 15 Richard Thier 2019-06-03 12:50:31 UTC

It seems similar issues not only affect me then...

Comment 16 Richard Thier 2019-06-03 13:36:49 UTC

Created attachment 144427 [details] [review]
Simple diff/patch to test the issue

See attached diff/patch for what I am trying as a quickfix and testing what causes the issue. Not a "real fix" just supposed to be used for TESTING!!!

My suspection is that the change in the "r600_buffer_common.c" file might be needed for other cards too for setting the flag for the faster code path but it maybe happens only for r600 because it is in this file?!

That is only my wild guess so far and will test this with the lot of logging in the patch... Only provided for those who want to understand my later logs I provide after building a version with this patch in place.

Comment 17 Marek Olšák 2019-06-03 18:52:51 UTC

Created attachment 144431 [details] [review]
possible fix

Hey fellas, can you test this patch? Thanks.

Comment 18 Richard Thier 2019-06-03 20:19:02 UTC

Will try that supposed patch as soon as possible. It seems reasonable to me as far as I understand it. So you basically set the flag for r300. I was searching a lot to see where to add these, but I was too noob for telling where I am supposed to put these in to only affect/fix my old card range...

Sorry for answering only now. I was in wine garden ;-)

Comment 19 Richard Thier 2019-06-03 20:51:25 UTC

Hmmm... I have tried the possible fix and Extreme Tux Racer became fast again however I still see a lot of "DRM_IOCTL_RADEON_GEM_CREATE" if I do an

strace -T -e ioctl glxgears 2>&1 | vim -

So things are fast now, but I still see these logged out. Is this normal in any ways? I am happy with the performance of course just I don't understand this. 

Maybe I made some mistake...

Comment 20 Marek Olšák 2019-06-04 04:22:24 UTC

GEM_CREATE just creates a buffer, you should always see them, but a lot less.

Comment 21 Richard Thier 2019-06-04 05:50:45 UTC

Hi!

I have applied this "possible fix" patch to the (nearly) latest 19.x from git and I get an assertion error there.

glxinfo: ../src/gallium/winsys/radeon/drm/radeon_drm_bo.c:1023: radeon_winsys_bo_create: Assertion `heap >= 0 && heap < RADEON_MAX_CACHED_HEAPS' failed.

Comment 22 Richard Thier 2019-06-04 06:30:36 UTC

I also get the same error if I apply my simple hack of use_reusable_pool=true;

Applying the changes to the revision where things went wrong in the past seem to be working though (but I did not test them for very long periods).

Comment 23 Richard Thier 2019-06-04 06:48:00 UTC

I have added this before the assert:

fprintf(stderr, "radeon_get_heap_index(%u, %u): %d", domain, flags, heap);

And I see this logged out before the assertion fail:

radeon_get_heap_index(6, 1): -1

Will look around more if possible.

Comment 24 Richard Thier 2019-06-04 06:53:29 UTC

Sorry! I have to add that the above printf only prints as such if I use my hacky patch on latest git! I will try what it says with the "proposed fix" patch of Marek as the body of radeon_get_heap_index seems to need RADEON_FLAG_NO_INTERPROCESS_SHARING so the above might be misleading without me telling what are the exact circumstances.

But the assert I was running into even with the proposed fix - just I wanted to try if there is a difference between the hacky fix and the normal one in the assert appearing or not...

Comment 25 Richard Thier 2019-06-04 07:56:23 UTC

In the assert problem this is the domain and flag meanings in case of only my hack is in place:

domain(6) == RADEON_DOMAIN_GTT(2) | RADEON_DOMAIN_VRAM(4)
flags(1) == RADEON_FLAG_GTT_WC (1 << 0)

So I get -1 even if I "outcomment" this check in radeon_get_heap_index(..):

    /* Resources with interprocess sharing don't use any winsys allocators. 
    if (!(flags & RADEON_FLAG_NO_INTERPROCESS_SHARING))
        return -1;*/

There is a switch statement in this function where they switch based on the domain parameter:

    switch (domain) {
    case RADEON_DOMAIN_VRAM:
      ...
    case RADEON_DOMAIN_GTT:
      ...
    default:
        break;
    }
    return -1;

I suspect we return -1 here because we go into the "default" case. I think that is right because domain is: 6 == 2 | 4 == GTT+VRAM at the same time.

Maybe once it was meaningful to have both GTT (is that GART?) and VRAM at the same time being or-ed together but not anymore?

I will try to add a hack to make the domain always just VRAM to see what happens... I know that is bad, but just for the analysis.

Comment 26 Richard Thier 2019-06-04 08:24:41 UTC

Created attachment 144439 [details] [review]
Hacky fix working also for 19.x versions

I can confirm the problem with the "possible fix" is because radeon_get_heap_index does not handle both GTT and VRAM domains at the same time, but r300 still asks for that.

Added a hacky fix that works on 19.x versions but forces VRAM domain. This highlights problematic areas for the new versions. I think this latter issue is because of bitrot happening with r300 as now we re-enabled a code path that was never happening after the slowdown in 17.x as analysed before.

PS.: Of course the "possible fix" still works when applied directly to the 17.x commit that made things slow and that is the better direction, just not to the latest code because of this issue.

Comment 27 Rui Salvaterra 2019-06-04 09:43:37 UTC

This is amazing work, Richard. I'd love to help you with testing, since I also have the same hardware, but unfortunately I don't have a build environment ready. I'm using padoka's PPA, though. FWIW, I'll be able to test a couple of weeks after the (proper) fix hits mainline Mesa. In any case, if you'd like to send me a prebuilt x86-64 lib, I can manually replace it and test on my laptop.

Comment 28 Richard Thier 2019-06-04 10:02:02 UTC

Hi Rui!

We should also thank Marek for him proposing a possible fix. I am currently creating a "less hacky" solution based on that attachment and a better idea to not always force VRAM domain, but only if the r300 caller codes ask "VRAM or something" and forcing GTT if it asks "GTT or something - but not VRAM". This basically lead to the driver favoring the domain choice to be in VRAM, but still supporting other things if the caller asks other domains directly.

Also incorporating the parts that Marek proposes is much better and cleaner than my hack that just sets the flag right before it is read.

After I am completed with these changes and their testing I will publish it here and only maybe then have any time to compile an x86_64 version for other people. I highly suggest however for you and others to just grab the latest revision from git and apply the changesets until the solution arrives in mesa releases. I am on a single core, 32bit laptop with 1512-128MB of memory I got in 2007 and I think most of the hardware is like from 2005 technology in it so I might be some among the slowest cross-compiling person so please give a shot to put together a build environment yourself.

It is not that hard to build mesa, this was my first time doing so too! Just believe in yourself and read the nice docs on the mesa page about building with meson. If you have any questions of the build process look around phoronix or mesa docs and forums or if you give up just you can ask any people. Even I can help now and you will be able too after you build yourself once ;-)

Comment 29 Richard Thier 2019-06-04 10:23:06 UTC

Created attachment 144442 [details] [review]
quickfix, working with 19.x mesa versions (tested: latest)

Added a proposed quickfix based on "possible fix" from Marek.

This version now works with latest mesa and I suspect it does not break anything else with other cards neither. Still could be cleaned up a bit by finding and changing which r300 code paths ask for VRAM and GTT domains together and changing code there, but this works so far.

This change favors VRAM domain over GTT domain when both are provided. For more available memory, you can change it the other way around easily, but this seems to favor faster speed over more memory available.

Comment 30 Richard Thier 2019-06-04 10:27:06 UTC

Interesting to me is that the CREATE call number is still higher in strace output than how it was originally before the first slowdown, but the performance seems to be the same after applying the quickfix to latest 19.x from git. This one still bugs me a bit, but as there is no visible degradation it is fine I guess.

Comment 31 Richard Thier 2019-06-04 10:33:07 UTC

While bisecting the original issue I saw a point where speed was already bad, but not that much as with latest however mostly like that (at least 90-95%).

So there might be a little bit more speedup gain if someone or me bisects between 04ea03d99d8 and the latest to see when is the minor slowdown, but it might not worth it as it is so small that it might be less than 1% in effect. Just documenting this as I still see this from strace output after the quickfix patch - but not visible or measurable when playing or running 3D apps.

At least now there is a quickfix solution based on Marek's attachment that works with 19.x too - enjoy ;-)

Comment 32 Rui Salvaterra 2019-06-04 10:37:35 UTC

(In reply to Richard Thier from comment #28)
> Hi Rui!

Hi, Richard!

> We should also thank Marek for him proposing a possible fix.

Yeah, Marek kicks ass, he's one of the original developers of r300g. :)

> I am on a single core, 32bit laptop with 1512-128MB of
> memory I got in 2007 and I think most of the hardware is like from 2005
> technology in it so I might be some among the slowest cross-compiling person
> so please give a shot to put together a build environment yourself.

My Xpress 200M laptop has a Turion X2 TL-60 with 4 GiB of RAM, so I'll just shut up. :P

> It is not that hard to build mesa, this was my first time doing so too! Just
> believe in yourself and read the nice docs on the mesa page about building
> with meson.

I build my own kernels straight from Linus' git, but that's trivial.
I've never built Mesa, but I might as well just bite the bullet. ;) Something tells my I'll be needing it soon, when I try to get a X1950 XT running on my Power Mac G5. I have no idea about the current state of r300g on big-endian, it's most likely bitrotted (I'd love to be wrong, though).

Speaking of which (and totally off-topic), how hard and/or beneficial would it be to also implement shader caching on r300g?

> If you have any questions of the build process look around
> phoronix or mesa docs and forums or if you give up just you can ask any
> people. Even I can help now and you will be able too after you build
> yourself once ;-)

Great! I often lurk around Phoronix too. I don't think I'll need extra help after RTFM, but I appreciate it. Thanks a lot, Richard! :)

Comment 33 Richard Thier 2019-06-04 12:43:24 UTC

I have compiled a "how to build" just for you - linking sources I used:

http://ballmerpeak.web.elte.hu/devblog/building-your-own-mesa.html

Everyone should be able to try the quickfix using this and the links on this page, but I guess even if questions arise there will be people answering them - just please do not clutter the bug report with these anymore. Write emails or ask around forums or something as this is a general topic unrelated to the slowdown bug.

Wondering about two things now:

- Is my quickfix patch working for other people too?
- Is my quickfix good-enough or should we dig deeper and solve the domain bits issue other ways? (latter is a question towards more experiences devs)

Comment 34 Marek Olšák 2019-06-04 23:09:00 UTC

Created attachment 144454 [details] [review]
possible fix v2

Please test this one.

Comment 35 Rui Salvaterra 2019-06-05 09:52:38 UTC

(In reply to Marek Olšák from comment #34)
> Created attachment 144454 [details] [review] [review]
> possible fix v2
> 
> Please test this one.

Hi, Marek,

Unfortunately I couldn't do it yesterday, but I'll try to do so today. I'm also entertaining the idea of implementing support for shader disk caching, but I'd like your opinion before moving forward. I read the implementation on nouveau and r600 and it seemed simple enough (or, most likely, I'm missing something fundamental ;)). I don't know how beneficial it could be, though, since R300-R500 shaders are much simpler, so measurements would have to be done.

Comment 36 Marek Olšák 2019-06-05 15:10:07 UTC

I think if you just implement the disk cache callback, it'll automatically enable the shader cache for the GLSL compiler.

Comment 37 Rui Salvaterra 2019-06-05 16:14:12 UTC

(In reply to Marek Olšák from comment #36)
> I think if you just implement the disk cache callback, it'll automatically
> enable the shader cache for the GLSL compiler.

Hi, Marek

Yeah, that was my reasoning. Will try. :)

Comment 38 Richard Thier 2019-06-05 20:35:06 UTC

Hello everyone!

While hopefully someone is testing Marek's solution as I went to test if the "minor problem" between 04ea03d99d8 and 19.x is causing a slowdown or not.

Despite most of the heavy slowdown went away now, I am still unsure if the more ioctl calls in later versions (let's call it ioctl spamming) can cause a slowdown in cases or not so I plan to compare performance of more heavy games between a patched 19.x and the last working 17.x just to see how it is and maybe start bisecting from 04ea03d99d8 if there is still a slowdown.

It just bugs me as a bad feeling so I want to be sure with heavier games than extreme tux racer...

Comment 39 Richard Thier 2019-06-05 20:50:09 UTC

First I went on to strace a bit and measure directly...

10 sec glxgears - lastfast 17.x:

- 2105 ioctl calls
- 6 DRM_IOCTL_RADEON_GEM_CREATE
- 7 DRM_IOCTL_GEM_CLOSE

10 sec glxgears - patched 19.x:

- 2916 ioctl calls
- 712 DRM_IOCTL_RADEON_GEM_CREATE
- 713 DRM_IOCTL_GEM_CLOSE

In comparison this is the unpatched 19.x originally (10 sec):

-> 12466 (any) ioctl calls
-> 3111 DRM_IOCTL_RADEON_GEM_CREATE
-> 3112 DRM_IOCTL_GEM_CLOSE

Surely it is already an order of magnitude better now, but maybe still not as best as it should be. As I told you, when bisecting there was a point I saw this happening and it was hinting there are two issues - but one is much smaller.

glxgears however runs roughly the same FPS now. Maybe things start to slow down if I run out of CPU power as glxgears don't run out now - neither tuxracer, but will try heavy things like mount&blade: warband... it was working well before..

Comment 40 Richard Thier 2019-06-05 21:13:43 UTC

Okay... I cannot easily test more heavy games because some of them crash with the 17.x last good one (maybe because missing external S3TC or whatever, or just something similar).

Will try to bisect the minor issue too however still, just to see what that is.

Comment 41 Rui Salvaterra 2019-06-05 22:47:14 UTC

Hi, guys

I haven't tried Marek's solution yet. It's the first time I build Mesa from scratch and I've never used Meson before, so I'm studying along the way. I applied the patch, configured with no DRI drivers and only the r300g driver, compiled and installed to the default location (/usr/local/lib/x86_64-linux-gnu, it seems). Unfortunately, glxinfo can't find the libraries when I point the environment variables to them, so I must have done something stupid. Will dig deeper.
(I know this isn't a developer support thread, I just wanted you to know I haven't given up or stopped!)

Comment 42 Andrew Randrianasulu 2019-06-06 03:02:45 UTC

(In reply to Rui Salvaterra from comment #41)
> Hi, guys
> 
> I haven't tried Marek's solution yet. It's the first time I build Mesa from
> scratch and I've never used Meson before, so I'm studying along the way. I
> applied the patch, configured with no DRI drivers and only the r300g driver,
> compiled and installed to the default location
> (/usr/local/lib/x86_64-linux-gnu, it seems). Unfortunately, glxinfo can't
> find the libraries when I point the environment variables to them, so I must
> have done something stupid. Will dig deeper.
> (I know this isn't a developer support thread, I just wanted you to know I
> haven't given up or stopped!)

meson actually accepts prefix , so I use it like this (from BUILD dir inside mesa tree):

meson ../ --prefix=/usr/X11R7 rest_of_options (obviously, for x86-64 prefix will be diffeent, I just wanted to say I have non-standard X prefix and install mesa into it just fine)

full line in my case looks like:
meson ../ --prefix=/usr/X11R7 --strip --buildtype debugoptimized -Degl=true -Ddri-drivers=r100,r200,i965,nouveau -Dplatforms=drm,x11 -Dgallium-drivers=i915,r600,radeonsi,swrast,virgl,nouveau,r300 -Dvulkan-drivers=amd,intel  -Dgallium-nine=true -Dgallium-opencl=icd -Dgallium-va=true -Dgallium-xvmc=true -Dgallium-xa=false

You probably want to add --reconfigure if you re-run meson after some git pull from same BUILD directory.

Comment 43 Rui Salvaterra 2019-06-06 17:46:46 UTC

Hi, everyone!

I finally did it, I was able to point the envvars in the right direction, tested Marek's patch and it seems to be working nicely. Compiz feels much snappier, indeed. Feel free to add,

Tested-by: Rui Salvaterra <rsalvaterra@gmail.com>

Now that I'm able to do code-build-test cycles, it's time to implement on-disk shader caching. Stay tuned… :)

Comment 44 Rui Salvaterra 2019-06-06 19:50:08 UTC

Created attachment 144469 [details] [review]
r300g: implement glsl shader caching.

And here's a preliminary patch. If it looks good enough to you, I'll send it officially to mesa-dev. :)

Comment 45 Marek Olšák 2019-06-07 01:48:57 UTC

Yeah that looks good.

Comment 46 Richard Thier 2019-06-07 06:36:47 UTC

In comparison this is the unpatched 19.x originally (10 sec):

-> 12466 (any) ioctl calls
-> 3111 DRM_IOCTL_RADEON_GEM_CREATE
-> 3112 DRM_IOCTL_GEM_CLOSE

^^Sorry these values were bad, it is not this much and I don't know how I measured this back then but I cannot do it anymore.


Actually it is not that easy to find the "more minor" issue, but I still feel it is there. The only way for finding it is to bisect in this way:

-> bisect ...
-> apply patches of Marek
-> See if result has 7-8 CREATE calls or a bit more
-> stash away patches
-> bisect ...

Only this way it is properly measurable because "measuring" without the patch in place is much harder to do by naked eye.

Comment 47 Rui Salvaterra 2019-06-07 08:05:41 UTC

(In reply to Marek Olšák from comment #45)
> Yeah that looks good.

Great! In that case, I'm going to do a proper version and send it. I don't like the fact that I'm doing chip_families[r300screen->caps.family] directly in r300_disk_cache_create, when we already have r300_get_name(struct pipe_screen* pscreen), so I'm going to factor that out to something like r300_get_family_name(struct r300_screen* r300screen).
Thanks for the review, Marek!

Comment 48 Rui Salvaterra 2019-06-07 08:22:03 UTC

(In reply to Rui Salvaterra from comment #43)
> Compiz feels much snappier, indeed.

To be 100 % honest, the difference is noticeable, but my laptop has one big problem: as if the Xpress 200M wasn't slow enough, the screen has a resolution of 1680x1050 when the typical resolution, at the time, was 1280x800. That's 72 % more pixels to toss around on a GPU with shared memory. :(

Comment 49 Richard Thier 2019-06-07 08:38:37 UTC

> as if the Xpress 200M wasn't slow enough, the screen has
> a resolution of > 1680x1050 when the typical resolution,
> at the time, was 1280x800.

I don't get it. Is there no supported resolution below that big one? That is quite hard to believe. Mine is 1024x768 when maxed, but sometimes I reverse engineer game binaries that do not let me use 640x480 in case they lag haha.

Fill rate really drops a lot when the resolution is high because of the mentioned square-relation of the screen area to pixel count...

Comment 50 Rui Salvaterra 2019-06-07 08:49:21 UTC

(In reply to Richard Thier from comment #49)
> I don't get it. Is there no supported resolution below that big one? That is
> quite hard to believe. Mine is 1024x768 when maxed, but sometimes I reverse
> engineer game binaries that do not let me use 640x480 in case they lag haha.

Heh… :) I only used Compiz at the native resolution on the LCD. I do have an old 1024x768 monitor with VGA input, I might give it a spin this weekend.

Comment 51 Richard Thier 2019-06-07 14:43:41 UTC

> I do have an old 1024x768 monitor with VGA input,
> I might give it a spin this weekend.

But didn't the LCD also supports smaler resolutions natively? Most of them do even if their best native is more exotic ratios and stuff. Then you just need to change with xrandr to something smaller.

Comment 52 Alex Deucher 2019-06-07 14:56:46 UTC

(In reply to Richard Thier from comment #51)
> > I do have an old 1024x768 monitor with VGA input,
> > I might give it a spin this weekend.
> 
> But didn't the LCD also supports smaler resolutions natively? Most of them
> do even if their best native is more exotic ratios and stuff. Then you just
> need to change with xrandr to something smaller.

Generally panels only have a single native mode.  The GPU has a scaler that can scale a smaller image to the native timing.  Many LCD monitors also have a scaler built in to provide the same functionality.

Comment 53 Rui Salvaterra 2019-06-07 15:10:12 UTC

(In reply to Richard Thier from comment #49)
> I don't get it. Is there no supported resolution below that big one? That is
> quite hard to believe. Mine is 1024x768 when maxed, but sometimes I reverse
> engineer game binaries that do not let me use 640x480 in case they lag haha.

1024x768?! Jesus, that's less than half the number of pixels, no wonder it's faster…!
And yes, like Alex Deucher said, I've never seen an LCD with more than one native resolution. At different resolutions, there's always scaling involved (either at the panel or at the GPU).

Comment 54 Richard Thier 2019-06-07 15:20:31 UTC

> The GPU has a scaler that can scale a smaller image to the native timing.  Many > LCD monitors also have a scaler built in to provide the same functionality.

I see, but still I see a big performance rise when I change the resolution smaller and let this scaler do its work in the last phase.

I have no idea if there is a real speed difference between this two:
- having 640x480 on a 4:3 aspect screen without a scaler
- Having hardware pixels of 1024x768, but using the scaler in the end as 640x480 software.
- Z tests run for smaller amount of pixels
- Stencil and Z buffers are smaller just like the backbuffer
- etc.

Memory usage on the GPU should be still smaller, pixel shaders still run rarer times, post processing things still run on smaller amount of pixels etc... The only reason it can be slower if for some reason the scaler circuit would be so slow while doing its job that it cannot keep its FPS, but I think this is a highly specialized hardware not doing anything else so I hoped it is just "fast" and even if there is some loss I guess it is barely measurable - but tell me if I am wrong as that would be new to me if this has a really measurable overhead.

In any ways I think it is better to use a same aspect ratio, but smaller resolution than using the biggest possible native size. Actually on my machine I get faster frame rate even if I run a game in 640x480 window on the top left corner the same way as if I turn the screen into 640x480 itself and even then I barely see a difference despite this case I feel there should be some overhead against using directly a video mode for that..

Comment 55 Alex Deucher 2019-06-07 15:25:50 UTC

Performance will be better because the 3D engine is rendering to a smaller surface (e.g., 640x480) regardless of whether you are fullscreen or windowed.  The size of the rendered image is what impact 3D performance.  The scaler is fixed function hw so it doesn't impact the 3D engine directly other than using some memory bandwidth to read the source image.

Comment 56 Rui Salvaterra 2019-06-07 16:23:05 UTC

And speaking of performance, I wonder if this will have any effect… I'll try it later.

https://git.congatec.com/android/qmx6_kernel/commit/a0a53aa8c7b491a43e2ef66786f9511bae8cbc35

Comment 57 Richard Thier 2019-06-07 19:23:35 UTC

Hello everyone!

I have found one more place where we can end up calling extra ioctls for GEM_CREATE. I remember telling you above after the very first proposed patch of Marek, that "Interesting to me is that the CREATE call number is still higher in strace output than how it was originally before the first slowdown"

I went on to find the cause because I was using his patch when bisecting the minor issue and this makes it really hard to spot the minor issue I mentined earlier.

So I decided to find out what causes the 600-700 CREATE calls despite the performance seems fast.

This is what I did:

    /* Shared resources don't use cached heaps. */
    if (use_reusable_pool) {
        int heap = radeon_get_heap_index(domain, flags);
        assert(heap >= 0 && heap < RADEON_MAX_CACHED_HEAPS);
        usage = 1 << heap; /* Only set one usage bit for each heap. */

        pb_cache_bucket = radeon_get_pb_cache_bucket_index(heap);
        assert(pb_cache_bucket < ARRAY_SIZE(ws->bo_cache.buckets));

        bo = radeon_bo(pb_cache_reclaim_buffer(&ws->bo_cache, size, alignment,
                                               usage, pb_cache_bucket));
        if (bo) 
            return &bo->base;
    } else {
        // first only logs where here
        fprintf(stderr, "1 KULAKVA!!!\n");
        fprintf(stderr, "2 KULAKVA!!!\n");
        fprintf(stderr, "3 KULAKVA!!!\n");
        // this segfault got added later
        uint kula = 0; 
        int *kulakva = (int*)kula;
        *kulakva=42;
    }    

I saw that still I can get to the code path where "KULAKVA" gets printed, while in the very original code of course this was never the case ever as the whole if and the bool variable if using reusable pools or not was not existing yet.

I am now recompiling with debug symbols so that running glxgears with gdb will hopefully fail at the last line of the else clause and tell me a backtrace where I am coming from ;-)

Comment 58 Richard Thier 2019-06-07 20:12:53 UTC

The output from gdb glxgears on the segfault:

#0  0xb75a8892 in radeon_winsys_bo_create (rws=0x45e100, size=4096, alignment=4096, domain=RADEON_DOMAIN_GTT, flags=(unknown: 0))
    at radeon_drm_bo.c:993
#1  0xb75a972b in radeon_cs_create_fence (rcs=rcs@entry=0xb06a4010) at radeon_drm_cs.c:753
#2  0xb75aa327 in radeon_drm_cs_flush (rcs=0xb06a4010, flags=2, pfence=0xbffff7d8) at radeon_drm_cs.c:593
#3  0xb757845a in r300_flush_and_cleanup (r300=r300@entry=0x446090, flags=flags@entry=2, fence=fence@entry=0xbffff7d8) at r300_flush.c:56
#4  0xb757860f in r300_flush (pipe=0x446090, flags=2, fence=0xbffff7d8) at r300_flush.c:82
#5  0xb731508f in st_context_flush (stctxi=0x55ed20, flags=3, fence=0xbffff7d8) at state_tracker/st_manager.c:635
#6  0xb74511f8 in dri_flush (cPriv=0x445fd0, dPriv=0x5760d0, flags=3, reason=__DRI2_THROTTLE_SWAPBUFFER) at dri_drawable.c:568
#7  0xb7f9dbe8 in dri2Flush (psc=psc@entry=0x4165d0, ctx=<optimized out>, draw=draw@entry=0x5590e0, flags=3, 
    throttle_reason=__DRI2_THROTTLE_SWAPBUFFER) at dri2_glx.c:553
#8  0xb7f9e09d in dri2SwapBuffers (pdraw=0x5590e0, target_msc=0, divisor=0, remainder=0, flush=1) at dri2_glx.c:845
#9  0xb7f73049 in glXSwapBuffers (dpy=0x407160, drawable=18874370) at glxcmds.c:843
#10 0x00401742 in ?? ()
#11 0xb7b5c669 in __libc_start_main () from /usr/lib/libc.so.6
#12 0x00401cd5 in ?? ()

I have changed this relevant part with the added flag:

    /* Create a fence, which is a dummy BO. */
    fence = cs->ws->base.buffer_create(&cs->ws->base, 1, 1,
                                       RADEON_DOMAIN_GTT,
                                       RADEON_FLAG_NO_SUBALLOC
                                       | RADEON_FLAG_NO_INTERPROCESS_SHARING);

Currently I am testing this, in latest but the build is still going. Will upload patch for it only if it works. I think it does not harm, but this is happening when the bo is a "fence". I have no idea what a fence is in this terminology, but surely it sounds some synchronization primitive so I am unsure that I can add the flag safely, but knowing earlier it was not in the code likely I can just add it.

Will see how this shrinks the GEM_CREATE call numbers.

Comment 59 Marek Olšák 2019-06-08 03:28:51 UTC

Richard, your change for create_fence is correct.

Comment 60 Richard Thier 2019-06-08 06:51:08 UTC

Created attachment 144481 [details] [review]
Possible fix v3

Added attachment for v3 with also adding the flag at the create_fence calls.

Now there is only 6 (!) GEM_CREATE ioctl calls in 10 seconds of glxgears in 19.x versions just like once it was before.

Performance seems very close to v2 so this was not causing any big change.

Comment 61 Richard Thier 2019-06-08 16:43:24 UTC

I still get some slowdown when playing Mount&Blade: Warband which worked before, but there is some chances it is something else or not even in mesa.

On my last system I was playing it with wine 1.8 and now with latest wine-staging 4.x so there are just too many variables.

I have played a bit of Urban Terror though and it became smooth and playable once again. Many times I actually get 60FPS top and 18-20 FPS when looking at big areas.

Before our patches Urban Terror was running like a slideshow and completely unplayable.

Also I have built my mesa using -O3 now - and updated my blog with knowledge about how to add params for the compiler such as debug symbols or this. I think the latter have helped a bit more FPS too, but still there is always room for improvement.

The original problem this bug was about seems solved now.

PS.: I have issues/glitches with HYPERZ turned on. Should I open a different bug report for that one? I think it belongs somewhere else as it might be no even a regession and I feel it never worked for my card.

Comment 62 Richard Thier 2019-06-10 22:34:25 UTC

Added a blog post about the whole issue and its solving process here:

http://ballmerpeak.web.elte.hu/devblog/debugging-mesa-and-the-linux-3d-graphics-stack.html

I have linked to this bug report and every other place and people where communications were done for solving this slowdown issue.

I just wanted to put things together as it might be valuable for others wanting to do similar things - I was and still is a rookie for these things after all.

Comment 63 Rui Salvaterra 2019-06-11 11:23:55 UTC

(In reply to Richard Thier from comment #62)
> Added a blog post about the whole issue and its solving process here:
> 
> http://ballmerpeak.web.elte.hu/devblog/debugging-mesa-and-the-linux-3d-
> graphics-stack.html
> 
> I have linked to this bug report and every other place and people where
> communications were done for solving this slowdown issue.
> 
> I just wanted to put things together as it might be valuable for others
> wanting to do similar things - I was and still is a rookie for these things
> after all.

Nice article! I'll read it thoroughly when I get home, probably while also testing your v3 and trying to squeeze some performance numbers out of my patch. ;)

Cheers!

Comment 64 Richard Thier 2019-06-11 17:12:18 UTC

Rui!

Btw I am also running your patch already too and did not see much problems so far, but many of the stuff I am running was DX7 and DX8 level in graphics settings so I do not know how many times I actually tried your code literally, but it is built into my mesa too.

Also Mount&Blade is working well too now! It was misconfigured in the new wine to have CSMT and that was über-slow on my single core machine leading to starvation and a lot of overhead that was even visible in the perf output.

So far everything works at full speed now!

I am thinking about maybe trying to figure out HyperZ on r300. According to the feature matrix it is untested and I see bad glitches but elevated performance. I already made some changes to try it...

But if I continue with hyperZ stuff likely it will be another bug report, just for sake of transparency about knowing this one is pretty much seems solved completely...

Comment 65 Rui Salvaterra 2019-06-11 17:33:50 UTC

(In reply to Richard Thier from comment #64)
> Rui!
> 
> Btw I am also running your patch already too and did not see much problems
> so far, but many of the stuff I am running was DX7 and DX8 level in graphics
> settings so I do not know how many times I actually tried your code
> literally, but it is built into my mesa too.

Good to know! It worked fine here, too.

> Also Mount&Blade is working well too now! It was misconfigured in the new
> wine to have CSMT and that was über-slow on my single core machine leading
> to starvation and a lot of overhead that was even visible in the perf output.
> 
> So far everything works at full speed now!

Great!

> I am thinking about maybe trying to figure out HyperZ on r300. According to
> the feature matrix it is untested and I see bad glitches but elevated
> performance. I already made some changes to try it...
> 
> But if I continue with hyperZ stuff likely it will be another bug report,
> just for sake of transparency about knowing this one is pretty much seems
> solved completely...

Yes, I think it's a good idea to open a new bug.
By the way, I have a vague memory of HiZ on the R4xx being exclusive (i.e., only the first app to use the Z-buffer gets HiZ), but I could be completely wrong.

Comment 66 Richard Thier 2019-06-11 18:24:37 UTC

> Yes, I think it's a good idea to open a new bug.

Added seperate bug for working on HyperZ glitches:

https://bugs.freedesktop.org/show_bug.cgi?id=110897

> By the way, I have a vague memory of HiZ on the R4xx being exclusive (i.e.,
> only the first app to use the Z-buffer gets HiZ), but I could be completely wrong.

It is exclusive, but not to the first app it seems. The code to get hyperZ ownership is in the buffer clearing function and I see it with my bare eyes right now so I think not only the first app can get it, but if you exit that automatically some of the next one (that gets clear called first).

> It worked fine here, too.

Is there any indicator to look for? Like shader files on disk at some places or near the runned binary or current dir or whatever with this or that name?

Comment 67 Rui Salvaterra 2019-06-11 23:14:23 UTC

(In reply to Richard Thier from comment #66)
> Is there any indicator to look for? Like shader files on disk at some places
> or near the runned binary or current dir or whatever with this or that name?
Yes, it's easy to see if it's being used. The cache is per user and it's likely stored at ~/.cache/mesa_shader_cache (or at the path pointed by XDG_CACHE_HOME, if your distro sets it).

Comment 68 Marek Olšák 2019-06-12 00:46:16 UTC

I've pushed the fix. Closing.

Comment 69 Michel Dänzer 2019-06-24 08:44:38 UTC

*** Bug 105171 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.