105171 – performance regression (3x slower) running glamor with PutImage workload (radeonsi)

Bug 105171 - performance regression (3x slower) running glamor with PutImage workload (radeonsi)

Summary: performance regression (3x slower) running glamor with PutImage workload (rad...

Status:	RESOLVED DUPLICATE of bug 110781

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	17.3
Hardware:	Other All

Importance:	medium normal
Assignee:	mesa-dev
QA Contact:	mesa-dev

URL:
Whiteboard:
Keywords:	bisected, regression

Depends on:
Blocks:

Reported:	2018-02-20 08:40 UTC by Clemens Eisserer
Modified:	2019-06-24 08:44 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
test case (15.93 KB, application/x-java-archive) 2018-02-20 08:40 UTC, Clemens Eisserer	Details
View All

Description Clemens Eisserer 2018-02-20 08:40:08 UTC

Created attachment 137456 [details]
test case

After updating Fedora from mesa-17.2.4 to 17.3.5 I noticed throughput in XPutImage/XShmPutImage based workloads dropped significantly.

I'd noticed this before with a self-compiled version of mesa-18.0-rc, but thought this had something to do with the chosen compile flags.

One test went from 40fps to 15fps (very small XPutImage requests immideatly followed by XRenderComposite), while other degraded by about 30%.

System details:
* AMD Kaveri 7650k
* 4k + FullHD displays
* linux 4.15.3
* radeon kernel driver


How to test:
Run the attached java program and enable the "antialising" checkbox:
java -jar JGears2.jar

Comment 1 Clemens Eisserer 2018-02-20 09:28:14 UTC

currently bisecting...

Comment 2 Clemens Eisserer 2018-02-20 10:12:04 UTC

The commit causing this regression is:

8b3a257851905ff444d981e52938cbf2b36ba830 is the first bad commit
commit 8b3a257851905ff444d981e52938cbf2b36ba830
Author: Marek Olšák <marek.olsak@amd.com>
Date:   Tue Jul 18 16:08:44 2017 -0400

    radeonsi: set a per-buffer flag that disables inter-process sharing (v4)
    
    For lower overhead in the CS ioctl.
    Winsys allocators are not used with interprocess-sharable resources.
    
    v2: It shouldn't crash anymore, but the kernel will reject the new flag.
    v3 (christian): Rename the flag, avoid sending those buffers in the BO list.
    v4 (christian): Remove setting the kernel flag for now
    
    Reviewed-by: Marek Olšák <marek.olsak@amd.com>

:040000 040000 b775b6b0ea5b971d2165a644ea8912c120f54431 2e4b2737f37ede2bbdbbe6815fe0fa562177c2b7 M      src



x11per -putimage10 regressed from 75 kOps/s to 22 kOps/s after this patch running Xephyr with vblank disabled.

before:
     400000 reps @   0.0130 msec ( 77000.0/sec): PutImage 10x10 square

after:
     120000 reps @   0.0457 msec ( 21900.0/sec): PutImage 10x10 square

Comment 3 Emil Velikov 2018-02-22 14:39:22 UTC

Thanks for the bisection Clemens.

For the future feel free to add the commit author/reviewer in the CC list. It should help flag the issue amongst the dozens of others.

Comment 4 Clemens Eisserer 2018-03-08 06:38:34 UTC

just some unrelated, interesting numbers:

Sync time adjustment is 0.0355 msecs.
    8000000 reps @   0.0012 msec (816000.0/sec): ShmPutImage 10x10 square
    8000000 reps @   0.0012 msec (818000.0/sec): ShmPutImage 10x10 square

These are the results achieved by a Geforce-8800GTS (11 years old dGPU) using the proprietary driver in the same system.

Confirms my subjective experience - the glamor based open-source driver stack is really slow for some operations. It seems the proprietary nvidia driver has way lower driver overhead (considering the 10x10 putimage won't saturate the GPU).

Comment 5 Dieter Nützel 2018-03-08 07:00:13 UTC

Marek,

any ideas?
My Polaris 20 is somewhat faster, but by no means like Nvidia blob.
git revert xxx do NOT work, clean.

Someone on Phoronix mentioned that fglrx was even much faster then Mesa git before your commit.

Comment 6 Clemens Eisserer 2018-03-08 10:32:14 UTC

Strange, after tinkering around with my system, I cannot reproduce the issue anymore. Even with Mesa-17.3.x x11perf -shnmput10 is now at ~70-80kOps/s - so maybe it was a configuration issue that was somehow triggered by the commit in question?

This still leaves the question to be answered, how/why the nvidia blob can be magnitudes faster for XPutImage based workloads.

Comment 7 Michel Dänzer 2018-03-08 10:35:05 UTC

(In reply to Clemens Eisserer from comment #6)
> This still leaves the question to be answered, how/why the nvidia blob can
> be magnitudes faster for XPutImage based workloads.

If somebody wants to improve this, the place to start is probably glamor rather than the drivers.

Comment 8 Clemens Eisserer 2018-03-08 10:46:40 UTC

> If somebody wants to improve this, 
> the place to start is probably glamor rather than the drivers.

I wonder, what could glamor do better (especially for small uploads) than call into glTexSubImage2D?

Comment 9 Clemens Eisserer 2018-03-10 06:27:04 UTC

So, shmput10 is now equally fast with Mesa-17.3.6 and Mesa-27.2.4 - however the real-world workload still suffers.

Please have a look at http://93.83.133.214/downloads/JXRenderMark-1.0.1.zip - it is a stand-alone benchmark which emulates the XRender sequences generated by the Java XRender backend.

CentOS-7 + updates (Mesa 17.0.1):
./render 3 32 3 32 3 32 3 32 3 32 3 32 3 32
18621.335408 Ops/s; put composition (!); 32x32
18901.781304 Ops/s; put composition (!); 32x32
18903.572785 Ops/s; put composition (!); 32x32

Fedora 27 + updates (Mesa 17.3.6):
./render 3 32 3 32 3 32 3 32 3 32 3 32 3 32
[ce@localhost temp]$ ./JXRenderMark-1.0.1 3 32 3 32 3 32 3 32 3 32 3 32
6938.738245 Ops/s; put composition (!); 32x32
6825.050537 Ops/s; put composition (!); 32x32
6955.692404 Ops/s; put composition (!); 32x32

So there it is ... the slowdown of factor 2,5 :/

Comment 10 Clemens Eisserer 2018-03-10 10:43:54 UTC

I bisected the regression again, this time with the benchmark mentioned in the post above (JXRenderMark) and I was agin led to the following commit:

[ce@localhost mesa]$ git bisect good
8b3a257851905ff444d981e52938cbf2b36ba830 is the first bad commit
commit 8b3a257851905ff444d981e52938cbf2b36ba830
Author: Marek Olšák <marek.olsak@amd.com>
Date:   Tue Jul 18 16:08:44 2017 -0400

    radeonsi: set a per-buffer flag that disables inter-process sharing (v4)
    

So regardless of different manifestations, this commit seems to introduce regressions for antialiased rendering using the Xrender Java2D backend.

Comment 11 Michel Dänzer 2018-03-15 18:08:07 UTC

https://patchwork.freedesktop.org/patch/210907/ helps for this benchmark with the r600 driver, but radeonsi already has the same code...

Clemens, are you still seeing the problem with current Mesa Git master?

Comment 12 Marek Olšák 2018-03-15 20:35:04 UTC

(In reply to Clemens Eisserer from comment #10)
> I bisected the regression again, this time with the benchmark mentioned in
> the post above (JXRenderMark) and I was agin led to the following commit:
> 
> [ce@localhost mesa]$ git bisect good
> 8b3a257851905ff444d981e52938cbf2b36ba830 is the first bad commit
> commit 8b3a257851905ff444d981e52938cbf2b36ba830
> Author: Marek Olšák <marek.olsak@amd.com>
> Date:   Tue Jul 18 16:08:44 2017 -0400
> 
>     radeonsi: set a per-buffer flag that disables inter-process sharing (v4)
>     
> 
> So regardless of different manifestations, this commit seems to introduce
> regressions for antialiased rendering using the Xrender Java2D backend.

8b3a257851905ff444d981e52938cbf2b36ba830 indeed regressed performance, but it was fixed later. The regression is not reproducible with branches 17.3, 18.0, and master.

Comment 13 Clemens Eisserer 2018-03-15 20:46:09 UTC

For my kaveri-system I got the following numbers (composition manager disabled, Xephyr):

./JXRenderMark-1.0.1 3 32 3 32 3 32

#amdgpu, IOMMU enabled
12325.077581 Ops/s; put composition (!); 32x32    # mesa-17.2.4 self-compiled
10582.511406 Ops/s; put composition (!); 32x32    # mesa-17.3.6, fedora 27, updates repo
8636.834555 Ops/s; put composition (!); 32x32     # mesa-18.1.0-devel self-compiled

#radeon, IOMMU enabled
12060.500868 Ops/s; put composition (!); 32x32    # mesa-17.2.4, self-compiled
6330.459659 Ops/s; put composition (!); 32x32     # mesa-17.3.6, fedora 27, updates repo
6100.570157 Ops/s; put composition (!); 32x32     # mesa-18.1.0-devel self-compiled


So amdgpu didn't regress as badly as radeon, but performance is constantly decreasing.

Comment 14 Marek Olšák 2018-03-15 22:31:45 UTC

Can you test this patch?
https://patchwork.freedesktop.org/patch/210920/

Comment 15 Dieter Nützel 2018-03-16 03:52:18 UTC

I can't hardly see any changes.(In reply to Marek Olšák from comment #14)
> Can you test this patch?
> https://patchwork.freedesktop.org/patch/210920/

I see hardly any changes with radeonsi on RX 580.

Comment 16 Clemens Eisserer 2018-03-16 08:19:39 UTC

some here, on my Kaveri 7650k results with the patch are basically unchanged :

amdgpu:
8557.942992 Ops/s; put composition (!); 32x32

should I test with radeon too?

Dieter: Just to be curious, which values do you obtain with your polaris GPU?

Comment 17 Dieter Nützel 2018-03-17 00:42:22 UTC

(In reply to Clemens Eisserer from comment #16)
> some here, on my Kaveri 7650k results with the patch are basically unchanged
> :
> 
> amdgpu:
> 8557.942992 Ops/s; put composition (!); 32x32
> 
> should I test with radeon too?
> 
> Dieter: Just to be curious, which values do you obtain with your polaris GPU?

RX580 (DC enabled) 'cpupower frequency-set -g performance'

composit (faster): !!! ;-)
./JXRenderMark-1.0.1 3 32 3 32 3 32 3 32 3 32 3 32
29845.626072 Ops/s; put composition (!); 32x32                                                      
30745.957643 Ops/s; put composition (!); 32x32                                                      
30922.973502 Ops/s; put composition (!); 32x32                                                      
30460.302141 Ops/s; put composition (!); 32x32                                                      
30330.232018 Ops/s; put composition (!); 32x32                                                      
30757.257217 Ops/s; put composition (!); 32x32

without (slower):
28507.546115 Ops/s; put composition (!); 32x32                                                      
29570.588821 Ops/s; put composition (!); 32x32                                                      
29909.051450 Ops/s; put composition (!); 32x32                                                      
29839.934108 Ops/s; put composition (!); 32x32                                                      
30024.853684 Ops/s; put composition (!); 32x32                                                      
29852.673826 Ops/s; put composition (!); 32x32

Comment 18 Dieter Nützel 2018-03-17 01:23:29 UTC

(In reply to Dieter Nützel from comment #17)
> (In reply to Clemens Eisserer from comment #16)
> > some here, on my Kaveri 7650k results with the patch are basically unchanged
> > :
> > 
> > amdgpu:
> > 8557.942992 Ops/s; put composition (!); 32x32
> > 
> > should I test with radeon too?
> > 
> > Dieter: Just to be curious, which values do you obtain with your polaris GPU?
> 
> RX580 (DC enabled) 'cpupower frequency-set -g performance'
> 
> composit (faster): !!! ;-)
> ./JXRenderMark-1.0.1 3 32 3 32 3 32 3 32 3 32 3 32
> 29845.626072 Ops/s; put composition (!); 32x32                              
> 
> 30745.957643 Ops/s; put composition (!); 32x32                              
> 
> 30922.973502 Ops/s; put composition (!); 32x32                              
> 
> 30460.302141 Ops/s; put composition (!); 32x32                              
> 
> 30330.232018 Ops/s; put composition (!); 32x32                              
> 
> 30757.257217 Ops/s; put composition (!); 32x32
> 
> without (slower):
> 28507.546115 Ops/s; put composition (!); 32x32                              
> 
> 29570.588821 Ops/s; put composition (!); 32x32                              
> 
> 29909.051450 Ops/s; put composition (!); 32x32                              
> 
> 29839.934108 Ops/s; put composition (!); 32x32                              
> 
> 30024.853684 Ops/s; put composition (!); 32x32                              
> 
> 29852.673826 Ops/s; put composition (!); 32x32

This was with Marek's patch from Comment 14.

Comment 19 Richard Thier 2019-06-03 10:54:56 UTC

Possibly related problem on r300 code paths:

https://bugs.freedesktop.org/show_bug.cgi?id=110781
https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1099745-how-to-tell-if-a-driver-is-gallium-or-just-mesa-slow-renderng-with-radeon/page10
https://bbs.archlinux32.org/viewtopic.php?pid=5973#p5973

It took me a whole lot of time to analyse the source of the problem and this is the commit that causes slowdown for me too.

For me it was really useful to do an strace before and after this commit and I find the GEM_CREATE numbers rise from around 7-11 to about thousands when just doing 10 seconds of glxgears which is clearly wrong and causes my slowdown.

Maybe would be useful to test if the problem is also related to GEM/TTL in this case? Just informing you because I have found this earlier issue when googling the commit hash...

prenex

Comment 20 Clemens Eisserer 2019-06-23 08:56:58 UTC

Hi Richard,

Unfortunatly there was very little interest in tackling the issue itself, despite bisecting it was real pain.

For me the problem was "fixed" by switching to amdgpu, a luxury the r300/r600 code paths don't have - so I guess the report is still valid. Thanks for re-opening it.

Comment 21 Michel Dänzer 2019-06-24 08:44:38 UTC

Let's assume this is the same as bug 110781, which is now fixed.

*** This bug has been marked as a duplicate of bug 110781 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.