Bug 80792 - [IVB/HSW/BYT-M Bisected]3D ( SynMark2_v5.3.0 /SynMark2_v6.0 /Lightsmarkv2008 &/warsow_v1.0 /GLBenchmarkv2.7.0/GpuTest GiMark) performance reduced 20%~90%
Summary: [IVB/HSW/BYT-M Bisected]3D ( SynMark2_v5.3.0 /SynMark2_v6.0 /Lightsmarkv2008 ...
Status: VERIFIED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: All Linux (All)
: high major
Assignee: Iago Toral
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-02 07:46 UTC by zhoujian
Modified: 2014-07-22 05:56 UTC (History)
6 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Disable SOL statistics during meta operations (1.34 KB, patch)
2014-07-02 08:41 UTC, Iago Toral
Details | Splinter Review
Disable SOL buffers and decls when TF is disabled (5.35 KB, patch)
2014-07-02 08:42 UTC, Iago Toral
Details | Splinter Review

Description zhoujian 2014-07-02 07:46:24 UTC
System Environment:       
----------------------------------------------
Platform: IVB/HSW/BYT-M
Libdrm:(master)libdrm-2.4.54-17-ge8c3c1358ecaf4e90f7d43762357
Mesa:(master)1bfc0a11027449ae7ab7c28eb695f26de530eccf
Xserver:(master)xorg-server-1.15.99.902-121-g2f5cf9ff9a0f713b7e0386364
Xf86_video_intel: (master)2.99.912-200-ge6e5330857097eb2caafa89d571d12e
Cairo: (master)550385fb004e6064305518cf265adc03bd2d0c0b
Libva: (master)c61d8c6ce9ffc27320e9e177c1e1123d5f1b5014
Libva_intel_driver:(master)c5cb17ea86f0065a939d3636dd26651c93d497c8
Kernel:(drm-intel-nightly) git-a7665f

Bug detailed description:
----------------------------------------------
3D ( SynMark2_v5.3.0 /SynMark2_v6.0 /Lightsmarkv2008 &/warsow_v1.0 /GLBenchmarkv2.7.0) performance reduced 20%~70% on IVB/BYT-M/HSW,It works well on BDW. 
The problem exists both on gnome-session and Raw X.

It’s Mesa regression.By bisected show first bad commit is:
3178d2474ae5bdd1102fb3d76a60d1d63c961ff5
Author:     Iago Toral Quiroga <itoral@igalia.com>
AuthorDate: Tue Jun 17 13:45:18 2014 +0200
Commit:     Iago Toral Quiroga <itoral@igalia.com>
CommitDate: Mon Jun 30 08:08:50 2014 +0200
    i965: Implement GL_PRIMITIVES_GENERATED with non-zero streams.

Reproduce steps:
---------------------------------------------
1. xinit&
2. vblank_mode=0 ./backend silent 1920x1080
Comment 1 Iago Toral 2014-07-02 08:29:22 UTC
Besides this, there are other problems with this patch as well. We are currently looking into it.

Kenneth sent a couple of patches that modify mine to disable SOL statistics when doing meta operations and also to change certain configurations when transform feedback was inactive:

http://lists.freedesktop.org/archives/mesa-dev/2014-July/062571.html
http://lists.freedesktop.org/archives/mesa-dev/2014-July/062572.html

I wonder if these fix the performance drop you detected. I'll attach these patches here so you can test again (the second one needs to be modified to prevent a GPU hang, so I'll attach the updated version)

If that does not fix the problem then it would look like activating the SOL unit even when disabling output to all buffers may have too big of a performance impact, which I think Chris was not expecting. In that case maybe we have to rethink the approach to GL_PRIMITIVES_GENERATED.

Adding Kenneth and Chris to the CC just in case they have something to add.
Comment 2 Iago Toral 2014-07-02 08:41:13 UTC
Created attachment 102117 [details] [review]
Disable SOL statistics during meta operations
Comment 3 Iago Toral 2014-07-02 08:42:06 UTC
Created attachment 102118 [details] [review]
Disable SOL buffers and decls when TF is disabled

This version includes the fix I mentioned in the mailing list.
Comment 4 zhoujian 2014-07-03 01:47:54 UTC
(In reply to comment #3)
I have tried above the patch,it can fixed this issue,please upload the patch to upstream.
Comment 5 Eero Tamminen 2014-07-09 14:53:51 UTC
On HSW GT3e, this change drops 3Dgeeks' GpuTest geometry instancing test performance to 10% of the performance without the indicated change:
  MESA_GL_VERSION_OVERRIDE=3.0 ./GpuTest /test=gi

(This is unrelated to the rendering error regression visible in the test, I saw that also with 2014-06-27 git version of Mesa.)
Comment 6 Eero Tamminen 2014-07-16 10:44:18 UTC
GLBenchmark Egpyt & T-Rex performance drops to 1/3 (on HSW GT3e).  If there's some issue with the fix still, at least revert the patch until fix is ready.
Comment 7 Iago Toral 2014-07-16 10:53:39 UTC
(In reply to comment #6)
> GLBenchmark Egpyt & T-Rex performance drops to 1/3 (on HSW GT3e).  If
> there's some issue with the fix still, at least revert the patch until fix
> is ready.

Since this performance drop is specific to Haswell I think it would be enough if we keep the patch for other gens only and leave the original behavior for Haswell.

Kenneth, what do you think?
Comment 8 Eero Tamminen 2014-07-16 12:00:43 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > GLBenchmark Egpyt & T-Rex performance drops to 1/3 (on HSW GT3e).  If
> > there's some issue with the fix still, at least revert the patch until fix
> > is ready.
> 
> Since this performance drop is specific to Haswell

As the bug title and description state, this isn't specific to HSW or GT3e. Eg. on BYT, TriangleList processing performance has halved and GLB T-Rex performance has dropped >10%.  On IVB results are worse, and on HSW worst.
Comment 9 Iago Toral 2014-07-16 12:13:31 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > (In reply to comment #6)
> > > GLBenchmark Egpyt & T-Rex performance drops to 1/3 (on HSW GT3e).  If
> > > there's some issue with the fix still, at least revert the patch until fix
> > > is ready.
> > 
> > Since this performance drop is specific to Haswell
> 
> As the bug title and description state, this isn't specific to HSW or GT3e.
> Eg. on BYT, TriangleList processing performance has halved and GLB T-Rex
> performance has dropped >10%.  On IVB results are worse, and on HSW worst.

But zhoujian's comment above suggested that the patches I uploaded here fixed the performance drop on the reported platforms... Eero are you testing this with master and the two attached patches on top or with a clean checkout of master? I think Kenneth has not pushed these patches to master yet.
Comment 10 Eero Tamminen 2014-07-16 13:21:36 UTC
I tested clean master.


Btw. the patch comment states:
------------
 So far we have been using CL_INVOCATION_COUNT to resolve this query
 but this is no good with streams, as only stream 0 reaches the
 clipping stage. Instead we will use SO_PRIM_STORAGE_NEEDED which
 can keep track of the primitives sent to each individual stream.

 Since SO_PRIM_STORAGE_NEEDED is related to the SOL stage and according
 to ARB_transform_feedback3 we need to be able to query primitives
 generated in each stream whether transform feedback is active or not
 what we do is to enable the SOL unit even if transform feedback is not
 active but disable all output buffers in that case. This effectively
 disables transform feedback but permits activation of statistics
 enabling SO_PRIM_STORAGE_NEEDED even when transform feedback is not
 active.
----------

But looking at the spec:
----------
When geometry shaders are not used, or when an old geometry shader not
writing multiple streams is used, all vertices produced by the GL are
directed at the stream numbered zero.  The set of transform feedback-
related query targets is extended to accommodate multiple vertex
streams, so it is possible to count the number of processed and
recorded primitives for each stream separately.
----------

-> As the issue was with statistics for non-zero streams and those are only
used with geometry shaders, it seems that clipper invocation count can
still be used if there's no geometry shader.  None of the programs which
had regressed hugely, used geometry shaders.
Comment 11 Kenneth Graunke 2014-07-16 21:22:22 UTC
I've reverted the patch.

We'll have to come up with a proper implementation for GL_PRIMITIVES_GENERATED in the multiple streams case.  It looks like none of the counters work in all cases.  We may need to use atomics in the geometry shader to count things manually for streams 1-3, and hack stream 0 to use CL_INVOCATIONS_COUNT unless a GS program that uses streams is active, then use atomics, and add both sources together to get the final value.  Ugly, but I can't think of anything better...
Comment 12 Eero Tamminen 2014-07-18 07:54:11 UTC
(In reply to comment #9)
> But zhoujian's comment above suggested that the patches I uploaded here
> fixed the performance drop on the reported platforms... Eero are you testing
> this with master and the two attached patches on top or with a clean
> checkout of master? I think Kenneth has not pushed these patches to master
> yet.

Kenneth commented that they had failures, so they're no-go.

But I had tested performance with them: Egpyt, T-Rex and GiMark have still few (4-5%) percent regression, and basic TriList test has >30% regression (on HSW GT3e).  IMHO its too much even if the patch would work.

FYI: the max amount of primitives per frame are fairly high with these tests):
- T-Rex:   3.8M
- Egypt:   3.2M
- GiMark: 38.0M  (without instancing 420)
- Valley:  6.4M  (without instancing 4.2M)
- Heaven:  3.8M  (without instancing 3.2M)
- TriList: 6.0M 


(In reply to comment #11)
> We'll have to come up with a proper implementation for
> GL_PRIMITIVES_GENERATED in the multiple streams case.  It looks like none of
> the counters work in all cases.  We may need to use atomics in the geometry
> shader to count things manually for streams 1-3, and hack stream 0 to use
> CL_INVOCATIONS_COUNT unless a GS program that uses streams is active, then
> use atomics, and add both sources together to get the final value.  Ugly,
> but I can't think of anything better...

Do you mean that the original patch didn't provide the correct GL_PRIMITIVES_GENERATED counts, or that there are some other problems with just doing:
  if geometry shader
    use SO_PRIM_STORAGE_NEEDED // for all streams
  else
    use CL_INVOCATIONS_COUNT

(Is there some way to use non-zero streams without geometry shader?)
Comment 13 Iago Toral 2014-07-18 08:17:10 UTC
(In reply to comment #12)
> (In reply to comment #9)
> > But zhoujian's comment above suggested that the patches I uploaded here
> > fixed the performance drop on the reported platforms... Eero are you testing
> > this with master and the two attached patches on top or with a clean
> > checkout of master? I think Kenneth has not pushed these patches to master
> > yet.
> 
> Kenneth commented that they had failures, so they're no-go.
> 
> But I had tested performance with them: Egpyt, T-Rex and GiMark have still
> few (4-5%) percent regression, and basic TriList test has >30% regression
> (on HSW GT3e).  IMHO its too much even if the patch would work.

Yeah, I agree.

> FYI: the max amount of primitives per frame are fairly high with these
> tests):
> - T-Rex:   3.8M
> - Egypt:   3.2M
> - GiMark: 38.0M  (without instancing 420)
> - Valley:  6.4M  (without instancing 4.2M)
> - Heaven:  3.8M  (without instancing 3.2M)
> - TriList: 6.0M 
> 
> 
> (In reply to comment #11)
> > We'll have to come up with a proper implementation for
> > GL_PRIMITIVES_GENERATED in the multiple streams case.  It looks like none of
> > the counters work in all cases.  We may need to use atomics in the geometry
> > shader to count things manually for streams 1-3, and hack stream 0 to use
> > CL_INVOCATIONS_COUNT unless a GS program that uses streams is active, then
> > use atomics, and add both sources together to get the final value.  Ugly,
> > but I can't think of anything better...
> 
> Do you mean that the original patch didn't provide the correct
> GL_PRIMITIVES_GENERATED counts, or that there are some other problems with
> just doing:
>   if geometry shader
>     use SO_PRIM_STORAGE_NEEDED // for all streams
>   else
>     use CL_INVOCATIONS_COUNT

I think there are two problems here:

One is related to performance and I think we don't want to have that even if we could reduce it only to the case where we have geometry shaders active. The problem here is related to the fact that we need to activate the SOL unit even when we don't need to do transform feedback, and that seems to come with a severe performance penalty, as you discovered. 

Besides that, I think there are some other problems specific to Haswell where this patch is not producing correct counts. I think that multi-stream patches have not been ported to Haswell yet though, so maybe it is related to that but it could also be something else. Maybe Kenneth can confirm this.

> (Is there some way to use non-zero streams without geometry shader?)

No, streams are only available in geometry shaders.
Comment 14 zhoujian 2014-07-18 08:25:04 UTC
3DMMES2.0 hoverjet case can not fixed,other cases have fixed,so reopen this bug.
Comment 15 Eero Tamminen 2014-07-18 09:15:48 UTC
(In reply to comment #14)
> 3DMMES2.0 hoverjet case can not fixed,other cases have fixed,so reopen this
> bug.

Regressing commit was reverted (revert is identical to original commit), so it cannot anymore be causing regressions.  You need to find the right commit that regressed hoverjet *before this commit*, or regressed it more *after this commit* (before it got fixed), and file a separate bug about that.

I.e. you may need to do two bisect runs. I would suggest starting from one after this commit until the one before the revert.  If that doesn't find it, then it's before the commit discussed here.
Comment 16 zhoujian 2014-07-22 05:42:29 UTC
(In reply to comment #15)
Found the 3DMMES2.0 hoverjet case can't be fixed reason,it's build machine lack of 32 library problem,retest it can works well.
Comment 17 zhoujian 2014-07-22 05:56:08 UTC
Verified it,fixed this issue commit:git-99f8ea2


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.