System Environment: ---------------------------------------------- Platform: IVB/HSW/BYT-M Libdrm:(master)libdrm-2.4.54-17-ge8c3c1358ecaf4e90f7d43762357 Mesa:(master)1bfc0a11027449ae7ab7c28eb695f26de530eccf Xserver:(master)xorg-server-1.15.99.902-121-g2f5cf9ff9a0f713b7e0386364 Xf86_video_intel: (master)2.99.912-200-ge6e5330857097eb2caafa89d571d12e Cairo: (master)550385fb004e6064305518cf265adc03bd2d0c0b Libva: (master)c61d8c6ce9ffc27320e9e177c1e1123d5f1b5014 Libva_intel_driver:(master)c5cb17ea86f0065a939d3636dd26651c93d497c8 Kernel:(drm-intel-nightly) git-a7665f Bug detailed description: ---------------------------------------------- 3D ( SynMark2_v5.3.0 /SynMark2_v6.0 /Lightsmarkv2008 &/warsow_v1.0 /GLBenchmarkv2.7.0) performance reduced 20%~70% on IVB/BYT-M/HSW,It works well on BDW. The problem exists both on gnome-session and Raw X. Itβs Mesa regression.By bisected show first bad commit is: 3178d2474ae5bdd1102fb3d76a60d1d63c961ff5 Author: Iago Toral Quiroga <itoral@igalia.com> AuthorDate: Tue Jun 17 13:45:18 2014 +0200 Commit: Iago Toral Quiroga <itoral@igalia.com> CommitDate: Mon Jun 30 08:08:50 2014 +0200 i965: Implement GL_PRIMITIVES_GENERATED with non-zero streams. Reproduce steps: --------------------------------------------- 1. xinit& 2. vblank_mode=0 ./backend silent 1920x1080
Besides this, there are other problems with this patch as well. We are currently looking into it. Kenneth sent a couple of patches that modify mine to disable SOL statistics when doing meta operations and also to change certain configurations when transform feedback was inactive: http://lists.freedesktop.org/archives/mesa-dev/2014-July/062571.html http://lists.freedesktop.org/archives/mesa-dev/2014-July/062572.html I wonder if these fix the performance drop you detected. I'll attach these patches here so you can test again (the second one needs to be modified to prevent a GPU hang, so I'll attach the updated version) If that does not fix the problem then it would look like activating the SOL unit even when disabling output to all buffers may have too big of a performance impact, which I think Chris was not expecting. In that case maybe we have to rethink the approach to GL_PRIMITIVES_GENERATED. Adding Kenneth and Chris to the CC just in case they have something to add.
Created attachment 102117 [details] [review] Disable SOL statistics during meta operations
Created attachment 102118 [details] [review] Disable SOL buffers and decls when TF is disabled This version includes the fix I mentioned in the mailing list.
(In reply to comment #3) I have tried above the patch,it can fixed this issue,please upload the patch to upstream.
On HSW GT3e, this change drops 3Dgeeks' GpuTest geometry instancing test performance to 10% of the performance without the indicated change: MESA_GL_VERSION_OVERRIDE=3.0 ./GpuTest /test=gi (This is unrelated to the rendering error regression visible in the test, I saw that also with 2014-06-27 git version of Mesa.)
GLBenchmark Egpyt & T-Rex performance drops to 1/3 (on HSW GT3e). If there's some issue with the fix still, at least revert the patch until fix is ready.
(In reply to comment #6) > GLBenchmark Egpyt & T-Rex performance drops to 1/3 (on HSW GT3e). If > there's some issue with the fix still, at least revert the patch until fix > is ready. Since this performance drop is specific to Haswell I think it would be enough if we keep the patch for other gens only and leave the original behavior for Haswell. Kenneth, what do you think?
(In reply to comment #7) > (In reply to comment #6) > > GLBenchmark Egpyt & T-Rex performance drops to 1/3 (on HSW GT3e). If > > there's some issue with the fix still, at least revert the patch until fix > > is ready. > > Since this performance drop is specific to Haswell As the bug title and description state, this isn't specific to HSW or GT3e. Eg. on BYT, TriangleList processing performance has halved and GLB T-Rex performance has dropped >10%. On IVB results are worse, and on HSW worst.
(In reply to comment #8) > (In reply to comment #7) > > (In reply to comment #6) > > > GLBenchmark Egpyt & T-Rex performance drops to 1/3 (on HSW GT3e). If > > > there's some issue with the fix still, at least revert the patch until fix > > > is ready. > > > > Since this performance drop is specific to Haswell > > As the bug title and description state, this isn't specific to HSW or GT3e. > Eg. on BYT, TriangleList processing performance has halved and GLB T-Rex > performance has dropped >10%. On IVB results are worse, and on HSW worst. But zhoujian's comment above suggested that the patches I uploaded here fixed the performance drop on the reported platforms... Eero are you testing this with master and the two attached patches on top or with a clean checkout of master? I think Kenneth has not pushed these patches to master yet.
I tested clean master. Btw. the patch comment states: ------------ So far we have been using CL_INVOCATION_COUNT to resolve this query but this is no good with streams, as only stream 0 reaches the clipping stage. Instead we will use SO_PRIM_STORAGE_NEEDED which can keep track of the primitives sent to each individual stream. Since SO_PRIM_STORAGE_NEEDED is related to the SOL stage and according to ARB_transform_feedback3 we need to be able to query primitives generated in each stream whether transform feedback is active or not what we do is to enable the SOL unit even if transform feedback is not active but disable all output buffers in that case. This effectively disables transform feedback but permits activation of statistics enabling SO_PRIM_STORAGE_NEEDED even when transform feedback is not active. ---------- But looking at the spec: ---------- When geometry shaders are not used, or when an old geometry shader not writing multiple streams is used, all vertices produced by the GL are directed at the stream numbered zero. The set of transform feedback- related query targets is extended to accommodate multiple vertex streams, so it is possible to count the number of processed and recorded primitives for each stream separately. ---------- -> As the issue was with statistics for non-zero streams and those are only used with geometry shaders, it seems that clipper invocation count can still be used if there's no geometry shader. None of the programs which had regressed hugely, used geometry shaders.
I've reverted the patch. We'll have to come up with a proper implementation for GL_PRIMITIVES_GENERATED in the multiple streams case. It looks like none of the counters work in all cases. We may need to use atomics in the geometry shader to count things manually for streams 1-3, and hack stream 0 to use CL_INVOCATIONS_COUNT unless a GS program that uses streams is active, then use atomics, and add both sources together to get the final value. Ugly, but I can't think of anything better...
(In reply to comment #9) > But zhoujian's comment above suggested that the patches I uploaded here > fixed the performance drop on the reported platforms... Eero are you testing > this with master and the two attached patches on top or with a clean > checkout of master? I think Kenneth has not pushed these patches to master > yet. Kenneth commented that they had failures, so they're no-go. But I had tested performance with them: Egpyt, T-Rex and GiMark have still few (4-5%) percent regression, and basic TriList test has >30% regression (on HSW GT3e). IMHO its too much even if the patch would work. FYI: the max amount of primitives per frame are fairly high with these tests): - T-Rex: 3.8M - Egypt: 3.2M - GiMark: 38.0M (without instancing 420) - Valley: 6.4M (without instancing 4.2M) - Heaven: 3.8M (without instancing 3.2M) - TriList: 6.0M (In reply to comment #11) > We'll have to come up with a proper implementation for > GL_PRIMITIVES_GENERATED in the multiple streams case. It looks like none of > the counters work in all cases. We may need to use atomics in the geometry > shader to count things manually for streams 1-3, and hack stream 0 to use > CL_INVOCATIONS_COUNT unless a GS program that uses streams is active, then > use atomics, and add both sources together to get the final value. Ugly, > but I can't think of anything better... Do you mean that the original patch didn't provide the correct GL_PRIMITIVES_GENERATED counts, or that there are some other problems with just doing: if geometry shader use SO_PRIM_STORAGE_NEEDED // for all streams else use CL_INVOCATIONS_COUNT (Is there some way to use non-zero streams without geometry shader?)
(In reply to comment #12) > (In reply to comment #9) > > But zhoujian's comment above suggested that the patches I uploaded here > > fixed the performance drop on the reported platforms... Eero are you testing > > this with master and the two attached patches on top or with a clean > > checkout of master? I think Kenneth has not pushed these patches to master > > yet. > > Kenneth commented that they had failures, so they're no-go. > > But I had tested performance with them: Egpyt, T-Rex and GiMark have still > few (4-5%) percent regression, and basic TriList test has >30% regression > (on HSW GT3e). IMHO its too much even if the patch would work. Yeah, I agree. > FYI: the max amount of primitives per frame are fairly high with these > tests): > - T-Rex: 3.8M > - Egypt: 3.2M > - GiMark: 38.0M (without instancing 420) > - Valley: 6.4M (without instancing 4.2M) > - Heaven: 3.8M (without instancing 3.2M) > - TriList: 6.0M > > > (In reply to comment #11) > > We'll have to come up with a proper implementation for > > GL_PRIMITIVES_GENERATED in the multiple streams case. It looks like none of > > the counters work in all cases. We may need to use atomics in the geometry > > shader to count things manually for streams 1-3, and hack stream 0 to use > > CL_INVOCATIONS_COUNT unless a GS program that uses streams is active, then > > use atomics, and add both sources together to get the final value. Ugly, > > but I can't think of anything better... > > Do you mean that the original patch didn't provide the correct > GL_PRIMITIVES_GENERATED counts, or that there are some other problems with > just doing: > if geometry shader > use SO_PRIM_STORAGE_NEEDED // for all streams > else > use CL_INVOCATIONS_COUNT I think there are two problems here: One is related to performance and I think we don't want to have that even if we could reduce it only to the case where we have geometry shaders active. The problem here is related to the fact that we need to activate the SOL unit even when we don't need to do transform feedback, and that seems to come with a severe performance penalty, as you discovered. Besides that, I think there are some other problems specific to Haswell where this patch is not producing correct counts. I think that multi-stream patches have not been ported to Haswell yet though, so maybe it is related to that but it could also be something else. Maybe Kenneth can confirm this. > (Is there some way to use non-zero streams without geometry shader?) No, streams are only available in geometry shaders.
3DMMES2.0 hoverjet case can not fixed,other cases have fixed,so reopen this bug.
(In reply to comment #14) > 3DMMES2.0 hoverjet case can not fixed,other cases have fixed,so reopen this > bug. Regressing commit was reverted (revert is identical to original commit), so it cannot anymore be causing regressions. You need to find the right commit that regressed hoverjet *before this commit*, or regressed it more *after this commit* (before it got fixed), and file a separate bug about that. I.e. you may need to do two bisect runs. I would suggest starting from one after this commit until the one before the revert. If that doesn't find it, then it's before the commit discussed here.
(In reply to comment #15) Found the 3DMMES2.0 hoverjet case can't be fixed reason,it's build machine lack of 32 library problem,retest it can works well.
Verified it,fixed this issue commit:git-99f8ea2
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.