|Summary:||6% performance drop in GpuTest v0.7 Volplosion with "remove GLSL IR optimisation loop" (-> fails to compile as SIMD16)|
|Product:||Mesa||Reporter:||Eero Tamminen <eero.t.tamminen>|
|Component:||Drivers/DRI/i965||Assignee:||Intel 3D Bugs Mailing List <intel-3d-bugs>|
|Status:||RESOLVED MOVED||QA Contact:||Intel 3D Bugs Mailing List <intel-3d-bugs>|
|Priority:||medium||CC:||kenneth, mattst88, t_arceri|
|i915 platform:||i915 features:|
|Attachments:||Register liveness & fragmentation for the SIMD16 assembly before the change|
Description Eero Tamminen 2017-05-03 13:36:40 UTC
Patch series culminating in this commit changed performance in several tests: ------------------------------------------------------- commit ad55b1a7701ad51234af3b9fc30f4c54d2546b86 Author: Timothy Arceri <firstname.lastname@example.org> AuthorDate: Wed Jan 18 10:28:22 2017 +1100 Commit: Timothy Arceri <email@example.com> CommitDate: Mon Apr 24 12:08:14 2017 +1000 i965: remove GLSL IR optimisation loop IVB is running into some spilling issues in piglit with the loop removed. However those tests are not really reflective of a real world use case, also fp64 is brand new to IVB so we leave the spilling issues to be resolved at a later time. ------------------------------------------------------- On most of the platforms GpuTest v0.7 PixMark Volplosion test performance dropped by ~6%, on SKL GT4e, a bit more. (There were few performance improvement around same time, but they were small enough that I couldn't bisect them to this commit, besides the SynMark shader compilation speed test which improved by ~9%.) Volplosion test is fairly pure ALU test. While it has one copy shader, none of the other shaders do any texture accesses, including the heavy main fragment shader, i.e. texture access scheduling doesn't affect it.
Comment 1 Eero Tamminen 2017-05-03 14:18:12 UTC
After the change, the main fragment shader doesn't anymore compile as SIMD16 (benchmark seems to compile shader 7 twice so you get the warning twice): FS compile failed: Failure to register allocate. Reduce number of live scalar values to avoid this. Resulting SIMD8 shader has about the same amount of instructions (nearly two thousand) as earlier, with following changes: - One mad() more - One add() & mul() less - slightly more register bank conflicts - at max, 71 live registers instead of earlier 61 -> somehow this change messes up the current register allocation. Before the change, the SIMD16 variant of the shader had at max 109 live registers.
Comment 2 Matt Turner 2017-05-04 02:26:15 UTC
Looks like this regression could/should have been caught by shader-db: LOST: shaders/closed/gputest/volplosion/7.shader_test FS SIMD16 Separately, Curro is having an interesting time with this particular shader with his new scheduler. Perhaps if that is sorted out, the regression will be fixed.
Comment 3 Kenneth Graunke 2017-05-04 03:09:02 UTC
(In reply to Matt Turner from comment #2) > Looks like this regression could/should have been caught by shader-db: > > LOST: shaders/closed/gputest/volplosion/7.shader_test FS SIMD16 Sorry about that - I saw 48 LOST and 48 GAINED in a random assortment of programs and didn't think much of it. There are always some teetering on the edge, and most of them aren't crazy ALU-bound like volplosion where it makes such a large difference.
Comment 4 Eero Tamminen 2017-05-04 10:08:42 UTC
Created attachment 131197 [details] Register liveness & fragmentation for the SIMD16 assembly before the change
Comment 5 Timothy Arceri 2017-05-05 02:53:35 UTC
(In reply to Matt Turner from comment #2) > Looks like this regression could/should have been caught by shader-db: > > LOST: shaders/closed/gputest/volplosion/7.shader_test FS SIMD16 > > > Separately, Curro is having an interesting time with this particular shader > with his new scheduler. Perhaps if that is sorted out, the regression will > be fixed. Right. As Ken said there were differences noticed, but it would be pretty difficult to not see any difference considering the change. The results from BDW as per the commit message were: LOST: 17 GAINED: 40 Looking at the output the only real difference I can see is the ordering of the final nir instructions which seems makes a big diff to the scheduler. It appears when the vectors are split in nir the instructions seem to be grouped together as per the execution order of the vectors in the shader, in contrast the GLSL IR the instructions seem to be order such that they are grouped slightly more towards executing more instructions on a channel one after the other. This is probably enough to free up some registers. I suspect the only thing we can do is wait for Curros new scheduler. We could revert the change, but that just feels like it hides the problem rather than fixes it and would be a step backwards IMO.
Comment 6 Timothy Arceri 2017-05-05 03:00:21 UTC
By the way I should also note that I did look into a large number of the regressions when working on this series (otherwise I couldn't have gotten to this point) and all the remaining regressions (besides a couple of instructions here and there) appeared to be scheduler related as per the commit message.
Comment 7 Eero Tamminen 2017-05-16 15:59:32 UTC
I was able to do better bisection with ezBench on SKL GT2, there drop to Volplosion was slightly smaller (5%), and FurMark actually improved by 1%. Previous "nir: shuffle constants to the top" commit also affected both of these tests. According to ezBench, on SKL GT2 Volplosion dropped by 1% and FurMark improved by 2%.
Comment 8 Eero Tamminen 2019-04-16 13:24:01 UTC
Most of the perf regression got fixed less than year ago, in Jan/Feb 2018, but main shader still fails to build as SIMD16 and therefore perf remains below original level.
Comment 9 GitLab Migration User 2019-09-25 19:02:05 UTC
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1593.