Bisecting on BXT (J4205, 18 EUs) revealed following commit:
i965: Set "Subslice Hashing Mode" to 16x16 on Apollolake.
As of 4.11, the kernel isn't bothering to set the subslice hashing mode
on Apollolake, leaving it at the default of 8x8. (It initializes it to
16x4 on most platforms.)
Performance data for GPUTest Triangle on Apollolake at 1024x640:
... <max ~1% perf improvement> ...
Based on this, we choose 16x16 for Apollolake.
Skylake GT2 with X-tiled buffers appears to be a toss-up between 16x4
and 16x16, and with Y-tiled buffers it doesn't seem to really matter.
So we'll leave Skylake alone for now.
The hashing mode doesn't seem to make a measurable impact on more
Acked-by: Matt Turner <firstname.lastname@example.org>
To drop performance in several test-cases:
- 6% in SynMark v7 TerrainPanTess
- 3% in GpuTest v0.7 GiMark
- 3% in GpuTest v0.7 FurMark
- 2% in SynMark v7 TerrainFlyTess
And cause additional drop in max sampling rate, on top of bug 102258.
(Unigine Valley had also dropped by 1% somewhere around this time, but that was too small change to be bisected reliably.)
GpuTest tests were run in HalfHD window, SynMark ones in FullHD fullscreen.
I don't have yet reliable data on the potential improvements from this commit, but around same time:
- GpuTest Triangle seems indeed to have improved marginally
- raw GPU texture read, copy & blend bandwidth has improved slightly
- Bug 102258 perf drop in SynMark TexMem*, TexFilterTri & GLB 2.7 Fill cases gets mostly compensated
- SynMark ZBuffer test, which does lot of depth buffer reads, also improves slightly
-> I will bisect also these (when the BXT machine is free again), to verify they come from the same commit and to see how much their exact impact is.
Note: Above data is for 18 EU BXT, if commit was tested with 12 EU variant, it's possible that hashing mode has less impact on 12 EU one.
I.e. it seems that things depending on raw sampler throughput have perf drops, and everything that is completely memory bandwidth bound improves a bit.
If latter gets verified by bisection, I'm not sure whether anything should be done about this bug, as bandwidth limitations should be more common for real-world use-cases than being purely sampler limited.
Both FurMark and GiMark use anisotropic filtering, but I think changing hash mode for a draw, based on sampling mode, would have too much overhead.
Terrain tessellation doesn't use costly filtering, so for a drop in that I don't have yet a good explanation.
It's possible that using larger (16x16) area for the cross-slice load balancing checkerboard works worse for very small triangles, but impact "should" then be visible both in instancing & tessellation shader terrain tests, as both run about same amount of pixel shader instances, look same and have identical pixel shaders. Maybe different vertex order explain that difference.
Bisect verified that the memory bandwidth improvements on 18 EU BXT:
- 5% GLB 2.7 Fill, SynMark ZBuffer, GpuTest Triangle (fullscreen)
- 3% SynMark TexMem* & TexFilterTri
were also from the same Hashing Mode change that caused the perf drops.
When investigating the Fill case with performance counters, 16x16 hashing mode improves the cache hit rate as expected, which explains the performance improvement.
-> Marking as wontfix, the perf regressions from the commit are acceptable compromise compared to improvements.
If there are important use-cases with huge amounts of really small triangles where 16x16 can regress performance, it may make sense to have this as DRI conf option for the context creation.
(If terrain tessellation perf dropping, but geom instancing not with 16x16 mode is because of their different triangle ordering, that raises a question why tessellation in GPU side produces triangle order that's aligned worse for the 16x16 hashing mode.)
There was also following perf drops during the same time frame, which I assume are due to same change:
- 2% GpuTest v0.7 PixMark Piano
- 1% GpuTest v0.7 PixMark Volplosion
And now that bug 102258 is fixed, SynMark Anisotropic filtering test performance is still 6-7% lower than it was before that bug.
Except for terrain tessellation tests, all the dropping test-cases use anisotropic filter.
So, I think both cases that have a lot of really small triangles, and/or where usage of anisotropic filtering has clear impact for the performance, suffer from this change.
Bug 102258 may have hidden also other impacts of this change, so it makes sense to re-test the impact of the hash mode on 18 EU BXT where it's more visible. The affected test-cases were:
- GLB 2.7 Fill*
- GpuTest v0.7 GiMark, FurMark, Piano, Volplosion & Triangle*
- SynMark v7 TerrainPanTess, TerrainFlyTess, TexFilterAniso, TexFilterTri*, TexMem128, TexMem512* & ZBuffer*
Potentially also Unigine Valley 1.0.
(*) These were improved, others regressed with 8x8 -> 16x16 change.
Looking at GKL data, Valley dropped by ~1.3% on the day of this commit, but GLB 2.7 Fill improvement was even larger than on BXT.