Currently loop unrolling is done in frontend. This is done based on heuristics, not on actual data from backend whether and how much unrolling makes sense for given HW: will unrolling run out of registers for SIMD16, how many instructions fit into cache... Backend should provide this kind of information for unrolling and frontend should utilize it (either before unrolling, to control how much to do it, or after unrolling to revert it). (Frontend probably should be fixed to push loop invariant code outside of loop before devoting time to improving unrolling itself.)
FWIW there exists code to detect expressions that are constant within a loop (during loop analysis), I think this could be used as starting point for such optimization.
(In reply to comment #1) > FWIW there exists code to detect expressions that are constant within a loop > (during loop analysis), I think this could be used as starting point for > such optimization. Loop invariant code handling could also be the root cause making e.g. SynMark2 PSPhong 3x slower than it should be: - code doesn't get unrolled with all the loop invariant code inside the loop - this resulting in cascade effect that causes also other problems. e.g. stuff which on Windows GL is handled with push & no sampler, changes to (unecessarily looped) pull and makes thing sampler bound with Mesa instead of it being PS ALU bound
(In reply to Eero Tamminen from comment #2) > (In reply to comment #1) > > FWIW there exists code to detect expressions that are constant within a loop > > (during loop analysis), I think this could be used as starting point for > > such optimization. > > Loop invariant code handling could also be the root cause making e.g. > SynMark2 PSPhong 3x slower than it should be: > - code doesn't get unrolled with all the loop invariant code inside the loop > - this resulting in cascade effect that causes also other problems. e.g. > stuff which on Windows GL is handled with push & no sampler, changes to > (unecessarily looped) pull and makes thing sampler bound with Mesa instead > of it being PS ALU bound For me it looks like invariant code in loop is not the problem within PSPhong example, I will try to make a handwritten loop with invariant code to investigate this optimization opportunity more.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1453.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.