Bug 94477

Summary: Too many temporary expressions in shader freeze glLinkProgram
Product: Mesa Reporter: Martina Kollarova <martina.kollarova>
Component: Drivers/DRI/i965Assignee: Ian Romanick <idr>
Status: RESOLVED MOVED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: eero.t.tamminen, kenneth, lemody, yang.gu
Version: 11.2   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: glxinfo output
patch to reduce time spent
fix
use hash in opt_copy_propagate
use hash in copy propagation elements

Description Martina Kollarova 2016-03-10 15:00:15 UTC
Created attachment 122206 [details]
glxinfo output

This bug is causing a webgl conformance test in Chromium to freeze and time out. https://bugs.chromium.org/p/chromium/issues/detail?id=593680

The shader contains something like:

  temp += 
      u_uniform +
      u_uniform -
      u_uniform /
      u_uniform *
      u_uniform +
      u_uniform -
      u_uniform /
      u_uniform *
      u_uniform +
      u_uniform -
      u_uniform ;

repeated 1000 times. It freezes forever (I waited 30 minutes) on glLinkProgram. It doesn't freeze if it's repeated only ~100 times.
I created a small reproducer that uses approximately the same shaders as the test in Chromium: https://github.com/mkollaro/opengl_snippets, you can try it out with:

mkdir build && cd build && cmake .. && make && ./bin/shader_freeze
Comment 1 Matt Turner 2016-03-11 20:08:25 UTC
Thank you for the excellent bug report. I wish the average report was half this good.

Running your example program with a profiler shows that the do_copy_propagation and do_copy_propagation_elements passes are taking all the time.

I suspect this is because internally they use linked lists for the acp/kill sets when they should be using hash tables. Ken fixed a similar problem in commit 4654439fd.

I'll look into making this change.
Comment 2 Martina Kollarova 2016-04-11 14:01:07 UTC
*** Bug 94302 has been marked as a duplicate of this bug. ***
Comment 3 Matt Turner 2016-07-25 18:22:39 UTC
I've come back to this recently.

The first problem appears to be in the do_copy_propagation() pass. visit_leave(ir_assignment *) calls kill(), and kill() does a linked-list walk across many thousands of elements in the "ACP". visit_leave() executes for each assignment. Printing the list length in kill() shows it to continually increase, so I'm not sure if replacing the ACP with a different data structure is sufficient or if there is another bug...

Replacing the ACP with a hash table, for instance, isn't completely straightforward, because we need to be able to look up entries by "lhs", of which there only may be one, and "rhs", of which there may be many (or none).

do_copy_propagation_elements() suffers from the same problem.

But neither of these would be a problem if we didn't call brw_do_channel_expressions(). It makes a huge mess of the program, massively increasing the number of assignments.

Removing it has been something we want to do for a long time, but we aren't there yet because it actually hurts programs in our shader-db if we remove it.
Comment 4 Tapani Pälli 2016-09-02 07:34:37 UTC
Created attachment 126167 [details] [review]
patch to reduce time spent

Martina, could you test with this patch? For me it reduces time spent in the test case as well with the WebGL test. I did not spot any regressions with this so maybe something like this could be done to help reducing time spent in optimizations and lowering passes.
Comment 5 Yang Gu 2016-09-02 08:52:22 UTC
I tried this patch, and it can improve the time consumed a lot for WebGL case conformance/glsl/bugs/temp-expressions-should-not-crash.html (https://bugs.freedesktop.org/show_bug.cgi?id=94302). On my skylake machine, it's from 610s to 130s.
Comment 6 Tapani Pälli 2016-09-02 10:09:37 UTC
Created attachment 126170 [details] [review]
fix

This should result in same effect as the first experimentation, I will run testing on this and if it works fine then I'll send it to mesa-dev mailing list.
Comment 7 Tapani Pälli 2016-09-02 11:28:23 UTC
I run some testing and unfortunately there are regressions with my patch, it won't work correctly for cases where there inf/nan (division by zero), I guess there is no way to make it to .. will look if I can come up with any other ideas
Comment 8 Matt Turner 2016-09-02 17:51:37 UTC
(In reply to Tapani Pälli from comment #6)
> Created attachment 126170 [details] [review] [review]
> fix
> 
> This should result in same effect as the first experimentation, I will run
> testing on this and if it works fine then I'll send it to mesa-dev mailing
> list.

That's pretty clever, but unfortunately it cannot work. Cases of dividing by zero, inf, nan, etc cannot be handled. We don't do this optimization in opt_algebraic for exactly that reason.

On top of that, the purpose of the test is to check whether you can compile a huge expression, and the one it uses is hilariously unrepresentative and prone to abuse by hacks like this one :)

The test has uncovered real flaws in our compiler: GLSL IR's copy propagation runs in O(n^2), and the channel_expressions pass creates a huge number of expressions.

Ken and I have spent time recently trying to handle the shader-db regressions by disabling channel_expressions. It's slow going and other tasks keep taking priority.

The most direct way forward then, is to improve copy propagate's run time like I described in comment 3. Maybe you want to take a stab at replacing the linked lists in the pass with hash tables? I spent some time on it... maybe you'll have better luck. :)
Comment 9 Tapani Pälli 2016-09-05 10:39:33 UTC
(In reply to Matt Turner from comment #8)
> (In reply to Tapani Pälli from comment #6)
> > Created attachment 126170 [details] [review] [review] [review]
> > fix
> > 
> > This should result in same effect as the first experimentation, I will run
> > testing on this and if it works fine then I'll send it to mesa-dev mailing
> > list.
> 
> That's pretty clever, but unfortunately it cannot work. Cases of dividing by
> zero, inf, nan, etc cannot be handled. We don't do this optimization in
> opt_algebraic for exactly that reason.

right .. I knew there to be something missing as it was too good to be true :)

> On top of that, the purpose of the test is to check whether you can compile
> a huge expression, and the one it uses is hilariously unrepresentative and
> prone to abuse by hacks like this one :)
> 
> The test has uncovered real flaws in our compiler: GLSL IR's copy
> propagation runs in O(n^2), and the channel_expressions pass creates a huge
> number of expressions.
> 
> Ken and I have spent time recently trying to handle the shader-db
> regressions by disabling channel_expressions. It's slow going and other
> tasks keep taking priority.
> 
> The most direct way forward then, is to improve copy propagate's run time
> like I described in comment 3. Maybe you want to take a stab at replacing
> the linked lists in the pass with hash tables? I spent some time on it...
> maybe you'll have better luck. :)

I made a wrapper class for 'acp' that internally uses hash table but provides some methods of exec_list (just to easily experiment with this). I'm seeing some performance gains but not nearly as big as with my hack. I'll try to do the same for 'propagation_elements' and see if total effect would be big enough to continue with this exercise.
Comment 10 Tapani Pälli 2016-09-06 07:33:13 UTC
Created attachment 126236 [details] [review]
use hash in opt_copy_propagate

this helps a bit but probably not enough
Comment 11 Martina Kollarova 2016-09-09 14:28:29 UTC
The last patch in comment #10 seems to have about halved the time it takes to link (using the reproducer - in Chromium it's hard to tell because it tends to freeze the whole Xserver somehow and I didn't manage to make it actually finish, though maybe I just didn't wait enough).
Comment 12 Tapani Pälli 2016-09-14 06:31:42 UTC
I have upcoming 2nd patch which should help more, there's some regressions with the patch though so I'm trying to figure out those before.
Comment 13 Tapani Pälli 2016-09-15 10:23:07 UTC
Created attachment 126539 [details] [review]
use hash in copy propagation elements

Martina, can you try out this patch, this together with first one *should* give nice speedup.
Comment 14 Martina Kollarova 2016-09-16 13:44:48 UTC
Thanks. Patch in comment #13, together with the previous one, speeds it up about 8x. Linking the reproducer currently takes about 2 minutes 25 seconds on average.
Comment 15 Yang Gu 2016-09-18 11:20:51 UTC
With these 2 patches, WebGL case conformance/glsl/bugs/temp-expressions-should-not-crash.html (https://bugs.freedesktop.org/show_bug.cgi?id=94302) is improved from 610s to 280s.
Comment 16 Yang Gu 2016-09-26 06:09:07 UTC
When will the second patch land?
Comment 17 Tapani Pälli 2016-09-26 10:46:32 UTC
(In reply to Yang Gu from comment #16)
> When will the second patch land?

Code review has not been done, I'll commit this as soon as patch gets reviewed. There's a lot of traffic in mesa-dev so unfortunately sometimes this can take quite a bit of time.
Comment 18 Yang Gu 2016-09-26 11:12:29 UTC
No problem and thanks a lot!
Comment 19 Tapani Pälli 2016-09-28 11:05:20 UTC
There is debate if the copy_propagate_elements pass should be removed instead of optimized. I tried removing it but it seems to cause functional regression, with only 1 test though: arb_gpu_shader5-interpolateAtSample-dynamically-nonuniform. CC Ken for possible hints what is going wrong with this and how to proceed ...
Comment 20 Tapani Pälli 2016-10-10 05:11:13 UTC
(In reply to Yang Gu from comment #16)
> When will the second patch land?

It's now available in Mesa master as commit c64093e. Does this fix the bug or does the test still timeout?
Comment 21 Yang Gu 2016-10-10 10:01:07 UTC
I just had a try with latest ToT code, and this case took 262s, while current budget for each WebGL conformance test case is 20s. I think you already did a very great improvement from 610s to 262s, and it might be hard to further improve this to be within 20s. I will talk with Khronos to see if we can ask more budget specially for this case. 
Thank you very much!
Comment 22 Tapani Pälli 2016-10-10 10:15:12 UTC
(In reply to Yang Gu from comment #21)
> I just had a try with latest ToT code, and this case took 262s, while
> current budget for each WebGL conformance test case is 20s. I think you
> already did a very great improvement from 610s to 262s, and it might be hard
> to further improve this to be within 20s. I will talk with Khronos to see if
> we can ask more budget specially for this case. 
> Thank you very much!

You are welcome :) 20 secs is a really tough budget for us. Upcoming shader cache will sort of workaround this but even then the first run will be slow.

It would be interesting to know how fast Nvidia does without shader cache, you should be able to disable cache by removing '$HOME/.nv/GLCache' or exporting __GL_SHADER_DISK_CACHE=false before running the app.

I assume we *should* be able to do better than 262s, will need to do some profiling to see. One thing to try would be Marek's patches to use jemalloc instead of libc's malloc. This test causes extensive usage of exec_list and ralloc since there are so many temporary variables being created and destroyed.
Comment 23 Yang Gu 2016-10-10 10:28:32 UTC
I tried both ways to disable cache (remove GL_Cache and set __GL_SHADER_DISK_CACHE). But w/ or w/o cache, the result of Nvidia seems similar: 500-600 "ms". What kind of magic Nvidia does to make it so fast!
BTW, jemalloc may help a bit, but I don't think it would improve much.
Comment 24 Tapani Pälli 2016-10-10 10:40:04 UTC
(In reply to Yang Gu from comment #23)
> I tried both ways to disable cache (remove GL_Cache and set
> __GL_SHADER_DISK_CACHE). But w/ or w/o cache, the result of Nvidia seems
> similar: 500-600 "ms". What kind of magic Nvidia does to make it so fast!

For me it sounds like it is using some cache, maybe it is actually chrome's shader cache in question here? AFAIK chrome implements such cache via GL_ARB_get_program_binary when supported, try passing '--disable-gpu-program-cache' for it (?)
Comment 25 Yang Gu 2016-10-11 06:02:25 UTC
I tried with both "--disable-gpu-program-cache" and "--disable-gpu-shader-disk-cache", and it seems to take effect. Now it doubles the previous time, about 1200ms.
Comment 26 Tapani Pälli 2016-10-11 11:16:49 UTC
here you can find experimental branch that has a bit older Mesa with both optimization patches, Marek's allocator changes and some fixes to backend:

https://cgit.freedesktop.org/~tpalli/mesa/log/?h=kukkakaali

I will try to get some performance figures with this and Martina's test.
Comment 27 Yang Gu 2016-10-13 06:53:06 UTC
BTW, I just had a check with similiar HW and closed source Intel driver on Windows, and it took about 7s for temp-expressions-should-not-crash.html after disabling the cache.
Comment 28 Tapani Pälli 2016-11-08 08:49:11 UTC
Is this still a problem, I saw a report that we would be passing 100% webgl tests on Linux now?
Comment 29 Yang Gu 2016-11-08 16:44:44 UTC
The 20s budget is hard coded in WebGL Conformance Test Suite, but I don't know why it wouldn't always take effect. I can run the whole tests including this case smoothly on local machine, so I don't think this is a blocker issue for WebGL 2 release. Google Chrome bots may still fail as they have strict timeout mechanism. 
Anyway, I'm fine to close this bug now.
Comment 30 Thomas Helland 2017-01-08 23:03:32 UTC
See [1] for a patch that pretty much fixes the problem in opt_copy_propagation. It still has to be regression tested with piglit, etc, but it looks promising.

[1] https://lists.freedesktop.org/archives/mesa-dev/2017-January/139693.html
Comment 31 Thomas Helland 2017-01-08 23:04:16 UTC
See [1] for a patch that pretty much fixes the problem in opt_copy_propagation. It still has to be regression tested with piglit, etc, but it looks promising.

[1] https://lists.freedesktop.org/archives/mesa-dev/2017-January/139693.html
Comment 32 Thomas Helland 2017-01-16 18:17:56 UTC
The patch I posted does not work as it should. It will end up doing some bad things when there are multiple blocks in play. It might be possible to create copies of the acp_entries and fix up the pointers when decending into a block. I will try to cook up a refined example shader where the patch fails.
Comment 33 GitLab Migration User 2019-09-25 18:56:36 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1518.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.