Created attachment 122206 [details]
This bug is causing a webgl conformance test in Chromium to freeze and time out. https://bugs.chromium.org/p/chromium/issues/detail?id=593680
The shader contains something like:
repeated 1000 times. It freezes forever (I waited 30 minutes) on glLinkProgram. It doesn't freeze if it's repeated only ~100 times.
I created a small reproducer that uses approximately the same shaders as the test in Chromium: https://github.com/mkollaro/opengl_snippets, you can try it out with:
mkdir build && cd build && cmake .. && make && ./bin/shader_freeze
Thank you for the excellent bug report. I wish the average report was half this good.
Running your example program with a profiler shows that the do_copy_propagation and do_copy_propagation_elements passes are taking all the time.
I suspect this is because internally they use linked lists for the acp/kill sets when they should be using hash tables. Ken fixed a similar problem in commit 4654439fd.
I'll look into making this change.
*** Bug 94302 has been marked as a duplicate of this bug. ***
I've come back to this recently.
The first problem appears to be in the do_copy_propagation() pass. visit_leave(ir_assignment *) calls kill(), and kill() does a linked-list walk across many thousands of elements in the "ACP". visit_leave() executes for each assignment. Printing the list length in kill() shows it to continually increase, so I'm not sure if replacing the ACP with a different data structure is sufficient or if there is another bug...
Replacing the ACP with a hash table, for instance, isn't completely straightforward, because we need to be able to look up entries by "lhs", of which there only may be one, and "rhs", of which there may be many (or none).
do_copy_propagation_elements() suffers from the same problem.
But neither of these would be a problem if we didn't call brw_do_channel_expressions(). It makes a huge mess of the program, massively increasing the number of assignments.
Removing it has been something we want to do for a long time, but we aren't there yet because it actually hurts programs in our shader-db if we remove it.
Created attachment 126167 [details] [review]
patch to reduce time spent
Martina, could you test with this patch? For me it reduces time spent in the test case as well with the WebGL test. I did not spot any regressions with this so maybe something like this could be done to help reducing time spent in optimizations and lowering passes.
I tried this patch, and it can improve the time consumed a lot for WebGL case conformance/glsl/bugs/temp-expressions-should-not-crash.html (https://bugs.freedesktop.org/show_bug.cgi?id=94302). On my skylake machine, it's from 610s to 130s.
Created attachment 126170 [details] [review]
This should result in same effect as the first experimentation, I will run testing on this and if it works fine then I'll send it to mesa-dev mailing list.
I run some testing and unfortunately there are regressions with my patch, it won't work correctly for cases where there inf/nan (division by zero), I guess there is no way to make it to .. will look if I can come up with any other ideas
(In reply to Tapani Pälli from comment #6)
> Created attachment 126170 [details] [review] [review]
> This should result in same effect as the first experimentation, I will run
> testing on this and if it works fine then I'll send it to mesa-dev mailing
That's pretty clever, but unfortunately it cannot work. Cases of dividing by zero, inf, nan, etc cannot be handled. We don't do this optimization in opt_algebraic for exactly that reason.
On top of that, the purpose of the test is to check whether you can compile a huge expression, and the one it uses is hilariously unrepresentative and prone to abuse by hacks like this one :)
The test has uncovered real flaws in our compiler: GLSL IR's copy propagation runs in O(n^2), and the channel_expressions pass creates a huge number of expressions.
Ken and I have spent time recently trying to handle the shader-db regressions by disabling channel_expressions. It's slow going and other tasks keep taking priority.
The most direct way forward then, is to improve copy propagate's run time like I described in comment 3. Maybe you want to take a stab at replacing the linked lists in the pass with hash tables? I spent some time on it... maybe you'll have better luck. :)
(In reply to Matt Turner from comment #8)
> (In reply to Tapani Pälli from comment #6)
> > Created attachment 126170 [details] [review] [review] [review]
> > fix
> > This should result in same effect as the first experimentation, I will run
> > testing on this and if it works fine then I'll send it to mesa-dev mailing
> > list.
> That's pretty clever, but unfortunately it cannot work. Cases of dividing by
> zero, inf, nan, etc cannot be handled. We don't do this optimization in
> opt_algebraic for exactly that reason.
right .. I knew there to be something missing as it was too good to be true :)
> On top of that, the purpose of the test is to check whether you can compile
> a huge expression, and the one it uses is hilariously unrepresentative and
> prone to abuse by hacks like this one :)
> The test has uncovered real flaws in our compiler: GLSL IR's copy
> propagation runs in O(n^2), and the channel_expressions pass creates a huge
> number of expressions.
> Ken and I have spent time recently trying to handle the shader-db
> regressions by disabling channel_expressions. It's slow going and other
> tasks keep taking priority.
> The most direct way forward then, is to improve copy propagate's run time
> like I described in comment 3. Maybe you want to take a stab at replacing
> the linked lists in the pass with hash tables? I spent some time on it...
> maybe you'll have better luck. :)
I made a wrapper class for 'acp' that internally uses hash table but provides some methods of exec_list (just to easily experiment with this). I'm seeing some performance gains but not nearly as big as with my hack. I'll try to do the same for 'propagation_elements' and see if total effect would be big enough to continue with this exercise.
Created attachment 126236 [details] [review]
use hash in opt_copy_propagate
this helps a bit but probably not enough
The last patch in comment #10 seems to have about halved the time it takes to link (using the reproducer - in Chromium it's hard to tell because it tends to freeze the whole Xserver somehow and I didn't manage to make it actually finish, though maybe I just didn't wait enough).
I have upcoming 2nd patch which should help more, there's some regressions with the patch though so I'm trying to figure out those before.
Created attachment 126539 [details] [review]
use hash in copy propagation elements
Martina, can you try out this patch, this together with first one *should* give nice speedup.
Thanks. Patch in comment #13, together with the previous one, speeds it up about 8x. Linking the reproducer currently takes about 2 minutes 25 seconds on average.
With these 2 patches, WebGL case conformance/glsl/bugs/temp-expressions-should-not-crash.html (https://bugs.freedesktop.org/show_bug.cgi?id=94302) is improved from 610s to 280s.
When will the second patch land?
(In reply to Yang Gu from comment #16)
> When will the second patch land?
Code review has not been done, I'll commit this as soon as patch gets reviewed. There's a lot of traffic in mesa-dev so unfortunately sometimes this can take quite a bit of time.
No problem and thanks a lot!
There is debate if the copy_propagate_elements pass should be removed instead of optimized. I tried removing it but it seems to cause functional regression, with only 1 test though: arb_gpu_shader5-interpolateAtSample-dynamically-nonuniform. CC Ken for possible hints what is going wrong with this and how to proceed ...
(In reply to Yang Gu from comment #16)
> When will the second patch land?
It's now available in Mesa master as commit c64093e. Does this fix the bug or does the test still timeout?
I just had a try with latest ToT code, and this case took 262s, while current budget for each WebGL conformance test case is 20s. I think you already did a very great improvement from 610s to 262s, and it might be hard to further improve this to be within 20s. I will talk with Khronos to see if we can ask more budget specially for this case.
Thank you very much!
(In reply to Yang Gu from comment #21)
> I just had a try with latest ToT code, and this case took 262s, while
> current budget for each WebGL conformance test case is 20s. I think you
> already did a very great improvement from 610s to 262s, and it might be hard
> to further improve this to be within 20s. I will talk with Khronos to see if
> we can ask more budget specially for this case.
> Thank you very much!
You are welcome :) 20 secs is a really tough budget for us. Upcoming shader cache will sort of workaround this but even then the first run will be slow.
It would be interesting to know how fast Nvidia does without shader cache, you should be able to disable cache by removing '$HOME/.nv/GLCache' or exporting __GL_SHADER_DISK_CACHE=false before running the app.
I assume we *should* be able to do better than 262s, will need to do some profiling to see. One thing to try would be Marek's patches to use jemalloc instead of libc's malloc. This test causes extensive usage of exec_list and ralloc since there are so many temporary variables being created and destroyed.
I tried both ways to disable cache (remove GL_Cache and set __GL_SHADER_DISK_CACHE). But w/ or w/o cache, the result of Nvidia seems similar: 500-600 "ms". What kind of magic Nvidia does to make it so fast!
BTW, jemalloc may help a bit, but I don't think it would improve much.
(In reply to Yang Gu from comment #23)
> I tried both ways to disable cache (remove GL_Cache and set
> __GL_SHADER_DISK_CACHE). But w/ or w/o cache, the result of Nvidia seems
> similar: 500-600 "ms". What kind of magic Nvidia does to make it so fast!
For me it sounds like it is using some cache, maybe it is actually chrome's shader cache in question here? AFAIK chrome implements such cache via GL_ARB_get_program_binary when supported, try passing '--disable-gpu-program-cache' for it (?)
I tried with both "--disable-gpu-program-cache" and "--disable-gpu-shader-disk-cache", and it seems to take effect. Now it doubles the previous time, about 1200ms.
here you can find experimental branch that has a bit older Mesa with both optimization patches, Marek's allocator changes and some fixes to backend:
I will try to get some performance figures with this and Martina's test.
BTW, I just had a check with similiar HW and closed source Intel driver on Windows, and it took about 7s for temp-expressions-should-not-crash.html after disabling the cache.
Is this still a problem, I saw a report that we would be passing 100% webgl tests on Linux now?
The 20s budget is hard coded in WebGL Conformance Test Suite, but I don't know why it wouldn't always take effect. I can run the whole tests including this case smoothly on local machine, so I don't think this is a blocker issue for WebGL 2 release. Google Chrome bots may still fail as they have strict timeout mechanism.
Anyway, I'm fine to close this bug now.
See  for a patch that pretty much fixes the problem in opt_copy_propagation. It still has to be regression tested with piglit, etc, but it looks promising.
The patch I posted does not work as it should. It will end up doing some bad things when there are multiple blocks in play. It might be possible to create copies of the acp_entries and fix up the pointers when decending into a block. I will try to cook up a refined example shader where the patch fails.
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1518.