Bug 103732 - [swr] often gets stuck in piglit's glx-multi-context-single-window test
Summary: [swr] often gets stuck in piglit's glx-multi-context-single-window test
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/swr (show other bugs)
Version: 17.2
Hardware: Other All
: medium normal
Assignee: Bruce Cherniak
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-14 09:20 UTC by Andrés Gómez García
Modified: 2017-12-22 15:05 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
BT from the stuck glx-multi-context-single-window process (56.51 KB, text/plain)
2017-11-17 11:40 UTC, Andrés Gómez García
Details
attachment-11419-0.html (2.35 KB, text/html)
2017-11-29 20:41 UTC, Bruce Cherniak
Details

Description Andrés Gómez García 2017-11-14 09:20:54 UTC
For some time now, when running the complete "all" profile in piglit against mesa's swr driver, the glx-multi-context-single-window often gets stuck in an infinite loop using 100% CPU and needs to be killed.

This doesn't happen always, but quite often.

Piglit report is as follows:


Detail      |   Value
------------+---------------
Returncode  |   -15
------------+---------------
Time        |   9:34:48.086238
------------+---------------
Stdout      |
------------+---------------
Stderr      | SWR detected AVX2 
            | vert shader  0x7f65e7e39000
            | frag shader  0x7f65e7e37000
            | fetch shader 0x7f65e7e35000
            | vert shader  0x7f65e7e33000
            | frag shader  0x7f65e7e31000
            | fetch shader 0x7f65e7e35000
            | vert shader  0x7f65e7c42000
            | frag shader  0x7f65e7c40000
            | fetch shader 0x7f65e7e35000
            | vert shader  0x7f65e7c3e000
            | frag shader  0x7f65c218a000
            | fetch shader 0x7f65e7e35000
            | vert shader  0x7f65c20fe000
            | frag shader  0x7f65c20fc000
            | fetch shader 0x7f65e7e35000
            | vert shader  0x7f65c2070000
            | frag shader  0x7f65c206e000
            | fetch shader 0x7f65e7e35000
            | vert shader  0x7f65c1fe2000
            | frag shader  0x7f65c1fe0000
            | fetch shader 0x7f65e7e35000
            | vert shader  0x7f65c1f54000
            | frag shader  0x7f65c1f52000
            | fetch shader 0x7f65e7e35000
------------+---------------
Environment | PIGLIT_PLATFORM="mixed_glx_egl" PIGLIT_SOURCE_DIR="/home/local/piglit"
------------+---------------
Command     | /home/local/piglit/bin/glx-multi-context-single-window -auto
dmesg	
----------------------------

Environment is an Ubuntu Xenial with custom LLVM packages installed and locally compiled mesa and mesa dependencies. If needed, I can provide a docker image with which to test.
Comment 1 Michel Dänzer 2017-11-14 10:12:07 UTC
I've run into this with radeonsi as well.

It didn't happen for me until September 13th, using a CPU with 4 cores and 4 logical threads. After a break, I started running piglit again on October 9th, using a CPU with 8 cores and 16 logical threads, and run into this issue. So either it depends on the number of CPU cores/threads, or it's a regression between September 13th and October 9th.
Comment 2 Emil Velikov 2017-11-14 11:33:49 UTC
Changing the component to SWR ;-)
Comment 3 Michel Dänzer 2017-11-14 15:41:27 UTC
(In reply to Emil Velikov from comment #2)
> Changing the component to SWR ;-)

See comment 1, this isn't SWR specific.
Comment 4 Andrés Gómez García 2017-11-15 13:48:02 UTC
FWIW, with similar conditions, I've not been able to reproduce with llvmpipe, softpipe nor i965.
Comment 5 Michel Dänzer 2017-11-15 14:29:37 UTC
(In reply to Andrés Gómez García from comment #4)
> FWIW, with similar conditions, I've not been able to reproduce with
> llvmpipe, softpipe nor i965.

Hmm, then maybe it is related to threading done by SWR and radeonsi.
Comment 6 Andrés Gómez García 2017-11-17 11:40:27 UTC
Created attachment 135551 [details]
BT from the stuck glx-multi-context-single-window process

This is a quite complete backtrace from the stuck glx-multi-context-single-window piglit process.
Comment 7 Bruce Cherniak 2017-11-28 17:14:12 UTC
I'll take a look into this.  The first thing I notice, is that you are running with the DRI drivers.  Most of our customers use only the standalone GLX drivers.  We do not test DRI heavily.

You appear to be running a debug build (from the stderr output) of either mesa or llvm.  Does this occur with release build as well?  And, is there a reason you are running with a debug build?

From the very complete BT (thank you!), it appears that the api thread is waiting for a fence to complete, but all of the worker threads are sitting in idle -- suggesting that the fence should be complete.  Once you hit this stuck loop, can you step into swr_is_fence_done and "print *fence".

Thanks.  I'll report back as soon as I find anything.

(assigning back to Gallium/swr until something suggests otherwise)
Comment 8 Bruce Cherniak 2017-11-28 20:46:46 UTC
The root cause to this bug was fixed in a post-17.2 patch (b9aa0fa7) "swr: Handle resource across context changes".  It's in mesa master and the forthcoming 17.3.

The test still fails occasionally, but does not get stuck.
Comment 9 Andrés Gómez García 2017-11-29 13:50:14 UTC
(In reply to Bruce Cherniak from comment #8)
> The root cause to this bug was fixed in a post-17.2 patch (b9aa0fa7) "swr:
> Handle resource across context changes".  It's in mesa master and the
> forthcoming 17.3.
> 
> The test still fails occasionally, but does not get stuck.

Wow! That was quick!

Thanks a lot, Bruce, should we mark as "ALREADYFIXED" or rename for the occasional failure?

Also, should we pick b9aa0fa7 for the 17.2 stable queue? It seems to apply clean ...
Comment 10 Bruce Cherniak 2017-11-29 20:41:20 UTC
Created attachment 135812 [details]
attachment-11419-0.html

On Nov 29, 2017, at 7:50 AM, bugzilla-daemon@freedesktop.org<mailto:bugzilla-daemon@freedesktop.org> wrote:

Comment # 9<https://bugs.freedesktop.org/show_bug.cgi?id=103732#c9> on bug 103732<https://bugs.freedesktop.org/show_bug.cgi?id=103732> from Andrés Gómez García<mailto:agomez@igalia.com>

(In reply to Bruce Cherniak from comment #8<x-msg://49/show_bug.cgi?id=103732#c8>)
> The root cause to this bug was fixed in a post-17.2 patch (b9aa0fa7) "swr:
> Handle resource across context changes".  It's in mesa master and the
> forthcoming 17.3.
>
> The test still fails occasionally, but does not get stuck.

Wow! That was quick!

Thanks a lot, Bruce, should we mark as "ALREADYFIXED" or rename for the
occasional failure?

Also, should we pick b9aa0fa7 for the 17.2 stable queue? It seems to apply
clean ...

Yes, I do believe this is a good candidate for picking to the 17.2 stable queue.  What do I need to do to enable that?

Thanks,
Bruce


________________________________
You are receiving this mail because:

  *   You are the assignee for the bug.
Comment 11 Emil Velikov 2017-11-30 13:47:28 UTC
> Yes, I do believe this is a good candidate for picking to the 17.2 stable
> queue.  What do I need to do to enable that?
> 
I've just did it [1] but for future patches check the instructions[2].
Feel free to send patches if you think the instructions could be improved ;-)

[1] https://lists.freedesktop.org/archives/mesa-stable/2017-November/007531.html
[2] https://www.mesa3d.org/submittingpatches.html#nominations
Comment 12 Bruce Cherniak 2017-11-30 13:55:24 UTC
> I've just did it [1] but for future patches check the instructions[2].
> Feel free to send patches if you think the instructions could be improved ;-)

Much thanks Emil!  Instructions are good.  As usual, it's me that could be improved. ;-)
Comment 13 Emil Velikov 2017-12-22 15:05:17 UTC
Should be fixed with Mesa 17.2.7


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.