Bug 104553

Summary: mat4: m[i][j] incorrect result with row_major UBO & SSBO
Product: Mesa Reporter: florian.will
Component: glsl-compilerAssignee: mesa-dev
Status: RESOLVED MOVED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: apinheiro
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Failing piglit test (when test_variant = 2) for this bug.
Proposed patch, needs more testing
Changes to piglit UBO test generator
attachment-29990-0.html
Simpler piglit test demonstrating the issue

Description florian.will 2018-01-09 11:39:00 UTC
Created attachment 136630 [details]
Failing piglit test (when test_variant = 2) for this bug.

I hit another bug while trying to get Banshee 3D <https://github.com/BearishSun/BansheeEngine> to work correctly on Mesa radeonsi (HD 7870) / amdgpu kernel module. I use git commit 2719467eb6 right now, which is a few days old.

Accessing a "mat4 m" component, e.g. m[1][2], returns unexpected results if m is declared inside a UBO block and uses row_major matrix format. col_major works as expected. m[1].z also works. Some experiments indicate that for m[i][j], the UBO buffer is accessed at offset (i+j)*4 instead of (i+j*4)*4.

The invalid offsets for loading floats from the UBO are visible in the Mesa IR when linker.cpp is done (probably introduced by lower_ubo_reference()).

While exploring this issue, I prepared a piglit test case that tests for this. It fails on my setup (when test_variant = 2, the other tests succeed). I will attach it to this bug report and send it to the piglit list for review.



Possible Fix:

I doubt this is enough/correct, because I haven't fully grasped the IR processing in the glsl compiler and which fields/data types can/can't be row_major and the implications, but this simple change in lower_buffer_access.cpp fixes this test and apparently causes no piglit regressions, so it's probably a good place to start (line 309):

 if (deref_array->array->type->is_vector()) {
    /* We get this when storing or loading a component out of a vector
     * with a non-constant index. This happens for v[i] = f where v is
     * a vector (or m[i][j] = f where m is a matrix). If we don't
     * lower that here, it gets turned into v = vector_insert(v, i,
     * f), which loads the entire vector, modifies one component and
     * then write the entire thing back.  That breaks if another
     * thread or SIMD channel is modifying the same vector.
     */
    //array_stride = 4; // git master
    array_stride = *row_major ? 16 : 4; // "fixed" line
    if (deref_array->array->type->is_64bit())
       array_stride *= 2;
Comment 1 Ilia Mirkin 2018-01-09 11:46:38 UTC
Ian Romanick (idr) wrote a test generator which generated random shader_test files with different ubo arrangements. It caught a lot of bugs back in the day, but I don't think tests from it were ever checked in. May be worth resurfacing, and extending to SSBO's while one's at it.
Comment 2 florian.will 2018-01-09 12:00:11 UTC
I found this by Ian Romanick:
https://cgit.freedesktop.org/piglit/tree/generated_tests/random_ubo.py

I'm not sure if it's used in a standard "tests/all" piglit run. If it is used but doesn't catch this bug, maybe the generated tests don't attempt to verify the the m[i][j] result.
Comment 3 Timothy Arceri 2018-01-10 01:07:15 UTC
(In reply to Ilia Mirkin from comment #1)
> Ian Romanick (idr) wrote a test generator which generated random shader_test
> files with different ubo arrangements. It caught a lot of bugs back in the
> day, but I don't think tests from it were ever checked in. May be worth
> resurfacing, and extending to SSBO's while one's at it.

I believe I pushed a version of the generator with arrays of arrays and doubles support merged in but it was never hooked up to run in standard piglit runs. I believe people were worried about running an excessive number of tests.
Comment 4 florian.will 2018-01-10 09:09:12 UTC
I used the generated_tests/random_ubo-arb_uniform_buffer_object.py script and it generated 270 tests, and initially they all passed. That's because the tests only verify matrix values using the [i].{x,y,z,w} method. I added the [i][j] method and 187 out of the 270 tests started to fail.

I tested the Mesa change I suggested in the orignal bug report, i.e.
array_stride = *row_major ? 16 : 4;
and while this works for most of the generated tests, 45 tests still fail (and they only check UBOs, so there's no std430 involved, maybe that change looks even worse when taking that into account).

Looking at two of the individual checks that fail:
if (!float_match(m22_1[0][1], 29246.2736706, 0x46e47c8cu)), and
if (!float_match(m24_2[0][3], 13624.6092622, 0x4654e270u)),
it turns out that m22_1 and m24_2 are both row_major and 2-column.

So (obviously?) the array_stride needs to be set to 8 (instead of 4 or 16) for 2-column row-major matrices. I'm unable to figure out how to identify the matrix type from that part of the code though. deref_array->array->type appears to be the type of m[i], i.e. a vector data type with no info about the number of other columns in m.
Comment 5 florian.will 2018-01-10 11:31:45 UTC
Created attachment 136648 [details] [review]
Proposed patch, needs more testing

I now have a patch that works for all the 270 random UBO test cases generated using Ian Romanick's useful script (and also my own simple test case).

It still needs more testing (maybe arrays of arrays of mat2 still fail), but I'm attaching it to this bug.

Unless more testing results in new issues, I plan to eventually submit it for review.
Comment 6 Ilia Mirkin 2018-01-10 11:34:41 UTC
(In reply to florian.will from comment #5)
> I now have a patch that works for all the 270 random UBO test cases
> generated using Ian Romanick's useful script (and also my own simple test
> case).

There was a process whereby you could just run that script in an infinite loop, and it would save off any failures somewhere. You could then take the failures and run them through a bisection script which would reduce them down to the simplest (local minimum) it could for the failure to still occur. Hopefully these were also checked in somewhere. If not I can search around.
Comment 7 florian.will 2018-01-21 15:19:23 UTC
Created attachment 136878 [details] [review]
Changes to piglit UBO test generator

I have now extended the random UBO piglit test generator python script (in a hackish way) to generate SSBO tests as well, and added std430 packing rules to generate std430 SSBO tests. My changes are in the attached patch file, but I'd say it's not suitable for piglit git (too ugly).

It was helpful to validate the mesa patch I've attached to this bug report earlier. Using mesa git master, 391 out of the 540 generated UBO&SSBO tests fail. After applying my patch, only a few tests (3-7) fail. The failing tests are always very huge test files (some have more than 10k lines and sometimes up to 5MB shader_test files). Apparently they hit something like an internal size limit for vertex shaders, because the tests pass when commenting out one half of the test conditions in the vertex shader, and they still pass when commenting out the other half of the vertex shader.

So I'm now fairly confident that my patch improves the SSBO / UBO buffer access behaviour when reading from SSBOs and UBOs.

Is there anything else that should be tested? Or any comments about the patch by someone who knows the lower_buffer_access code better than I do?
Comment 8 Ilia Mirkin 2018-01-22 15:04:06 UTC
Created attachment 136897 [details]
attachment-29990-0.html

FYI idr had a separate script to reduce the giant test cases to the
smallest that would still fail. It should be in one of his piglit branches
in his personal fd.o git repo. (I'm on the go, hence no specific URL.)

On Jan 22, 2018 09:02, <bugzilla-daemon@freedesktop.org> wrote:

> florian.will@googlemail.com changed bug 104553
> <https://bugs.freedesktop.org/show_bug.cgi?id=104553>
> What Removed Added
> Attachment #136630 [details] is obsolete   1
>
> *Comment # 7 <https://bugs.freedesktop.org/show_bug.cgi?id=104553#c7> on
> bug 104553 <https://bugs.freedesktop.org/show_bug.cgi?id=104553> from
> florian.will@googlemail.com <florian.will@googlemail.com> *
>
> Created attachment 136878 [details] [review] <https://bugs.freedesktop.org/attachment.cgi?id=136878> [details] <https://bugs.freedesktop.org/attachment.cgi?id=136878&action=edit> [review] <https://bugs.freedesktop.org/page.cgi?id=splinter.html&bug=104553&attachment=136878>
> Changes to piglit UBO test generator
>
> I have now extended the random UBO piglit test generator python script (in a
> hackish way) to generate SSBO tests as well, and added std430 packing rules to
> generate std430 SSBO tests. My changes are in the attached patch file, but I'd
> say it's not suitable for piglit git (too ugly).
>
> It was helpful to validate the mesa patch I've attached to this bug report
> earlier. Using mesa git master, 391 out of the 540 generated UBO&SSBO tests
> fail. After applying my patch, only a few tests (3-7) fail. The failing tests
> are always very huge test files (some have more than 10k lines and sometimes up
> to 5MB shader_test files). Apparently they hit something like an internal size
> limit for vertex shaders, because the tests pass when commenting out one half
> of the test conditions in the vertex shader, and they still pass when
> commenting out the other half of the vertex shader.
>
> So I'm now fairly confident that my patch improves the SSBO / UBO buffer access
> behaviour when reading from SSBOs and UBOs.
>
> Is there anything else that should be tested? Or any comments about the patch
> by someone who knows the lower_buffer_access code better than I do?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are the assignee for the bug.
>
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
>
Comment 9 Alejandro Piñeiro (freenode IRC: apinheiro) 2018-02-22 17:08:16 UTC
Found this bug when I was about to report the same problem. Some comments below:

(In reply to florian.will from comment #0)
> Created attachment 136630 [details]
> Failing piglit test (when test_variant = 2) for this bug.
> 
> I hit another bug while trying to get Banshee 3D
> <https://github.com/BearishSun/BansheeEngine> to work correctly on Mesa
> radeonsi (HD 7870) / amdgpu kernel module. I use git commit 2719467eb6 right
> now, which is a few days old.
> 
> Accessing a "mat4 m" component, e.g. m[1][2], returns unexpected results if
> m is declared inside a UBO block and uses row_major matrix format. col_major
> works as expected. m[1].z also works. Some experiments indicate that for
> m[i][j], the UBO buffer is accessed at offset (i+j)*4 instead of (i+j*4)*4.
> 
> The invalid offsets for loading floats from the UBO are visible in the Mesa
> IR when linker.cpp is done (probably introduced by lower_ubo_reference()).

As you imply below, this problem also affects ssbo. So perhaps it would be a good idea to update the bug description?

> While exploring this issue, I prepared a piglit test case that tests for
> this. It fails on my setup (when test_variant = 2, the other tests succeed).
> I will attach it to this bug report and send it to the piglit list for
> review.
> 
> 
> 
> Possible Fix:
> 
> I doubt this is enough/correct, because I haven't fully grasped the IR
> processing in the glsl compiler and which fields/data types can/can't be
> row_major and the implications, but this simple change in
> lower_buffer_access.cpp 

FWIW, while I was doing a skimming, I also got to lower_buffer_access and lower_ubo_reference (this one touches both ssbo and ubo), so I agree that the issue is likely at that code. Not sure if your fix is correct or in the good path sorry.

(In reply to florian.will from comment #7)
> Created attachment 136878 [details] [review] [review]
> Changes to piglit UBO test generator
> 
> I have now extended the random UBO piglit test generator python script (in a
> hackish way) to generate SSBO tests as well, and added std430 packing rules
> to generate std430 SSBO tests. My changes are in the attached patch file,
> but I'd say it's not suitable for piglit git (too ugly).
> 
> It was helpful to validate the mesa patch I've attached to this bug report
> earlier. Using mesa git master, 391 out of the 540 generated UBO&SSBO tests
> fail. After applying my patch, only a few tests (3-7) fail. The failing
> tests are always very huge test files (some have more than 10k lines and
> sometimes up to 5MB shader_test files). Apparently they hit something like
> an internal size limit for vertex shaders, because the tests pass when
> commenting out one half of the test conditions in the vertex shader, and
> they still pass when commenting out the other half of the vertex shader.

Somewhat off-topic: Timothy mentioned that in the past it was not included due all the amount of tests added. So perhaps a compromise would be added some (~10?) barebone tests, to at least cover the most basic cases. Something like this test I wrote while debugging this:
https://github.com/Igalia/piglit/blob/apinheiro/matrix-row-major-failure/tests/spec/arb_fake/execution/ubo/matrix-column-vs-row.shader_test

or in a ideal world, get the script to be configurable on how many tests to create, and the default being a reasonable amount of tests (fwiw, 540 generated tests seems somewhat too much).

> So I'm now fairly confident that my patch improves the SSBO / UBO buffer
> access behaviour when reading from SSBOs and UBOs.
> 
> Is there anything else that should be tested? Or any comments about the
> patch by someone who knows the lower_buffer_access code better than I do?

Unfourtunately although I would be interested on working on this, I don't have the time right now.


And now totally off-topic, but probably it is worth to mention here to not forget: VK-GL-CTS doesn't catch this problem either. And they have tons of row_major tests, for example:
KHR-GL45.shaders.uniform_block.single_basic_type.std140.row_major_mediump_mat4

is passing properly. So or the test is wrong or it is incomplete. I tried to take a look to the test, but it is somewhat hard to understand.
https://github.com/KhronosGroup/VK-GL-CTS/blob/master/external/openglcts/modules/common/glcUniformBlockCase.cpp#L1346
Comment 10 Timothy Arceri 2018-03-28 00:35:14 UTC
(In reply to Alejandro Piñeiro (freenode IRC: apinheiro) from comment #9)
> Found this bug when I was about to report the same problem. Some comments
> below:
> 
> (In reply to florian.will from comment #0)
> > Created attachment 136630 [details]
> > Failing piglit test (when test_variant = 2) for this bug.
> > 
> > I hit another bug while trying to get Banshee 3D
> > <https://github.com/BearishSun/BansheeEngine> to work correctly on Mesa
> > radeonsi (HD 7870) / amdgpu kernel module. I use git commit 2719467eb6 right
> > now, which is a few days old.
> > 
> > Accessing a "mat4 m" component, e.g. m[1][2], returns unexpected results if
> > m is declared inside a UBO block and uses row_major matrix format. col_major
> > works as expected. m[1].z also works. Some experiments indicate that for
> > m[i][j], the UBO buffer is accessed at offset (i+j)*4 instead of (i+j*4)*4.
> > 
> > The invalid offsets for loading floats from the UBO are visible in the Mesa
> > IR when linker.cpp is done (probably introduced by lower_ubo_reference()).
> 
> As you imply below, this problem also affects ssbo. So perhaps it would be a
> good idea to update the bug description?
> 
> > While exploring this issue, I prepared a piglit test case that tests for
> > this. It fails on my setup (when test_variant = 2, the other tests succeed).
> > I will attach it to this bug report and send it to the piglit list for
> > review.
> > 
> > 
> > 
> > Possible Fix:
> > 
> > I doubt this is enough/correct, because I haven't fully grasped the IR
> > processing in the glsl compiler and which fields/data types can/can't be
> > row_major and the implications, but this simple change in
> > lower_buffer_access.cpp 
> 
> FWIW, while I was doing a skimming, I also got to lower_buffer_access and
> lower_ubo_reference (this one touches both ssbo and ubo), so I agree that
> the issue is likely at that code. Not sure if your fix is correct or in the
> good path sorry.
> 
> (In reply to florian.will from comment #7)
> > Created attachment 136878 [details] [review] [review] [review]
> > Changes to piglit UBO test generator
> > 
> > I have now extended the random UBO piglit test generator python script (in a
> > hackish way) to generate SSBO tests as well, and added std430 packing rules
> > to generate std430 SSBO tests. My changes are in the attached patch file,
> > but I'd say it's not suitable for piglit git (too ugly).
> > 
> > It was helpful to validate the mesa patch I've attached to this bug report
> > earlier. Using mesa git master, 391 out of the 540 generated UBO&SSBO tests
> > fail. After applying my patch, only a few tests (3-7) fail. The failing
> > tests are always very huge test files (some have more than 10k lines and
> > sometimes up to 5MB shader_test files). Apparently they hit something like
> > an internal size limit for vertex shaders, because the tests pass when
> > commenting out one half of the test conditions in the vertex shader, and
> > they still pass when commenting out the other half of the vertex shader.
> 
> Somewhat off-topic: Timothy mentioned that in the past it was not included
> due all the amount of tests added. So perhaps a compromise would be added
> some (~10?) barebone tests, to at least cover the most basic cases.
> Something like this test I wrote while debugging this:
> https://github.com/Igalia/piglit/blob/apinheiro/matrix-row-major-failure/
> tests/spec/arb_fake/execution/ubo/matrix-column-vs-row.shader_test
> 

Yes for now I'd be happy with someone committing a piglit test that trips up the  bug. Please go ahead and push your test as its the simplest of the two tests. Thanks.
Comment 11 florian.will 2018-06-27 08:27:29 UTC
> And now totally off-topic, but probably it is worth to mention here to not
> forget: VK-GL-CTS doesn't catch this problem either. And they have tons of
> row_major tests, for example:
> KHR-GL45.shaders.uniform_block.single_basic_type.std140.row_major_mediump_mat4

> is passing properly. So or the test is wrong or it is incomplete. I tried to
> take a look to the test, but it is somewhat hard to understand.
> https://github.com/KhronosGroup/VK-GL-CTS/blob/master/external/openglcts/modules/common/glcUniformBlockCase.cpp#L1346

That one probably works because it compares "mat"s by delegating to compare "vec"s. Getting a vector from a mat using the [i] syntax works, if I remember correctly, but getting a float out of a mat directly using the [i][j] syntax fails. At least that's my take on lines 671-695 of that file. I assume that it fails (unless this bug has been fixed in Mesa by now and nobody noticed?) if the compare_mat* lines are expanded to include "*compare_float(a[i][j], b[i][j])" for all valid i,j.

I'm on Ubuntu 18.04's mesa right now (called "18.0.0~rc5-1ubuntu1"). There's a rather simple piglit test case attached to my original bug report, and that one still fails on that 18.0.0 version:
$ ./bin/shader_runner 104553.shader_test 
Probe color at (0,0)
  Expected: 2 33 10
  Observed: 3 33 33
Test failure on line 83


I marked that test case obsolete because I modified the piglit "UBO test generator" to produce many, many failing test cases that helped me validate my patch a bit in lots of different situations. But those generated test cases are quite huge, so they are probably not fit for inclusion in piglit.

Maybe that older, simpler, "obsolete" test case attached to this bug report can be added to piglit instead? It simply writes floats into two uniform matrices using piglit script, where one matrix is row_major and the other column_major, and then extracts four values using the [i][j] ("index") and the [i].{x,y,z,w} ("swizzle") methods in a shader and compares them to the expected values in the piglit script.

My plans to submit the mesa patch for inclusion are canceled as I'm not confident enough it doesn't break anything, but so far I haven't found any issues in my limited testing, so if someone wants to pick it up, feel free to do so.


> As you imply below, this problem also affects ssbo. So perhaps it would be
> a good idea to update the bug description?

Yep. :-)
Comment 12 Alejandro Piñeiro (freenode IRC: apinheiro) 2019-01-30 15:40:45 UTC
Hi, today I found that row_major matrices are working fine, at least on my tests. I briefly tried to bisect when this started, without too much luck.

Having said so, I don't think that checking when this got fixed is really relevant.

Could you re-check with your tests if things are working fine now?
Comment 13 florian.will 2019-05-05 19:14:22 UTC
Created attachment 144167 [details]
Simpler piglit test demonstrating the issue

Thanks for taking a look, Alejandro! And I'm sorry this took longer than I had hoped. Anyway, I updated my mesa to latest master git-6823873246 (purely numeric commit hash, wow) and those tests are still failing for me. My patch still fixes them, but I'm also still not sure about the general validity of the patch.

I feel like all the tests so far have been disappointing because they were a bit difficult to understand, or because that UBO test generator is kind of complicated. So I wanted to find a short and simple to understand test case.

Since what seems to be going on is that [i][j] access is changed to something similar to [0][i+j] access, I wrote a piglit test based on an existing piglit test that basically does this on a row_major matrix "m" in a compute shader, and then stores the resulting color in a shader image:

    m[0][3] = 0.1f;
    m[1][2] = 0.2f;
    m[2][1] = 0.3f;

    m[1][3] = 0.4f;
    m[3][1] = 0.5f;

    vec4 color = vec4(m[0][3], m[1][2], m[2][1], m[1][3]);

It also has a memory barrier and barrier before setting the color. As I understand it (I'm not very OpenGL experienced though, so correct me if I'm wrong), this should set the color to (0.1, 0.2, 0.3, 0.4) reliably.

What seems to happen though when I use a row_major matrix, is that the [1][2] write access overwrites [0][3], and then [2][1] overwrites both, so all 3 values end up as the same value, 0.3.

Same for the [3][1] write access, it overwrites [1][3], so that value ends up as 0.5. So at the end, the color is incorrectly set to (0.3, 0.3, 0.3, 0.5):
Probe color at (0,0)
  Expected: 25 51 76 102
  Observed: 77 77 77 128

The test case works as expected if I change the row_major matrix to be column_major.

I'm attaching this new and simpler piglit test. I'm not sure if I should change the status to "NEW" again or keep it at "NEEDINFO".

Maybe someone who is not an OpenGL newbie could have a look at my test and see if it's valid, and see if it fails in the same way for you.
Comment 14 GitLab Migration User 2019-09-18 19:46:03 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/815.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.