Bug 89597 - Implement SSBOs in GLSL front-end and i965
Summary: Implement SSBOs in GLSL front-end and i965
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: unspecified
Hardware: Other All
: medium enhancement
Assignee: Iago Toral
QA Contact: Intel 3D Bugs Mailing List
URL: https://www.opengl.org/registry/specs...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-16 18:52 UTC by Kristian Høgsberg
Modified: 2015-09-25 22:53 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Prototype supporting buffer loads (32.88 KB, patch)
2015-03-16 18:52 UTC, Kristian Høgsberg
Details | Splinter Review

Description Kristian Høgsberg 2015-03-16 18:52:06 UTC
We need to add support for SSBOs (link in URL field).  Very brief summary: SSBOs are GL buffer objects accessed through interface blocks.  In other words, similar to UBOs, except 1) they're writable, 2) they can hold atomics and 3) they have an optional dynamically sized array at the end.

Attached patch is a quick prototype that implements parts of the front-end changes (adding the new 'buffer' keyword) and backend changes to add new opcodes and implement SSBO loads using the date cache oword read message.

The implementation is not expected to exercise the hardware in new ways, we already oword reads and writes and atomics. One exception is that for bounds checking access to the variable length array we could use the resinfo sampler message.
Comment 1 Kristian Høgsberg 2015-03-16 18:52:59 UTC
Created attachment 114351 [details] [review]
Prototype supporting buffer loads
Comment 2 Kristian Høgsberg 2015-03-16 19:09:04 UTC
Update after talking to Ken: we've been using resinfo for a while so nothing new there.  Also, we'd would focus on GEN7+ for now.
Comment 3 Iago Toral 2015-03-18 11:32:00 UTC
Samuel and I are working on this
Comment 4 Kristian Høgsberg 2015-03-18 19:24:09 UTC
(In reply to Iago Toral from comment #3)
> Samuel and I are working on this

Great!  And btw, resinfo isn't needed for bounds checking, the hw does that of course. However, there's a way to query the length of the variable size array from GLSL which will need resinfo.
Comment 5 Iago Toral 2015-03-27 07:48:00 UTC
Kristian, the prototype patch you attached defines a ir_binop_ssbo_store operator for ir_expressions. I have a quick implementation of the ssbo write as an expression operator that looks like this in GLSL IR:

(declare (temporary ) vec4 ssbo_write_temp)
(assign  (zw) (var_ref ssbo_write_temp)  (constant vec2 (-3.000000 -4.000000)) ) 
(expression vec4 ssbo_store (constant uint (0)) (constant uint (0)) (var_ref ssbo_write_temp) ) 

(I made it a triop because besides the block number and offset we also need the value to write. In fact, it would have to end up being quadop because we also need a write mask operand to know which channels we need to write).

In the beginning I found odd to implement ssbo_store as an expression operator, mostly because it is not expected to return any value. In fact, I expected that it would be killed by dead code elimination passes because of this reason. It seems that I was wrong and this is working fine, but I am not aware of other expression operators that work like this (I mean, which are not used as part of a RHS in an assignment), so I wonder if this is how you envisioned the implementation or if you think that we should implement the ssbo write differently (maybe having a separate ir_ssbo_store IR node).

What do you think?
Comment 6 Kenneth Graunke 2015-03-27 07:57:32 UTC
Memory writes should definitely not be ir_expressions - those are pure, with no side-effects.  It probably makes more sense to handle them as ir_call intrinsics, like we do for atomics.  Or as a new node type.
Comment 7 Iago Toral 2015-03-27 08:00:12 UTC
(In reply to Kenneth Graunke from comment #6)
> Memory writes should definitely not be ir_expressions - those are pure, with
> no side-effects.  It probably makes more sense to handle them as ir_call
> intrinsics, like we do for atomics.  Or as a new node type.

Yeah, that is what I figured. Thanks for the fast-reply Ken!
Comment 8 Iago Toral 2015-04-10 10:44:31 UTC
Kristian, I have a question concerting writes to row_major matrices:

Since we are using owrod block write messages, we always write 16 bytes and we need to setup the writemask bits in the SEND's dst register to select the channels we want effectively written.

When writing to a column in a row_major mat4 (i.e. mat[2] = ...) we will have to emit 4 separate writes (since each component in that column falls into a different OWORD offset in the buffer. Specifically, if we write all components of mat[2] we want to write to dword offsets 2, 6, 10, 14, which translates to writing the Y channel only in oword offsets 0, 1, 2, 3.

With a constant index, setting the writemask on the SEND's dst register to select the Y channel is easy, because the offset is a constant expression and we can simply do something like this in the visitor code:

unsigned const_offset = offset_ir->value.u[0];
int write_mask = 1 << (const_offset % 4);
struct brw_reg brw_dst = brw_set_writemask(dst, write_mask);

However, if we index into a row_major matrix with an non-constant expression, then we can't do that. I imagine there is no way to set a register's writemask dynamically from values stored in other registers, so in these cases I think we will have to emit a read of the same dword offset we are going to write, then overwrite the components we want to change, and finally write the entire oword back to memory.

Am I missing a better way to do this?
Comment 9 Iago Toral 2015-04-10 11:41:09 UTC
(In reply to Iago Toral from comment #8)
> Kristian, I have a question concerting writes to row_major matrices:

Actually, the problem is more general than row_major matrices, for example this:

struct TB {
    float a, b, c, d;
};

layout(std140, binding=2) buffer Fragments {
   TB s[3];
   int index;
};

uniform mat4 MVP;

void main()
{
   s[index].d = -1.0;
}

Generally, we have the problem when using dynamic indices into row_major matrices or arrays of types that are not a vec4 (if the type is a vec4 it works because the offset of a vec4 is always 16-byte aligned, so we can ignore the offset and just set the writemask based on the vec4 channels being written).
Comment 10 Iago Toral 2015-04-10 15:56:29 UTC
(In reply to Iago Toral from comment #9)
> (In reply to Iago Toral from comment #8)
> > Kristian, I have a question concerting writes to row_major matrices:
> 
> Actually, the problem is more general than row_major matrices, for example
> this:
> 
> struct TB {
>     float a, b, c, d;
> };
> 
> layout(std140, binding=2) buffer Fragments {
>    TB s[3];
>    int index;
> };
> 
> uniform mat4 MVP;
> 
> void main()
> {
>    s[index].d = -1.0;
> }
> 
> Generally, we have the problem when using dynamic indices into row_major
> matrices or arrays of types that are not a vec4 (if the type is a vec4 it
> works because the offset of a vec4 is always 16-byte aligned, so we can
> ignore the offset and just set the writemask based on the vec4 channels
> being written).

It seems that I can fix this example (and others like this) by ignoring the index into the array (for the purpose of computing the writemask, not when computing the offset where we will write of course) and simply take the constant offset of the element being accessed ('d') within the type (that would be the constant value 12 in this specific. Since arrays are aligned to 16-byte boundaries this works fine for the purpose of computing the writemask.

Row-major matrices are still an issue though, so my question from comment #8 still applies.
Comment 11 Jason Ekstrand 2015-04-14 19:43:48 UTC
When you get into SIMD8 (or maybe even for vec4), you'll want to look at the SCATTERED_READ/WRITE messages.  They are basically exactly what you want for SSBOs on SIMD8/16.
Comment 12 Iago Toral 2015-04-15 06:02:14 UTC
(In reply to Jason Ekstrand from comment #11)
> When you get into SIMD8 (or maybe even for vec4), you'll want to look at the
> SCATTERED_READ/WRITE messages.  They are basically exactly what you want for
> SSBOs on SIMD8/16.

Hi Jason, thanks for the pointer. Right now I think I have most of the SIMD8/16 working, at least for std140, except for row-major matrices, using a combination of OWord Block (for reads and writes) and Unaligned OWord Block Reads (for non-constant offset reads).

If this can fix row-major (in vec4 and fs) then it will come also in handy for std430.
Comment 13 Iago Toral 2015-04-16 10:59:08 UTC
Jason, scattered writes did fix the problem, thanks!

I noticed an unexpected behavior though, according to the PRM, the scattered write message is supposed to write 8 DWords at 8 offsets (for a block size of 8), however, for me it only writes 4. It completely ignores offsets stored in M1.4:M1.7 and data stored in M2.4:M2.7 of the message payload.

This issue actually works great for me here because a vector type is at most 4 elements so we want to write 4 DWords tops with each message, but I wonder why this this happening  and if it is safe to assume that it is going to write 4 Dwords always. The PRM says that the hardware uses the 8 lower bits of the execution mask to select which of the 8 channels are effectively written, so I wonder if that could be affecting here or if this issue might be related to something else.

Any thoughts?

This is important because if I can't be sure that only 4 Dwords are going to be written then I need to disable the writes from offsets M1.4:M1.7. Ideally I would do this by altering the execution mask for the SEND instruction so that it only considers the the channels we want to write. Is this possible? I have not found any examples in the driver where this is done.

Alternatively, I could replicate the writes from offsets 0..3 into 4..7 (the PRM says that the hardware optimizes writes to the same offset so this may not be that bad).
Comment 14 Jason Ekstrand 2015-04-16 16:19:06 UTC
(In reply to Iago Toral from comment #13)
> Jason, scattered writes did fix the problem, thanks!
> 
> I noticed an unexpected behavior though, according to the PRM, the scattered
> write message is supposed to write 8 DWords at 8 offsets (for a block size
> of 8), however, for me it only writes 4. It completely ignores offsets
> stored in M1.4:M1.7 and data stored in M2.4:M2.7 of the message payload.

I'm not sure if this is your problem, but something that took me by surprised about the scattered read/write messages is that they don't do what you might first expect.  The 8 dwords are written to the 8 different offsets provided.  This means that, if all 8 offsets are the same, one of those 8 values will end up there and the other 7 won't get written at all.  If you want to use it (as I did to spill an entire register), you have to give it 8 different offsets.  I did this using an add with a vector int immediate:

http://cgit.freedesktop.org/~jekstrand/mesa/tree/src/mesa/drivers/dri/i965/brw_fs.cpp?h=wip/fs-indirects-v0.5#n1740

For SSBO's, however, scattered read/write should be exactly what you want because because you get an offset per SIMD channel and you just have to put the data there.  The user is responsible for making sure that data from different fragments or vertices end up in different locations.

> This issue actually works great for me here because a vector type is at most
> 4 elements so we want to write 4 DWords tops with each message, but I wonder
> why this this happening  and if it is safe to assume that it is going to
> write 4 Dwords always. The PRM says that the hardware uses the 8 lower bits
> of the execution mask to select which of the 8 channels are effectively
> written, so I wonder if that could be affecting here or if this issue might
> be related to something else.
> 
> Any thoughts?

I'm confused.  Are you trying to use dword scattered read/write for vec4?  In SIMD8, you only write one float at a time anyway.  Unless, of course, I'm massively misunderstanding SSBO's.  For vec4, I think you want the 

> This is important because if I can't be sure that only 4 Dwords are going to
> be written then I need to disable the writes from offsets M1.4:M1.7. Ideally
> I would do this by altering the execution mask for the SEND instruction so
> that it only considers the the channels we want to write. Is this possible?
> I have not found any examples in the driver where this is done.
> 
> Alternatively, I could replicate the writes from offsets 0..3 into 4..7 (the
> PRM says that the hardware optimizes writes to the same offset so this may
> not be that bad).

If you're trying to use scattered read/write in vec4, then you may be running into execution mask issues.  I don't know how the execution mask in 4x2 is laid out but scattered read/write is usually a SIMD8 message.  It can be used in 4x2 mode but you'll have to monkey with the writemask yourself.  I'm not sure how you do that.  Ken would know.
Comment 15 Iago Toral 2015-04-17 06:18:29 UTC
(In reply to Jason Ekstrand from comment #14)
> (In reply to Iago Toral from comment #13)
> > Jason, scattered writes did fix the problem, thanks!
> > 
> > I noticed an unexpected behavior though, according to the PRM, the scattered
> > write message is supposed to write 8 DWords at 8 offsets (for a block size
> > of 8), however, for me it only writes 4. It completely ignores offsets
> > stored in M1.4:M1.7 and data stored in M2.4:M2.7 of the message payload.
> 
> I'm not sure if this is your problem, but something that took me by
> surprised about the scattered read/write messages is that they don't do what
> you might first expect.  The 8 dwords are written to the 8 different offsets
> provided.  This means that, if all 8 offsets are the same, one of those 8
> values will end up there and the other 7 won't get written at all.  If you
> want to use it (as I did to spill an entire register), you have to give it 8
> different offsets.  I did this using an add with a vector int immediate:
> 
> http://cgit.freedesktop.org/~jekstrand/mesa/tree/src/mesa/drivers/dri/i965/
> brw_fs.cpp?h=wip/fs-indirects-v0.5#n1740
> 
> For SSBO's, however, scattered read/write should be exactly what you want
> because because you get an offset per SIMD channel and you just have to put
> the data there.  The user is responsible for making sure that data from
> different fragments or vertices end up in different locations.

Nope, that is not my problem. I provide 8 different consecutive dword offets but I only see 4 of these actually written.

> > This issue actually works great for me here because a vector type is at most
> > 4 elements so we want to write 4 DWords tops with each message, but I wonder
> > why this this happening  and if it is safe to assume that it is going to
> > write 4 Dwords always. The PRM says that the hardware uses the 8 lower bits
> > of the execution mask to select which of the 8 channels are effectively
> > written, so I wonder if that could be affecting here or if this issue might
> > be related to something else.
> > 
> > Any thoughts?
> 
> I'm confused.  Are you trying to use dword scattered read/write for vec4? 
> In SIMD8, you only write one float at a time anyway.  Unless, of course, I'm
> massively misunderstanding SSBO's.  For vec4, I think you want the 

Nope, this is SIMD8/16, haven't tried to use this with vec4. The thing is, imagine that I have a vector type at the IR element with element count > 1. Initially I would loop through the elements and write each one individually by passing offset(value_reg, i) as src to the write message, but then I noticed that I could use the same message to write all the elements in the vector (up to 4) in one go if I provided 4 different offsets to the scattered message and prepared the message payload with the 4 floats to write at each offset. That is, I do something like this in the visitor:

   /* Prepare scattered write message payload.
    * M1.0..M1.3: Dword offsets to be added to the global offset
    * M2.0..M2.3: Dword values
    */
   int base_mrf = 1;
   for (int i = 0; i < ir->val->type->vector_elements; i++) {
      int component_mask = 1 << i;
      if (ir->write_mask & component_mask) {
         fs_reg mrf = fs_reg(MRF, base_mrf + 1, BRW_REGISTER_TYPE_UD);
         mrf.subreg_offset += i * type_sz(mrf.type);
         emit(MOV(mrf, brw_imm_ud(i)));

         mrf = fs_reg(MRF, base_mrf + 2, val_reg.type);
         mrf.subreg_offset += i * type_sz(mrf.type);
         emit(MOV(mrf, offset(val_reg, i)));
      }
   }

   /* Set the writemask so we only write to the offsets we want */
   struct brw_reg brw_dst =
      brw_set_writemask(brw_vec8_grf(0, 0), ir->write_mask);
   fs_reg push_dst = fs_reg(brw_dst);
   fs_inst *inst =
      new(mem_ctx) fs_inst(SHADER_OPCODE_SCATTERED_BUFFER_STORE, 8,
                           push_dst, surf_index, offset_reg);

This seems to work well, and for vectors I end up only needing one message to write all the channels I need to write. Now that I think about it, the reason I only get 4 channels written at most is probably because ir->write_mask can be 0xf at most, I imagine that in SIMD8 the wridst temask would have to be 0xff to cover all 8 channels, unlike vec4.

> > This is important because if I can't be sure that only 4 Dwords are going to
> > be written then I need to disable the writes from offsets M1.4:M1.7. Ideally
> > I would do this by altering the execution mask for the SEND instruction so
> > that it only considers the the channels we want to write. Is this possible?
> > I have not found any examples in the driver where this is done.
> > 
> > Alternatively, I could replicate the writes from offsets 0..3 into 4..7 (the
> > PRM says that the hardware optimizes writes to the same offset so this may
> > not be that bad).
> 
> If you're trying to use scattered read/write in vec4, then you may be
> running into execution mask issues.  I don't know how the execution mask in
> 4x2 is laid out but scattered read/write is usually a SIMD8 message.  It can
> be used in 4x2 mode but you'll have to monkey with the writemask yourself. 
> I'm not sure how you do that.  Ken would know.

Haven't tried this for vec4 yet, but if I end up needing it there too I'll ask Ken.

Thanks Jason.
Comment 16 Jason Ekstrand 2015-04-17 07:33:35 UTC
(In reply to Iago Toral from comment #15)
> (In reply to Jason Ekstrand from comment #14)
> > I'm confused.  Are you trying to use dword scattered read/write for vec4? 
> > In SIMD8, you only write one float at a time anyway.  Unless, of course, I'm
> > massively misunderstanding SSBO's.  For vec4, I think you want the 
> 
> Nope, this is SIMD8/16, haven't tried to use this with vec4. The thing is,
> imagine that I have a vector type at the IR element with element count > 1.
> Initially I would loop through the elements and write each one individually
> by passing offset(value_reg, i) as src to the write message, but then I
> noticed that I could use the same message to write all the elements in the
> vector (up to 4) in one go if I provided 4 different offsets to the
> scattered message and prepared the message payload with the 4 floats to
> write at each offset. That is, I do something like this in the visitor:
> 
>    /* Prepare scattered write message payload.
>     * M1.0..M1.3: Dword offsets to be added to the global offset
>     * M2.0..M2.3: Dword values
>     */
>    int base_mrf = 1;
>    for (int i = 0; i < ir->val->type->vector_elements; i++) {
>       int component_mask = 1 << i;
>       if (ir->write_mask & component_mask) {
>          fs_reg mrf = fs_reg(MRF, base_mrf + 1, BRW_REGISTER_TYPE_UD);
>          mrf.subreg_offset += i * type_sz(mrf.type);
>          emit(MOV(mrf, brw_imm_ud(i)));
> 
>          mrf = fs_reg(MRF, base_mrf + 2, val_reg.type);
>          mrf.subreg_offset += i * type_sz(mrf.type);
>          emit(MOV(mrf, offset(val_reg, i)));
>       }
>    }
> 
>    /* Set the writemask so we only write to the offsets we want */
>    struct brw_reg brw_dst =
>       brw_set_writemask(brw_vec8_grf(0, 0), ir->write_mask);
>    fs_reg push_dst = fs_reg(brw_dst);
>    fs_inst *inst =
>       new(mem_ctx) fs_inst(SHADER_OPCODE_SCATTERED_BUFFER_STORE, 8,
>                            push_dst, surf_index, offset_reg);
> 
> This seems to work well, and for vectors I end up only needing one message
> to write all the channels I need to write. Now that I think about it, the
> reason I only get 4 channels written at most is probably because
> ir->write_mask can be 0xf at most, I imagine that in SIMD8 the wridst temask
> would have to be 0xff to cover all 8 channels, unlike vec4.

I think you are misunderstanding how these SIMD8/16 write messages work.  I'll assume 8 in the following discussion but it all applies to 16.

As the shader executes, it is executes 8 pixels at a time.  Each sub-register represents the same symbolic value in GLSL but for a different pixel.  Suppose I have an SSBO declared as follows:

buffer Block {
    vec4 s[128];
};

And suppose I execute the line of code "s[i].xzw = foo;" where foo is some vec3.  When the SIMD8 shader reaches this line, it stores 12 values in the SSBO; 3 per pixel.  If the client doesn't want the values to stomp on each other, it is up to the client to ensure that i is different for each pixel.

How does this work with the scattered read/write messages?  They are designed for exactly a case like this.  When you get to this statement, you will have one register that holds the value of i and three more for foo.  Each of these registers has 8 sub-registers one for each SIMD channel (or pixel).  All you should have to do is build 3 messages each one of which is i + some math for the address part and a component of foo for the payload part.  Each scattered write writes 8 values but they are the different values from the different SIMD channels, not from different components of foo.  The first one will write all 8 of the s[i].x, the next one s[i].y, etc.

Does that make more sense?

> > If you're trying to use scattered read/write in vec4, then you may be
> > running into execution mask issues.  I don't know how the execution mask in
> > 4x2 is laid out but scattered read/write is usually a SIMD8 message.  It can
> > be used in 4x2 mode but you'll have to monkey with the writemask yourself. 
> > I'm not sure how you do that.  Ken would know.
> 
> Haven't tried this for vec4 yet, but if I end up needing it there too I'll
> ask Ken.

For vec4, the oword messages are *probably* what you want, but I'm not sure how that plays with packing.  That said, I think it's probably best to get this working for the FS backend as it's a good deal simpler there.  It also allows us to enable it on BDW+ before you get it working in vec4.
Comment 17 Iago Toral 2015-04-17 07:56:59 UTC
(In reply to Jason Ekstrand from comment #16)
> (In reply to Iago Toral from comment #15)
> > (In reply to Jason Ekstrand from comment #14)
> > > I'm confused.  Are you trying to use dword scattered read/write for vec4? 
> > > In SIMD8, you only write one float at a time anyway.  Unless, of course, I'm
> > > massively misunderstanding SSBO's.  For vec4, I think you want the 
> > 
> > Nope, this is SIMD8/16, haven't tried to use this with vec4. The thing is,
> > imagine that I have a vector type at the IR element with element count > 1.
> > Initially I would loop through the elements and write each one individually
> > by passing offset(value_reg, i) as src to the write message, but then I
> > noticed that I could use the same message to write all the elements in the
> > vector (up to 4) in one go if I provided 4 different offsets to the
> > scattered message and prepared the message payload with the 4 floats to
> > write at each offset. That is, I do something like this in the visitor:
> > 
> >    /* Prepare scattered write message payload.
> >     * M1.0..M1.3: Dword offsets to be added to the global offset
> >     * M2.0..M2.3: Dword values
> >     */
> >    int base_mrf = 1;
> >    for (int i = 0; i < ir->val->type->vector_elements; i++) {
> >       int component_mask = 1 << i;
> >       if (ir->write_mask & component_mask) {
> >          fs_reg mrf = fs_reg(MRF, base_mrf + 1, BRW_REGISTER_TYPE_UD);
> >          mrf.subreg_offset += i * type_sz(mrf.type);
> >          emit(MOV(mrf, brw_imm_ud(i)));
> > 
> >          mrf = fs_reg(MRF, base_mrf + 2, val_reg.type);
> >          mrf.subreg_offset += i * type_sz(mrf.type);
> >          emit(MOV(mrf, offset(val_reg, i)));
> >       }
> >    }
> > 
> >    /* Set the writemask so we only write to the offsets we want */
> >    struct brw_reg brw_dst =
> >       brw_set_writemask(brw_vec8_grf(0, 0), ir->write_mask);
> >    fs_reg push_dst = fs_reg(brw_dst);
> >    fs_inst *inst =
> >       new(mem_ctx) fs_inst(SHADER_OPCODE_SCATTERED_BUFFER_STORE, 8,
> >                            push_dst, surf_index, offset_reg);
> > 
> > This seems to work well, and for vectors I end up only needing one message
> > to write all the channels I need to write. Now that I think about it, the
> > reason I only get 4 channels written at most is probably because
> > ir->write_mask can be 0xf at most, I imagine that in SIMD8 the wridst temask
> > would have to be 0xff to cover all 8 channels, unlike vec4.
> 
> I think you are misunderstanding how these SIMD8/16 write messages work. 
> I'll assume 8 in the following discussion but it all applies to 16.
> 
> As the shader executes, it is executes 8 pixels at a time.  Each
> sub-register represents the same symbolic value in GLSL but for a different
> pixel.  Suppose I have an SSBO declared as follows:
> 
> buffer Block {
>     vec4 s[128];
> };
> 
> And suppose I execute the line of code "s[i].xzw = foo;" where foo is some
> vec3.  When the SIMD8 shader reaches this line, it stores 12 values in the
> SSBO; 3 per pixel.  If the client doesn't want the values to stomp on each
> other, it is up to the client to ensure that i is different for each pixel.
> 
> How does this work with the scattered read/write messages?  They are
> designed for exactly a case like this.  When you get to this statement, you
> will have one register that holds the value of i and three more for foo. 
> Each of these registers has 8 sub-registers one for each SIMD channel (or
> pixel).  All you should have to do is build 3 messages each one of which is
> i + some math for the address part and a component of foo for the payload
> part.  Each scattered write writes 8 values but they are the different
> values from the different SIMD channels, not from different components of
> foo.  The first one will write all 8 of the s[i].x, the next one s[i].y, etc.
> 
> Does that make more sense?

It does, thanks for the detailed explanation! I'll revert the implementation to what I had before then. I suppose what I have now works simply because all pixels are writing the same value to the same offset...

> > > If you're trying to use scattered read/write in vec4, then you may be
> > > running into execution mask issues.  I don't know how the execution mask in
> > > 4x2 is laid out but scattered read/write is usually a SIMD8 message.  It can
> > > be used in 4x2 mode but you'll have to monkey with the writemask yourself. 
> > > I'm not sure how you do that.  Ken would know.
> > 
> > Haven't tried this for vec4 yet, but if I end up needing it there too I'll
> > ask Ken.
> 
> For vec4, the oword messages are *probably* what you want, but I'm not sure
> how that plays with packing.  That said, I think it's probably best to get
> this working for the FS backend as it's a good deal simpler there.  It also
> allows us to enable it on BDW+ before you get it working in vec4.

Sure, will focus on that first. Thanks again Jason!
Comment 18 Iago Toral 2015-04-17 12:08:26 UTC
Jason, I think I got it working for SIMD8 but I have a question regarding SIMD16:

> (In reply to Jason Ekstrand from comment #16)
> > I think you are misunderstanding how these SIMD8/16 write messages work. 
> > I'll assume 8 in the following discussion but it all applies to 16.
> > 
> > As the shader executes, it is executes 8 pixels at a time.  Each
> > sub-register represents the same symbolic value in GLSL but for a different
> > pixel.  Suppose I have an SSBO declared as follows:
> > 
> > buffer Block {
> >     vec4 s[128];
> > };
> > 
> > And suppose I execute the line of code "s[i].xzw = foo;" where foo is some
> > vec3.  When the SIMD8 shader reaches this line, it stores 12 values in the
> > SSBO; 3 per pixel.  If the client doesn't want the values to stomp on each
> > other, it is up to the client to ensure that i is different for each pixel.
> > 
> > How does this work with the scattered read/write messages?  They are
> > designed for exactly a case like this.  When you get to this statement, you
> > will have one register that holds the value of i and three more for foo. 
> > Each of these registers has 8 sub-registers one for each SIMD channel (or
> > pixel).

In SIMD16 the instructions operate on 16 elements, but I understand that registers still have 8 sub-registers, so this instruction:

mov(16)   g116<1>F     1.0F                { align1 1H };

is writing 1.0 in all sub-registers of g116 (8 elements) and all sub-registers of g117 (8 elements). Is this correct? If I am correct, then I would expect this assembly code for a SIMD16 scattered write to work:

mov(8)    g113<1>UD    g0<8,8,1>UD         { align1 WE_all 1Q compacted };
mov(1)    g113.2<1>UD  0x00000000UD        { align1 WE_all compacted };
mov(16)   g114<1>UD    g13<8,8,1>UD        { align1 1H compacted };
mov(16)   g116<1>F     1.0F                { align1 1H };
send(16)  g0<1>F       g113<8,8,1>F
        data ( DC DWORD scatterd write, 1, 3) mlen 5 rlen 0 { align1 1H };

The first mov(16) would write the offset payload to M1,M2 (g114,g115) and the second mov(16) would write the data payload to M3,M4 (g116,g117). However, I see that this does not produce correct writes into the buffer, I see writes to the correct offsets but with wrong data, so I guess I am understanding something wrong again?.

For the record, this same code works fine if I make the second mov(16) write to g115 (like I do in SIMD8, where we want offsets in M1 and data in M2), but as far as my understanding goes, this should actually be incorrect for SIMD16.

> > All you should have to do is build 3 messages each one of which is
> > i + some math for the address part and a component of foo for the payload
> > part.  Each scattered write writes 8 values but they are the different
> > values from the different SIMD channels, not from different components of
> > foo.  The first one will write all 8 of the s[i].x, the next one s[i].y, etc.
Comment 19 Kristian Høgsberg 2015-04-17 16:21:19 UTC
(In reply to Iago Toral from comment #18)
> Jason, I think I got it working for SIMD8 but I have a question regarding
> SIMD16:
> 
> > (In reply to Jason Ekstrand from comment #16)
> > > I think you are misunderstanding how these SIMD8/16 write messages work. 
> > > I'll assume 8 in the following discussion but it all applies to 16.
> > > 
> > > As the shader executes, it is executes 8 pixels at a time.  Each
> > > sub-register represents the same symbolic value in GLSL but for a different
> > > pixel.  Suppose I have an SSBO declared as follows:
> > > 
> > > buffer Block {
> > >     vec4 s[128];
> > > };
> > > 
> > > And suppose I execute the line of code "s[i].xzw = foo;" where foo is some
> > > vec3.  When the SIMD8 shader reaches this line, it stores 12 values in the
> > > SSBO; 3 per pixel.  If the client doesn't want the values to stomp on each
> > > other, it is up to the client to ensure that i is different for each pixel.
> > > 
> > > How does this work with the scattered read/write messages?  They are
> > > designed for exactly a case like this.  When you get to this statement, you
> > > will have one register that holds the value of i and three more for foo. 
> > > Each of these registers has 8 sub-registers one for each SIMD channel (or
> > > pixel).
> 
> In SIMD16 the instructions operate on 16 elements, but I understand that
> registers still have 8 sub-registers, so this instruction:
> 
> mov(16)   g116<1>F     1.0F                { align1 1H };
> 
> is writing 1.0 in all sub-registers of g116 (8 elements) and all
> sub-registers of g117 (8 elements). Is this correct? If I am correct, then I
> would expect this assembly code for a SIMD16 scattered write to work:
> 
> mov(8)    g113<1>UD    g0<8,8,1>UD         { align1 WE_all 1Q compacted };
> mov(1)    g113.2<1>UD  0x00000000UD        { align1 WE_all compacted };
> mov(16)   g114<1>UD    g13<8,8,1>UD        { align1 1H compacted };
> mov(16)   g116<1>F     1.0F                { align1 1H };
> send(16)  g0<1>F       g113<8,8,1>F
>         data ( DC DWORD scatterd write, 1, 3) mlen 5 rlen 0 { align1 1H };
> 
> The first mov(16) would write the offset payload to M1,M2 (g114,g115) and
> the second mov(16) would write the data payload to M3,M4 (g116,g117).
> However, I see that this does not produce correct writes into the buffer, I
> see writes to the correct offsets but with wrong data, so I guess I am
> understanding something wrong again?.
> 
> For the record, this same code works fine if I make the second mov(16) write
> to g115 (like I do in SIMD8, where we want offsets in M1 and data in M2),
> but as far as my understanding goes, this should actually be incorrect for
> SIMD16.
> 
> > > All you should have to do is build 3 messages each one of which is
> > > i + some math for the address part and a component of foo for the payload
> > > part.  Each scattered write writes 8 values but they are the different
> > > values from the different SIMD channels, not from different components of
> > > foo.  The first one will write all 8 of the s[i].x, the next one s[i].y, etc.

Are you setting the block size in the message descriptor?

Bits 9:8 should be

  10: 8 DWords
  11: 16 DWords
Comment 20 Jason Ekstrand 2015-04-17 17:28:33 UTC
(In reply to Kristian Høgsberg from comment #19)
> (In reply to Iago Toral from comment #18)
> > Jason, I think I got it working for SIMD8 but I have a question regarding
> > SIMD16:
> > 
> > > (In reply to Jason Ekstrand from comment #16)
> > > > I think you are misunderstanding how these SIMD8/16 write messages work. 
> > > > I'll assume 8 in the following discussion but it all applies to 16.
> > > > 
> > > > As the shader executes, it is executes 8 pixels at a time.  Each
> > > > sub-register represents the same symbolic value in GLSL but for a different
> > > > pixel.  Suppose I have an SSBO declared as follows:
> > > > 
> > > > buffer Block {
> > > >     vec4 s[128];
> > > > };
> > > > 
> > > > And suppose I execute the line of code "s[i].xzw = foo;" where foo is some
> > > > vec3.  When the SIMD8 shader reaches this line, it stores 12 values in the
> > > > SSBO; 3 per pixel.  If the client doesn't want the values to stomp on each
> > > > other, it is up to the client to ensure that i is different for each pixel.
> > > > 
> > > > How does this work with the scattered read/write messages?  They are
> > > > designed for exactly a case like this.  When you get to this statement, you
> > > > will have one register that holds the value of i and three more for foo. 
> > > > Each of these registers has 8 sub-registers one for each SIMD channel (or
> > > > pixel).
> > 
> > In SIMD16 the instructions operate on 16 elements, but I understand that
> > registers still have 8 sub-registers, so this instruction:
> > 
> > mov(16)   g116<1>F     1.0F                { align1 1H };
> > 
> > is writing 1.0 in all sub-registers of g116 (8 elements) and all
> > sub-registers of g117 (8 elements). Is this correct? If I am correct, then I
> > would expect this assembly code for a SIMD16 scattered write to work:
> > 
> > mov(8)    g113<1>UD    g0<8,8,1>UD         { align1 WE_all 1Q compacted };
> > mov(1)    g113.2<1>UD  0x00000000UD        { align1 WE_all compacted };
> > mov(16)   g114<1>UD    g13<8,8,1>UD        { align1 1H compacted };
> > mov(16)   g116<1>F     1.0F                { align1 1H };
> > send(16)  g0<1>F       g113<8,8,1>F
> >         data ( DC DWORD scatterd write, 1, 3) mlen 5 rlen 0 { align1 1H };
> > 
> > The first mov(16) would write the offset payload to M1,M2 (g114,g115) and
> > the second mov(16) would write the data payload to M3,M4 (g116,g117).
> > However, I see that this does not produce correct writes into the buffer, I
> > see writes to the correct offsets but with wrong data, so I guess I am
> > understanding something wrong again?.
> > 
> > For the record, this same code works fine if I make the second mov(16) write
> > to g115 (like I do in SIMD8, where we want offsets in M1 and data in M2),
> > but as far as my understanding goes, this should actually be incorrect for
> > SIMD16.
> > 
> > > > All you should have to do is build 3 messages each one of which is
> > > > i + some math for the address part and a component of foo for the payload
> > > > part.  Each scattered write writes 8 values but they are the different
> > > > values from the different SIMD channels, not from different components of
> > > > foo.  The first one will write all 8 of the s[i].x, the next one s[i].y, etc.
> 
> Are you setting the block size in the message descriptor?
> 
> Bits 9:8 should be
> 
>   10: 8 DWords
>   11: 16 DWords

Yes, I think this is most likely the problem. We actually have a nice #define for this.  You can see it in use in my wip/fs-indirects-v0.5 branch in this commit:

http://cgit.freedesktop.org/~jekstrand/mesa/commit/?h=wip/fs-indirects-v0.5&id=df4293526f873102b45dd89dc20b084bc8662181

In fact, feel free to just cherry-pick that if you think it's what you want.  It also handles setting the right opcode for the different gens.
Comment 21 Iago Toral 2015-04-20 06:11:04 UTC
(In reply to Jason Ekstrand from comment #20)
> (In reply to Kristian Høgsberg from comment #19)
> > (In reply to Iago Toral from comment #18)
> > > Jason, I think I got it working for SIMD8 but I have a question regarding
> > > SIMD16:
> > > 
> > > > (In reply to Jason Ekstrand from comment #16)
> > > > > I think you are misunderstanding how these SIMD8/16 write messages work. 
> > > > > I'll assume 8 in the following discussion but it all applies to 16.
> > > > > 
> > > > > As the shader executes, it is executes 8 pixels at a time.  Each
> > > > > sub-register represents the same symbolic value in GLSL but for a different
> > > > > pixel.  Suppose I have an SSBO declared as follows:
> > > > > 
> > > > > buffer Block {
> > > > >     vec4 s[128];
> > > > > };
> > > > > 
> > > > > And suppose I execute the line of code "s[i].xzw = foo;" where foo is some
> > > > > vec3.  When the SIMD8 shader reaches this line, it stores 12 values in the
> > > > > SSBO; 3 per pixel.  If the client doesn't want the values to stomp on each
> > > > > other, it is up to the client to ensure that i is different for each pixel.
> > > > > 
> > > > > How does this work with the scattered read/write messages?  They are
> > > > > designed for exactly a case like this.  When you get to this statement, you
> > > > > will have one register that holds the value of i and three more for foo. 
> > > > > Each of these registers has 8 sub-registers one for each SIMD channel (or
> > > > > pixel).
> > > 
> > > In SIMD16 the instructions operate on 16 elements, but I understand that
> > > registers still have 8 sub-registers, so this instruction:
> > > 
> > > mov(16)   g116<1>F     1.0F                { align1 1H };
> > > 
> > > is writing 1.0 in all sub-registers of g116 (8 elements) and all
> > > sub-registers of g117 (8 elements). Is this correct? If I am correct, then I
> > > would expect this assembly code for a SIMD16 scattered write to work:
> > > 
> > > mov(8)    g113<1>UD    g0<8,8,1>UD         { align1 WE_all 1Q compacted };
> > > mov(1)    g113.2<1>UD  0x00000000UD        { align1 WE_all compacted };
> > > mov(16)   g114<1>UD    g13<8,8,1>UD        { align1 1H compacted };
> > > mov(16)   g116<1>F     1.0F                { align1 1H };
> > > send(16)  g0<1>F       g113<8,8,1>F
> > >         data ( DC DWORD scatterd write, 1, 3) mlen 5 rlen 0 { align1 1H };
> > > 
> > > The first mov(16) would write the offset payload to M1,M2 (g114,g115) and
> > > the second mov(16) would write the data payload to M3,M4 (g116,g117).
> > > However, I see that this does not produce correct writes into the buffer, I
> > > see writes to the correct offsets but with wrong data, so I guess I am
> > > understanding something wrong again?.
> > > 
> > > For the record, this same code works fine if I make the second mov(16) write
> > > to g115 (like I do in SIMD8, where we want offsets in M1 and data in M2),
> > > but as far as my understanding goes, this should actually be incorrect for
> > > SIMD16.
> > > 
> > > > > All you should have to do is build 3 messages each one of which is
> > > > > i + some math for the address part and a component of foo for the payload
> > > > > part.  Each scattered write writes 8 values but they are the different
> > > > > values from the different SIMD channels, not from different components of
> > > > > foo.  The first one will write all 8 of the s[i].x, the next one s[i].y, etc.
> > 
> > Are you setting the block size in the message descriptor?
> > 
> > Bits 9:8 should be
> > 
> >   10: 8 DWords
> >   11: 16 DWords
> 
> Yes, I think this is most likely the problem. We actually have a nice
> #define for this.  You can see it in use in my wip/fs-indirects-v0.5 branch
> in this commit:
> 
> http://cgit.freedesktop.org/~jekstrand/mesa/commit/?h=wip/fs-indirects-v0.
> 5&id=df4293526f873102b45dd89dc20b084bc8662181
> 
> In fact, feel free to just cherry-pick that if you think it's what you want.
> It also handles setting the right opcode for the different gens.

Nope, that shouldn't be it, this is what I have:

int mlen, msg_type;
if (dispatch_width == 8) {
   msg_type = BRW_DATAPORT_DWORD_SCATTERED_BLOCK_8DWORDS;
   mlen = 3;
} else {
   msg_type = BRW_DATAPORT_DWORD_SCATTERED_BLOCK_16DWORDS;
   mlen = 5;
}

I use dispatch_width rather than the inst->exec_size to check if we are in SIMD16 or SIMD8 mode, but both should be valid I guess. Anyway, if my understanding of how things operate in SIMD16 mode is correct then there must be something silly getting in the way, I'll try to track it down.

On a related note, I am trying to test the behavior for SIMD8/SIMD16 with a fragment shader like this:

   int index = int(mod(gl_FragCoord.x, 32));
   data[index] = index;


so that I have each pixel write a different value to a different index. To my surprise, if I always use  the SIMD8 implementation  (i.e. only write 8 offsets to M1 and 8 values to M2 and use BRW_DATAPORT_DWORD_SCATTERED_BLOCK_8DWORDS), the result is correct even for a SIMD16 execution (that is, I read data[i] == i after rendering). Shouldn't this shader produce bogus results at least for some of the indices in SIMD16 mode with this setup?
Comment 22 Iago Toral 2015-04-20 07:35:00 UTC
(...)
> > > (In reply to Iago Toral from comment #18)
> > > Are you setting the block size in the message descriptor?
> > > 
> > > Bits 9:8 should be
> > > 
> > >   10: 8 DWords
> > >   11: 16 DWords
> > 
> > Yes, I think this is most likely the problem. We actually have a nice
> > #define for this.  You can see it in use in my wip/fs-indirects-v0.5 branch
> > in this commit:
> > 
> > http://cgit.freedesktop.org/~jekstrand/mesa/commit/?h=wip/fs-indirects-v0.
> > 5&id=df4293526f873102b45dd89dc20b084bc8662181
> > 
> > In fact, feel free to just cherry-pick that if you think it's what you want.
> > It also handles setting the right opcode for the different gens.
> 
> Nope, that shouldn't be it, this is what I have:
> 
> int mlen, msg_type;
> if (dispatch_width == 8) {
>    msg_type = BRW_DATAPORT_DWORD_SCATTERED_BLOCK_8DWORDS;
>    mlen = 3;
> } else {
>    msg_type = BRW_DATAPORT_DWORD_SCATTERED_BLOCK_16DWORDS;
>    mlen = 5;
> }
> 
> I use dispatch_width rather than the inst->exec_size to check if we are in
> SIMD16 or SIMD8 mode, but both should be valid I guess. Anyway, if my
> understanding of how things operate in SIMD16 mode is correct then there
> must be something silly getting in the way, I'll try to track it down.

I fixed the problem, in case you are curious: I had to update the implementation of fs_inst::regs_read in brw_fs.cpp. As soon as I added that things started to work fine.
Comment 23 Jason Ekstrand 2015-04-20 19:07:50 UTC
(In reply to Iago Toral from comment #22)
> (...)
> > > > (In reply to Iago Toral from comment #18)
> > > > Are you setting the block size in the message descriptor?
> > > > 
> > > > Bits 9:8 should be
> > > > 
> > > >   10: 8 DWords
> > > >   11: 16 DWords
> > > 
> > > Yes, I think this is most likely the problem. We actually have a nice
> > > #define for this.  You can see it in use in my wip/fs-indirects-v0.5 branch
> > > in this commit:
> > > 
> > > http://cgit.freedesktop.org/~jekstrand/mesa/commit/?h=wip/fs-indirects-v0.
> > > 5&id=df4293526f873102b45dd89dc20b084bc8662181
> > > 
> > > In fact, feel free to just cherry-pick that if you think it's what you want.
> > > It also handles setting the right opcode for the different gens.
> > 
> > Nope, that shouldn't be it, this is what I have:
> > 
> > int mlen, msg_type;
> > if (dispatch_width == 8) {

This should be inst->exec_size.  That way it will work of someone does an exec_size 8 in SIMD16 mode.

> >    msg_type = BRW_DATAPORT_DWORD_SCATTERED_BLOCK_8DWORDS;
> >    mlen = 3;

mlen should be set on the instruction in the visitor not at the generator level.

> > } else {
> >    msg_type = BRW_DATAPORT_DWORD_SCATTERED_BLOCK_16DWORDS;
> >    mlen = 5;
> > }
> > 
> > I use dispatch_width rather than the inst->exec_size to check if we are in
> > SIMD16 or SIMD8 mode, but both should be valid I guess. Anyway, if my
> > understanding of how things operate in SIMD16 mode is correct then there
> > must be something silly getting in the way, I'll try to track it down.
> 
> I fixed the problem, in case you are curious: I had to update the
> implementation of fs_inst::regs_read in brw_fs.cpp. As soon as I added that
> things started to work fine.

Also make sure you update has_side_effects as well
--Jason
Comment 24 Iago Toral 2015-04-21 05:58:05 UTC
(In reply to Jason Ekstrand from comment #23)
> (In reply to Iago Toral from comment #22)
> > (...)
> > > > > (In reply to Iago Toral from comment #18)
> > > > > Are you setting the block size in the message descriptor?
> > > > > 
> > > > > Bits 9:8 should be
> > > > > 
> > > > >   10: 8 DWords
> > > > >   11: 16 DWords
> > > > 
> > > > Yes, I think this is most likely the problem. We actually have a nice
> > > > #define for this.  You can see it in use in my wip/fs-indirects-v0.5 branch
> > > > in this commit:
> > > > 
> > > > http://cgit.freedesktop.org/~jekstrand/mesa/commit/?h=wip/fs-indirects-v0.
> > > > 5&id=df4293526f873102b45dd89dc20b084bc8662181
> > > > 
> > > > In fact, feel free to just cherry-pick that if you think it's what you want.
> > > > It also handles setting the right opcode for the different gens.
> > > 
> > > Nope, that shouldn't be it, this is what I have:
> > > 
> > > int mlen, msg_type;
> > > if (dispatch_width == 8) {
> 
> This should be inst->exec_size.  That way it will work of someone does an
> exec_size 8 in SIMD16 mode.

I see, I'll change this.

> > >    msg_type = BRW_DATAPORT_DWORD_SCATTERED_BLOCK_8DWORDS;
> > >    mlen = 3;
> 
> mlen should be set on the instruction in the visitor not at the generator
> level.

Yeah, I have changed that already.

> > > } else {
> > >    msg_type = BRW_DATAPORT_DWORD_SCATTERED_BLOCK_16DWORDS;
> > >    mlen = 5;
> > > }
> > > 
> > > I use dispatch_width rather than the inst->exec_size to check if we are in
> > > SIMD16 or SIMD8 mode, but both should be valid I guess. Anyway, if my
> > > understanding of how things operate in SIMD16 mode is correct then there
> > > must be something silly getting in the way, I'll try to track it down.
> > 
> > I fixed the problem, in case you are curious: I had to update the
> > implementation of fs_inst::regs_read in brw_fs.cpp. As soon as I added that
> > things started to work fine.
> 
> Also make sure you update has_side_effects as well

Yep, I did this already.

Thanks for the tips Jason.
Comment 25 Iago Toral 2015-04-21 08:13:45 UTC
I think the FS bits work well now and I have switched focus to vec4. Here the problem is again with non-constant offsets that are not 16-byte aligned (this is actually not working in master for UBOs either). I have a working solution for this, but I'd like to discuss here the details and see what you think about it:

Instead of using scattered messages like we do for writes in the FS I am experimenting with unaligned oword block reads in this case. The solution, that I describe below, seems to work well but I had to work around a couple of issues that maybe can be dealt with in a better way. The solution looks like this (only for non-constant offsets, for constant offsets we can just use dual oword read):

1) Since unaligned oword messages are not "dual", we have to handle each of the two SIMD4x2 vertices separately, so I emit two separate unaligned reads, one from offset.0 (first vertex) and another from offset.4 (second vertex). I store the results of these reads to separate virtual registers read_result0 and read_result1 respectively.

2) In the next step I merge both read results into a single register suitable for SIMD4x2 operation. That is, if we call dst the destination of the SIMD4x2 operation, I move the lower half of read_result0 to the lower half of dst and the lower half of read_result1 to the higher half of dst. For this part I have defined a generator opcode (let's call it simd4x2_merge) that I initially implemented as two mov(4) operations, like this:

   brw_MOV(p,
           brw_vec4_reg(dst.file, dst.nr, 0),
           brw_vec4_reg(src0.file, src0.nr, 0));
   brw_MOV(p,
           brw_vec4_reg(dst.file, dst.nr, 4),
           brw_vec4_reg(src0.file, src1.nr, 0));

The first problem I found with this solution is that in some examples it would trigger the following assertion:

brw_vec4_generator.cpp:1927: void brw::vec4_generator::generate_code(const cfg_t*): Assertion `p->nr_insn == pre_emit_nr_insn + 1 || !"conditional_mod, no_dd_check, or no_dd_clear set for IR " "emitting more than 1 instruction"' failed.

This problem comes from the fact that in these cases, opt_set_dependency_control would set no_dd_clear/no_dd_check on the simd4x2_merge instruction and this does not seem to like the fact that this generator opcode actually expands to more than just one assembly instruction. I did not see an obvious way to deal with this other than skipping opt_set_dependency_control for this generator opcode, since the fact that it spawns more than just one assembly instruction will inevitably lead to this problem. I imagine that it could be possible to fix this more generally by having the generator set dependency control flags on all the instructions emitted by the opcode maybe. Anyway, since there is a  is_dep_ctrl_unsafe() function that seems to be there to select situations where we want to avoid opt_set_dependency_control to kick in, I just added this opcode there. Is there anything else to this scenario that I am missing?

With that fixed, I found another issue with the register coalesce optimization pass, as it attempted to rewrite the simd4x2_merge instruction to write directly to an MRF register using a writemask. As I show above, my first approach to the generator opcode was to have two MOVs that would always write all of the dst register (since that made sense in my context), but if other optimization passes can take the liberty to rewrite the instruction to introduce a writemask then that approach is no longer valid and I have to honor the writemask on the dst. The (small?) problem with that is that as far as I know, I cannot do mov(4) operations that honor dst.dw1.bits.writemask, so I have to do that manually emitting mov(1) operations for each channel enabled. Is there a better way to handle this?
Comment 26 Kristian Høgsberg 2015-04-21 16:06:46 UTC
(In reply to Iago Toral from comment #25)
> I think the FS bits work well now and I have switched focus to vec4. Here
> the problem is again with non-constant offsets that are not 16-byte aligned
> (this is actually not working in master for UBOs either). I have a working
> solution for this, but I'd like to discuss here the details and see what you
> think about it:
> 
> Instead of using scattered messages like we do for writes in the FS I am
> experimenting with unaligned oword block reads in this case. The solution,
> that I describe below, seems to work well but I had to work around a couple
> of issues that maybe can be dealt with in a better way. The solution looks
> like this (only for non-constant offsets, for constant offsets we can just
> use dual oword read):
> 
> 1) Since unaligned oword messages are not "dual", we have to handle each of
> the two SIMD4x2 vertices separately, so I emit two separate unaligned reads,
> one from offset.0 (first vertex) and another from offset.4 (second vertex).
> I store the results of these reads to separate virtual registers
> read_result0 and read_result1 respectively.
> 
> 2) In the next step I merge both read results into a single register
> suitable for SIMD4x2 operation. That is, if we call dst the destination of
> the SIMD4x2 operation, I move the lower half of read_result0 to the lower
> half of dst and the lower half of read_result1 to the higher half of dst.
> For this part I have defined a generator opcode (let's call it
> simd4x2_merge) that I initially implemented as two mov(4) operations, like
> this:
> 
>    brw_MOV(p,
>            brw_vec4_reg(dst.file, dst.nr, 0),
>            brw_vec4_reg(src0.file, src0.nr, 0));
>    brw_MOV(p,
>            brw_vec4_reg(dst.file, dst.nr, 4),
>            brw_vec4_reg(src0.file, src1.nr, 0));
> 
> The first problem I found with this solution is that in some examples it
> would trigger the following assertion:
> 
> brw_vec4_generator.cpp:1927: void brw::vec4_generator::generate_code(const
> cfg_t*): Assertion `p->nr_insn == pre_emit_nr_insn + 1 || !"conditional_mod,
> no_dd_check, or no_dd_clear set for IR " "emitting more than 1 instruction"'
> failed.
> 
> This problem comes from the fact that in these cases,
> opt_set_dependency_control would set no_dd_clear/no_dd_check on the
> simd4x2_merge instruction and this does not seem to like the fact that this
> generator opcode actually expands to more than just one assembly
> instruction. I did not see an obvious way to deal with this other than
> skipping opt_set_dependency_control for this generator opcode, since the
> fact that it spawns more than just one assembly instruction will inevitably
> lead to this problem. I imagine that it could be possible to fix this more
> generally by having the generator set dependency control flags on all the
> instructions emitted by the opcode maybe. Anyway, since there is a 
> is_dep_ctrl_unsafe() function that seems to be there to select situations
> where we want to avoid opt_set_dependency_control to kick in, I just added
> this opcode there. Is there anything else to this scenario that I am missing?
> 
> With that fixed, I found another issue with the register coalesce
> optimization pass, as it attempted to rewrite the simd4x2_merge instruction
> to write directly to an MRF register using a writemask. As I show above, my
> first approach to the generator opcode was to have two MOVs that would
> always write all of the dst register (since that made sense in my context),
> but if other optimization passes can take the liberty to rewrite the
> instruction to introduce a writemask then that approach is no longer valid
> and I have to honor the writemask on the dst. The (small?) problem with that
> is that as far as I know, I cannot do mov(4) operations that honor
> dst.dw1.bits.writemask, so I have to do that manually emitting mov(1)
> operations for each channel enabled. Is there a better way to handle this?

Would it be better to use the scattered read in vec4 mode as well to load both vec4s in one operation? Either way, there is no unaligned write message, so we'd have to use scatter write for writing.
Comment 27 Matt Turner 2015-04-21 16:17:51 UTC
(In reply to Iago Toral from comment #25)
> I think the FS bits work well now and I have switched focus to vec4. Here
> the problem is again with non-constant offsets that are not 16-byte aligned
> (this is actually not working in master for UBOs either). I have a working
> solution for this, but I'd like to discuss here the details and see what you
> think about it:
> 
> Instead of using scattered messages like we do for writes in the FS I am
> experimenting with unaligned oword block reads in this case. The solution,
> that I describe below, seems to work well but I had to work around a couple
> of issues that maybe can be dealt with in a better way. The solution looks
> like this (only for non-constant offsets, for constant offsets we can just
> use dual oword read):
> 
> 1) Since unaligned oword messages are not "dual", we have to handle each of
> the two SIMD4x2 vertices separately, so I emit two separate unaligned reads,
> one from offset.0 (first vertex) and another from offset.4 (second vertex).
> I store the results of these reads to separate virtual registers
> read_result0 and read_result1 respectively.
> 
> 2) In the next step I merge both read results into a single register
> suitable for SIMD4x2 operation. That is, if we call dst the destination of
> the SIMD4x2 operation, I move the lower half of read_result0 to the lower
> half of dst and the lower half of read_result1 to the higher half of dst.
> For this part I have defined a generator opcode (let's call it
> simd4x2_merge) that I initially implemented as two mov(4) operations, like
> this:
> 
>    brw_MOV(p,
>            brw_vec4_reg(dst.file, dst.nr, 0),
>            brw_vec4_reg(src0.file, src0.nr, 0));
>    brw_MOV(p,
>            brw_vec4_reg(dst.file, dst.nr, 4),
>            brw_vec4_reg(src0.file, src1.nr, 0));
> 
> The first problem I found with this solution is that in some examples it
> would trigger the following assertion:
> 
> brw_vec4_generator.cpp:1927: void brw::vec4_generator::generate_code(const
> cfg_t*): Assertion `p->nr_insn == pre_emit_nr_insn + 1 || !"conditional_mod,
> no_dd_check, or no_dd_clear set for IR " "emitting more than 1 instruction"'
> failed.
> 
> This problem comes from the fact that in these cases,
> opt_set_dependency_control would set no_dd_clear/no_dd_check on the
> simd4x2_merge instruction and this does not seem to like the fact that this
> generator opcode actually expands to more than just one assembly
> instruction. I did not see an obvious way to deal with this other than
> skipping opt_set_dependency_control for this generator opcode, since the
> fact that it spawns more than just one assembly instruction will inevitably
> lead to this problem. I imagine that it could be possible to fix this more
> generally by having the generator set dependency control flags on all the
> instructions emitted by the opcode maybe. Anyway, since there is a 
> is_dep_ctrl_unsafe() function that seems to be there to select situations
> where we want to avoid opt_set_dependency_control to kick in, I just added
> this opcode there. Is there anything else to this scenario that I am missing?

Your analysis is correct. Though it's easy to work around like we did in brw_fs_generator.cpp. See commit 7452f18b -- basically just add a 'multiple_instructions_emitted' bool and use it to skip setting the dependency control bits.

> With that fixed, I found another issue with the register coalesce
> optimization pass, as it attempted to rewrite the simd4x2_merge instruction
> to write directly to an MRF register using a writemask. As I show above, my
> first approach to the generator opcode was to have two MOVs that would
> always write all of the dst register (since that made sense in my context),
> but if other optimization passes can take the liberty to rewrite the
> instruction to introduce a writemask then that approach is no longer valid
> and I have to honor the writemask on the dst. The (small?) problem with that
> is that as far as I know, I cannot do mov(4) operations that honor
> dst.dw1.bits.writemask, so I have to do that manually emitting mov(1)
> operations for each channel enabled. Is there a better way to handle this?

With the multiple instructions thing fixed, I don't think this needs to be solved? I think it's simpler to emit the two MOV(4)s in the generator than figuring out what needs to be fixed in the vec4 optimizations to allow it.
Comment 28 Iago Toral 2015-04-22 06:13:06 UTC
(In reply to Kristian Høgsberg from comment #26)
> (In reply to Iago Toral from comment #25)
> > I think the FS bits work well now and I have switched focus to vec4. Here
> > the problem is again with non-constant offsets that are not 16-byte aligned
> > (this is actually not working in master for UBOs either). I have a working
> > solution for this, but I'd like to discuss here the details and see what you
> > think about it:
> > 
> > Instead of using scattered messages like we do for writes in the FS I am
> > experimenting with unaligned oword block reads in this case. The solution,
> > that I describe below, seems to work well but I had to work around a couple
> > of issues that maybe can be dealt with in a better way. The solution looks
> > like this (only for non-constant offsets, for constant offsets we can just
> > use dual oword read):
> > 
> > 1) Since unaligned oword messages are not "dual", we have to handle each of
> > the two SIMD4x2 vertices separately, so I emit two separate unaligned reads,
> > one from offset.0 (first vertex) and another from offset.4 (second vertex).
> > I store the results of these reads to separate virtual registers
> > read_result0 and read_result1 respectively.
> > 
> > 2) In the next step I merge both read results into a single register
> > suitable for SIMD4x2 operation. That is, if we call dst the destination of
> > the SIMD4x2 operation, I move the lower half of read_result0 to the lower
> > half of dst and the lower half of read_result1 to the higher half of dst.
> > For this part I have defined a generator opcode (let's call it
> > simd4x2_merge) that I initially implemented as two mov(4) operations, like
> > this:
> > 
> >    brw_MOV(p,
> >            brw_vec4_reg(dst.file, dst.nr, 0),
> >            brw_vec4_reg(src0.file, src0.nr, 0));
> >    brw_MOV(p,
> >            brw_vec4_reg(dst.file, dst.nr, 4),
> >            brw_vec4_reg(src0.file, src1.nr, 0));
> > 
> > The first problem I found with this solution is that in some examples it
> > would trigger the following assertion:
> > 
> > brw_vec4_generator.cpp:1927: void brw::vec4_generator::generate_code(const
> > cfg_t*): Assertion `p->nr_insn == pre_emit_nr_insn + 1 || !"conditional_mod,
> > no_dd_check, or no_dd_clear set for IR " "emitting more than 1 instruction"'
> > failed.
> > 
> > This problem comes from the fact that in these cases,
> > opt_set_dependency_control would set no_dd_clear/no_dd_check on the
> > simd4x2_merge instruction and this does not seem to like the fact that this
> > generator opcode actually expands to more than just one assembly
> > instruction. I did not see an obvious way to deal with this other than
> > skipping opt_set_dependency_control for this generator opcode, since the
> > fact that it spawns more than just one assembly instruction will inevitably
> > lead to this problem. I imagine that it could be possible to fix this more
> > generally by having the generator set dependency control flags on all the
> > instructions emitted by the opcode maybe. Anyway, since there is a 
> > is_dep_ctrl_unsafe() function that seems to be there to select situations
> > where we want to avoid opt_set_dependency_control to kick in, I just added
> > this opcode there. Is there anything else to this scenario that I am missing?
> > 
> > With that fixed, I found another issue with the register coalesce
> > optimization pass, as it attempted to rewrite the simd4x2_merge instruction
> > to write directly to an MRF register using a writemask. As I show above, my
> > first approach to the generator opcode was to have two MOVs that would
> > always write all of the dst register (since that made sense in my context),
> > but if other optimization passes can take the liberty to rewrite the
> > instruction to introduce a writemask then that approach is no longer valid
> > and I have to honor the writemask on the dst. The (small?) problem with that
> > is that as far as I know, I cannot do mov(4) operations that honor
> > dst.dw1.bits.writemask, so I have to do that manually emitting mov(1)
> > operations for each channel enabled. Is there a better way to handle this?
> 
> Would it be better to use the scattered read in vec4 mode as well to load
> both vec4s in one operation? Either way, there is no unaligned write
> message, so we'd have to use scatter write for writing.

I already had some initial work started with unaligned reads for vec4 so that is why I continued with that approach. For writes you are right that we will have to use scattered messages in any case.
Comment 29 Iago Toral 2015-04-22 06:22:14 UTC
(In reply to Matt Turner from comment #27)
> (In reply to Iago Toral from comment #25)
> > I think the FS bits work well now and I have switched focus to vec4. Here
> > the problem is again with non-constant offsets that are not 16-byte aligned
> > (this is actually not working in master for UBOs either). I have a working
> > solution for this, but I'd like to discuss here the details and see what you
> > think about it:
> > 
> > Instead of using scattered messages like we do for writes in the FS I am
> > experimenting with unaligned oword block reads in this case. The solution,
> > that I describe below, seems to work well but I had to work around a couple
> > of issues that maybe can be dealt with in a better way. The solution looks
> > like this (only for non-constant offsets, for constant offsets we can just
> > use dual oword read):
> > 
> > 1) Since unaligned oword messages are not "dual", we have to handle each of
> > the two SIMD4x2 vertices separately, so I emit two separate unaligned reads,
> > one from offset.0 (first vertex) and another from offset.4 (second vertex).
> > I store the results of these reads to separate virtual registers
> > read_result0 and read_result1 respectively.
> > 
> > 2) In the next step I merge both read results into a single register
> > suitable for SIMD4x2 operation. That is, if we call dst the destination of
> > the SIMD4x2 operation, I move the lower half of read_result0 to the lower
> > half of dst and the lower half of read_result1 to the higher half of dst.
> > For this part I have defined a generator opcode (let's call it
> > simd4x2_merge) that I initially implemented as two mov(4) operations, like
> > this:
> > 
> >    brw_MOV(p,
> >            brw_vec4_reg(dst.file, dst.nr, 0),
> >            brw_vec4_reg(src0.file, src0.nr, 0));
> >    brw_MOV(p,
> >            brw_vec4_reg(dst.file, dst.nr, 4),
> >            brw_vec4_reg(src0.file, src1.nr, 0));
> > 
> > The first problem I found with this solution is that in some examples it
> > would trigger the following assertion:
> > 
> > brw_vec4_generator.cpp:1927: void brw::vec4_generator::generate_code(const
> > cfg_t*): Assertion `p->nr_insn == pre_emit_nr_insn + 1 || !"conditional_mod,
> > no_dd_check, or no_dd_clear set for IR " "emitting more than 1 instruction"'
> > failed.
> > 
> > This problem comes from the fact that in these cases,
> > opt_set_dependency_control would set no_dd_clear/no_dd_check on the
> > simd4x2_merge instruction and this does not seem to like the fact that this
> > generator opcode actually expands to more than just one assembly
> > instruction. I did not see an obvious way to deal with this other than
> > skipping opt_set_dependency_control for this generator opcode, since the
> > fact that it spawns more than just one assembly instruction will inevitably
> > lead to this problem. I imagine that it could be possible to fix this more
> > generally by having the generator set dependency control flags on all the
> > instructions emitted by the opcode maybe. Anyway, since there is a 
> > is_dep_ctrl_unsafe() function that seems to be there to select situations
> > where we want to avoid opt_set_dependency_control to kick in, I just added
> > this opcode there. Is there anything else to this scenario that I am missing?
> 
> Your analysis is correct. Though it's easy to work around like we did in
> brw_fs_generator.cpp. See commit 7452f18b -- basically just add a
> 'multiple_instructions_emitted' bool and use it to skip setting the
> dependency control bits.

Ah, i see. That works, thanks!

> > With that fixed, I found another issue with the register coalesce
> > optimization pass, as it attempted to rewrite the simd4x2_merge instruction
> > to write directly to an MRF register using a writemask. As I show above, my
> > first approach to the generator opcode was to have two MOVs that would
> > always write all of the dst register (since that made sense in my context),
> > but if other optimization passes can take the liberty to rewrite the
> > instruction to introduce a writemask then that approach is no longer valid
> > and I have to honor the writemask on the dst. The (small?) problem with that
> > is that as far as I know, I cannot do mov(4) operations that honor
> > dst.dw1.bits.writemask, so I have to do that manually emitting mov(1)
> > operations for each channel enabled. Is there a better way to handle this?
> 
> With the multiple instructions thing fixed, I don't think this needs to be
> solved? I think it's simpler to emit the two MOV(4)s in the generator than
> figuring out what needs to be fixed in the vec4 optimizations to allow it.

No, this behavior is exclusively related to the doings of opt_register_coalesce, which kicks in before code generation, so the multiple instructions thing does not help.

In this particular case I think it is not too bad since I think I only need to deal with the writemask and swizzles of the dst and source operands (I have it fixed locally), but I was surprised by this effect because if this is happening here then I guess it means that the same issue could, at least in theory, happen to other generator opcodes that are probably not considering this scenario.
Comment 30 Samuel Iglesias Gonsálvez 2015-04-23 07:03:59 UTC
I have been working on the .length() calculation of unsized arrays in SSBOs. As it was said in comment 2, the size of the allocated buffer by the GPU can be queried by resinfo message. I have that code working fine with that approach but recently I discovered an inconsistency:

Intel driver is allocating internally buffers of multiples of 4k (see buffer object returned by intel_bufferobj_buffer() in brw_upload_ubo_surfaces()), so if the user wants a buffer of size 32 bytes, it would allocate a buffer of 4096 bytes. Then, the emitted resinfo message returns 4096.

However if the application executes glGetBufferParameteriv(GL_BUFFER_SIZE) it will return 32 bytes and I suppose this is the buffer size value that should be used for unsized array's length calculations.

How can I fix this inconsistency?

Should I modify intel_bufferobj_buffer() to return a buffer object of the same size it was asked for and not round it up to multiples of 4k? Is there a good reason for that behavior?

What do you think?
Comment 31 Jason Ekstrand 2015-04-23 14:28:27 UTC
(In reply to Samuel Iglesias from comment #30)
> I have been working on the .length() calculation of unsized arrays in SSBOs.
> As it was said in comment 2, the size of the allocated buffer by the GPU can
> be queried by resinfo message. I have that code working fine with that
> approach but recently I discovered an inconsistency:
> 
> Intel driver is allocating internally buffers of multiples of 4k (see buffer
> object returned by intel_bufferobj_buffer() in brw_upload_ubo_surfaces()),
> so if the user wants a buffer of size 32 bytes, it would allocate a buffer
> of 4096 bytes. Then, the emitted resinfo message returns 4096.

Yes, it does.  There are a couple of reasons for this.  First, is that kernel bo's (I think) work in terms of whole pages.  Second, drm has an internal buffer cache for recycling buffers.  If you ask it for a new buffer and there is already a free one available of the correct size, you get that one instead of a new one.  We do this to save on some of the up-front costs of creating and mapping buffers.  However, in order for this to work, we have to greatly reduce the number of different possible sizes.  I think there's also a power-of-two size restriction in there somewhere to accomplish this.

> However if the application executes glGetBufferParameteriv(GL_BUFFER_SIZE)
> it will return 32 bytes and I suppose this is the buffer size value that
> should be used for unsized array's length calculations.

Yes, we want to use what the client thinks is the size for those calculations.  Otherwise, the client will get very confused.

> How can I fix this inconsistency?

I'm not particularly familiar with the resinfo message.  However, there is another message that we use to implement the textureSize function.  This one reads the width/height out of the SURFACE_STATE object and hands that back.  Given that an SSBO has to have a surface state associated with it since it goes in the binding table, you could probably use this message to pull the length out.

> Should I modify intel_bufferobj_buffer() to return a buffer object of the
> same size it was asked for and not round it up to multiples of 4k? Is there
> a good reason for that behavior?

Explained above.
Comment 32 Kristian Høgsberg 2015-04-23 15:39:43 UTC
(In reply to Jason Ekstrand from comment #31)
> (In reply to Samuel Iglesias from comment #30)
> > I have been working on the .length() calculation of unsized arrays in SSBOs.
> > As it was said in comment 2, the size of the allocated buffer by the GPU can
> > be queried by resinfo message. I have that code working fine with that
> > approach but recently I discovered an inconsistency:
> > 
> > Intel driver is allocating internally buffers of multiples of 4k (see buffer
> > object returned by intel_bufferobj_buffer() in brw_upload_ubo_surfaces()),
> > so if the user wants a buffer of size 32 bytes, it would allocate a buffer
> > of 4096 bytes. Then, the emitted resinfo message returns 4096.
> 
> Yes, it does.  There are a couple of reasons for this.  First, is that
> kernel bo's (I think) work in terms of whole pages.  Second, drm has an
> internal buffer cache for recycling buffers.  If you ask it for a new buffer
> and there is already a free one available of the correct size, you get that
> one instead of a new one.  We do this to save on some of the up-front costs
> of creating and mapping buffers.  However, in order for this to work, we
> have to greatly reduce the number of different possible sizes.  I think
> there's also a power-of-two size restriction in there somewhere to
> accomplish this.
> 
> > However if the application executes glGetBufferParameteriv(GL_BUFFER_SIZE)
> > it will return 32 bytes and I suppose this is the buffer size value that
> > should be used for unsized array's length calculations.
> 
> Yes, we want to use what the client thinks is the size for those
> calculations.  Otherwise, the client will get very confused.
> 
> > How can I fix this inconsistency?
> 
> I'm not particularly familiar with the resinfo message.  However, there is
> another message that we use to implement the textureSize function.  This one
> reads the width/height out of the SURFACE_STATE object and hands that back. 
> Given that an SSBO has to have a surface state associated with it since it
> goes in the binding table, you could probably use this message to pull the
> length out.
> 
> > Should I modify intel_bufferobj_buffer() to return a buffer object of the
> > same size it was asked for and not round it up to multiples of 4k? Is there
> > a good reason for that behavior?
> 
> Explained above.

The allocations in mesa and gem buffer objects in the kernel are all just sw abstractions. They're rounded up and aligned for various reasons, but once the hardware starts executing a batchbuffer it's all a big linear address space. The only size the hardware sees is what we put in the SURFAFCE_STATE object. This is filled in in gen4/gen7/gen8_emit_buffer_surface_state(). The problem is in brw_upload_ubo_surfaces() where we use the size from the drm_intel_bo instead of the API provided size from binding->BufferObject in the calls to brw_create_constant_surface() and your new brw_create_buffer_surface().
Comment 33 Samuel Iglesias Gonsálvez 2015-04-24 05:01:00 UTC
(In reply to Kristian Høgsberg from comment #32)
> (In reply to Jason Ekstrand from comment #31)
> > (In reply to Samuel Iglesias from comment #30)
> > > I have been working on the .length() calculation of unsized arrays in SSBOs.
> > > As it was said in comment 2, the size of the allocated buffer by the GPU can
> > > be queried by resinfo message. I have that code working fine with that
> > > approach but recently I discovered an inconsistency:
> > > 
> > > Intel driver is allocating internally buffers of multiples of 4k (see buffer
> > > object returned by intel_bufferobj_buffer() in brw_upload_ubo_surfaces()),
> > > so if the user wants a buffer of size 32 bytes, it would allocate a buffer
> > > of 4096 bytes. Then, the emitted resinfo message returns 4096.
> > 
> > Yes, it does.  There are a couple of reasons for this.  First, is that
> > kernel bo's (I think) work in terms of whole pages.  Second, drm has an
> > internal buffer cache for recycling buffers.  If you ask it for a new buffer
> > and there is already a free one available of the correct size, you get that
> > one instead of a new one.  We do this to save on some of the up-front costs
> > of creating and mapping buffers.  However, in order for this to work, we
> > have to greatly reduce the number of different possible sizes.  I think
> > there's also a power-of-two size restriction in there somewhere to
> > accomplish this.
> > 
> > > However if the application executes glGetBufferParameteriv(GL_BUFFER_SIZE)
> > > it will return 32 bytes and I suppose this is the buffer size value that
> > > should be used for unsized array's length calculations.
> > 
> > Yes, we want to use what the client thinks is the size for those
> > calculations.  Otherwise, the client will get very confused.
> > 
> > > How can I fix this inconsistency?
> > 
> > I'm not particularly familiar with the resinfo message.  However, there is
> > another message that we use to implement the textureSize function.  This one
> > reads the width/height out of the SURFACE_STATE object and hands that back. 
> > Given that an SSBO has to have a surface state associated with it since it
> > goes in the binding table, you could probably use this message to pull the
> > length out.
> > 
> > > Should I modify intel_bufferobj_buffer() to return a buffer object of the
> > > same size it was asked for and not round it up to multiples of 4k? Is there
> > > a good reason for that behavior?
> > 
> > Explained above.
> 
> The allocations in mesa and gem buffer objects in the kernel are all just sw
> abstractions. They're rounded up and aligned for various reasons, but once
> the hardware starts executing a batchbuffer it's all a big linear address
> space. The only size the hardware sees is what we put in the SURFAFCE_STATE
> object. This is filled in in gen4/gen7/gen8_emit_buffer_surface_state(). The
> problem is in brw_upload_ubo_surfaces() where we use the size from the
> drm_intel_bo instead of the API provided size from binding->BufferObject in
> the calls to brw_create_constant_surface() and your new
> brw_create_buffer_surface().

Thanks both for your comments!

Khristian, I used the size from binding->BufferObject when calling brw_create_buffer_surface() and everything is working fine now. I am going to do the same for the brw_create_constant_surface() call.

Thanks again!
Comment 34 Iago Toral 2015-04-30 09:04:40 UTC
Kristian, Jason:

we are almost done with development and we are doing some more elaborate testing before preparing the patchset for review. In one of these tests I found a situation that I tracked down to what I think is an incorrect value of the pixel mask being delivered in the FS payload. I explain below:

I have a fragment shader like this:

layout(std140, binding=0) buffer Fragments {
   vec4 vf[8];
   int a[8];
   int last;
   vec4 d[];
};

layout(binding = 0, offset = 0) uniform atomic_uint counter;

in vec4 fragmentColor;
out vec4 color;

void main() {
   atomicCounterIncrement(counter);

   int index = int(mod(gl_FragCoord.x, 8));
   int i = atomicAdd(a[index], 1);

   if (i == 0)
      vf[index] = gl_FragCoord;

   i = atomicAdd(last, 1);
   d[i] = gl_FragCoord;

   color = fragmentColor;
}

When drawing a single point primitive I find this to produce the following result:

- The atomic counter's value is 1
- a[3] == 1, a[i] = 0 for i != 3
- vf[3] = (199.5, 165.5, 0.5, 1.0)
- vf[2] = (198.5, 165.5, 0.5, 1.0)
- vf[i] = (0,0,0,0) for i != 2,3
- last == 1
- d[0] = (199.5, 165.5, 0.5, 1.0)
- d[i] i > 0 is 0

My understanding of this result is the following: the rendering of this single point is producing two fragments that are both processed in a single execution of the fragment shader (hence the two writes into vf). I verified that the writes into vf come from offset.0 and offset.1 in the scattered write message for that instruction, and seeing the coordinates  stored in vf it seems to make sense.

However, in this scenario, what is the expected behavior of the atomic
operations? I would expect the atomicAdd on a[index] to increase a[3] *and* a[2], since index has two different values for each of the fragments involved. It seems like it is ignoring one of the fragments in this case. I would also
expect the acotmic counter to be incremented twice, etc

Since my implementation of the atomic operations on SSBOs reuse the existing infrastructure used for atomic counters (so fs_visitor::emit_untyped_atomic in this case) and I noticed this copies the pixel mask from g1.7 into the message, I thought there could be a problem there, so I forced the pixel mask to be 3 (so that it aknowledges data in channels 0,1 where I know valid fragments are stored in this particular case) and then I see this working as expected, that is, after renderin the point primitive I see:

- The atomic counter's value is 2
- a[2] = 1 and a[3] = 1
- last = 2
- d[1] and d[2] have the two fragments stored in vf[2], vf[3]

So based on this I think something is wrong with the pixel mask that is being delivered in the thread payload, which for some reason is removing a valid channel. I guess this does not affect the scattered writes because these don't include a pixel mask in their header... does this make sense to you?
Comment 35 Kenneth Graunke 2015-04-30 21:59:05 UTC
Is this on Gen7?  Does setting GEN7_PS_VECTOR_MASK_ENABLE in 3DSTATE_PS DWord 2 help?  (Gen8+ already does this.)
Comment 36 Iago Toral 2015-05-01 09:29:26 UTC
(In reply to Kenneth Graunke from comment #35)
> Is this on Gen7?  Does setting GEN7_PS_VECTOR_MASK_ENABLE in 3DSTATE_PS
> DWord 2 help?  (Gen8+ already does this.)

Yes, it is gen7. I've just tested this but it does not seem to have any effect in this case, the problem persists.
Comment 37 Iago Toral 2015-05-04 07:01:12 UTC
(In reply to Iago Toral from comment #36)
> (In reply to Kenneth Graunke from comment #35)
> > Is this on Gen7?  Does setting GEN7_PS_VECTOR_MASK_ENABLE in 3DSTATE_PS
> > DWord 2 help?  (Gen8+ already does this.)
> 
> Yes, it is gen7. I've just tested this but it does not seem to have any
> effect in this case, the problem persists.

As far as I can see, this issue spawns from the fact that atomic messages can include a message header with pixel mask information while scattered read/write messages (like many other read/write messages) can't, which can lead to some inconsistencies when both are used together, something that I imagine being common in SSBO usage patterns, unfortunately.

The PRMs clearly state that the pixel mask and the dispatch mask can be different, which means that in these scenarios atomic operations will operate on less channels (since the pixel mask is implicitly anded with the execution mask) than our read/write messages, leading the inconsistent behavior I explained in comment #34.

One way to work around this issue would be to fix the pixel mask in atomic operations to always be 0xffff. Since the mask is anded with the dispatch mask this should make atomic operations effectively use the dispatch mask and operate on the same channels as our read/write messages. I have tested this locally with a couple of examples and this seems to work, producing consistent results.

Would this be an option?

If that is not a good idea, then I guess the alternative would be to do it the other way around: fix the dispatch mask in the read/write messages to be like the pixel mask we use in atomic operations, but I don't know if that is possible.
Comment 38 Francisco Jerez 2015-05-04 11:05:33 UTC
Hi Iago,

(In reply to Iago Toral from comment #37)
> (In reply to Iago Toral from comment #36)
> > (In reply to Kenneth Graunke from comment #35)
> > > Is this on Gen7?  Does setting GEN7_PS_VECTOR_MASK_ENABLE in 3DSTATE_PS
> > > DWord 2 help?  (Gen8+ already does this.)
> >
> > Yes, it is gen7. I've just tested this but it does not seem to have any
> > effect in this case, the problem persists.
>
> As far as I can see, this issue spawns from the fact that atomic messages
> can include a message header with pixel mask information while scattered
> read/write messages (like many other read/write messages) can't, which can
> lead to some inconsistencies when both are used together, something that I
> imagine being common in SSBO usage patterns, unfortunately.
>
> The PRMs clearly state that the pixel mask and the dispatch mask can be
> different, which means that in these scenarios atomic operations will
> operate on less channels (since the pixel mask is implicitly anded with the
> execution mask) than our read/write messages, leading the inconsistent
> behavior I explained in comment #34.

The sample and execution masks differ in cases where a subset of
fragments in a single subspan (2x2 fragment block) are either not
covered by the primitive being drawn or discarded by the early
depth or stencil tests.  This is required for derivatives to give
well-defined results if they are calculated explicitly or
implicitly by texture sampling operations.

>
> One way to work around this issue would be to fix the pixel mask in atomic
> operations to always be 0xffff. Since the mask is anded with the dispatch
> mask this should make atomic operations effectively use the dispatch mask
> and operate on the same channels as our read/write messages. I have tested
> this locally with a couple of examples and this seems to work, producing
> consistent results.
>

That wouldn't be compliant, see section 7.1 of the GLSL spec
version 4.5:

| Fragment shader helper invocations execute the same shader code
| as non-helper invocations, but will not have side effects that
| modify the framebuffer or other shader-accessible memory. In
| particular:
| [..]
|  - Stores to image and buffer variables performed by helper
|    invocations have no effect on the underlying image or buffer
|    memory.
|  - Atomic operations to image, buffer, or atomic counter
|    variables performed by helper invocations have no effect on
|    the underlying image or buffer memory. The values returned by
|    such atomic operations are undefined.
|

> Would this be an option?
>
> If that is not a good idea, then I guess the alternative would be to do it
> the other way around: fix the dispatch mask in the read/write messages to be
> like the pixel mask we use in atomic operations, but I don't know if that is
> possible.

AFAIK scattered DWORD read and write messages don't to take a
sample mask independent from the execution mask.  It seems to me
that they aren't particularly well-suited for this extension.
Is there any reason you aren't using untyped surface reads and
writes instead?  That would allow you to provide an explicit
sample mask, share some code with ARB_shader_image_load_store by
using the same instructions (I'll land my patches adding support
for untyped surface writes shortly), and access up to 128 bits of
data per channel and message, likely giving better performance in
the long run.
Comment 39 Iago Toral 2015-05-04 11:44:06 UTC
Hi Francisco, thanks a lot for the input,

(In reply to Francisco Jerez from comment #38)
> Hi Iago,
> 
> (In reply to Iago Toral from comment #37)
> > (In reply to Iago Toral from comment #36)
> > > (In reply to Kenneth Graunke from comment #35)
> > > > Is this on Gen7?  Does setting GEN7_PS_VECTOR_MASK_ENABLE in 3DSTATE_PS
> > > > DWord 2 help?  (Gen8+ already does this.)
> > >
> > > Yes, it is gen7. I've just tested this but it does not seem to have any
> > > effect in this case, the problem persists.
> >
> > As far as I can see, this issue spawns from the fact that atomic messages
> > can include a message header with pixel mask information while scattered
> > read/write messages (like many other read/write messages) can't, which can
> > lead to some inconsistencies when both are used together, something that I
> > imagine being common in SSBO usage patterns, unfortunately.
> >
> > The PRMs clearly state that the pixel mask and the dispatch mask can be
> > different, which means that in these scenarios atomic operations will
> > operate on less channels (since the pixel mask is implicitly anded with the
> > execution mask) than our read/write messages, leading the inconsistent
> > behavior I explained in comment #34.
> 
> The sample and execution masks differ in cases where a subset of
> fragments in a single subspan (2x2 fragment block) are either not
> covered by the primitive being drawn or discarded by the early
> depth or stencil tests.  This is required for derivatives to give
> well-defined results if they are calculated explicitly or
> implicitly by texture sampling operations.

I see, thanks for explaining this.

> >
> > One way to work around this issue would be to fix the pixel mask in atomic
> > operations to always be 0xffff. Since the mask is anded with the dispatch
> > mask this should make atomic operations effectively use the dispatch mask
> > and operate on the same channels as our read/write messages. I have tested
> > this locally with a couple of examples and this seems to work, producing
> > consistent results.
> >
> 
> That wouldn't be compliant, see section 7.1 of the GLSL spec
> version 4.5:
> 
> | Fragment shader helper invocations execute the same shader code
> | as non-helper invocations, but will not have side effects that
> | modify the framebuffer or other shader-accessible memory. In
> | particular:
> | [..]
> |  - Stores to image and buffer variables performed by helper
> |    invocations have no effect on the underlying image or buffer
> |    memory.
> |  - Atomic operations to image, buffer, or atomic counter
> |    variables performed by helper invocations have no effect on
> |    the underlying image or buffer memory. The values returned by
> |    such atomic operations are undefined.
> |

Right, good catch.

> > Would this be an option?
> >
> > If that is not a good idea, then I guess the alternative would be to do it
> > the other way around: fix the dispatch mask in the read/write messages to be
> > like the pixel mask we use in atomic operations, but I don't know if that is
> > possible.
> 
> AFAIK scattered DWORD read and write messages don't to take a
> sample mask independent from the execution mask.  It seems to me
> that they aren't particularly well-suited for this extension.
> Is there any reason you aren't using untyped surface reads and
> writes instead?  That would allow you to provide an explicit
> sample mask, share some code with ARB_shader_image_load_store by
> using the same instructions (I'll land my patches adding support
> for untyped surface writes shortly), and access up to 128 bits of
> data per channel and message, likely giving better performance in
> the long run.

Not really, I started this using oword writes, but that had some issues with unaligned offsets so Jason suggested using scattered writes instead. At that moment it seemed like that was a good fit for fragment shaders: it would allow us to write up to 8/16 dwords at random dword offsets and that seemed to work great in all scenarios... until I played with some more elaborate tests that combined atomics and writes in the fragment shader and noticed this problem with the pixel mask included with atomic messages.

So unless someone else has a clever idea to respect the pixel mask with scattered write messages I guess I should at least look into untyped write messages. I imagine that reads are fine with scattered messages since channels that are not in the pixel mask will be ignored at write time.
Comment 40 Francisco Jerez 2015-05-04 12:36:34 UTC
(In reply to Iago Toral from comment #39)
>[...]
> > > Would this be an option?
> > >
> > > If that is not a good idea, then I guess the alternative would be to do it
> > > the other way around: fix the dispatch mask in the read/write messages to be
> > > like the pixel mask we use in atomic operations, but I don't know if that is
> > > possible.
> > 
> > AFAIK scattered DWORD read and write messages don't to take a
> > sample mask independent from the execution mask.  It seems to me
> > that they aren't particularly well-suited for this extension.
> > Is there any reason you aren't using untyped surface reads and
> > writes instead?  That would allow you to provide an explicit
> > sample mask, share some code with ARB_shader_image_load_store by
> > using the same instructions (I'll land my patches adding support
> > for untyped surface writes shortly), and access up to 128 bits of
> > data per channel and message, likely giving better performance in
> > the long run.
> 
> Not really, I started this using oword writes, but that had some issues with
> unaligned offsets so Jason suggested using scattered writes instead. At that
> moment it seemed like that was a good fit for fragment shaders: it would
> allow us to write up to 8/16 dwords at random dword offsets and that seemed
> to work great in all scenarios... until I played with some more elaborate
> tests that combined atomics and writes in the fragment shader and noticed
> this problem with the pixel mask included with atomic messages.
> 
> So unless someone else has a clever idea to respect the pixel mask with
> scattered write messages I guess I should at least look into untyped write
> messages. I imagine that reads are fine with scattered messages since
> channels that are not in the pixel mask will be ignored at write time.

I doubt there is much benefit from using scattered dword messages even if they would work for reads, untyped surface messages can do a superset of what the scatter messages do, and they might save bandwidth in cases where contiguous dwords are read sequentially, as they can read up to four dwords at once.  Aside from that you would avoid the hassle of defining and implementing new opcodes.
Comment 41 Iago Toral 2015-05-04 13:43:58 UTC
(In reply to Francisco Jerez from comment #40)
> (In reply to Iago Toral from comment #39)
> >[...]
> > > > Would this be an option?
> > > >
> > > > If that is not a good idea, then I guess the alternative would be to do it
> > > > the other way around: fix the dispatch mask in the read/write messages to be
> > > > like the pixel mask we use in atomic operations, but I don't know if that is
> > > > possible.
> > > 
> > > AFAIK scattered DWORD read and write messages don't to take a
> > > sample mask independent from the execution mask.  It seems to me
> > > that they aren't particularly well-suited for this extension.
> > > Is there any reason you aren't using untyped surface reads and
> > > writes instead?  That would allow you to provide an explicit
> > > sample mask, share some code with ARB_shader_image_load_store by
> > > using the same instructions (I'll land my patches adding support
> > > for untyped surface writes shortly), and access up to 128 bits of
> > > data per channel and message, likely giving better performance in
> > > the long run.
> > 
> > Not really, I started this using oword writes, but that had some issues with
> > unaligned offsets so Jason suggested using scattered writes instead. At that
> > moment it seemed like that was a good fit for fragment shaders: it would
> > allow us to write up to 8/16 dwords at random dword offsets and that seemed
> > to work great in all scenarios... until I played with some more elaborate
> > tests that combined atomics and writes in the fragment shader and noticed
> > this problem with the pixel mask included with atomic messages.
> > 
> > So unless someone else has a clever idea to respect the pixel mask with
> > scattered write messages I guess I should at least look into untyped write
> > messages. I imagine that reads are fine with scattered messages since
> > channels that are not in the pixel mask will be ignored at write time.
> 
> I doubt there is much benefit from using scattered dword messages even if
> they would work for reads, untyped surface messages can do a superset of
> what the scatter messages do, and they might save bandwidth in cases where
> contiguous dwords are read sequentially, as they can read up to four dwords
> at once.  Aside from that you would avoid the hassle of defining and
> implementing new opcodes.

You are probably right. I'll use this for reads as well then. Thanks again!
Comment 42 Jason Ekstrand 2015-05-04 16:11:44 UTC
(In reply to Francisco Jerez from comment #38)
> (In reply to Iago Toral from comment #37)
> > If that is not a good idea, then I guess the alternative would be to do it
> > the other way around: fix the dispatch mask in the read/write messages to be
> > like the pixel mask we use in atomic operations, but I don't know if that is
> > possible.
> 
> AFAIK scattered DWORD read and write messages don't to take a
> sample mask independent from the execution mask.  It seems to me
> that they aren't particularly well-suited for this extension.
> Is there any reason you aren't using untyped surface reads and
> writes instead?  That would allow you to provide an explicit
> sample mask, share some code with ARB_shader_image_load_store by
> using the same instructions (I'll land my patches adding support
> for untyped surface writes shortly), and access up to 128 bits of
> data per channel and message, likely giving better performance in
> the long run.

I'll take the blame for that.  I'm not nearly as familiar with the image read/write instructions as you are (having worked on load_store) and I wasn't thinking about pixel masks at the time.  I previously used the scattered read/write instructions for doing spilling on registers with indirects which is why I suggested them.  It certainly sounds like the untyped image read/write instructions are better suited for the task.
Comment 43 Iago Toral 2015-05-05 10:33:26 UTC
Francisco, since you have already landed the untyped write messages in master I have started to rewrite my code to use them but it seems as if these messages don't write anything for me. Looking at the generated assembly it seems that everything is fine though. For example:

mov(8)    g2<1>UD    0x00000000UD     { align1 WE_all 1Q compacted };
mov(16)   g3<1>UD    0x00000000UD     { align1 WE_all 1H compacted };
mov(16)   g5<1>UD    0x0000000aUD     { align1 WE_all 1H compacted };
mov(1)    g2.7<1>UD  g1.7<0,1,0>UD    { align1 WE_all };
send(8)   null       g2<8,8,1>UD
     data ( DC untyped surface write, 1, 46) mlen 5 rlen 0 { align1 1Q };

that is supposed to write the value 10 (0xa) to offset 0 of surface index 1, which comes from a single assignment of that value to a integer variable at the beginning of the buffer.

The assembly looks correct to me:
- g2 is the header and carries the pixel mask as delivered to the thread
- g3/g4 have the offset payload for SIMD16 (0 for all channels)
- g5/g6 have the data payload for SIMD16  (0xa for all channels)
- The message length is 5 for SIMD16 (header, offsetx2, datax2)
- surface index is 1, which is correct
- msg_control is 46, meaning only red channel enabled and SIMD16 mode.

but it does not seem to take any effect: the buffer contents are not modified at any offset after this.

The surface I am writing to is a BRW_SURFACE_BUFFER with format BRW_SURFACEFORMAT_RAW and a pitch of 1.

Am I missing something obvious here?
Comment 44 Iago Toral 2015-05-05 11:10:46 UTC
(In reply to Iago Toral from comment #43)
> Francisco, since you have already landed the untyped write messages in
> master I have started to rewrite my code to use them but it seems as if
> these messages don't write anything for me. Looking at the generated
> assembly it seems that everything is fine though. For example:
> 
> mov(8)    g2<1>UD    0x00000000UD     { align1 WE_all 1Q compacted };
> mov(16)   g3<1>UD    0x00000000UD     { align1 WE_all 1H compacted };
> mov(16)   g5<1>UD    0x0000000aUD     { align1 WE_all 1H compacted };
> mov(1)    g2.7<1>UD  g1.7<0,1,0>UD    { align1 WE_all };
> send(8)   null       g2<8,8,1>UD
>      data ( DC untyped surface write, 1, 46) mlen 5 rlen 0 { align1 1Q };
> 
> that is supposed to write the value 10 (0xa) to offset 0 of surface index 1,
> which comes from a single assignment of that value to a integer variable at
> the beginning of the buffer.
> 
> The assembly looks correct to me:
> - g2 is the header and carries the pixel mask as delivered to the thread
> - g3/g4 have the offset payload for SIMD16 (0 for all channels)
> - g5/g6 have the data payload for SIMD16  (0xa for all channels)
> - The message length is 5 for SIMD16 (header, offsetx2, datax2)
> - surface index is 1, which is correct
> - msg_control is 46, meaning only red channel enabled and SIMD16 mode.
> 
> but it does not seem to take any effect: the buffer contents are not
> modified at any offset after this.
> 
> The surface I am writing to is a BRW_SURFACE_BUFFER with format
> BRW_SURFACEFORMAT_RAW and a pitch of 1.
> 
> Am I missing something obvious here?

I was, seems like I removed the marking of the surface as used by mistake. I have put that back and it works fine now.

Sorry for the noise... :-/
Comment 45 Iago Toral 2015-05-05 11:21:02 UTC
(In reply to Iago Toral from comment #44)
> (In reply to Iago Toral from comment #43)
> > Francisco, since you have already landed the untyped write messages in
> > master I have started to rewrite my code to use them but it seems as if
> > these messages don't write anything for me. Looking at the generated
> > assembly it seems that everything is fine though. For example:
> > 
> > mov(8)    g2<1>UD    0x00000000UD     { align1 WE_all 1Q compacted };
> > mov(16)   g3<1>UD    0x00000000UD     { align1 WE_all 1H compacted };
> > mov(16)   g5<1>UD    0x0000000aUD     { align1 WE_all 1H compacted };
> > mov(1)    g2.7<1>UD  g1.7<0,1,0>UD    { align1 WE_all };
> > send(8)   null       g2<8,8,1>UD
> >      data ( DC untyped surface write, 1, 46) mlen 5 rlen 0 { align1 1Q };
> > 
> > that is supposed to write the value 10 (0xa) to offset 0 of surface index 1,
> > which comes from a single assignment of that value to a integer variable at
> > the beginning of the buffer.
> > 
> > The assembly looks correct to me:
> > - g2 is the header and carries the pixel mask as delivered to the thread
> > - g3/g4 have the offset payload for SIMD16 (0 for all channels)
> > - g5/g6 have the data payload for SIMD16  (0xa for all channels)
> > - The message length is 5 for SIMD16 (header, offsetx2, datax2)
> > - surface index is 1, which is correct
> > - msg_control is 46, meaning only red channel enabled and SIMD16 mode.
> > 
> > but it does not seem to take any effect: the buffer contents are not
> > modified at any offset after this.
> > 
> > The surface I am writing to is a BRW_SURFACE_BUFFER with format
> > BRW_SURFACEFORMAT_RAW and a pitch of 1.
> > 
> > Am I missing something obvious here?
> 
> I was, seems like I removed the marking of the surface as used by mistake. I
> have put that back and it works fine now.
> 
> Sorry for the noise... :-/

Oh, actually it is not me that removed it, it just wasn't there to begin with. Francisco, I will send a patch to fix this.
Comment 46 Iago Toral 2015-05-06 09:30:49 UTC
I have implemented reads and writes in the fragment shader using untyped messages. Works fine, but I found a very strange quirk that seems to happen in a very specific scenario and that once again seems related to the pixel mask. I don't know what is going on.

The scenario that triggers this is always with an instance block although I am not sure that the actual problem has anything to do with that. The shader looks like this:

layout(std140, binding=2) buffer Fragments {
   vec4 v0;
   vec4 v1;
} inst[3];

out vec4 color;

void main()
{
   int index = int(inst[0].v0.x);
   inst[index].v1 = vec4(-1, -2, -3, -4);
   color = vec4(int(inst[0].v0.x), int(inst[0].v0.y), int(inst[0].v0.z), 1);
}

inst[0].v0 is initialized to (1,1,1,1), so  the shader should be writing (-1,-2,-3,-4) to inst[1].v1. What I see, however, is that it writes to inst[0].v1 instead.

I tracked this down to the pixel mask again:

If I force the pixel mask used in the untyped read messages to be 0x1, then we obtain the result we expect (that is, a write into inst[1]). My interpretation of this is that the write is operating on channel 0, while the read, for some reason, is not, even if they both use the same pixel mask. However, if I do this, the pixel color shown on the screen is incorrect. Again, my interpretation of this is that the framebuffer write is choosing data from a channel other than 0.

If I force the pixel mask in the untyped read messages to be 0x8 (channel 3), then the pixel color on the screen is correct. My interpretation of this is that framebuffer writes take data from channel 3.

If I don't force any pixel mask in the untyped read messages (i.e., I use the one delivered to the thread in g1.7), then the color in the screen is also correct. My interpretation of this is that untyped reads and framebuffer writes operate on channel 3.

So I would guess the problem here is that the untyped write operates on channel 0 when it should operate on channel 3. However, I don't see what could be causing this when both untyped reads and untyped writes are using the same pixel mask for their headers (the one from g1.7). Also, if I force the pixel mask in the write message to be 0x8 so it operates on channel 3, it still writes to inst[0], which again, I find unexpected, since channel 3 should be carrying the right index data as per the tests above.

Again, this only seems to happen in a couple of examples where instance blocks are involved, but I don't see how that could be related.

Any ideas as to what could be going on here?
Comment 47 Iago Toral 2015-05-06 10:15:06 UTC
(In reply to Iago Toral from comment #46)
> I have implemented reads and writes in the fragment shader using untyped
> messages. Works fine, but I found a very strange quirk that seems to happen
> in a very specific scenario and that once again seems related to the pixel
> mask. I don't know what is going on.
> 
> The scenario that triggers this is always with an instance block although I
> am not sure that the actual problem has anything to do with that. The shader
> looks like this:
> 
> layout(std140, binding=2) buffer Fragments {
>    vec4 v0;
>    vec4 v1;
> } inst[3];
> 
> out vec4 color;
> 
> void main()
> {
>    int index = int(inst[0].v0.x);
>    inst[index].v1 = vec4(-1, -2, -3, -4);
>    color = vec4(int(inst[0].v0.x), int(inst[0].v0.y), int(inst[0].v0.z), 1);
> }
> 
> inst[0].v0 is initialized to (1,1,1,1), so  the shader should be writing
> (-1,-2,-3,-4) to inst[1].v1. What I see, however, is that it writes to
> inst[0].v1 instead.
> 
> I tracked this down to the pixel mask again:
> 
> If I force the pixel mask used in the untyped read messages to be 0x1, then
> we obtain the result we expect (that is, a write into inst[1]). My
> interpretation of this is that the write is operating on channel 0, while
> the read, for some reason, is not, even if they both use the same pixel
> mask. However, if I do this, the pixel color shown on the screen is
> incorrect. Again, my interpretation of this is that the framebuffer write is
> choosing data from a channel other than 0.
> 
> If I force the pixel mask in the untyped read messages to be 0x8 (channel
> 3), then the pixel color on the screen is correct. My interpretation of this
> is that framebuffer writes take data from channel 3.
> 
> If I don't force any pixel mask in the untyped read messages (i.e., I use
> the one delivered to the thread in g1.7), then the color in the screen is
> also correct. My interpretation of this is that untyped reads and
> framebuffer writes operate on channel 3.
> 
> So I would guess the problem here is that the untyped write operates on
> channel 0 when it should operate on channel 3. However, I don't see what
> could be causing this when both untyped reads and untyped writes are using
> the same pixel mask for their headers (the one from g1.7). Also, if I force
> the pixel mask in the write message to be 0x8 so it operates on channel 3,
> it still writes to inst[0], which again, I find unexpected, since channel 3
> should be carrying the right index data as per the tests above.

And to make things even more confusing, if I use scattered writes instead of untyped writes, the problem persists (strange, since these ignore the pixel mask and should write data from any channel enabled in the execution mask, including the channel where reads are operating on. It is at least consistent with the fact that fixing the pixel mask in the untyped write message didn't work). However, if I use scattered reads instead of untyped reads (and keeping untyped writes) it works well (correct write into inst[1] and correct pixel color on screen). This is also consistent with the tests I did above, since with scattered reads we will get the right index into the channel that the write is using.

As a side note, in the case that we can't find an explanation for this behavior and fix it, I was thinking that a workaround that seems to work could be to ignore the pixel mask for untyped reads. The rationale is that the requirements from the spec as shared in comment #38 only affect operations that can modify the underlying surfaces. Since reads do not modify anything, I think it should be fine to have them ignore the pixel mask and only set a pixel mask for the write messages and atomic operations. That would fix the problem by making sure that writes that depend on the result of a read do not use data from channels that the reads end up ignoring because of the pixel mask.
Comment 48 Francisco Jerez 2015-05-06 11:22:06 UTC
(In reply to Iago Toral from comment #47)
>[...]
> And to make things even more confusing, if I use scattered writes instead of
> untyped writes, the problem persists (strange, since these ignore the pixel
> mask and should write data from any channel enabled in the execution mask,
> including the channel where reads are operating on. It is at least
> consistent with the fact that fixing the pixel mask in the untyped write
> message didn't work). However, if I use scattered reads instead of untyped
> reads (and keeping untyped writes) it works well (correct write into inst[1]
> and correct pixel color on screen). This is also consistent with the tests I
> did above, since with scattered reads we will get the right index into the
> channel that the write is using.
> 
> As a side note, in the case that we can't find an explanation for this
> behavior and fix it, I was thinking that a workaround that seems to work
> could be to ignore the pixel mask for untyped reads. The rationale is that
> the requirements from the spec as shared in comment #38 only affect
> operations that can modify the underlying surfaces. Since reads do not
> modify anything, I think it should be fine to have them ignore the pixel
> mask and only set a pixel mask for the write messages and atomic operations.
> That would fix the problem by making sure that writes that depend on the
> result of a read do not use data from channels that the reads end up
> ignoring because of the pixel mask.

You seem to have found the explanation yourself ;).  The spec requires shader storage block reads to give well-defined values even for helper invocations, so the sample mask should *not* be used for untyped surface reads -- I think that's not only a workaround, it's how it should be.  I suggest you simply remove the header from the payload built by emit_untyped_read() and unset the "header present" bit in brw_eu_emit.c, so that the shared unit uses the default sample mask of all ones.
Comment 49 Iago Toral 2015-05-06 11:40:31 UTC
(In reply to Francisco Jerez from comment #48)
> (In reply to Iago Toral from comment #47)
> >[...]
> > And to make things even more confusing, if I use scattered writes instead of
> > untyped writes, the problem persists (strange, since these ignore the pixel
> > mask and should write data from any channel enabled in the execution mask,
> > including the channel where reads are operating on. It is at least
> > consistent with the fact that fixing the pixel mask in the untyped write
> > message didn't work). However, if I use scattered reads instead of untyped
> > reads (and keeping untyped writes) it works well (correct write into inst[1]
> > and correct pixel color on screen). This is also consistent with the tests I
> > did above, since with scattered reads we will get the right index into the
> > channel that the write is using.
> > 
> > As a side note, in the case that we can't find an explanation for this
> > behavior and fix it, I was thinking that a workaround that seems to work
> > could be to ignore the pixel mask for untyped reads. The rationale is that
> > the requirements from the spec as shared in comment #38 only affect
> > operations that can modify the underlying surfaces. Since reads do not
> > modify anything, I think it should be fine to have them ignore the pixel
> > mask and only set a pixel mask for the write messages and atomic operations.
> > That would fix the problem by making sure that writes that depend on the
> > result of a read do not use data from channels that the reads end up
> > ignoring because of the pixel mask.
> 
> You seem to have found the explanation yourself ;).  The spec requires
> shader storage block reads to give well-defined values even for helper
> invocations, so the sample mask should *not* be used for untyped surface
> reads -- I think that's not only a workaround, it's how it should be.  I
> suggest you simply remove the header from the payload built by
> emit_untyped_read() and unset the "header present" bit in brw_eu_emit.c, so
> that the shared unit uses the default sample mask of all ones.

Ah, that makes sense, otherwise helper invocations would not do be reliable, of course. Thanks Francisco, I'll do that.
Comment 50 Iago Toral 2015-05-07 06:27:51 UTC
Francisco, your implementation of brw_untyped_surface_write in brw_eu_emit.c fixes the mask to use for the dst, which in turn decides which channels are effectively written to memory.

For vec4 and gen7 (except haswell) it sets the writemask to WRITEMASK_X, but I think we want clients of this function to decide the writemask to use. For example, if we are writing a vec4, I want to use WRITEMASK_XYZW (even if we are using SIMD8 mode) to get all 4 components written directly. Fixing the writemask inside this function seems a bit restrictive.

Would you be okay with the addition of a dst parameter to this function? The clients (generator or visitor) would be responsible for setting the appropriate writemask on that dst depending on the situation.

Since we are using WRITEMASK_XYZW for all FS invocations I think in that case we can have the generator set the dst parameter as  brw_writemask(brw_null_reg(), WRITEMASK_XYZW), but for vec4 the  visitor would do the same, but injecting the writemask it needs depending on the case, then the vec4 generator would just pass that dst to brw_untyped_surface_write. I was also thinking about passing the writemask as a separate parameter to the generator opcode, but it looks like generator opcodes can only take 3 src arguments and we are already using them, so passing it as part of the dst seems like the best way to go.
Comment 51 Francisco Jerez 2015-05-07 15:43:48 UTC
(In reply to Iago Toral from comment #50)
> Francisco, your implementation of brw_untyped_surface_write in brw_eu_emit.c
> fixes the mask to use for the dst, which in turn decides which channels are
> effectively written to memory.
> 
> For vec4 and gen7 (except haswell) it sets the writemask to WRITEMASK_X, but
> I think we want clients of this function to decide the writemask to use. For
> example, if we are writing a vec4, I want to use WRITEMASK_XYZW (even if we
> are using SIMD8 mode) to get all 4 components written directly. Fixing the
> writemask inside this function seems a bit restrictive.
> 

IVB didn't have a SIMD4x2 variant of the untyped surface write
message, so the SIMD8 one has to be used.  The writemask is
required because the dataport will reinterpret the execution mask
sent by the EU as if each bit mapped to a separate scalar
channel, just like is the case for a FS thread, so, if you set
the writemask to XYZW the dataport might end up writing *eight*
separate vec4s to memory, which is almost certainly not what you
want.

> Would you be okay with the addition of a dst parameter to this function? The
> clients (generator or visitor) would be responsible for setting the
> appropriate writemask on that dst depending on the situation.
> 

It shouldn't be necessary.  The untyped surface write message
already stores as much data as you need (2 vec4s) with the
writemask set to one component.  The problem is that the way it
expects vectors to be laid out in the payload is somewhat
annoying: each component of the argument needs to be passed in a
separate register, just like in SIMD8 mode, with the value for
the first and second channels stored in the first and fourth
32-bit elements of each register respectively.

To solve this problem I wrote some helper functions (emit_insert
and emit_extract) while implementing ARB_shader_image_load_store
which might be useful for you as starting point to convert
between the native vector layout of the EU and the vector layout
of a message payload.  See [1], but note that the branch is
somewhat outdated.

> Since we are using WRITEMASK_XYZW for all FS invocations I think in that

The writemask doesn't exist at all in Align1 instructions (which is
what the FS back-end uses).

> case we can have the generator set the dst parameter as 
> brw_writemask(brw_null_reg(), WRITEMASK_XYZW), but for vec4 the  visitor
> would do the same, but injecting the writemask it needs depending on the
> case, then the vec4 generator would just pass that dst to
> brw_untyped_surface_write. I was also thinking about passing the writemask
> as a separate parameter to the generator opcode, but it looks like generator
> opcodes can only take 3 src arguments and we are already using them, so
> passing it as part of the dst seems like the best way to go.

[1] http://cgit.freedesktop.org/~currojerez/mesa/tree/src/mesa/drivers/dri/i965/brw_ir_surface_builder.h?h=image-load-store#n186
Comment 52 Iago Toral 2015-05-08 06:38:50 UTC
(In reply to Francisco Jerez from comment #51)
> (In reply to Iago Toral from comment #50)
> > Francisco, your implementation of brw_untyped_surface_write in brw_eu_emit.c
> > fixes the mask to use for the dst, which in turn decides which channels are
> > effectively written to memory.
> > 
> > For vec4 and gen7 (except haswell) it sets the writemask to WRITEMASK_X, but
> > I think we want clients of this function to decide the writemask to use. For
> > example, if we are writing a vec4, I want to use WRITEMASK_XYZW (even if we
> > are using SIMD8 mode) to get all 4 components written directly. Fixing the
> > writemask inside this function seems a bit restrictive.
> > 
> 
> IVB didn't have a SIMD4x2 variant of the untyped surface write
> message, so the SIMD8 one has to be used.  The writemask is
> required because the dataport will reinterpret the execution mask
> sent by the EU as if each bit mapped to a separate scalar
> channel, just like is the case for a FS thread, so, if you set
> the writemask to XYZW the dataport might end up writing *eight*
> separate vec4s to memory, which is almost certainly not what you
> want.

I think this should not happen. The IVB PRMs, 3.9.9.10
Message Payload, says:

"For SIMD16 and SIMD8 messages, the message length is used to determine how may address parameters are included in the message. The number of message registers in the write data payload is determined by the number of channel mask bits that are enabled"

So, if we only enable one channel (red), it should only write up to 8 dwords, never 8 vec4s. With this in mind, this is what I am doing:

I write up to 8 offsets to M1 and up to 8 values to M2 (so red channel only). The first 4 values in the red channel (M2.0 to M2.3) are the four vector components of the vertex stored in the lower half of the SIMD4x2 execution, the data from second vertex of the SIMD4x2 execution goes in M2.4 to M2.7. Since I only provide data for the red channel, the message can only write up to 8 dwords, no matter the writemask I use. Then, with WRITEMASK_X, only M2.0 and M.2.4 get written. With WRITEMASK_XYZW I get all 8 dwords written, with other writetemasks I can get any subset I need, which works great because I only need to pass the writemask we get from the GLSL IR as is to get exactly what we want.

I have tested this in multiple scenarios and seems to work fine in all of them, and the implementation is straight forward. Do you see any issues that I might be missing?

> > Would you be okay with the addition of a dst parameter to this function? The
> > clients (generator or visitor) would be responsible for setting the
> > appropriate writemask on that dst depending on the situation.
> > 
> 
> It shouldn't be necessary.  The untyped surface write message
> already stores as much data as you need (2 vec4s) with the
> writemask set to one component.  The problem is that the way it
> expects vectors to be laid out in the payload is somewhat
> annoying: each component of the argument needs to be passed in a
> separate register, just like in SIMD8 mode, with the value for
> the first and second channels stored in the first and fourth
> 32-bit elements of each register respectively.

Right, but as you explain, this seems a bit less direct that what I am doing now. As I explain above, I can get this to work by writing all my data elements to just one register, which I think makes the implementation a bit simpler if there are no other caveats that I may be overlooking.

> To solve this problem I wrote some helper functions (emit_insert
> and emit_extract) while implementing ARB_shader_image_load_store
> which might be useful for you as starting point to convert
> between the native vector layout of the EU and the vector layout
> of a message payload.  See [1], but note that the branch is
> somewhat outdated.
> 
> > Since we are using WRITEMASK_XYZW for all FS invocations I think in that
> 
> The writemask doesn't exist at all in Align1 instructions (which is
> what the FS back-end uses).
> 
> > case we can have the generator set the dst parameter as 
> > brw_writemask(brw_null_reg(), WRITEMASK_XYZW), but for vec4 the  visitor
> > would do the same, but injecting the writemask it needs depending on the
> > case, then the vec4 generator would just pass that dst to
> > brw_untyped_surface_write. I was also thinking about passing the writemask
> > as a separate parameter to the generator opcode, but it looks like generator
> > opcodes can only take 3 src arguments and we are already using them, so
> > passing it as part of the dst seems like the best way to go.
> 
> [1]
> http://cgit.freedesktop.org/~currojerez/mesa/tree/src/mesa/drivers/dri/i965/
> brw_ir_surface_builder.h?h=image-load-store#n186
Comment 53 Iago Toral 2015-05-08 08:08:38 UTC
(In reply to Iago Toral from comment #52)
> (In reply to Francisco Jerez from comment #51)
> > (In reply to Iago Toral from comment #50)
> > > Francisco, your implementation of brw_untyped_surface_write in brw_eu_emit.c
> > > fixes the mask to use for the dst, which in turn decides which channels are
> > > effectively written to memory.
> > > 
> > > For vec4 and gen7 (except haswell) it sets the writemask to WRITEMASK_X, but
> > > I think we want clients of this function to decide the writemask to use. For
> > > example, if we are writing a vec4, I want to use WRITEMASK_XYZW (even if we
> > > are using SIMD8 mode) to get all 4 components written directly. Fixing the
> > > writemask inside this function seems a bit restrictive.
> > > 
> > 
> > IVB didn't have a SIMD4x2 variant of the untyped surface write
> > message, so the SIMD8 one has to be used.  The writemask is
> > required because the dataport will reinterpret the execution mask
> > sent by the EU as if each bit mapped to a separate scalar
> > channel, just like is the case for a FS thread, so, if you set
> > the writemask to XYZW the dataport might end up writing *eight*
> > separate vec4s to memory, which is almost certainly not what you
> > want.
> 
> I think this should not happen. The IVB PRMs, 3.9.9.10
> Message Payload, says:
> 
> "For SIMD16 and SIMD8 messages, the message length is used to determine how
> may address parameters are included in the message. The number of message
> registers in the write data payload is determined by the number of channel
> mask bits that are enabled"
> 
> So, if we only enable one channel (red), it should only write up to 8
> dwords, never 8 vec4s. With this in mind, this is what I am doing:
> 
> I write up to 8 offsets to M1 and up to 8 values to M2 (so red channel
> only). The first 4 values in the red channel (M2.0 to M2.3) are the four
> vector components of the vertex stored in the lower half of the SIMD4x2
> execution, the data from second vertex of the SIMD4x2 execution goes in M2.4
> to M2.7. Since I only provide data for the red channel, the message can only
> write up to 8 dwords, no matter the writemask I use. Then, with WRITEMASK_X,
> only M2.0 and M.2.4 get written. With WRITEMASK_XYZW I get all 8 dwords
> written, with other writetemasks I can get any subset I need, which works
> great because I only need to pass the writemask we get from the GLSL IR as
> is to get exactly what we want.
> 
> I have tested this in multiple scenarios and seems to work fine in all of
> them, and the implementation is straight forward. Do you see any issues that
> I might be missing?

There is another benefit of this implementation that I have just noticed: the structure of the payload I use in SIMD8 mode for IVB is the same as the structure of the payload for haswell in SIMD4x2, which means that the implementation is the same for both (if anything I can optimize the haswell version since it only needs M1.0 and M1.4 in the address payload). If I change the implementation to do as you suggested then I would need to write different implementations for both systems, right?
Comment 54 Francisco Jerez 2015-05-08 11:40:29 UTC
(In reply to Iago Toral from comment #53)
> (In reply to Iago Toral from comment #52)
> > (In reply to Francisco Jerez from comment #51)
> > > (In reply to Iago Toral from comment #50)
> > > > Francisco, your implementation of brw_untyped_surface_write in brw_eu_emit.c
> > > > fixes the mask to use for the dst, which in turn decides which channels are
> > > > effectively written to memory.
> > > > 
> > > > For vec4 and gen7 (except haswell) it sets the writemask to WRITEMASK_X, but
> > > > I think we want clients of this function to decide the writemask to use. For
> > > > example, if we are writing a vec4, I want to use WRITEMASK_XYZW (even if we
> > > > are using SIMD8 mode) to get all 4 components written directly. Fixing the
> > > > writemask inside this function seems a bit restrictive.
> > > > 
> > > 
> > > IVB didn't have a SIMD4x2 variant of the untyped surface write
> > > message, so the SIMD8 one has to be used.  The writemask is
> > > required because the dataport will reinterpret the execution mask
> > > sent by the EU as if each bit mapped to a separate scalar
> > > channel, just like is the case for a FS thread, so, if you set
> > > the writemask to XYZW the dataport might end up writing *eight*
> > > separate vec4s to memory, which is almost certainly not what you
> > > want.
> > 
> > I think this should not happen. The IVB PRMs, 3.9.9.10
> > Message Payload, says:
> > 
> > "For SIMD16 and SIMD8 messages, the message length is used to determine how
> > may address parameters are included in the message. The number of message
> > registers in the write data payload is determined by the number of channel
> > mask bits that are enabled"
> > 
> > So, if we only enable one channel (red), it should only write up to 8
> > dwords, never 8 vec4s. With this in mind, this is what I am doing:
> > 
> > I write up to 8 offsets to M1 and up to 8 values to M2 (so red channel
> > only). The first 4 values in the red channel (M2.0 to M2.3) are the four
> > vector components of the vertex stored in the lower half of the SIMD4x2
> > execution, the data from second vertex of the SIMD4x2 execution goes in M2.4
> > to M2.7. Since I only provide data for the red channel, the message can only
> > write up to 8 dwords, no matter the writemask I use. Then, with WRITEMASK_X,
> > only M2.0 and M.2.4 get written. With WRITEMASK_XYZW I get all 8 dwords
> > written, with other writetemasks I can get any subset I need, which works
> > great because I only need to pass the writemask we get from the GLSL IR as
> > is to get exactly what we want.
> > 
> > I have tested this in multiple scenarios and seems to work fine in all of
> > them, and the implementation is straight forward. Do you see any issues that
> > I might be missing?
> 
> There is another benefit of this implementation that I have just noticed:
> the structure of the payload I use in SIMD8 mode for IVB is the same as the
> structure of the payload for haswell in SIMD4x2, which means that the
> implementation is the same for both (if anything I can optimize the haswell
> version since it only needs M1.0 and M1.4 in the address payload). If I
> change the implementation to do as you suggested then I would need to write
> different implementations for both systems, right?

No, not necessarily.  In fact in the link I shared earlier there
is a single implementation of untyped surface write shared among
all generations which works regardless of whether SIMD8, 16 or
4x2 is being used.  The only reason why that's possible despite
the strange vector layout of the message on IVB is that both the
X-only SIMD8 message and the HSW SIMD4x2 messages have the exact
same semantics -- IOW they take and return the same set of values
up to a transposition, which can be applied consistently to all
vector values based on the has_simd4x2 flag, in a manner
transparent for the implementation of the typed and untyped
surface messages.

Even though your idea would work in this specific case it can't
be extended to the typed messages (because by tweaking the
coordinates you get the same single color component from
different locations of the image rather than different color
components), and it breaks the symmetry (up to a transpose)
between the FS and VEC4 implementations and between the HSW+ and
IVB implementations.
Comment 55 Iago Toral 2015-05-08 12:04:39 UTC
(In reply to Francisco Jerez from comment #54)
> (In reply to Iago Toral from comment #53)
> > (In reply to Iago Toral from comment #52)
> > > (In reply to Francisco Jerez from comment #51)
> > > > (In reply to Iago Toral from comment #50)
> > > > > Francisco, your implementation of brw_untyped_surface_write in brw_eu_emit.c
> > > > > fixes the mask to use for the dst, which in turn decides which channels are
> > > > > effectively written to memory.
> > > > > 
> > > > > For vec4 and gen7 (except haswell) it sets the writemask to WRITEMASK_X, but
> > > > > I think we want clients of this function to decide the writemask to use. For
> > > > > example, if we are writing a vec4, I want to use WRITEMASK_XYZW (even if we
> > > > > are using SIMD8 mode) to get all 4 components written directly. Fixing the
> > > > > writemask inside this function seems a bit restrictive.
> > > > > 
> > > > 
> > > > IVB didn't have a SIMD4x2 variant of the untyped surface write
> > > > message, so the SIMD8 one has to be used.  The writemask is
> > > > required because the dataport will reinterpret the execution mask
> > > > sent by the EU as if each bit mapped to a separate scalar
> > > > channel, just like is the case for a FS thread, so, if you set
> > > > the writemask to XYZW the dataport might end up writing *eight*
> > > > separate vec4s to memory, which is almost certainly not what you
> > > > want.
> > > 
> > > I think this should not happen. The IVB PRMs, 3.9.9.10
> > > Message Payload, says:
> > > 
> > > "For SIMD16 and SIMD8 messages, the message length is used to determine how
> > > may address parameters are included in the message. The number of message
> > > registers in the write data payload is determined by the number of channel
> > > mask bits that are enabled"
> > > 
> > > So, if we only enable one channel (red), it should only write up to 8
> > > dwords, never 8 vec4s. With this in mind, this is what I am doing:
> > > 
> > > I write up to 8 offsets to M1 and up to 8 values to M2 (so red channel
> > > only). The first 4 values in the red channel (M2.0 to M2.3) are the four
> > > vector components of the vertex stored in the lower half of the SIMD4x2
> > > execution, the data from second vertex of the SIMD4x2 execution goes in M2.4
> > > to M2.7. Since I only provide data for the red channel, the message can only
> > > write up to 8 dwords, no matter the writemask I use. Then, with WRITEMASK_X,
> > > only M2.0 and M.2.4 get written. With WRITEMASK_XYZW I get all 8 dwords
> > > written, with other writetemasks I can get any subset I need, which works
> > > great because I only need to pass the writemask we get from the GLSL IR as
> > > is to get exactly what we want.
> > > 
> > > I have tested this in multiple scenarios and seems to work fine in all of
> > > them, and the implementation is straight forward. Do you see any issues that
> > > I might be missing?
> > 
> > There is another benefit of this implementation that I have just noticed:
> > the structure of the payload I use in SIMD8 mode for IVB is the same as the
> > structure of the payload for haswell in SIMD4x2, which means that the
> > implementation is the same for both (if anything I can optimize the haswell
> > version since it only needs M1.0 and M1.4 in the address payload). If I
> > change the implementation to do as you suggested then I would need to write
> > different implementations for both systems, right?
> 
> No, not necessarily.  In fact in the link I shared earlier there
> is a single implementation of untyped surface write shared among
> all generations which works regardless of whether SIMD8, 16 or
> 4x2 is being used.  The only reason why that's possible despite
> the strange vector layout of the message on IVB is that both the
> X-only SIMD8 message and the HSW SIMD4x2 messages have the exact
> same semantics -- IOW they take and return the same set of values
> up to a transposition, which can be applied consistently to all
> vector values based on the has_simd4x2 flag, in a manner
> transparent for the implementation of the typed and untyped
> surface messages.
> 
> Even though your idea would work in this specific case it can't
> be extended to the typed messages (because by tweaking the
> coordinates you get the same single color component from
> different locations of the image rather than different color
> components), and it breaks the symmetry (up to a transpose)
> between the FS and VEC4 implementations and between the HSW+ and
> IVB implementations.

I see, thanks again for taking the time to explain this. I'll rewrite my solution to use SIMD8 messages with WRITEMASK_X only so we can keep that symmetry.
Comment 56 Francisco Jerez 2015-05-09 16:50:23 UTC
(In reply to Iago Toral from comment #55)
> (In reply to Francisco Jerez from comment #54)
> > (In reply to Iago Toral from comment #53)
> > > (In reply to Iago Toral from comment #52)
> > > > (In reply to Francisco Jerez from comment #51)
> > > > > (In reply to Iago Toral from comment #50)
> > > > > > Francisco, your implementation of brw_untyped_surface_write in brw_eu_emit.c
> > > > > > fixes the mask to use for the dst, which in turn decides which channels are
> > > > > > effectively written to memory.
> > > > > > 
> > > > > > For vec4 and gen7 (except haswell) it sets the writemask to WRITEMASK_X, but
> > > > > > I think we want clients of this function to decide the writemask to use. For
> > > > > > example, if we are writing a vec4, I want to use WRITEMASK_XYZW (even if we
> > > > > > are using SIMD8 mode) to get all 4 components written directly. Fixing the
> > > > > > writemask inside this function seems a bit restrictive.
> > > > > > 
> > > > > 
> > > > > IVB didn't have a SIMD4x2 variant of the untyped surface write
> > > > > message, so the SIMD8 one has to be used.  The writemask is
> > > > > required because the dataport will reinterpret the execution mask
> > > > > sent by the EU as if each bit mapped to a separate scalar
> > > > > channel, just like is the case for a FS thread, so, if you set
> > > > > the writemask to XYZW the dataport might end up writing *eight*
> > > > > separate vec4s to memory, which is almost certainly not what you
> > > > > want.
> > > > 
> > > > I think this should not happen. The IVB PRMs, 3.9.9.10
> > > > Message Payload, says:
> > > > 
> > > > "For SIMD16 and SIMD8 messages, the message length is used to determine how
> > > > may address parameters are included in the message. The number of message
> > > > registers in the write data payload is determined by the number of channel
> > > > mask bits that are enabled"
> > > > 
> > > > So, if we only enable one channel (red), it should only write up to 8
> > > > dwords, never 8 vec4s. With this in mind, this is what I am doing:
> > > > 
> > > > I write up to 8 offsets to M1 and up to 8 values to M2 (so red channel
> > > > only). The first 4 values in the red channel (M2.0 to M2.3) are the four
> > > > vector components of the vertex stored in the lower half of the SIMD4x2
> > > > execution, the data from second vertex of the SIMD4x2 execution goes in M2.4
> > > > to M2.7. Since I only provide data for the red channel, the message can only
> > > > write up to 8 dwords, no matter the writemask I use. Then, with WRITEMASK_X,
> > > > only M2.0 and M.2.4 get written. With WRITEMASK_XYZW I get all 8 dwords
> > > > written, with other writetemasks I can get any subset I need, which works
> > > > great because I only need to pass the writemask we get from the GLSL IR as
> > > > is to get exactly what we want.
> > > > 
> > > > I have tested this in multiple scenarios and seems to work fine in all of
> > > > them, and the implementation is straight forward. Do you see any issues that
> > > > I might be missing?
> > > 
> > > There is another benefit of this implementation that I have just noticed:
> > > the structure of the payload I use in SIMD8 mode for IVB is the same as the
> > > structure of the payload for haswell in SIMD4x2, which means that the
> > > implementation is the same for both (if anything I can optimize the haswell
> > > version since it only needs M1.0 and M1.4 in the address payload). If I
> > > change the implementation to do as you suggested then I would need to write
> > > different implementations for both systems, right?
> > 
> > No, not necessarily.  In fact in the link I shared earlier there
> > is a single implementation of untyped surface write shared among
> > all generations which works regardless of whether SIMD8, 16 or
> > 4x2 is being used.  The only reason why that's possible despite
> > the strange vector layout of the message on IVB is that both the
> > X-only SIMD8 message and the HSW SIMD4x2 messages have the exact
> > same semantics -- IOW they take and return the same set of values
> > up to a transposition, which can be applied consistently to all
> > vector values based on the has_simd4x2 flag, in a manner
> > transparent for the implementation of the typed and untyped
> > surface messages.
> > 
> > Even though your idea would work in this specific case it can't
> > be extended to the typed messages (because by tweaking the
> > coordinates you get the same single color component from
> > different locations of the image rather than different color
> > components), and it breaks the symmetry (up to a transpose)
> > between the FS and VEC4 implementations and between the HSW+ and
> > IVB implementations.
> 
> I see, thanks again for taking the time to explain this. I'll rewrite my
> solution to use SIMD8 messages with WRITEMASK_X only so we can keep that
> symmetry.

I was thinking that it wouldn't make sense to re-implement this
madness in the SSBO handling code.  ARB_shader_image_load_store
uses the exact same messages so we should share a common
implementation.  Check out the for-iago branch of my mesa
repository [1], I've ripped out everything specific to
ARB_shader_image_load_store, what is left is the code for
building and sending typed and untyped surface messages, along
with some dependencies.  It should make your task considerably
easier.  There are emit_untyped_read(), emit_untyped_write() and
emit_untyped_atomic() functions you can call passing the surface
index you want to access, the address and argument vectors,
number of address dimensions (typically one for SSBOs) and number
of 32-bit components to write or read.  It will give you the
result (if any) in a register returned by the function.  It
should take care of most hardware quirks for you (SIMD mode
restrictions, differences in header and vector layout across
generations).

[1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=for-iago
Comment 57 Iago Toral 2015-05-11 14:08:30 UTC
(In reply to Francisco Jerez from comment #56)
(...)
> I was thinking that it wouldn't make sense to re-implement this
> madness in the SSBO handling code.  ARB_shader_image_load_store
> uses the exact same messages so we should share a common
> implementation.  Check out the for-iago branch of my mesa
> repository [1], I've ripped out everything specific to
> ARB_shader_image_load_store, what is left is the code for
> building and sending typed and untyped surface messages, along
> with some dependencies.  It should make your task considerably
> easier.  There are emit_untyped_read(), emit_untyped_write() and
> emit_untyped_atomic() functions you can call passing the surface
> index you want to access, the address and argument vectors,
> number of address dimensions (typically one for SSBOs) and number
> of 32-bit components to write or read.  It will give you the
> result (if any) in a register returned by the function.  It
> should take care of most hardware quirks for you (SIMD mode
> restrictions, differences in header and vector layout across
> generations).
> 
> [1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=for-iago

Thanks a lot Francisco, I have merged your code into our tree and reworked our FS reads and writes successfully, however, I am not sure if we can use emit_untyped_write() to implement ivybridge vec4 writes.

The problem I see is that this function takes a size parameter with the number of components we want to write, but it does not know which components they are.

For example, if I have something lke v.xz = vec2(1,3) there is no way for the function to know that I want to write at offsets 0 and 8. If we pass a size=2 it will write a value of 1 at offset 0 but instead of writing 3 at offset 8 it will write garbage at offset 4. If we pass size=4 then it will write 1, 3 at offsets 0 and 8 as expected, but it will also write garbage at offsets 4 and 12 of course.

Unfortunately, the PRM also says this:

"For the Untyped Surface Write message, each channel mask cannot be 0 unless all of the lower mask bits are also zero. This means that the only 4-bit channel mask values allowed are 0000b, 1000b, 1100b, and 1110b. Other messages allow any combination of channel masks."

So we cannot set the channel mask to inform the message that it should only write to offsets 0 and 4 even if we changed the signature of the function (for example to add a writemask parameter).

I only see two ways to fix this:

1) Go back to the solution I explained in comment #52 (in which we only use one channel and instead use the writemask to filter the components we want to write). Since you made a good case for keeping the symmetry between typed and untyped messages, if we go with this I guess it would mean to write a new opcode specifically for vec4/ivybridge so we can keep the current messages untouched.

2) Emit a separate write message for  each component of the vector if we are not writing the entire vector.

The second option does not look like a good idea in any case though, so I'd go with 1) unless you have a better idea.
Comment 58 Francisco Jerez 2015-05-11 17:05:55 UTC
(In reply to Iago Toral from comment #57)
>[...]
> Thanks a lot Francisco, I have merged your code into our tree and reworked
> our FS reads and writes successfully, however, I am not sure if we can use
> emit_untyped_write() to implement ivybridge vec4 writes.
> 
> The problem I see is that this function takes a size parameter with the
> number of components we want to write, but it does not know which components
> they are.
> 
> For example, if I have something lke v.xz = vec2(1,3) there is no way for
> the function to know that I want to write at offsets 0 and 8. If we pass a
> size=2 it will write a value of 1 at offset 0 but instead of writing 3 at
> offset 8 it will write garbage at offset 4. If we pass size=4 then it will
> write 1, 3 at offsets 0 and 8 as expected, but it will also write garbage at
> offsets 4 and 12 of course.
> 
> Unfortunately, the PRM also says this:
> 
> "For the Untyped Surface Write message, each channel mask cannot be 0 unless
> all of the lower mask bits are also zero. This means that the only 4-bit
> channel mask values allowed are 0000b, 1000b, 1100b, and 1110b. Other
> messages allow any combination of channel masks."
> 
> So we cannot set the channel mask to inform the message that it should only
> write to offsets 0 and 4 even if we changed the signature of the function
> (for example to add a writemask parameter).
> 
> I only see two ways to fix this:
> 
> 1) Go back to the solution I explained in comment #52 (in which we only use
> one channel and instead use the writemask to filter the components we want
> to write). Since you made a good case for keeping the symmetry between typed
> and untyped messages, if we go with this I guess it would mean to write a
> new opcode specifically for vec4/ivybridge so we can keep the current
> messages untouched.
> 
> 2) Emit a separate write message for  each component of the vector if we are
> not writing the entire vector.
> 
> The second option does not look like a good idea in any case though, so I'd
> go with 1) unless you have a better idea.

I doubt it's terribly important, worst-case 2) will increase the
latency of the request if you need to emit two rather than one
message, but they will still be pipelined so it's unlikely to
matter much.  I think I would refrain from introducing
significant complexity for an IVB- and vec4-specific optimization
until we have some evidence that it helps in some realistic
scenario.  We can always go back and add the new opcode if we
find out it's helpful for some application.
Comment 59 Iago Toral 2015-05-12 05:53:25 UTC
(In reply to Francisco Jerez from comment #58)
> (In reply to Iago Toral from comment #57)
> >[...]
> > Thanks a lot Francisco, I have merged your code into our tree and reworked
> > our FS reads and writes successfully, however, I am not sure if we can use
> > emit_untyped_write() to implement ivybridge vec4 writes.
> > 
> > The problem I see is that this function takes a size parameter with the
> > number of components we want to write, but it does not know which components
> > they are.
> > 
> > For example, if I have something lke v.xz = vec2(1,3) there is no way for
> > the function to know that I want to write at offsets 0 and 8. If we pass a
> > size=2 it will write a value of 1 at offset 0 but instead of writing 3 at
> > offset 8 it will write garbage at offset 4. If we pass size=4 then it will
> > write 1, 3 at offsets 0 and 8 as expected, but it will also write garbage at
> > offsets 4 and 12 of course.
> > 
> > Unfortunately, the PRM also says this:
> > 
> > "For the Untyped Surface Write message, each channel mask cannot be 0 unless
> > all of the lower mask bits are also zero. This means that the only 4-bit
> > channel mask values allowed are 0000b, 1000b, 1100b, and 1110b. Other
> > messages allow any combination of channel masks."
> > 
> > So we cannot set the channel mask to inform the message that it should only
> > write to offsets 0 and 4 even if we changed the signature of the function
> > (for example to add a writemask parameter).
> > 
> > I only see two ways to fix this:
> > 
> > 1) Go back to the solution I explained in comment #52 (in which we only use
> > one channel and instead use the writemask to filter the components we want
> > to write). Since you made a good case for keeping the symmetry between typed
> > and untyped messages, if we go with this I guess it would mean to write a
> > new opcode specifically for vec4/ivybridge so we can keep the current
> > messages untouched.
> > 
> > 2) Emit a separate write message for  each component of the vector if we are
> > not writing the entire vector.
> > 
> > The second option does not look like a good idea in any case though, so I'd
> > go with 1) unless you have a better idea.
> 
> I doubt it's terribly important, worst-case 2) will increase the
> latency of the request if you need to emit two rather than one
> message, but they will still be pipelined so it's unlikely to
> matter much.  I think I would refrain from introducing
> significant complexity for an IVB- and vec4-specific optimization
> until we have some evidence that it helps in some realistic
> scenario.  We can always go back and add the new opcode if we
> find out it's helpful for some application.

Ok, I'll implement 2) in that case.
Comment 60 Samuel Iglesias Gonsálvez 2015-06-10 10:08:03 UTC
Francisco, I have a question for you related to untyped write messages source code:

Today, I have been rebasing our SSBO patch series to master, finding that FS builder was already pushed upstream and a lot of code was migrated to use it. I found a crash that is related to one of the patches we picked from you some time ago that emits untyped write messages. This is the problem:

When processing nir_intrinsic_store_ssbo in brw_fs_nir.cpp, there is a call to
surface_access::emit_untyped_write() which ends up internally calling to emit_header().
    
The function emit_header() (in src/mesa/drivers/dri/i965/brw_fs_surface_builder.cpp)
copies the sample mask to the header register:
    
       exec_all(ubld.MOV(component(dst, 7), sample_mask));
    
This MOV fails the following assert in function emit(instruction *inst) at
src/mesa/drivers/dri/i965/brw_fs_builder.h:
    
    assert(inst->exec_size == dispatch_width() || force_writemask_all);
   
According to GDB, those variables have the following values:
    
    * inst->exec_size = 1
    * dispatch_width() = 8
    * force_writemask_all = 0
    
Removing this assert makes the code to work properly but I wonder if we need to update our code in some way or if you have already a fix for this.

If you want to test it, I have published a branch which includes the assert deletion in the last commit:

$ git clone -b wip/siglesias/ARB_shader_storage_buffer_object-v2.3 https://github.com/Igalia/mesa.git

You can run the following piglit test to reproduce the crash, if you revert last commit:

$ bin/arb_shader_storage_buffer_object-layout-std140-write-shader -fbo -auto
Comment 61 Francisco Jerez 2015-06-10 12:07:49 UTC
(In reply to Samuel Iglesias from comment #60)
> Francisco, I have a question for you related to untyped write messages
> source code:
> 
> Today, I have been rebasing our SSBO patch series to master, finding that FS
> builder was already pushed upstream and a lot of code was migrated to use
> it. I found a crash that is related to one of the patches we picked from you
> some time ago that emits untyped write messages. This is the problem:
> 
> When processing nir_intrinsic_store_ssbo in brw_fs_nir.cpp, there is a call
> to
> surface_access::emit_untyped_write() which ends up internally calling to
> emit_header().
>     
> The function emit_header() (in
> src/mesa/drivers/dri/i965/brw_fs_surface_builder.cpp)
> copies the sample mask to the header register:
>     
>        exec_all(ubld.MOV(component(dst, 7), sample_mask));
>     
> This MOV fails the following assert in function emit(instruction *inst) at
> src/mesa/drivers/dri/i965/brw_fs_builder.h:
>     
>     assert(inst->exec_size == dispatch_width() || force_writemask_all);
>    
> According to GDB, those variables have the following values:
>     
>     * inst->exec_size = 1
>     * dispatch_width() = 8
>     * force_writemask_all = 0
>     
> Removing this assert makes the code to work properly but I wonder if we need
> to update our code in some way or if you have already a fix for this.
> 
> If you want to test it, I have published a branch which includes the assert
> deletion in the last commit:
> 
> $ git clone -b wip/siglesias/ARB_shader_storage_buffer_object-v2.3
> https://github.com/Igalia/mesa.git
> 
> You can run the following piglit test to reproduce the crash, if you revert
> last commit:
> 
> $ bin/arb_shader_storage_buffer_object-layout-std140-write-shader -fbo -auto

Ah, yeah, that's because the way channel enable masking is handled has changed since I sent you that branch.  Instead of the exec_all() function there's now an exec_all() method in the builder you need to call to get the same effect.  I've rebased your branch on top of the latest builder changes (including some changes in the VEC4 builder which is not upstream yet), see the for-samuel branch of my mesa tree:

git://people.freedesktop.org/~currojerez/mesa
Comment 62 Samuel Iglesias Gonsálvez 2015-06-10 13:01:02 UTC
(In reply to Francisco Jerez from comment #61)
[...]
> Ah, yeah, that's because the way channel enable masking is handled has
> changed since I sent you that branch.  Instead of the exec_all() function
> there's now an exec_all() method in the builder you need to call to get the
> same effect.  I've rebased your branch on top of the latest builder changes
> (including some changes in the VEC4 builder which is not upstream yet), see
> the for-samuel branch of my mesa tree:
> 
> git://people.freedesktop.org/~currojerez/mesa

I am going to clone it and take a look at it.

Thanks!
Comment 63 Matt Turner 2015-08-10 18:02:47 UTC
*** Bug 91259 has been marked as a duplicate of this bug. ***
Comment 64 Jordan Justen 2015-09-25 22:53:09 UTC
Landed in mesa master:

commit 065e7d37f16814646dfd1c95c15a73605042e277

    docs: Mark ARB_shader_storage_buffer_object as done for i965


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.