Bug 92760 - Add FP64 support to the i965 shader backends
Summary: Add FP64 support to the i965 shader backends
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: unspecified
Hardware: All All
: medium enhancement
Assignee: Iago Toral
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 94442
  Show dependency treegraph
 
Reported: 2015-10-31 21:29 UTC by Jason Ekstrand
Modified: 2017-06-22 08:26 UTC (History)
12 users (show)

See Also:
i915 platform:
i915 features:


Attachments
NIR indirect lowering pass (8.22 KB, text/plain)
2016-01-11 18:02 UTC, Jason Ekstrand
Details
Sample test showcasing non-uniform control flow problems in align16 (1.29 KB, text/plain)
2016-04-04 09:19 UTC, Iago Toral
Details

Description Jason Ekstrand 2015-10-31 21:29:09 UTC
We've had a variety of people take a crack at this over the last year or two and haven't actually gotten there yet.  I'm making this bug as a place to put notes for the next person to pick it up.  Connor Abbott started on this towards the end of this summer but there's still more work to do.  The best place to pick up would probably be with his branch which can be found here:

http://cgit.freedesktop.org/~cwabbott0/mesa/log/?h=i965-fp64-v2

With this branch, most of the core NIR bits are in place and most of the i965 fs backend code is in place but there's still more to do:

 1) Review core NIR bits (Probably Jason)

 2) Write a bit-width well-definedness validator for nir_algebraic (Probably Jason or Connor)

 3) Write piglit tests for and fix up Uniform alignment issues.  The hardware requires that 64-bit things be 64-bit aligned and we're doing nothing right now to ensure that.

 4) Implement NIR lowering of ftrunc and use that to implement fround etc.

 5) Write more tests.  Right now, the piglit tests for fp64 basically only test the math operations.  There are many other things that need testing such as uniform packing, UBOs, SSBOs, pack/unpack operations, and whatever other interactions need testing.  One of the issues Connor ran into was that the "simplest" tests were doing double-precision operations like sqrt that were lowered lots of operations and made it very hard to debug simple issues.

 6) Implement a NIR pass to lower double-precision vec4 operations to vec2 operations for use in the vec4 backend.

 7) Get fp64 working in the vec4 backend.  The vec4 backend can only work with doubles as vec2's; hence the splitting pass in (6)

There's probably more I've missed here but this is what I remember.
Comment 1 Connor Abbott 2015-10-31 21:48:09 UTC
(In reply to Jason Ekstrand from comment #0)
> We've had a variety of people take a crack at this over the last year or two
> and haven't actually gotten there yet.  I'm making this bug as a place to
> put notes for the next person to pick it up.  Connor Abbott started on this
> towards the end of this summer but there's still more work to do.  The best
> place to pick up would probably be with his branch which can be found here:
> 
> http://cgit.freedesktop.org/~cwabbott0/mesa/log/?h=i965-fp64-v2
> 
> With this branch, most of the core NIR bits are in place and most of the
> i965 fs backend code is in place but there's still more to do:
> 
>  1) Review core NIR bits (Probably Jason)
> 
>  2) Write a bit-width well-definedness validator for nir_algebraic (Probably
> Jason or Connor)
> 
>  3) Write piglit tests for and fix up Uniform alignment issues.  The
> hardware requires that 64-bit things be 64-bit aligned and we're doing
> nothing right now to ensure that.
> 
>  4) Implement NIR lowering of ftrunc and use that to implement fround etc.
> 
>  5) Write more tests.  Right now, the piglit tests for fp64 basically only
> test the math operations.  There are many other things that need testing
> such as uniform packing, UBOs, SSBOs, pack/unpack operations, and whatever
> other interactions need testing.  One of the issues Connor ran into was that
> the "simplest" tests were doing double-precision operations like sqrt that
> were lowered lots of operations and made it very hard to debug simple issues.
> 
>  6) Implement a NIR pass to lower double-precision vec4 operations to vec2
> operations for use in the vec4 backend.

I actually have this pass in my branch already. What's missing is support in the vec4 backend for handling something like:

dvec2 foo = vec2(bar.x, baz.y);

which happens when lowering vec4 operations that "cross" the vec2 boundaries, as in e.g.

dvec4 foo = bar + baz.xzyw;

> 
>  7) Get fp64 working in the vec4 backend.  The vec4 backend can only work
> with doubles as vec2's; hence the splitting pass in (6)

I actually got this half-working, besides the above-mentioned issue. I believe there are issues with the exec_mask, unless I'm misunderstanding something, but I don't think there are any tests that actually do anything non-dynamically-uniform, so that's another test that needs to be written.

> 
> There's probably more I've missed here but this is what I remember.

QWORD spilling! And some of the changes that came out of the discussion with curro. Other than that, I don't remember anything in particular.
Comment 2 Connor Abbott 2015-11-02 05:21:14 UTC
FWIW, I've posted an i965-fp64-v3 branch in my fdo tree that's based on a much more recent master, and with some more fixes: http://cgit.freedesktop.org/~cwabbott0/mesa/log/?h=i965-fp64-v3

It's not as clean as the v2 branch, though, and I'll probably be shuffling stuff around on it and force-pushing.

current piglit results:

[2226/2226] skip: 14, pass: 1768, fail: 392, crash: 50
Comment 3 Samuel Iglesias Gonsálvez 2015-11-03 07:24:29 UTC
Iago and me will work on adding FP64 support to i965 shader backends.
Comment 4 Connor Abbott 2015-11-03 18:16:40 UTC
I've pushed my latest changes to the i965-fp64-v3 branch. It's based on the patch series I sent out a little while ago to make glsl-to-nir use the visitor. It's not cleaned-up, but it's got a decent amount of things working.

Sam and Iago: for the most part, the core NIR changes are pretty stable. The one thing that Jason mentioned needed to be done there (the nir_algebraic bitwidth verifier) is something that would be best for Jason or me to do, since we've discussed it extensively in-person and know what it needs to look like. It's also not essential to get it working (it's basically just a tool to validate at compile-time that certain asserts/validator failures won't happen when rewriting expressions with different bit-widths). What do you want to do there?

Now, I think the main thing remaining to get the existing tests to pass (besides implementing ftrunc and friends) is to make our handling of splitting dvec3's and dvec4's into dvec2's and doubles more robust. Right now, it mostly works, but falls over whenever we hit a case where we need to create a vec3. There are two solutions I can see for this:

1. Special-case lower_vec_to_movs to skip over vec3/vec4 with doubles, and handle vec3/vec4 specially in the i965 vec4 backend.

2. Add the ability to handle math on dvec3's and dvec4's, but with a writemask that's never larger than 2 (i.e. we never have to emit more than one thing). lower_vec_to_movs may coalesce movs away, so we would have to do handle this in the backend in order to handle what NIR currently gives us.

#1 seems like a better solution to me, but unfortunately we'd have to modify lower_vec_to_movs in a somewhat ugly, i965-specific way.

There's another problem with the vec4 backend and exec_masks, but I want to send another email to see what other people who know about i965 have to say.
Comment 5 Iago Toral 2015-11-04 08:32:19 UTC
(In reply to Connor Abbott from comment #4)
> I've pushed my latest changes to the i965-fp64-v3 branch. It's based on the
> patch series I sent out a little while ago to make glsl-to-nir use the
> visitor. It's not cleaned-up, but it's got a decent amount of things working.

Thanks Connor! Our idea was to start by going through all the commits in this branch to get familiar with the current state of things. Right now I think we need to understand better how the hardware deals with doubles before pursuing anything specific.

Also, since we are going to go through all your patches I think we should also take that opportunity to clean-up the series and make it more manageable, or were you planning to work on further clean-ups / fixes yourself in the next days?

> Sam and Iago: for the most part, the core NIR changes are pretty stable. The
> one thing that Jason mentioned needed to be done there (the nir_algebraic
> bitwidth verifier) is something that would be best for Jason or me to do,
> since we've discussed it extensively in-person and know what it needs to
> look like. It's also not essential to get it working (it's basically just a
> tool to validate at compile-time that certain asserts/validator failures
> won't happen when rewriting expressions with different bit-widths). What do
> you want to do there?

Yeah, that's fine by me. If you know exactly what you want to do there it probably makes more sense that you take on it. As you day, it is not something we need to make progress on other things anyway.

> Now, I think the main thing remaining to get the existing tests to pass
> (besides implementing ftrunc and friends) is to make our handling of
> splitting dvec3's and dvec4's into dvec2's and doubles more robust. Right
> now, it mostly works, but falls over whenever we hit a case where we need to
> create a vec3. There are two solutions I can see for this:

Ok. My understanding is that this issue is specific to the vec4 backend only, right?

> 1. Special-case lower_vec_to_movs to skip over vec3/vec4 with doubles, and
> handle vec3/vec4 specially in the i965 vec4 backend.
> 
> 2. Add the ability to handle math on dvec3's and dvec4's, but with a
> writemask that's never larger than 2 (i.e. we never have to emit more than
> one thing). lower_vec_to_movs may coalesce movs away, so we would have to do
> handle this in the backend in order to handle what NIR currently gives us.

So if I understand #2 right you mean, for example, splitting a vec4 addition in NIR into 4 additions that operate on individual components of the vec4 operands each (and I guess write the results into different regs as well).

> #1 seems like a better solution to me, but unfortunately we'd have to modify
> lower_vec_to_movs in a somewhat ugly, i965-specific way.

I suppose that if we want to go with #1 we could do that behavior configurable so that NIR users can decide if that suits them or not.

> There's another problem with the vec4 backend and exec_masks, but I want to
> send another email to see what other people who know about i965 have to say.

Yeah, I saw that in the mailing list. Thanks for starting that discussion.
Comment 6 Iago Toral 2015-11-05 12:47:32 UTC
While having a look at the patches I noticed a couple of obvious bugs that I thought would break things outside fp64, so I ran piglit and found that this in fact breaks a ton of things on my IVB laptop. A run of shader.py (without fp64 tests) gives:

[14079/14079] skip: 7034, pass: 5104, fail: 1046, crash: 895 \       

including a handful GPU hangs :(

I have been fixing a lot of these today, mostly small things here and there and now we are at:

[14079/14079] skip: 6973, pass: 7043, fail: 46, crash: 3

This means a total of 48 regressions compared to master excluding fp64 functionality (and only for shader.py). These regressions include 9 GPU hangs.

In case you want to have a look, I have uploaded my fixes to this branch (rebased against current master too):
https://github.com/Igalia/mesa/commits/connor-i965-fp64-v3-rebased

I think we should focus on tracking and fixing the remaining regressions before moving on to doing actual fp64 stuff.

BTW, right now we have things like nir_type_float and nir_type_float32, or nit_type_int and nir_type_int32...there is a number of patches in your branch that changes some uses of basic types by the bit-sized versions, and I had to add a few more in my branch to fix a good number of crashes and problems since the driver has some code paths that expect a bit-sized type in any case (in fact, they have assertions for this). Why don't we just drop the types that don't have the bit-size information completely? As it is now I think they only cause confusion and bugs.
Comment 7 Connor Abbott 2015-11-06 14:31:54 UTC
(In reply to Iago Toral from comment #5)
> (In reply to Connor Abbott from comment #4)
> > I've pushed my latest changes to the i965-fp64-v3 branch. It's based on the
> > patch series I sent out a little while ago to make glsl-to-nir use the
> > visitor. It's not cleaned-up, but it's got a decent amount of things working.
> 
> Thanks Connor! Our idea was to start by going through all the commits in
> this branch to get familiar with the current state of things. Right now I
> think we need to understand better how the hardware deals with doubles
> before pursuing anything specific.
> 
> Also, since we are going to go through all your patches I think we should
> also take that opportunity to clean-up the series and make it more
> manageable, or were you planning to work on further clean-ups / fixes
> yourself in the next days?

No, go ahead... I'm going to be quite busy until next week, so I won't be able to do much.

> 
> > Sam and Iago: for the most part, the core NIR changes are pretty stable. The
> > one thing that Jason mentioned needed to be done there (the nir_algebraic
> > bitwidth verifier) is something that would be best for Jason or me to do,
> > since we've discussed it extensively in-person and know what it needs to
> > look like. It's also not essential to get it working (it's basically just a
> > tool to validate at compile-time that certain asserts/validator failures
> > won't happen when rewriting expressions with different bit-widths). What do
> > you want to do there?
> 
> Yeah, that's fine by me. If you know exactly what you want to do there it
> probably makes more sense that you take on it. As you day, it is not
> something we need to make progress on other things anyway.
> 
> > Now, I think the main thing remaining to get the existing tests to pass
> > (besides implementing ftrunc and friends) is to make our handling of
> > splitting dvec3's and dvec4's into dvec2's and doubles more robust. Right
> > now, it mostly works, but falls over whenever we hit a case where we need to
> > create a vec3. There are two solutions I can see for this:
> 
> Ok. My understanding is that this issue is specific to the vec4 backend
> only, right?

Yes, that's correct.

> 
> > 1. Special-case lower_vec_to_movs to skip over vec3/vec4 with doubles, and
> > handle vec3/vec4 specially in the i965 vec4 backend.
> > 
> > 2. Add the ability to handle math on dvec3's and dvec4's, but with a
> > writemask that's never larger than 2 (i.e. we never have to emit more than
> > one thing). lower_vec_to_movs may coalesce movs away, so we would have to do
> > handle this in the backend in order to handle what NIR currently gives us.
> 
> So if I understand #2 right you mean, for example, splitting a vec4 addition
> in NIR into 4 additions that operate on individual components of the vec4
> operands each (and I guess write the results into different regs as well).

Not quite. The vec4 backend needs to split things into vec2's. As I explained, I wrote a NIR pass that does that for us, but we're possibly still left with things like

dvec4 foo;
foo.xz = bar.xz + baz.zw;

and I meant that we could just make the vec4 backend deal with something like that directly, by moving the sources to temporary dvec2's, doing the operation, and then emitting two mov's to the upper and lower parts of the destination.

> 
> > #1 seems like a better solution to me, but unfortunately we'd have to modify
> > lower_vec_to_movs in a somewhat ugly, i965-specific way.
> 
> I suppose that if we want to go with #1 we could do that behavior
> configurable so that NIR users can decide if that suits them or not.
> 
> > There's another problem with the vec4 backend and exec_masks, but I want to
> > send another email to see what other people who know about i965 have to say.
> 
> Yeah, I saw that in the mailing list. Thanks for starting that discussion.

Yeah. curro discovered some really weird fp64 bugs with writemasking in Align16 mode, which make it seem like the best solution might be to give up and scalarize 64-bit things instead.
Comment 8 Connor Abbott 2015-11-06 14:49:19 UTC
(In reply to Iago Toral from comment #6)
> While having a look at the patches I noticed a couple of obvious bugs that I
> thought would break things outside fp64, so I ran piglit and found that this
> in fact breaks a ton of things on my IVB laptop. A run of shader.py (without
> fp64 tests) gives:
> 
> [14079/14079] skip: 7034, pass: 5104, fail: 1046, crash: 895 \       
> 
> including a handful GPU hangs :(
> 
> I have been fixing a lot of these today, mostly small things here and there
> and now we are at:
> 
> [14079/14079] skip: 6973, pass: 7043, fail: 46, crash: 3
> 
> This means a total of 48 regressions compared to master excluding fp64
> functionality (and only for shader.py). These regressions include 9 GPU
> hangs.
> 
> In case you want to have a look, I have uploaded my fixes to this branch
> (rebased against current master too):
> https://github.com/Igalia/mesa/commits/connor-i965-fp64-v3-rebased
> 
> I think we should focus on tracking and fixing the remaining regressions
> before moving on to doing actual fp64 stuff.
> 
> BTW, right now we have things like nir_type_float and nir_type_float32, or
> nit_type_int and nir_type_int32...there is a number of patches in your
> branch that changes some uses of basic types by the bit-sized versions, and
> I had to add a few more in my branch to fix a good number of crashes and
> problems since the driver has some code paths that expect a bit-sized type
> in any case (in fact, they have assertions for this). Why don't we just drop
> the types that don't have the bit-size information completely? As it is now
> I think they only cause confusion and bugs.

No... in fact, the unsized types were kind of the entire point of Jason's original proposal that I took as the basis of my implementation. A type of nir_type_float means "this instruction can take any bit-width, but it has to match with any other source/destination of this instruction that is nir_type_float as well," sort of like how an input_size/output_size of 0 allows to parameterize on vector width. For example, the new definition of fadd is:

binop("fadd", tfloat, commutative + associative, "src0 + src1")

which mean that output_type, input_type[0], and input_type[1] are all nir_type_float. The same opcode is used for f32 and f64, and when we add an f16 type and a mediump/relaxed-precision "f24" type, not much will have to change.

Now, if there are asserts we're missing to catch these sorts of things, then by all means go and add them. But having a nir_type_float around to mean "any width float will do" is one of the central points of the whole idea. Maybe we could rename it to "nir_type_float_all" or something?

BTW, for the GPU hangs, some of them might be caused by this commit:

http://cgit.freedesktop.org/~cwabbott0/mesa/commit/?h=i965-fp64-v3&id=1b38ed4e53da1a6a3e4ca1d659ce52f159e16037

it removes an old hack from back when we used to guess the exec_size of an instruction which breaks fp64, but I wouldn't be surprised if removing it uncovers other hidden bugs. I didn't manage to test it (I think Jenkins was down when I tried last summer), but it would be nice to fix any regressions and get that merged ahead of the rest of the series.
Comment 9 Iago Toral 2015-11-09 07:43:03 UTC
(In reply to Connor Abbott from comment #8)
> (In reply to Iago Toral from comment #6)
> > While having a look at the patches I noticed a couple of obvious bugs that I
> > thought would break things outside fp64, so I ran piglit and found that this
> > in fact breaks a ton of things on my IVB laptop. A run of shader.py (without
> > fp64 tests) gives:
> > 
> > [14079/14079] skip: 7034, pass: 5104, fail: 1046, crash: 895 \       
> > 
> > including a handful GPU hangs :(
> > 
> > I have been fixing a lot of these today, mostly small things here and there
> > and now we are at:
> > 
> > [14079/14079] skip: 6973, pass: 7043, fail: 46, crash: 3
> > 
> > This means a total of 48 regressions compared to master excluding fp64
> > functionality (and only for shader.py). These regressions include 9 GPU
> > hangs.
> > 
> > In case you want to have a look, I have uploaded my fixes to this branch
> > (rebased against current master too):
> > https://github.com/Igalia/mesa/commits/connor-i965-fp64-v3-rebased
> > 
> > I think we should focus on tracking and fixing the remaining regressions
> > before moving on to doing actual fp64 stuff.
> > 
> > BTW, right now we have things like nir_type_float and nir_type_float32, or
> > nit_type_int and nir_type_int32...there is a number of patches in your
> > branch that changes some uses of basic types by the bit-sized versions, and
> > I had to add a few more in my branch to fix a good number of crashes and
> > problems since the driver has some code paths that expect a bit-sized type
> > in any case (in fact, they have assertions for this). Why don't we just drop
> > the types that don't have the bit-size information completely? As it is now
> > I think they only cause confusion and bugs.
> 
> No... in fact, the unsized types were kind of the entire point of Jason's
> original proposal that I took as the basis of my implementation. A type of
> nir_type_float means "this instruction can take any bit-width, but it has to
> match with any other source/destination of this instruction that is
> nir_type_float as well," sort of like how an input_size/output_size of 0
> allows to parameterize on vector width. For example, the new definition of
> fadd is:
> 
> binop("fadd", tfloat, commutative + associative, "src0 + src1")
> 
> which mean that output_type, input_type[0], and input_type[1] are all
> nir_type_float. The same opcode is used for f32 and f64, and when we add an
> f16 type and a mediump/relaxed-precision "f24" type, not much will have to
> change.
> 
> Now, if there are asserts we're missing to catch these sorts of things, then
> by all means go and add them. But having a nir_type_float around to mean
> "any width float will do" is one of the central points of the whole idea.
> Maybe we could rename it to "nir_type_float_all" or something?

Ok, I see.

> BTW, for the GPU hangs, some of them might be caused by this commit:
> 
> http://cgit.freedesktop.org/~cwabbott0/mesa/commit/?h=i965-fp64-
> v3&id=1b38ed4e53da1a6a3e4ca1d659ce52f159e16037
> 
> it removes an old hack from back when we used to guess the exec_size of an
> instruction which breaks fp64, but I wouldn't be surprised if removing it
> uncovers other hidden bugs. I didn't manage to test it (I think Jenkins was
> down when I tried last summer), but it would be nice to fix any regressions
> and get that merged ahead of the rest of the series.

Yep, I noticed this on Friday, and removing it does indeed fix the remaining hangs. I need to think more about what to do with that, if that really breaks fp64 we'll have to try and fix the bugs that it is working around.
Comment 10 Iago Toral 2015-11-18 16:13:15 UTC
Hi Jason, Connor:

we have fixed all the regressions in gen7 and cleaned-up the code, the result can be found in this branch:

https://github.com/Igalia/mesa/commits/i965-fp64-v4-no-regressions-clean

There are probably a few more things we can squash from the top commits, but I don't think it is going to be much smaller than this. Right now there are 113 commits in there, so it is a  big series (and it is not even complete!)

While reviewing the code I noticed that there are some commits that fix things in master or are even unrelated to fp64, so we probably want to send these for review early. One of these commits is the nir_type_unsigned -> nir_type_uint rename from Jason that I guess we want to land soon to ease future rebases. These patches are at the beginning of the series but I moved them to this branch for easy reference:

https://github.com/Igalia/mesa/commits/fp64-early-review

If you don't have objections I'd like to send these for review ASAP.
Comment 11 Connor Abbott 2015-11-18 16:35:05 UTC
(In reply to Iago Toral from comment #10)
> Hi Jason, Connor:
> 
> we have fixed all the regressions in gen7 and cleaned-up the code, the
> result can be found in this branch:
> 
> https://github.com/Igalia/mesa/commits/i965-fp64-v4-no-regressions-clean
> 
> There are probably a few more things we can squash from the top commits, but
> I don't think it is going to be much smaller than this. Right now there are
> 113 commits in there, so it is a  big series (and it is not even complete!)
> 
> While reviewing the code I noticed that there are some commits that fix
> things in master or are even unrelated to fp64, so we probably want to send
> these for review early. One of these commits is the nir_type_unsigned ->
> nir_type_uint rename from Jason that I guess we want to land soon to ease
> future rebases. These patches are at the beginning of the series but I moved
> them to this branch for easy reference:
> 
> https://github.com/Igalia/mesa/commits/fp64-early-review
> 
> If you don't have objections I'd like to send these for review ASAP.

Sure, go ahead. Note that earlier I sent out a series with some of those patches:

http://lists.freedesktop.org/archives/mesa-dev/2015-August/091724.html

and some of them got r-b's so you can probably just push them.

One commit I might suggest moving earlier and sending out now is "i965/vec4: avoid dependency control around Align1 instructions". It uses the DOUBLE_TO_FLOAT and FLOAT_TO_DOUBLE opcodes that are introduced in the series, so you'd have to split the patch into two parts: the existing patch minus those opcodes, and adding those opcodes back in. Then we could get the first part reviewed now, since the thing its trying to do isn't directly related to fp64, and it could theoretically fix existing bugs. Also, apparently the PACK opcode I added could be used for images as well, so we could try landing that ahead of time too.
Comment 12 Iago Toral 2015-11-19 10:08:05 UTC
(In reply to Connor Abbott from comment #11)
> (In reply to Iago Toral from comment #10)
> > Hi Jason, Connor:
> > 
> > we have fixed all the regressions in gen7 and cleaned-up the code, the
> > result can be found in this branch:
> > 
> > https://github.com/Igalia/mesa/commits/i965-fp64-v4-no-regressions-clean
> > 
> > There are probably a few more things we can squash from the top commits, but
> > I don't think it is going to be much smaller than this. Right now there are
> > 113 commits in there, so it is a  big series (and it is not even complete!)
> > 
> > While reviewing the code I noticed that there are some commits that fix
> > things in master or are even unrelated to fp64, so we probably want to send
> > these for review early. One of these commits is the nir_type_unsigned ->
> > nir_type_uint rename from Jason that I guess we want to land soon to ease
> > future rebases. These patches are at the beginning of the series but I moved
> > them to this branch for easy reference:
> > 
> > https://github.com/Igalia/mesa/commits/fp64-early-review
> > 
> > If you don't have objections I'd like to send these for review ASAP.
> 
> Sure, go ahead. Note that earlier I sent out a series with some of those
> patches:
> 
> http://lists.freedesktop.org/archives/mesa-dev/2015-August/091724.html
> 
> and some of them got r-b's so you can probably just push them.
> 
> One commit I might suggest moving earlier and sending out now is "i965/vec4:
> avoid dependency control around Align1 instructions". It uses the
> DOUBLE_TO_FLOAT and FLOAT_TO_DOUBLE opcodes that are introduced in the
> series, so you'd have to split the patch into two parts: the existing patch
> minus those opcodes, and adding those opcodes back in. Then we could get the
> first part reviewed now, since the thing its trying to do isn't directly
> related to fp64, and it could theoretically fix existing bugs. Also,
> apparently the PACK opcode I added could be used for images as well, so we
> could try landing that ahead of time too.

Thanks for pointing me to that.

I added the Rbs the ones reviewed and pushed them. I also added a couple more unreviewed commits from that list into the bunch and send them for review to the list. I added my Rb to all of them but one for which I could not find all the relevant documentation, hopefully someone else can give the Rb to that.
Comment 13 Connor Abbott 2015-11-20 04:59:50 UTC
(In reply to Iago Toral from comment #10)
> Hi Jason, Connor:
> 
> we have fixed all the regressions in gen7 and cleaned-up the code

BTW, in case this hasn't been mentioned before: Jason and I agreed that it would be best to get fp64 working on gen8 first, and then add support for gen7 later. Gen7 is liable to have more bugs wrt fp64, and there is some weirdness around immediates, so it's better to focus on the easier gen first.
Comment 14 Iago Toral 2015-11-20 12:18:31 UTC
(In reply to Connor Abbott from comment #13)
> (In reply to Iago Toral from comment #10)
> > Hi Jason, Connor:
> > 
> > we have fixed all the regressions in gen7 and cleaned-up the code
> 
> BTW, in case this hasn't been mentioned before: Jason and I agreed that it
> would be best to get fp64 working on gen8 first, and then add support for
> gen7 later. Gen7 is liable to have more bugs wrt fp64, and there is some
> weirdness around immediates, so it's better to focus on the easier gen first.

Yes, that's the plan. We have been working on gen7 for now because of two reasons:

1) We don't have gen8 hardware available yet. We ordered some broadwell hardware some days ago but it won't get here until some time next week.

2) Even if we don't work on fp64 for gen < 8 for now, we don't want to break anything there, so that is work we would have to do anyway.

By the time we get our hands on gen8 hardware we will focus on that platform and only look at gen < 8 to make sure we don't add regressions for non-fp64 stuff there.
Comment 15 Connor Abbott 2015-11-30 16:05:01 UTC
Hey, since you posted a link to a branch with some updated patches, I noticed a few things:

1. Your patch " i965/fs: Fix register regioning for fp64" isn't correct. In particular, it's not correct that "The size of the type is irrelevant, since we manipulate DF as F with stride 2." The stride of a register is in units of the typesize, and we don't "manipulate DF as F with stride 2" anywhere at all in the backend (you can check for yourself!), so we need to take the type size into account when figuring out if the source crosses a SIMD8 register boundary. I suspect this is just papering over bugs due to reverting "XXX remove exec size hack" -- the right thing to do would be to fix the exec_size of any instructions we create that would otherwise hit this hack so we can remove it. The other possibility I can think of is that the crashes are due to Matt's regioning validation falling over, in which case you'll need to fix that. In any case, as-is you're going to produce a region of <8,8,1> for doubles which is incorrect -- you want a region of <4,4,1> since only the vertical stride can cross the SIMD8 register boundary, which my original i965-fp64-v3 branch does correctly.

2. "i965/fs: fix ceil/floor for doubles" -- sorry, you're not going to get off that easily :) You can't use floats here, since you're going to run into range and precision issues. If the piglit tests don't test for those, then they need to be fixed. Since we don't have HW support for ceil/floor/trunc/etc., you're going to need to implement them using integer operations. I was planning on doing this in nir_lower_double_ops.c. I think trunc is the easiest to implement using integer ops -- you don't need to worry about the sign, you just have to mask off the appropriate number of bits in the mantissa based on the exponent -- but the rub is that 64-bit integer arithmetic isn't implemented yet in NIR or the backend, and it doesn't even exist prior to Broadwell, so you have to split the double into 32-bit chunks and operate on those. Then once you've implemented trunc, you can implement all the other missing operations in terms  of it. Have fun :)
Comment 16 Matt Turner 2015-12-01 04:48:35 UTC
(In reply to Connor Abbott from comment #15)
> 2. "i965/fs: fix ceil/floor for doubles" -- sorry, you're not going to get
> off that easily :) You can't use floats here, since you're going to run into
> range and precision issues. If the piglit tests don't test for those, then
> they need to be fixed. Since we don't have HW support for
> ceil/floor/trunc/etc., you're going to need to implement them using integer
> operations. I was planning on doing this in nir_lower_double_ops.c. I think
> trunc is the easiest to implement using integer ops -- you don't need to
> worry about the sign, you just have to mask off the appropriate number of
> bits in the mantissa based on the exponent -- but the rub is that 64-bit
> integer arithmetic isn't implemented yet in NIR or the backend, and it
> doesn't even exist prior to Broadwell, so you have to split the double into
> 32-bit chunks and operate on those. Then once you've implemented trunc, you
> can implement all the other missing operations in terms  of it. Have fun :)

I knew at one point how to do these...

You basically implement trunc using bit operations (by looking at the exponent figure out how many bits of the mantissa you need to mask off, I think), and then you can implement floor and ceil by trunc(x ± nextafter(0.5, 0.0)).

Round-to-even was a little harder and I think required an explicit test against x.5.

I'll see if I can dig it back up.
Comment 17 Iago Toral 2015-12-01 07:02:39 UTC
Heu Connor, thanks a lot for looking into this:

(In reply to Connor Abbott from comment #15)
> Hey, since you posted a link to a branch with some updated patches, I
> noticed a few things:
> 
> 1. Your patch " i965/fs: Fix register regioning for fp64" isn't correct. In
> particular, it's not correct that "The size of the type is irrelevant, since
> we manipulate DF as F with stride 2." The stride of a register is in units
> of the typesize, and we don't "manipulate DF as F with stride 2" anywhere at
> all in the backend (you can check for yourself!), so we need to take the
> type size into account when figuring out if the source crosses a SIMD8
> register boundary. I suspect this is just papering over bugs due to
> reverting "XXX remove exec size hack" -- the right thing to do would be to
> fix the exec_size of any instructions we create that would otherwise hit
> this hack so we can remove it. The other possibility I can think of is that
> the crashes are due to Matt's regioning validation falling over, in which
> case you'll need to fix that. In any case, as-is you're going to produce a
> region of <8,8,1> for doubles which is incorrect -- you want a region of
> <4,4,1> since only the vertical stride can cross the SIMD8 register
> boundary, which my original i965-fp64-v3 branch does correctly.

Ok, I actually created a new testing branch because I wasn't too sure about this patch yet. The thing that confused me is that text from the BDW docs:

"In Align1 mode, all regioning parameters must use the syntax of a pair of packed floats, including channel selects and channel enables.

// Example:
mov (8) r10.0.xyzw:df r11.0.xyzw:df
// The above instruction moves four double floats. The .x picks the
// low 32 bits and the .y picks the high 32 bits of the double float."

This talks about align1 mode, but it actually seems to mean Align16... I am surprised that this actually fixed so much stuff and did not seem to create any trouble, I was thinking that if we are producing invalid regioning parameters we would either hang the GPU or at least see incorrect results somewhere...

Anyway, I think we should go back to trying to remove the exec_size hack as you suggest before trying to fix more tests, otherwise it seems that we are likely to be papering over broken things.

> 2. "i965/fs: fix ceil/floor for doubles" -- sorry, you're not going to get
> off that easily :) You can't use floats here, since you're going to run into
> range and precision issues. If the piglit tests don't test for those, then
> they need to be fixed. Since we don't have HW support for
> ceil/floor/trunc/etc., you're going to need to implement them using integer
> operations. I was planning on doing this in nir_lower_double_ops.c. I think
> trunc is the easiest to implement using integer ops -- you don't need to
> worry about the sign, you just have to mask off the appropriate number of
> bits in the mantissa based on the exponent -- but the rub is that 64-bit
> integer arithmetic isn't implemented yet in NIR or the backend, and it
> doesn't even exist prior to Broadwell, so you have to split the double into
> 32-bit chunks and operate on those. Then once you've implemented trunc, you
> can implement all the other missing operations in terms  of it. Have fun :)

Yeah, I noticed that a bit later, piglit tests don't really check for this, so they need to be fixed, there is only one test for trunc that catches this kind of thing. I'll dig more, thanks!
Comment 18 Iago Toral 2015-12-01 07:28:43 UTC
(In reply to Matt Turner from comment #16)
> (In reply to Connor Abbott from comment #15)
> > 2. "i965/fs: fix ceil/floor for doubles" -- sorry, you're not going to get
> > off that easily :) You can't use floats here, since you're going to run into
> > range and precision issues. If the piglit tests don't test for those, then
> > they need to be fixed. Since we don't have HW support for
> > ceil/floor/trunc/etc., you're going to need to implement them using integer
> > operations. I was planning on doing this in nir_lower_double_ops.c. I think
> > trunc is the easiest to implement using integer ops -- you don't need to
> > worry about the sign, you just have to mask off the appropriate number of
> > bits in the mantissa based on the exponent -- but the rub is that 64-bit
> > integer arithmetic isn't implemented yet in NIR or the backend, and it
> > doesn't even exist prior to Broadwell, so you have to split the double into
> > 32-bit chunks and operate on those. Then once you've implemented trunc, you
> > can implement all the other missing operations in terms  of it. Have fun :)
> 
> I knew at one point how to do these...
> 
> You basically implement trunc using bit operations (by looking at the
> exponent figure out how many bits of the mantissa you need to mask off, I
> think), and then you can implement floor and ceil by trunc(x ±
> nextafter(0.5, 0.0)).
> 
> Round-to-even was a little harder and I think required an explicit test
> against x.5.
> 
> I'll see if I can dig it back up.

Thanks Matt! I had a quick look for this yesterday but did not find exactly what I need yet. If you happen to dig it up back that would be great, I'll keep looking for it on my side though, so no pressure :)
Comment 19 Connor Abbott 2015-12-02 02:23:49 UTC
(In reply to Iago Toral from comment #17)
> Heu Connor, thanks a lot for looking into this:
> 
> (In reply to Connor Abbott from comment #15)
> > Hey, since you posted a link to a branch with some updated patches, I
> > noticed a few things:
> > 
> > 1. Your patch " i965/fs: Fix register regioning for fp64" isn't correct. In
> > particular, it's not correct that "The size of the type is irrelevant, since
> > we manipulate DF as F with stride 2." The stride of a register is in units
> > of the typesize, and we don't "manipulate DF as F with stride 2" anywhere at
> > all in the backend (you can check for yourself!), so we need to take the
> > type size into account when figuring out if the source crosses a SIMD8
> > register boundary. I suspect this is just papering over bugs due to
> > reverting "XXX remove exec size hack" -- the right thing to do would be to
> > fix the exec_size of any instructions we create that would otherwise hit
> > this hack so we can remove it. The other possibility I can think of is that
> > the crashes are due to Matt's regioning validation falling over, in which
> > case you'll need to fix that. In any case, as-is you're going to produce a
> > region of <8,8,1> for doubles which is incorrect -- you want a region of
> > <4,4,1> since only the vertical stride can cross the SIMD8 register
> > boundary, which my original i965-fp64-v3 branch does correctly.
> 
> Ok, I actually created a new testing branch because I wasn't too sure about
> this patch yet. The thing that confused me is that text from the BDW docs:
> 
> "In Align1 mode, all regioning parameters must use the syntax of a pair of
> packed floats, including channel selects and channel enables.
> 
> // Example:
> mov (8) r10.0.xyzw:df r11.0.xyzw:df
> // The above instruction moves four double floats. The .x picks the
> // low 32 bits and the .y picks the high 32 bits of the double float."
> 
> This talks about align1 mode, but it actually seems to mean Align16...

Yeah, seems like a mistake. The PRM's can be... less than reliable at times.

> I am
> surprised that this actually fixed so much stuff and did not seem to create
> any trouble, I was thinking that if we are producing invalid regioning
> parameters we would either hang the GPU or at least see incorrect results
> somewhere...

While working on bringing stuff up, I also ran into a lot of scenarios like that. "The test shows green, so it must work!" At one point, I thought that you actually could do sqrt/round/etc. on doubles, but it turned out that I was just doing double comparisons wrong :) and I only wrote the patch that yours (mostly) reverts after hitting an assertion in the simulator. A stride of <8,8,1> may work, but it's much safer to follow the documented restrictions in the PRM (unless the PRM is just plain wrong!). It's a little unfortunate that Matt's assembly validator didn't catch this... it might be nice to fix it up to test for this.

> 
> Anyway, I think we should go back to trying to remove the exec_size hack as
> you suggest before trying to fix more tests, otherwise it seems that we are
> likely to be papering over broken things.
> 
> > 2. "i965/fs: fix ceil/floor for doubles" -- sorry, you're not going to get
> > off that easily :) You can't use floats here, since you're going to run into
> > range and precision issues. If the piglit tests don't test for those, then
> > they need to be fixed. Since we don't have HW support for
> > ceil/floor/trunc/etc., you're going to need to implement them using integer
> > operations. I was planning on doing this in nir_lower_double_ops.c. I think
> > trunc is the easiest to implement using integer ops -- you don't need to
> > worry about the sign, you just have to mask off the appropriate number of
> > bits in the mantissa based on the exponent -- but the rub is that 64-bit
> > integer arithmetic isn't implemented yet in NIR or the backend, and it
> > doesn't even exist prior to Broadwell, so you have to split the double into
> > 32-bit chunks and operate on those. Then once you've implemented trunc, you
> > can implement all the other missing operations in terms  of it. Have fun :)
> 
> Yeah, I noticed that a bit later, piglit tests don't really check for this,
> so they need to be fixed, there is only one test for trunc that catches this
> kind of thing. I'll dig more, thanks!
Comment 20 Iago Toral 2015-12-02 12:46:50 UTC
Connor, I have a question about the code we generate for the double unpack x/y opcodes, which looks like this (in SIMD8):
 
mov(8)   g2<1>UD    g18.1<8,4,2>UD   { align1 1Q };
mov(8)   g28<2>UD   g18<8,4,2>UD     { align1 1Q };

Each of these instructions is intended to copy 4 UD elements from the source the destination, however, the execution size of the instructions is set to 8, not 4, which means that we have a a vertical dimension of 2 and we we are actually operating on more data elements than we need. Shouldn't these two instructions have an execution size of 4 instead?

The thing is that I tried to set the execution size to 4 (this also requires that I set force_writemask_all to true) but that produces GPU hangs and some regressions in the fp64 tests, so I guess I am missing something here...
Comment 21 Connor Abbott 2015-12-02 15:26:36 UTC
(In reply to Iago Toral from comment #20)
> Connor, I have a question about the code we generate for the double unpack
> x/y opcodes, which looks like this (in SIMD8):
>  
> mov(8)   g2<1>UD    g18.1<8,4,2>UD   { align1 1Q };
> mov(8)   g28<2>UD   g18<8,4,2>UD     { align1 1Q };
> 
> Each of these instructions is intended to copy 4 UD elements from the source
> the destination, however, the execution size of the instructions is set to
> 8, not 4, which means that we have a a vertical dimension of 2 and we we are
> actually operating on more data elements than we need. Shouldn't these two
> instructions have an execution size of 4 instead?

No, it's meant to copy 8 UD elements -- it's decomposing an 8-component fp64 value, which takes up 2 SIMD8 registers, into 2 32-bit values, each of which takes up 1 SIMD8 register. Unlike in the vec4 backend, in FS fp64 values have to have the same number of components as fp32 values (since we always operate on 8 or 16 pixels/vertices/things at a time, and there's no way to decompose that in NIR), which means that they're twice as large. If you're wondering about why the destination has a stride of 2 in the second instruction... I don't have the rest of the assembly, but it's probably because we're going to operate on the upper bits and then reconstruct the double in g28, so we're avoiding another copy here by copying the low bits directly to g28. Nice job, optimizer!

> 
> The thing is that I tried to set the execution size to 4 (this also requires
> that I set force_writemask_all to true) but that produces GPU hangs and some
> regressions in the fp64 tests, so I guess I am missing something here...

Well, I'm not sure why you got GPU hangs, but... meh. The current code is correct as-is. I wouldn't expect to find bugs in basic, used-all-the-time stuff like this, since I did get a large portion of the tests to pass already. The bugs exist, but they're not here :)
Comment 22 Iago Toral 2015-12-03 06:52:03 UTC
(In reply to Connor Abbott from comment #21)
> (In reply to Iago Toral from comment #20)
> > Connor, I have a question about the code we generate for the double unpack
> > x/y opcodes, which looks like this (in SIMD8):
> >  
> > mov(8)   g2<1>UD    g18.1<8,4,2>UD   { align1 1Q };
> > mov(8)   g28<2>UD   g18<8,4,2>UD     { align1 1Q };
> > 
> > Each of these instructions is intended to copy 4 UD elements from the source
> > the destination, however, the execution size of the instructions is set to
> > 8, not 4, which means that we have a a vertical dimension of 2 and we we are
> > actually operating on more data elements than we need. Shouldn't these two
> > instructions have an execution size of 4 instead?
> 
> No, it's meant to copy 8 UD elements -- it's decomposing an 8-component fp64
> value, which takes up 2 SIMD8 registers, into 2 32-bit values, each of which
> takes up 1 SIMD8 register. Unlike in the vec4 backend, in FS fp64 values
> have to have the same number of components as fp32 values (since we always
> operate on 8 or 16 pixels/vertices/things at a time, and there's no way to
> decompose that in NIR), which means that they're twice as large. 

Right, I figured this out after thinking about it for a while, I lost track that this is in fact a DF operation done with integers and got confused.

> If you're
> wondering about why the destination has a stride of 2 in the second
> instruction... I don't have the rest of the assembly, but it's probably
> because we're going to operate on the upper bits and then reconstruct the
> double in g28, so we're avoiding another copy here by copying the low bits
> directly to g28. Nice job, optimizer!

Yeah, I was curious about that too and inspecting the assembly it does seem to be what you suggest.
 
> > 
> > The thing is that I tried to set the execution size to 4 (this also requires
> > that I set force_writemask_all to true) but that produces GPU hangs and some
> > regressions in the fp64 tests, so I guess I am missing something here...
> 
> Well, I'm not sure why you got GPU hangs, but... meh. The current code is
> correct as-is. I wouldn't expect to find bugs in basic, used-all-the-time
> stuff like this, since I did get a large portion of the tests to pass
> already. The bugs exist, but they're not here :)

Agreed.

Thanks for all the feedback! I think I am starting to have a better grasp on how things need to work with fp64 on the backend now, so hopefully I won't need to keep bugging you too much :)
Comment 23 Iago Toral 2015-12-04 12:16:17 UTC
We spent some time working on fixing code that needs the execsize hack. For now we decided to only fix the cases that are absolutely required, which involve instructions with a dst.width of 4 (we don't currently generate instructions on doubles with a width < 4). This makes the number of changes required for gen7/8 pretty small (fixing this for widths < 4 would require a huge amount of changes everywhere...). This keeps non-fp64 piglit tests on gen7/8 and fp64 tests on gen8 happy with a small number of changes. Basically with this we need to keep the exec_size hack as is for gen < 7 and tweak it for gen7/8 so it only acts on instructions with a dst.width < 4.

I would not try to remove the execsize hack entirely (that is cover the cases where dst.witdh < 4), because that would mean a really huge number of things throughout the driver and across all hw gens.

We could try to avoid special-casing gen7/8 by fixing instructions with a dst.width of 4 in previous gens too, but this will require a lot more patches, so I am not sure that it is worth it. Specifically:

* In gen6 there are instances of instructions injected at the vec4_visitor level that need the exec_size hack to set the execution size to 4. The problem here is that vec4_instruction does not provide means to alter the execution size of the instruction, so that support would have to be added (or, alternatively, write generator opcodes that do that as needed). We think fixing gen6 would involve at least 5 patches, so it is not too bad.

* In gen5 the problem is that the strip-and-fans and clipping modules are full with code that mixes width4 and width8 instructions, so it requires to tweak execsizes in a fair amount of places. I have patches to do this that
in fact change the default execution size for the strip and fans to 4 and fix the cases that need 8, since that seemed more reasonable to limit the number of necessary changes. This involves something like ~10 patches, but the changes look a bit scarier to me (piglit seems to be happy though).

* We don't have gen4 or gen9 hardware available, so we would need help fixing these paths. I would expect a few more patches required for gen4 at least.

So the summary is that if we don't want to special-case the execsize for gen7/8, we will need a significant number of changes, help from Intel to track down issues in gen4 and gen9 and cross fingers that piglit actually exercises all the paths we need to tweak, so all things considered I wonder if we should just special case the execsize hack for gen7/8 for now and forget about other gens for now, since we are not planning fp64 for them yet anyway.

What do you think?
Comment 24 Connor Abbott 2015-12-05 01:04:58 UTC
(In reply to Iago Toral from comment #23)
> We spent some time working on fixing code that needs the execsize hack. For
> now we decided to only fix the cases that are absolutely required, which
> involve instructions with a dst.width of 4 (we don't currently generate
> instructions on doubles with a width < 4). This makes the number of changes
> required for gen7/8 pretty small (fixing this for widths < 4 would require a
> huge amount of changes everywhere...). This keeps non-fp64 piglit tests on
> gen7/8 and fp64 tests on gen8 happy with a small number of changes.
> Basically with this we need to keep the exec_size hack as is for gen < 7 and
> tweak it for gen7/8 so it only acts on instructions with a dst.width < 4.
> 
> I would not try to remove the execsize hack entirely (that is cover the
> cases where dst.witdh < 4), because that would mean a really huge number of
> things throughout the driver and across all hw gens.
> 
> We could try to avoid special-casing gen7/8 by fixing instructions with a
> dst.width of 4 in previous gens too, but this will require a lot more
> patches, so I am not sure that it is worth it. Specifically:
> 
> * In gen6 there are instances of instructions injected at the vec4_visitor
> level that need the exec_size hack to set the execution size to 4. The
> problem here is that vec4_instruction does not provide means to alter the
> execution size of the instruction, so that support would have to be added
> (or, alternatively, write generator opcodes that do that as needed). We
> think fixing gen6 would involve at least 5 patches, so it is not too bad.

In this case, I think the right thing to do would be to push the decision of whether to do exec_size 4 or 8 based on dst.width into the vec4 generator -- essentially, move the hack up one level.

> 
> * In gen5 the problem is that the strip-and-fans and clipping modules are
> full with code that mixes width4 and width8 instructions, so it requires to
> tweak execsizes in a fair amount of places. I have patches to do this that
> in fact change the default execution size for the strip and fans to 4 and
> fix the cases that need 8, since that seemed more reasonable to limit the
> number of necessary changes. This involves something like ~10 patches, but
> the changes look a bit scarier to me (piglit seems to be happy though).
> 
> * We don't have gen4 or gen9 hardware available, so we would need help
> fixing these paths. I would expect a few more patches required for gen4 at
> least.
> 
> So the summary is that if we don't want to special-case the execsize for
> gen7/8, we will need a significant number of changes, help from Intel to
> track down issues in gen4 and gen9 and cross fingers that piglit actually
> exercises all the paths we need to tweak, so all things considered I wonder
> if we should just special case the execsize hack for gen7/8 for now and
> forget about other gens for now, since we are not planning fp64 for them yet
> anyway.
> 
> What do you think?

Ugh, I didn't realize we were using this so much... nice job looking into it! Indeed, it seems like special-casing this for gen7/gen8 is the easiest thing that at least gets us moving in the right direction. I would like to get others' opinions on this, though. I think you should send out an RFC that does the minimum necessary to solve the problem (i.e. only fix the width-4 case on gen7/8), explaining the problem and what you came across when trying to solve it (basically this comment).

I was thinking about this, and there is another way to fix the problem. Since we don't actually use vstride or width in the destination, we could avoid dividing the width and vstride by 2 for destination registers. However, I don't think we want to go in this direction, since this is a step toward conflating dst.width and exec_size when, at least in FS, there is already a clearly-defined notion of what the exec_size should be that's part of the instruction. Jason has already done a lot of work towards not inferring exec_size from dst.width (although for somewhat different reasons, and at a higher level), and we should try to get rid of the assumption at a lower level too.
Comment 25 Iago Toral 2015-12-05 16:45:29 UTC
(In reply to Connor Abbott from comment #24)
> (In reply to Iago Toral from comment #23)
> > We spent some time working on fixing code that needs the execsize hack. For
> > now we decided to only fix the cases that are absolutely required, which
> > involve instructions with a dst.width of 4 (we don't currently generate
> > instructions on doubles with a width < 4). This makes the number of changes
> > required for gen7/8 pretty small (fixing this for widths < 4 would require a
> > huge amount of changes everywhere...). This keeps non-fp64 piglit tests on
> > gen7/8 and fp64 tests on gen8 happy with a small number of changes.
> > Basically with this we need to keep the exec_size hack as is for gen < 7 and
> > tweak it for gen7/8 so it only acts on instructions with a dst.width < 4.
> > 
> > I would not try to remove the execsize hack entirely (that is cover the
> > cases where dst.witdh < 4), because that would mean a really huge number of
> > things throughout the driver and across all hw gens.
> > 
> > We could try to avoid special-casing gen7/8 by fixing instructions with a
> > dst.width of 4 in previous gens too, but this will require a lot more
> > patches, so I am not sure that it is worth it. Specifically:
> > 
> > * In gen6 there are instances of instructions injected at the vec4_visitor
> > level that need the exec_size hack to set the execution size to 4. The
> > problem here is that vec4_instruction does not provide means to alter the
> > execution size of the instruction, so that support would have to be added
> > (or, alternatively, write generator opcodes that do that as needed). We
> > think fixing gen6 would involve at least 5 patches, so it is not too bad.
> 
> In this case, I think the right thing to do would be to push the decision of
> whether to do exec_size 4 or 8 based on dst.width into the vec4 generator --
> essentially, move the hack up one level.
> 
> > 
> > * In gen5 the problem is that the strip-and-fans and clipping modules are
> > full with code that mixes width4 and width8 instructions, so it requires to
> > tweak execsizes in a fair amount of places. I have patches to do this that
> > in fact change the default execution size for the strip and fans to 4 and
> > fix the cases that need 8, since that seemed more reasonable to limit the
> > number of necessary changes. This involves something like ~10 patches, but
> > the changes look a bit scarier to me (piglit seems to be happy though).
> > 
> > * We don't have gen4 or gen9 hardware available, so we would need help
> > fixing these paths. I would expect a few more patches required for gen4 at
> > least.
> > 
> > So the summary is that if we don't want to special-case the execsize for
> > gen7/8, we will need a significant number of changes, help from Intel to
> > track down issues in gen4 and gen9 and cross fingers that piglit actually
> > exercises all the paths we need to tweak, so all things considered I wonder
> > if we should just special case the execsize hack for gen7/8 for now and
> > forget about other gens for now, since we are not planning fp64 for them yet
> > anyway.
> > 
> > What do you think?
> 
> Ugh, I didn't realize we were using this so much... nice job looking into
> it! Indeed, it seems like special-casing this for gen7/gen8 is the easiest
> thing that at least gets us moving in the right direction. I would like to
> get others' opinions on this, though. I think you should send out an RFC
> that does the minimum necessary to solve the problem (i.e. only fix the
> width-4 case on gen7/8), explaining the problem and what you came across
> when trying to solve it (basically this comment).

Sure, I'll do that.

> I was thinking about this, and there is another way to fix the problem.
> Since we don't actually use vstride or width in the destination, we could
> avoid dividing the width and vstride by 2 for destination registers.
> However, I don't think we want to go in this direction, since this is a step
> toward conflating dst.width and exec_size when, at least in FS, there is
> already a clearly-defined notion of what the exec_size should be that's part
> of the instruction. Jason has already done a lot of work towards not
> inferring exec_size from dst.width (although for somewhat different reasons,
> and at a higher level), and we should try to get rid of the assumption at a
> lower level too.

Yeah, I agree. 

Thanks Connor!
Comment 26 Samuel Iglesias Gonsálvez 2016-01-07 11:59:40 UTC
I found an issue related to having interleaved uniform definitions of 32-bit and 64-bit data types in the push constant buffer.

The bug is easily shown when we have a double defined just after a 32-bit data type. For example, we have following definition in a GLSL fragment shader:

	uniform double arg0;
	uniform bool arg1;
	uniform double arg2;

The generated code that copies those push constant values does the following in SIMD16:

mov(8)          g19<1>DF        g2<0,1,0>DF
mov(8)          g23<1>DF        g2<0,1,0>DF
mov(16)         g9<1>D          g2.2<0,1,0>D
mov(8)          g5<1>DF         g2.1<0,1,0>DF
mov(8)          g7<1>DF         g2.1<0,1,0>DF

As you see, there is a misalignment in the memory access that copies 'arg2' contents: we are copying the 32 bits of arg1 into the copy of arg2 (notice that g2.1<0,1,0>DF is at the same offset than g2.2<0,1,0>D).

My proposal is to do a 64-bit alignment when uploading push constant doubles and when reading them from the push constant buffer. The 32-bit push constants' upload and access would not be changed. So the generated code for the same example would be like:

mov(8)          g19<1>DF        g2<0,1,0>DF
mov(8)          g23<1>DF        g2<0,1,0>DF
mov(16)         g9<1>D          g2.2<0,1,0>D
mov(8)          g5<1>DF         g2.2<0,1,0>DF
mov(8)          g7<1>DF         g2.2<0,1,0>DF

This solution has the drawback of adding padding inside push constant buffer when we have a mixture of 32 bits and 64-bit data type constants, so it is not memory efficient; plus take it into account to avoid exceeding the push buffer size limitation. The advantage is that it does not add new instructions in the generated code.

Do you like the proposed solution? Or do you have other solution in mind?

BTW, I expect to have a similar problem when reading doubles from the pull constant buffer contents but I have not checked it yet.
Comment 27 Jason Ekstrand 2016-01-08 06:36:10 UTC
(In reply to Samuel Iglesias from comment #26)
> I found an issue related to having interleaved uniform definitions of 32-bit
> and 64-bit data types in the push constant buffer.
> 
> The bug is easily shown when we have a double defined just after a 32-bit
> data type. For example, we have following definition in a GLSL fragment
> shader:
> 
> 	uniform double arg0;
> 	uniform bool arg1;
> 	uniform double arg2;
> 
> The generated code that copies those push constant values does the following
> in SIMD16:
> 
> mov(8)          g19<1>DF        g2<0,1,0>DF
> mov(8)          g23<1>DF        g2<0,1,0>DF
> mov(16)         g9<1>D          g2.2<0,1,0>D
> mov(8)          g5<1>DF         g2.1<0,1,0>DF
> mov(8)          g7<1>DF         g2.1<0,1,0>DF
> 
> As you see, there is a misalignment in the memory access that copies 'arg2'
> contents: we are copying the 32 bits of arg1 into the copy of arg2 (notice
> that g2.1<0,1,0>DF is at the same offset than g2.2<0,1,0>D).

This issue was anticipated.  We came across it in theory if not in practice this summer while Connor was working on it.

> My proposal is to do a 64-bit alignment when uploading push constant doubles
> and when reading them from the push constant buffer. The 32-bit push
> constants' upload and access would not be changed. So the generated code for
> the same example would be like:
> 
> mov(8)          g19<1>DF        g2<0,1,0>DF
> mov(8)          g23<1>DF        g2<0,1,0>DF
> mov(16)         g9<1>D          g2.2<0,1,0>D
> mov(8)          g5<1>DF         g2.2<0,1,0>DF
> mov(8)          g7<1>DF         g2.2<0,1,0>DF
> 
> This solution has the drawback of adding padding inside push constant buffer
> when we have a mixture of 32 bits and 64-bit data type constants, so it is
> not memory efficient; plus take it into account to avoid exceeding the push
> buffer size limitation. The advantage is that it does not add new
> instructions in the generated code.
> 
> Do you like the proposed solution? Or do you have other solution in mind?

That seems like what we need to do.  Unfortunately, executing it might be a bit interesting.  The uniform packing code we have (assign_constant_locations) isn't aware of the base data type.  However, you do have the type on the source, so you can probably get it.  You may want to take a look at this series (which still needs review) http://patchwork.freedesktop.org/series/1669/  It addresses some of the same problems you'll need to solve but for a different reason.

> BTW, I expect to have a similar problem when reading doubles from the pull
> constant buffer contents but I have not checked it yet.

No, that shouldn't be a problem.  We will need to maybe emit two pulls for a whole dvec4, but that's about it.  There should be no alignment problems.
Comment 28 Samuel Iglesias Gonsálvez 2016-01-08 07:30:36 UTC
(In reply to Jason Ekstrand from comment #27)
> (In reply to Samuel Iglesias from comment #26)
> > I found an issue related to having interleaved uniform definitions of 32-bit
> > and 64-bit data types in the push constant buffer.
> > 
> > The bug is easily shown when we have a double defined just after a 32-bit
> > data type. For example, we have following definition in a GLSL fragment
> > shader:
> > 
> > 	uniform double arg0;
> > 	uniform bool arg1;
> > 	uniform double arg2;
> > 
> > The generated code that copies those push constant values does the following
> > in SIMD16:
> > 
> > mov(8)          g19<1>DF        g2<0,1,0>DF
> > mov(8)          g23<1>DF        g2<0,1,0>DF
> > mov(16)         g9<1>D          g2.2<0,1,0>D
> > mov(8)          g5<1>DF         g2.1<0,1,0>DF
> > mov(8)          g7<1>DF         g2.1<0,1,0>DF
> > 
> > As you see, there is a misalignment in the memory access that copies 'arg2'
> > contents: we are copying the 32 bits of arg1 into the copy of arg2 (notice
> > that g2.1<0,1,0>DF is at the same offset than g2.2<0,1,0>D).
> 
> This issue was anticipated.  We came across it in theory if not in practice
> this summer while Connor was working on it.
> 
> > My proposal is to do a 64-bit alignment when uploading push constant doubles
> > and when reading them from the push constant buffer. The 32-bit push
> > constants' upload and access would not be changed. So the generated code for
> > the same example would be like:
> > 
> > mov(8)          g19<1>DF        g2<0,1,0>DF
> > mov(8)          g23<1>DF        g2<0,1,0>DF
> > mov(16)         g9<1>D          g2.2<0,1,0>D
> > mov(8)          g5<1>DF         g2.2<0,1,0>DF
> > mov(8)          g7<1>DF         g2.2<0,1,0>DF
> > 
> > This solution has the drawback of adding padding inside push constant buffer
> > when we have a mixture of 32 bits and 64-bit data type constants, so it is
> > not memory efficient; plus take it into account to avoid exceeding the push
> > buffer size limitation. The advantage is that it does not add new
> > instructions in the generated code.
> > 
> > Do you like the proposed solution? Or do you have other solution in mind?
> 
> That seems like what we need to do.  Unfortunately, executing it might be a
> bit interesting.  The uniform packing code we have
> (assign_constant_locations) isn't aware of the base data type.  However, you
> do have the type on the source, so you can probably get it.  You may want to
> take a look at this series (which still needs review)
> http://patchwork.freedesktop.org/series/1669/  It addresses some of the same
> problems you'll need to solve but for a different reason.
> 

OK, thanks for the tips. I will take a look at that patch series.

> > BTW, I expect to have a similar problem when reading doubles from the pull
> > constant buffer contents but I have not checked it yet.
> 
> No, that shouldn't be a problem.  We will need to maybe emit two pulls for a
> whole dvec4, but that's about it.  There should be no alignment problems.

OK :-)

Thanks Jason!
Comment 29 Iago Toral 2016-01-11 15:20:38 UTC
Hey Jason/Connor,

the lowering of trunc for doubles has some code that looks like this (pseudo-code):

if (exponent < 0) {
   mask = 0x0
} else if (exponent > 52) {
   mask = 0x7fffffffffffffff;
} else {
   /* This is a 64-bit integer op, needs to be split into hi/lo 32-bit ops */
   mask =  (1LL << frac_bits) - 1;
}

The current implementation I have works fine using bcsel. It looks something like this (again, pseudo-code):

mask = bcsel(exponent < 0,
             0x7fffffffffffffff,
             bcsel(exponent > 52,
                   0x0000000000000000,
                   (1LL << frac_bits) -1))

My problem with this is that "(1LL << frac_bits) - 1" is a 64-bit integer operation that we have to implement in terms of hi/lo 32-bit integer operations (at least until we support 64-bit integers), so it is really a bunch of instructions. Because I use bcsel, it means that we generate code for that even if exponent is not in [1..51], which is not ideal.

I was thinking about rewriting this as an if/else ladder instead, however, I noticed that because this occurs in SSA mode I would have to deal with the phi nodes etc manually and I don't see any other case where we do something like that outside the NIR to SSA pass, so I wonder if this is actually a good idea at all. What do you think?

If you think the if/else ladder is the way to go, is there any documentation or
code references I can look at to have an idea as to how that should be implemented for a lowering pass in SSA mode?
Comment 30 Jason Ekstrand 2016-01-11 18:02:36 UTC
Created attachment 120957 [details]
NIR indirect lowering pass

(In reply to Iago Toral from comment #29)
> Hey Jason/Connor,
> 
> the lowering of trunc for doubles has some code that looks like this
> (pseudo-code):
> 
> if (exponent < 0) {
>    mask = 0x0
> } else if (exponent > 52) {
>    mask = 0x7fffffffffffffff;
> } else {
>    /* This is a 64-bit integer op, needs to be split into hi/lo 32-bit ops */
>    mask =  (1LL << frac_bits) - 1;
> }
> 
> The current implementation I have works fine using bcsel. It looks something
> like this (again, pseudo-code):
> 
> mask = bcsel(exponent < 0,
>              0x7fffffffffffffff,
>              bcsel(exponent > 52,
>                    0x0000000000000000,
>                    (1LL << frac_bits) -1))
> 
> My problem with this is that "(1LL << frac_bits) - 1" is a 64-bit integer
> operation that we have to implement in terms of hi/lo 32-bit integer
> operations (at least until we support 64-bit integers), so it is really a
> bunch of instructions. Because I use bcsel, it means that we generate code
> for that even if exponent is not in [1..51], which is not ideal.

Right.  I would encourage you not to use if's too much because branching may be more expensive than bcsel depending on what paths different invocations take.  However, if one side of the if is overwhelmingly more likely than the other, then control-flow is probably a good idea.

> I was thinking about rewriting this as an if/else ladder instead, however, I
> noticed that because this occurs in SSA mode I would have to deal with the
> phi nodes etc manually and I don't see any other case where we do something
> like that outside the NIR to SSA pass, so I wonder if this is actually a
> good idea at all. What do you think?
> 
> If you think the if/else ladder is the way to go, is there any documentation
> or
> code references I can look at to have an idea as to how that should be
> implemented for a lowering pass in SSA mode?

I attached a pass that I've written recently (not yet sent out for review, but it does work) that does exactly this.  It replaces indirect load/store operations with if-ladders and phi nodes (if needed).  Most of it comes down to using nir_insert_cf_node to insert it at the builder's cursor and then making sure you set the cursor to something reasonable when you're done.
Comment 31 Iago Toral 2016-01-12 07:34:18 UTC
(In reply to Jason Ekstrand from comment #30)
> Created attachment 120957 [details]
> NIR indirect lowering pass
> 
> (In reply to Iago Toral from comment #29)
> > Hey Jason/Connor,
> > 
> > the lowering of trunc for doubles has some code that looks like this
> > (pseudo-code):
> > 
> > if (exponent < 0) {
> >    mask = 0x0
> > } else if (exponent > 52) {
> >    mask = 0x7fffffffffffffff;
> > } else {
> >    /* This is a 64-bit integer op, needs to be split into hi/lo 32-bit ops */
> >    mask =  (1LL << frac_bits) - 1;
> > }
> > 
> > The current implementation I have works fine using bcsel. It looks something
> > like this (again, pseudo-code):
> > 
> > mask = bcsel(exponent < 0,
> >              0x7fffffffffffffff,
> >              bcsel(exponent > 52,
> >                    0x0000000000000000,
> >                    (1LL << frac_bits) -1))
> > 
> > My problem with this is that "(1LL << frac_bits) - 1" is a 64-bit integer
> > operation that we have to implement in terms of hi/lo 32-bit integer
> > operations (at least until we support 64-bit integers), so it is really a
> > bunch of instructions. Because I use bcsel, it means that we generate code
> > for that even if exponent is not in [1..51], which is not ideal.
> 
> Right.  I would encourage you not to use if's too much because branching may
> be more expensive than bcsel depending on what paths different invocations
> take.  However, if one side of the if is overwhelmingly more likely than the
> other, then control-flow is probably a good idea.

Yeah, in this case exponents in the range 0..52 would be a lot more common than anything else.

> > I was thinking about rewriting this as an if/else ladder instead, however, I
> > noticed that because this occurs in SSA mode I would have to deal with the
> > phi nodes etc manually and I don't see any other case where we do something
> > like that outside the NIR to SSA pass, so I wonder if this is actually a
> > good idea at all. What do you think?
> > 
> > If you think the if/else ladder is the way to go, is there any documentation
> > or
> > code references I can look at to have an idea as to how that should be
> > implemented for a lowering pass in SSA mode?
> 
> I attached a pass that I've written recently (not yet sent out for review,
> but it does work) that does exactly this.  It replaces indirect load/store
> operations with if-ladders and phi nodes (if needed).  Most of it comes down
> to using nir_insert_cf_node to insert it at the builder's cursor and then
> making sure you set the cursor to something reasonable when you're done.

Awesome, thanks a lot Jason!
Comment 32 Samuel Iglesias Gonsálvez 2016-01-18 16:14:39 UTC
I found an issue related to the opt_algebraic passes after we lowered lrp instruction when it has double-based types as arguments (lrp instruction does not support doubles in BDW).

Currently, the lowering of double ops run as part the optimization loop in nir_optimize(). In that lowering function, I converted the lrp to its equivalent mathematical function when we have doubles as arguments... which is reverted by opt_algebraic() in next iteration of the optimization loop :-)

My idea is: remove my lrp's lowering pass, add a new flag in nir_shader_compiler_options called 'lower_lrp_doubles' and modify nir_opt_algebraic.py to run its lrp's lowering when the arguments are doubles, if that flag is enabled. Any driver that does not support lrp with double arguments, just enable that nir option in its configuration.

I would like to know your opinion about this solution before implementing it. Do you have other solutions in mind?

Thanks!
Comment 33 Connor Abbott 2016-01-18 16:52:36 UTC
(In reply to Samuel Iglesias from comment #32)
> I found an issue related to the opt_algebraic passes after we lowered lrp
> instruction when it has double-based types as arguments (lrp instruction
> does not support doubles in BDW).
> 
> Currently, the lowering of double ops run as part the optimization loop in
> nir_optimize(). In that lowering function, I converted the lrp to its
> equivalent mathematical function when we have doubles as arguments... which
> is reverted by opt_algebraic() in next iteration of the optimization loop :-)
> 
> My idea is: remove my lrp's lowering pass, add a new flag in
> nir_shader_compiler_options called 'lower_lrp_doubles' and modify
> nir_opt_algebraic.py to run its lrp's lowering when the arguments are
> doubles, if that flag is enabled. Any driver that does not support lrp with
> double arguments, just enable that nir option in its configuration.
> 
> I would like to know your opinion about this solution before implementing
> it. Do you have other solutions in mind?
> 
> Thanks!

That seems like a fine solution -- we already do similar things in a number of different places in opt_algebraic. Just keep in mind that you'll have to disable the lrp optimization in opt_algebraic for doubles when that option is enabled to avoid running into the same problem.
Comment 34 Samuel Iglesias Gonsálvez 2016-01-19 06:40:06 UTC
(In reply to Connor Abbott from comment #33)
> (In reply to Samuel Iglesias from comment #32)
> > I found an issue related to the opt_algebraic passes after we lowered lrp
> > instruction when it has double-based types as arguments (lrp instruction
> > does not support doubles in BDW).
> > 
> > Currently, the lowering of double ops run as part the optimization loop in
> > nir_optimize(). In that lowering function, I converted the lrp to its
> > equivalent mathematical function when we have doubles as arguments... which
> > is reverted by opt_algebraic() in next iteration of the optimization loop :-)
> > 
> > My idea is: remove my lrp's lowering pass, add a new flag in
> > nir_shader_compiler_options called 'lower_lrp_doubles' and modify
> > nir_opt_algebraic.py to run its lrp's lowering when the arguments are
> > doubles, if that flag is enabled. Any driver that does not support lrp with
> > double arguments, just enable that nir option in its configuration.
> > 
> > I would like to know your opinion about this solution before implementing
> > it. Do you have other solutions in mind?
> > 
> > Thanks!
> 
> That seems like a fine solution -- we already do similar things in a number
> of different places in opt_algebraic. Just keep in mind that you'll have to
> disable the lrp optimization in opt_algebraic for doubles when that option
> is enabled to avoid running into the same problem.

OK, thanks Connor!
Comment 35 Samuel Iglesias Gonsálvez 2016-01-25 11:25:06 UTC
I found i965 driver is currently running vec4 visitor and generator for geometry shaders and we have some crashed/failed tests due to its lack of support for doubles. After enabling scalar GS, piglit shows no regressions.

I would like to know if there is a strong reason to disable scalar GS by default (perhaps I am missing something) because then we would need to add support for doubles in vec4.
Comment 36 Samuel Iglesias Gonsálvez 2016-01-25 12:12:50 UTC
(In reply to Samuel Iglesias from comment #35)
> I found i965 driver is currently running vec4 visitor and generator for
> geometry shaders and we have some crashed/failed tests due to its lack of
> support for doubles. After enabling scalar GS, piglit shows no regressions.
> 

I forgot to mention we are currently focusing on the scalar backend to fix failed fp64 tests and running the tests on Broadwell (gen8) for now.

By enabling scalar GS in Broadwell by default, we would take advantage of our previous work to add support for doubles in the scalar backend. Thanks to that work, scalar GS not only shows no regressions on piglit but also passes the crashed/failed fp64's geometry shader tests we have seen with vec4 backend.

Then, my question is if we can enable scalar GS in Broadwell by default or if there are reasons to keep it disabled.

> I would like to know if there is a strong reason to disable scalar GS by
> default (perhaps I am missing something) because then we would need to add
> support for doubles in vec4.
Comment 37 Samuel Iglesias Gonsálvez 2016-01-25 14:24:52 UTC
(In reply to Samuel Iglesias from comment #36)
> (In reply to Samuel Iglesias from comment #35)
> > I found i965 driver is currently running vec4 visitor and generator for
> > geometry shaders and we have some crashed/failed tests due to its lack of
> > support for doubles. After enabling scalar GS, piglit shows no regressions.
> > 
> 
> I forgot to mention we are currently focusing on the scalar backend to fix
> failed fp64 tests and running the tests on Broadwell (gen8) for now.
> 
> By enabling scalar GS in Broadwell by default, we would take advantage of
> our previous work to add support for doubles in the scalar backend. Thanks
> to that work, scalar GS not only shows no regressions on piglit but also
> passes the crashed/failed fp64's geometry shader tests we have seen with
> vec4 backend.
> 
> Then, my question is if we can enable scalar GS in Broadwell by default or
> if there are reasons to keep it disabled.
> 
> > I would like to know if there is a strong reason to disable scalar GS by
> > default (perhaps I am missing something) because then we would need to add
> > support for doubles in vec4.

Oh, I did a wrong comparative of the piglit results:

There are 11 piglit regressions when enabling scalar GS. Those regressions are unrelated to fp64 but they show why scalar GS was disabled by default in Broadwell: the instanced GS support.

7 regressions are related to a failed assert in brw_compile_gs(): assert(prog_data->invocations == 1). Removing that assert, they pass. Is instanced GS supported already in scalar backend?

The other 4 regressions seem to be a different issue, but I have not analyzed it yet:

shader_runner: brw_fs_nir.cpp:1813: void fs_visitor::emit_gs_input_load(const fs_reg&, const nir_src&, unsigned int, const nir_src&, unsigned int): Assertion `imm_offset == 0' failed.
Comment 38 Connor Abbott 2016-01-25 14:33:23 UTC
Unfortunately, one of the tessellation stages (I forget which) still runs in vec4 mode on Broadwell, and even if we supported scalar mode there are certain cases where we need to use vec4 (Ken would know the details), so we have to support fp64 on vec4 for Broadwell. There are some extra things that need to be fixed about fp64 on vec4 mode, but I think that I've communicated most of them in this bug.
Comment 39 Kenneth Graunke 2016-01-25 19:35:26 UTC
There are a few bugs left with INTEL_SCALAR_GS=1 - I just filed bugs for them:

https://bugs.freedesktop.org/show_bug.cgi?id=93859
https://bugs.freedesktop.org/show_bug.cgi?id=93860

The tessellation control shader (TCS) currently operates in vec4 mode on all hardware.  I have a branch that begins implementing a SIMD8 backend, but I haven't finished it, as I don't think there's much advantage over SIMD4x2 mode.  However, I believe everything *could* run in SIMD8 mode if we wanted.  I could probably be talked into finishing it, if you need it.
Comment 40 Iago Toral 2016-01-26 14:42:25 UTC
(In reply to Kenneth Graunke from comment #39)
> There are a few bugs left with INTEL_SCALAR_GS=1 - I just filed bugs for
> them:
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=93859
> https://bugs.freedesktop.org/show_bug.cgi?id=93860
> 
> The tessellation control shader (TCS) currently operates in vec4 mode on all
> hardware.  I have a branch that begins implementing a SIMD8 backend, but I
> haven't finished it, as I don't think there's much advantage over SIMD4x2
> mode.  However, I believe everything *could* run in SIMD8 mode if we wanted.
> I could probably be talked into finishing it, if you need it.

Not really, we have been focusing on the FS backend for now for the fp64 implementation and if we could make Broadwell work  exclusively in scalar mode that would mean that we would have it fully covered (the implementation of fp64 in the FS backend is mostly done). But the plan was to work on the vec4 backend next anyway (we are going to need that in any case if we aim to support fp64 in gen7 at some point) so this only means that claiming full fp64 support in Broadwell will have to wait until we get the vec4 support in place.
Comment 41 Iago Toral 2016-02-01 11:15:46 UTC
Hi Connor, I have a question about the brw_nir_split_doubles pass that you wrote for the vec4 backend. The pass does not lower nir_op_vec3/4 on purpose with this comment:

/* These ops are the ones that group up dvec2's and doubles into dvec3's
 * and dvec4's when necessary, so we don't lower them. If they're
 * unnecessary, copy propagation will clean them up.
 */

However, this obviously leads to 64-bit instructions writing to channels ZW, which we don't want to have since our Nir->vec4 pass expects that any 64-bit operation won't have a writemask including channels other than XY.

Right now, the lower_vec_to_movs pass that we run right after the nir_from_ssa pass seems to generate MOVs that write to each channel of the vecN instruction dest, so with this, it generates MOVs with 64-bit things that write to components Z and W of a dvec3/4.

I suppose your idea was to break up ALU operations, then group them back as vec3/vec4 operations so we don't lose track of the original size of the data elements involved in the operations. If that is the case, I think we can disable lower_vec_to_movs() on dvec3/dvec4 and let the nir-vec4 pass handle those. Does this make sense to you? Did you have a different idea about how this should work?
Comment 42 Iago Toral 2016-02-01 15:37:34 UTC
(In reply to Iago Toral from comment #41)
> Hi Connor, I have a question about the brw_nir_split_doubles pass that you
> wrote for the vec4 backend. The pass does not lower nir_op_vec3/4 on purpose
> with this comment:
> 
> /* These ops are the ones that group up dvec2's and doubles into dvec3's
>  * and dvec4's when necessary, so we don't lower them. If they're
>  * unnecessary, copy propagation will clean them up.
>  */
> 
> However, this obviously leads to 64-bit instructions writing to channels ZW,
> which we don't want to have since our Nir->vec4 pass expects that any 64-bit
> operation won't have a writemask including channels other than XY.
> 
> Right now, the lower_vec_to_movs pass that we run right after the
> nir_from_ssa pass seems to generate MOVs that write to each channel of the
> vecN instruction dest, so with this, it generates MOVs with 64-bit things
> that write to components Z and W of a dvec3/4.
> 
> I suppose your idea was to break up ALU operations, then group them back as
> vec3/vec4 operations so we don't lose track of the original size of the data
> elements involved in the operations. If that is the case, I think we can
> disable lower_vec_to_movs() on dvec3/dvec4 and let the nir-vec4 pass handle
> those. Does this make sense to you? Did you have a different idea about how
> this should work?

Or maybe you expected that the MOVs in lower_vec_to_movs would always be coalesced so we would never really emit instructions to generate the vec3/4 at all? This is not happening because of the presence of source modifiers in the instructions that use the result of the vecN operation. I suppose we can detect these cases and fix them by inserting a MOV to a temporary with the source modifier and then rewriting the instruction to consume this instead of the original value.
Comment 43 Jason Ekstrand 2016-02-01 15:51:31 UTC
(In reply to Iago Toral from comment #42)
> (In reply to Iago Toral from comment #41)
> > Hi Connor, I have a question about the brw_nir_split_doubles pass that you
> > wrote for the vec4 backend. The pass does not lower nir_op_vec3/4 on purpose
> > with this comment:
> > 
> > /* These ops are the ones that group up dvec2's and doubles into dvec3's
> >  * and dvec4's when necessary, so we don't lower them. If they're
> >  * unnecessary, copy propagation will clean them up.
> >  */
> > 
> > However, this obviously leads to 64-bit instructions writing to channels ZW,
> > which we don't want to have since our Nir->vec4 pass expects that any 64-bit
> > operation won't have a writemask including channels other than XY.
> > 
> > Right now, the lower_vec_to_movs pass that we run right after the
> > nir_from_ssa pass seems to generate MOVs that write to each channel of the
> > vecN instruction dest, so with this, it generates MOVs with 64-bit things
> > that write to components Z and W of a dvec3/4.
> > 
> > I suppose your idea was to break up ALU operations, then group them back as
> > vec3/vec4 operations so we don't lose track of the original size of the data
> > elements involved in the operations. If that is the case, I think we can
> > disable lower_vec_to_movs() on dvec3/dvec4 and let the nir-vec4 pass handle
> > those. Does this make sense to you? Did you have a different idea about how
> > this should work?
> 
> Or maybe you expected that the MOVs in lower_vec_to_movs would always be
> coalesced so we would never really emit instructions to generate the vec3/4
> at all? This is not happening because of the presence of source modifiers in
> the instructions that use the result of the vecN operation. I suppose we can
> detect these cases and fix them by inserting a MOV to a temporary with the
> source modifier and then rewriting the instruction to consume this instead
> of the original value.

I believe that was the intention.  With full scalarizing, this works because every instruction reads or writes exactly one component by the time you're done.  With breaking things into vec2's, you can have things cross channels and, as you observed, this creates problems.  Adding movs seems like the right idea, but I think you want to add vecN instructions instead.  That way, you get the nice property that each source of the vecN only reads one channel so the vec that combines results gets copy-propagated into the source of the vec that sets up the source.  If you don't need the vecN on the source and can just swizzle (what you want), copy-prop should take care of it.
Comment 44 Connor Abbott 2016-02-01 16:22:53 UTC
(In reply to Jason Ekstrand from comment #43)
> (In reply to Iago Toral from comment #42)
> > (In reply to Iago Toral from comment #41)
> > > Hi Connor, I have a question about the brw_nir_split_doubles pass that you
> > > wrote for the vec4 backend. The pass does not lower nir_op_vec3/4 on purpose
> > > with this comment:
> > > 
> > > /* These ops are the ones that group up dvec2's and doubles into dvec3's
> > >  * and dvec4's when necessary, so we don't lower them. If they're
> > >  * unnecessary, copy propagation will clean them up.
> > >  */
> > > 
> > > However, this obviously leads to 64-bit instructions writing to channels ZW,
> > > which we don't want to have since our Nir->vec4 pass expects that any 64-bit
> > > operation won't have a writemask including channels other than XY.
> > > 
> > > Right now, the lower_vec_to_movs pass that we run right after the
> > > nir_from_ssa pass seems to generate MOVs that write to each channel of the
> > > vecN instruction dest, so with this, it generates MOVs with 64-bit things
> > > that write to components Z and W of a dvec3/4.
> > > 
> > > I suppose your idea was to break up ALU operations, then group them back as
> > > vec3/vec4 operations so we don't lose track of the original size of the data
> > > elements involved in the operations. If that is the case, I think we can
> > > disable lower_vec_to_movs() on dvec3/dvec4 and let the nir-vec4 pass handle
> > > those. Does this make sense to you? Did you have a different idea about how
> > > this should work?
> > 
> > Or maybe you expected that the MOVs in lower_vec_to_movs would always be
> > coalesced so we would never really emit instructions to generate the vec3/4
> > at all? This is not happening because of the presence of source modifiers in
> > the instructions that use the result of the vecN operation. I suppose we can
> > detect these cases and fix them by inserting a MOV to a temporary with the
> > source modifier and then rewriting the instruction to consume this instead
> > of the original value.
> 
> I believe that was the intention.  With full scalarizing, this works because
> every instruction reads or writes exactly one component by the time you're
> done.  With breaking things into vec2's, you can have things cross channels
> and, as you observed, this creates problems.  Adding movs seems like the
> right idea, but I think you want to add vecN instructions instead.  That
> way, you get the nice property that each source of the vecN only reads one
> channel so the vec that combines results gets copy-propagated into the
> source of the vec that sets up the source.  If you don't need the vecN on
> the source and can just swizzle (what you want), copy-prop should take care
> of it.

I think that what I intended was actually to disable lower_vec_to_mov's for dvec3/dvec4. I *think* this is essentially what you're saying. The issue with representing this with MOV's is that some of the MOV's that lower_vec_to_movs creates may straddle the boundary between the first and second half of the dvec4, which would be more work to handle in the backend than a single dvec3/4 operation with a separate source for each component. This is pretty similar to how scalarizing works in the FS backend, with the rub being that copy propagation can still decide to give you, say, a yz swizzle of a dvec4 as a source. This is similar to how, in FS, you can get a component of a larger vector as a source, except there we can just offset it but here we have to do extra work if the source straddles the boundary. We can still be guaranteed that all destinations are dvec2 or smaller, though, since everything larger than a dvec2 is only going to be created by a load op or a dvec3/4 op.
Comment 45 Iago Toral 2016-02-02 08:53:01 UTC
(In reply to Connor Abbott from comment #44)
> (In reply to Jason Ekstrand from comment #43)
> > (In reply to Iago Toral from comment #42)
> > > (In reply to Iago Toral from comment #41)
> > > > Hi Connor, I have a question about the brw_nir_split_doubles pass that you
> > > > wrote for the vec4 backend. The pass does not lower nir_op_vec3/4 on purpose
> > > > with this comment:
> > > > 
> > > > /* These ops are the ones that group up dvec2's and doubles into dvec3's
> > > >  * and dvec4's when necessary, so we don't lower them. If they're
> > > >  * unnecessary, copy propagation will clean them up.
> > > >  */
> > > > 
> > > > However, this obviously leads to 64-bit instructions writing to channels ZW,
> > > > which we don't want to have since our Nir->vec4 pass expects that any 64-bit
> > > > operation won't have a writemask including channels other than XY.
> > > > 
> > > > Right now, the lower_vec_to_movs pass that we run right after the
> > > > nir_from_ssa pass seems to generate MOVs that write to each channel of the
> > > > vecN instruction dest, so with this, it generates MOVs with 64-bit things
> > > > that write to components Z and W of a dvec3/4.
> > > > 
> > > > I suppose your idea was to break up ALU operations, then group them back as
> > > > vec3/vec4 operations so we don't lose track of the original size of the data
> > > > elements involved in the operations. If that is the case, I think we can
> > > > disable lower_vec_to_movs() on dvec3/dvec4 and let the nir-vec4 pass handle
> > > > those. Does this make sense to you? Did you have a different idea about how
> > > > this should work?
> > > 
> > > Or maybe you expected that the MOVs in lower_vec_to_movs would always be
> > > coalesced so we would never really emit instructions to generate the vec3/4
> > > at all? This is not happening because of the presence of source modifiers in
> > > the instructions that use the result of the vecN operation. I suppose we can
> > > detect these cases and fix them by inserting a MOV to a temporary with the
> > > source modifier and then rewriting the instruction to consume this instead
> > > of the original value.
> > 
> > I believe that was the intention.  With full scalarizing, this works because
> > every instruction reads or writes exactly one component by the time you're
> > done.  With breaking things into vec2's, you can have things cross channels
> > and, as you observed, this creates problems.  Adding movs seems like the
> > right idea, but I think you want to add vecN instructions instead.  That
> > way, you get the nice property that each source of the vecN only reads one
> > channel so the vec that combines results gets copy-propagated into the
> > source of the vec that sets up the source.  If you don't need the vecN on
> > the source and can just swizzle (what you want), copy-prop should take care
> > of it.
> 
> I think that what I intended was actually to disable lower_vec_to_mov's for
> dvec3/dvec4. I *think* this is essentially what you're saying. The issue
> with representing this with MOV's is that some of the MOV's that
> lower_vec_to_movs creates may straddle the boundary between the first and
> second half of the dvec4, which would be more work to handle in the backend
> than a single dvec3/4 operation with a separate source for each component.
> This is pretty similar to how scalarizing works in the FS backend, with the
> rub being that copy propagation can still decide to give you, say, a yz
> swizzle of a dvec4 as a source. This is similar to how, in FS, you can get a
> component of a larger vector as a source, except there we can just offset it
> but here we have to do extra work if the source straddles the boundary. We
> can still be guaranteed that all destinations are dvec2 or smaller, though,
> since everything larger than a dvec2 is only going to be created by a load
> op or a dvec3/4 op.

Thank you both for the fast replies! Based on your comments it looks like I should be experimenting with breaking up the vec3/4 operations into vec2's in the NIR->vec4 translator, so I'll start there and see how that works out.
Comment 46 Iago Toral 2016-02-05 10:52:01 UTC
I was thinking that since we want to make some modifications to lower_vec_to_movs for dvec3 and dvec4 that are really specific for Intel (they are only needed to play along with brw_nir_split_doubles) we should make that work under a compiler option. Actually, I was wondering if we should move the split_doubles pass to NIR as well and make it run or not based on the same compiler option. Does this make sense to you?
Comment 47 Connor Abbott 2016-02-05 14:31:25 UTC
(In reply to Iago Toral from comment #46)
> I was thinking that since we want to make some modifications to
> lower_vec_to_movs for dvec3 and dvec4 that are really specific for Intel
> (they are only needed to play along with brw_nir_split_doubles) we should
> make that work under a compiler option. Actually, I was wondering if we
> should move the split_doubles pass to NIR as well and make it run or not
> based on the same compiler option. Does this make sense to you?

Yeah, that probably makes sense. In case NIR-to-TGSI ever happens they'll want to do something similar as well.
Comment 48 Iago Toral 2016-02-11 13:07:11 UTC
Connor, I think there is a problem with the implementation of d2f and f2d in the vec4 backend. These use a generator opcode that switches to scalar mode like this (for d2f):

brw_set_default_access_mode(p, BRW_ALIGN_1);
dst.hstride = BRW_HORIZONTAL_STRIDE_2;
dst.width = BRW_WIDTH_4;
brw_MOV(p, dst, src[0]);
brw_set_default_access_mode(p, BRW_ALIGN_16);

The problem with this is that scalar mode does not consider the swizzle in src[0] and because of copy-propagation, we can end up with a swizzle like zwzw here.

In the case of d2f, it seems that  I can just do a normal MOV in align16 mode and that seems to work, but f2d is not so easy and we probably need align1 mode there.

Of course, we can fix this by disabling copy-propagation for the generator opcodes we use for this, I suppose that should be alright, but I wonder if you think there is a better approach.
Comment 49 Connor Abbott 2016-02-11 14:38:54 UTC
(In reply to Iago Toral from comment #48)
> Connor, I think there is a problem with the implementation of d2f and f2d in
> the vec4 backend. These use a generator opcode that switches to scalar mode
> like this (for d2f):
> 
> brw_set_default_access_mode(p, BRW_ALIGN_1);
> dst.hstride = BRW_HORIZONTAL_STRIDE_2;
> dst.width = BRW_WIDTH_4;
> brw_MOV(p, dst, src[0]);
> brw_set_default_access_mode(p, BRW_ALIGN_16);
> 
> The problem with this is that scalar mode does not consider the swizzle in
> src[0] and because of copy-propagation, we can end up with a swizzle like
> zwzw here.
> 
> In the case of d2f, it seems that  I can just do a normal MOV in align16
> mode and that seems to work, but f2d is not so easy and we probably need
> align1 mode there.
> 
> Of course, we can fix this by disabling copy-propagation for the generator
> opcodes we use for this, I suppose that should be alright, but I wonder if
> you think there is a better approach.

Yeah, I think that my plan was just to disable copy propagation into those. We already do this for other opcodes where we use align1 mode, and I thought I had already disabled it, although I could be mis-remembering. If we try and handle the swizzle in the generator by emitting another MOV, we would just be undoing the copy propagation which seems a little silly.
Comment 50 Connor Abbott 2016-02-11 14:39:25 UTC
(In reply to Connor Abbott from comment #49)
> (In reply to Iago Toral from comment #48)
> > Connor, I think there is a problem with the implementation of d2f and f2d in
> > the vec4 backend. These use a generator opcode that switches to scalar mode
> > like this (for d2f):
> > 
> > brw_set_default_access_mode(p, BRW_ALIGN_1);
> > dst.hstride = BRW_HORIZONTAL_STRIDE_2;
> > dst.width = BRW_WIDTH_4;
> > brw_MOV(p, dst, src[0]);
> > brw_set_default_access_mode(p, BRW_ALIGN_16);
> > 
> > The problem with this is that scalar mode does not consider the swizzle in
> > src[0] and because of copy-propagation, we can end up with a swizzle like
> > zwzw here.
> > 
> > In the case of d2f, it seems that  I can just do a normal MOV in align16
> > mode and that seems to work, but f2d is not so easy and we probably need
> > align1 mode there.
> > 
> > Of course, we can fix this by disabling copy-propagation for the generator
> > opcodes we use for this, I suppose that should be alright, but I wonder if
> > you think there is a better approach.
> 
> Yeah, I think that my plan was just to disable copy propagation into those.
> We already do this for other opcodes where we use align1 mode, and I thought
> I had already disabled it, although I could be mis-remembering. If we try
> and handle the swizzle in the generator by emitting another MOV, we would
> just be undoing the copy propagation which seems a little silly.

Err, disable copy propagation with swizzles of course...
Comment 51 Iago Toral 2016-02-15 10:48:40 UTC
I noticed that nir_lower_locals_to_regs can insert MOVs of 64-bit things and we need to catch these in our double splitting pass for the vec4 backend. However, I am a bit confused here because nir_lower_locals_to_regs injects nir_registers and not SSA definitions so the double splitting pass can't handle the generated NIR after it at the moment:

decl_reg vec4 64 r0[4]
(...)
vec4 64 ssa_6 = intrinsic load_ubo (ssa_0, ssa_5) () ()
r0[3] = imov ssa_6
(...)
vec4 64 ssa_12 = imov r0[0 + ssa_11]

If this is correct and expected, then I guess we will have to amend the double splitting pass to handle nir_registers as well, right?
Comment 52 Iago Toral 2016-02-15 11:03:40 UTC
(In reply to Iago Toral from comment #51)
> I noticed that nir_lower_locals_to_regs can insert MOVs of 64-bit things and
> we need to catch these in our double splitting pass for the vec4 backend.
> However, I am a bit confused here because nir_lower_locals_to_regs injects
> nir_registers and not SSA definitions so the double splitting pass can't
> handle the generated NIR after it at the moment:
> 
> decl_reg vec4 64 r0[4]
> (...)
> vec4 64 ssa_6 = intrinsic load_ubo (ssa_0, ssa_5) () ()
> r0[3] = imov ssa_6
> (...)
> vec4 64 ssa_12 = imov r0[0 + ssa_11]
> 
> If this is correct and expected, then I guess we will have to amend the
> double splitting pass to handle nir_registers as well, right?

Or maybe we should split dvec3/4 loads into two dvec2 loads plus a 64-bit  vec3/4 operation. So far I was working with the assumption that vecN operations and dvec loads where the two cases where the vec4 backend could see writes bigger than a dvec2. I actually implemented that for UBOs and SSBOs but seeing this, maybe it is better to split them into dvec2 loads.
Comment 53 Iago Toral 2016-02-15 13:25:54 UTC
(In reply to Iago Toral from comment #52)
> (In reply to Iago Toral from comment #51)
> > I noticed that nir_lower_locals_to_regs can insert MOVs of 64-bit things and
> > we need to catch these in our double splitting pass for the vec4 backend.
> > However, I am a bit confused here because nir_lower_locals_to_regs injects
> > nir_registers and not SSA definitions so the double splitting pass can't
> > handle the generated NIR after it at the moment:
> > 
> > decl_reg vec4 64 r0[4]
> > (...)
> > vec4 64 ssa_6 = intrinsic load_ubo (ssa_0, ssa_5) () ()
> > r0[3] = imov ssa_6
> > (...)
> > vec4 64 ssa_12 = imov r0[0 + ssa_11]
> > 
> > If this is correct and expected, then I guess we will have to amend the
> > double splitting pass to handle nir_registers as well, right?
> 
> Or maybe we should split dvec3/4 loads into two dvec2 loads plus a 64-bit 
> vec3/4 operation. So far I was working with the assumption that vecN
> operations and dvec loads where the two cases where the vec4 backend could
> see writes bigger than a dvec2. I actually implemented that for UBOs and
> SSBOs but seeing this, maybe it is better to split them into dvec2 loads.

FYI, I went ahead and implemented this, but the problem persists, as you can see here:

 	vec2 64 ssa_6 = intrinsic load_ubo (ssa_0, ssa_5) () ()
	vec1 32 ssa_7 = load_const (0x00000010 /* 0.000000 */)
	vec1 32 ssa_8 = iadd ssa_5, ssa_7
	vec2 64 ssa_9 = intrinsic load_ubo (ssa_0, ssa_8) () ()
	vec4 64 ssa_10 = vec4 ssa_6, ssa_6.y, ssa_9, ssa_9.y
	r0[3] = imov ssa_10

so it seems like we really need to run the splitting pass after  nir_lower_locals_to_regs and make it handle nir_register destinations too.

That said, are we still interested in breaking ubo/ssbo/shared-variable loads in dvec2 chunks? That would make the backend implementation slightly easier, but not too much. The current implementation I have can already handle loads of dvec3/dvec4 without much effort:

https://github.com/Igalia/mesa/commit/efe6233e7bb00fc583d393058b559091e73729f9
Comment 54 Jason Ekstrand 2016-02-15 16:04:47 UTC
(In reply to Iago Toral from comment #53)
> (In reply to Iago Toral from comment #52)
> > (In reply to Iago Toral from comment #51)
> > > I noticed that nir_lower_locals_to_regs can insert MOVs of 64-bit things and
> > > we need to catch these in our double splitting pass for the vec4 backend.
> > > However, I am a bit confused here because nir_lower_locals_to_regs injects
> > > nir_registers and not SSA definitions so the double splitting pass can't
> > > handle the generated NIR after it at the moment:
> > > 
> > > decl_reg vec4 64 r0[4]
> > > (...)
> > > vec4 64 ssa_6 = intrinsic load_ubo (ssa_0, ssa_5) () ()
> > > r0[3] = imov ssa_6
> > > (...)
> > > vec4 64 ssa_12 = imov r0[0 + ssa_11]
> > > 
> > > If this is correct and expected, then I guess we will have to amend the
> > > double splitting pass to handle nir_registers as well, right?
> > 
> > Or maybe we should split dvec3/4 loads into two dvec2 loads plus a 64-bit 
> > vec3/4 operation. So far I was working with the assumption that vecN
> > operations and dvec loads where the two cases where the vec4 backend could
> > see writes bigger than a dvec2. I actually implemented that for UBOs and
> > SSBOs but seeing this, maybe it is better to split them into dvec2 loads.
> 
> FYI, I went ahead and implemented this, but the problem persists, as you can
> see here:
> 
>  	vec2 64 ssa_6 = intrinsic load_ubo (ssa_0, ssa_5) () ()
> 	vec1 32 ssa_7 = load_const (0x00000010 /* 0.000000 */)
> 	vec1 32 ssa_8 = iadd ssa_5, ssa_7
> 	vec2 64 ssa_9 = intrinsic load_ubo (ssa_0, ssa_8) () ()
> 	vec4 64 ssa_10 = vec4 ssa_6, ssa_6.y, ssa_9, ssa_9.y
> 	r0[3] = imov ssa_10
> 
> so it seems like we really need to run the splitting pass after 
> nir_lower_locals_to_regs and make it handle nir_register destinations too.
> 
> That said, are we still interested in breaking ubo/ssbo/shared-variable
> loads in dvec2 chunks? That would make the backend implementation slightly
> easier, but not too much. The current implementation I have can already
> handle loads of dvec3/dvec4 without much effort:
> 
> https://github.com/Igalia/mesa/commit/
> efe6233e7bb00fc583d393058b559091e73729f9

I think the real solution here is to make the back-end handle dvec4 movs.  You wouldn't uaveto handle any other dvec4 ALU ops but MOV may be a special case  We could try and do some sort of splitting operation on variables prior to running vads_to_regs, but I think that may not be worth it.
Comment 55 Iago Toral 2016-02-16 07:41:20 UTC
(In reply to Jason Ekstrand from comment #54)
> (In reply to Iago Toral from comment #53)
> > (In reply to Iago Toral from comment #52)
> > > (In reply to Iago Toral from comment #51)
> > > > I noticed that nir_lower_locals_to_regs can insert MOVs of 64-bit things and
> > > > we need to catch these in our double splitting pass for the vec4 backend.
> > > > However, I am a bit confused here because nir_lower_locals_to_regs injects
> > > > nir_registers and not SSA definitions so the double splitting pass can't
> > > > handle the generated NIR after it at the moment:
> > > > 
> > > > decl_reg vec4 64 r0[4]
> > > > (...)
> > > > vec4 64 ssa_6 = intrinsic load_ubo (ssa_0, ssa_5) () ()
> > > > r0[3] = imov ssa_6
> > > > (...)
> > > > vec4 64 ssa_12 = imov r0[0 + ssa_11]
> > > > 
> > > > If this is correct and expected, then I guess we will have to amend the
> > > > double splitting pass to handle nir_registers as well, right?
> > > 
> > > Or maybe we should split dvec3/4 loads into two dvec2 loads plus a 64-bit 
> > > vec3/4 operation. So far I was working with the assumption that vecN
> > > operations and dvec loads where the two cases where the vec4 backend could
> > > see writes bigger than a dvec2. I actually implemented that for UBOs and
> > > SSBOs but seeing this, maybe it is better to split them into dvec2 loads.
> > 
> > FYI, I went ahead and implemented this, but the problem persists, as you can
> > see here:
> > 
> >  	vec2 64 ssa_6 = intrinsic load_ubo (ssa_0, ssa_5) () ()
> > 	vec1 32 ssa_7 = load_const (0x00000010 /* 0.000000 */)
> > 	vec1 32 ssa_8 = iadd ssa_5, ssa_7
> > 	vec2 64 ssa_9 = intrinsic load_ubo (ssa_0, ssa_8) () ()
> > 	vec4 64 ssa_10 = vec4 ssa_6, ssa_6.y, ssa_9, ssa_9.y
> > 	r0[3] = imov ssa_10
> > 
> > so it seems like we really need to run the splitting pass after 
> > nir_lower_locals_to_regs and make it handle nir_register destinations too.
> > 
> > That said, are we still interested in breaking ubo/ssbo/shared-variable
> > loads in dvec2 chunks? That would make the backend implementation slightly
> > easier, but not too much. The current implementation I have can already
> > handle loads of dvec3/dvec4 without much effort:
> > 
> > https://github.com/Igalia/mesa/commit/
> > efe6233e7bb00fc583d393058b559091e73729f9
> 
> I think the real solution here is to make the back-end handle dvec4 movs. 
> You wouldn't uaveto handle any other dvec4 ALU ops but MOV may be a special
> case  We could try and do some sort of splitting operation on variables
> prior to running vads_to_regs, but I think that may not be worth it.

Yeah, that could be a way to go about this. I'll do that and see how it works. Thanks Jason!

How about dvec3/4 loads (ubo, ssbo, shared variables)? Do we want to split these or is it fine to handle them in the backend? I think they are easy enough to handle in the backend (I am doing that already), but it is not difficult to have NIR split them either.
Comment 56 Samuel Iglesias Gonsálvez 2016-02-16 12:26:16 UTC
In our current implementation, the lowering passes for floor() and ceil() are using trunc() to do part of the calculations. Previosly, we scalarized trunc() implementation because of its use of nir_if instruction.

I scalarized both floor() and ceil() lowering passes under the hypothesis that nir_bcsel only reads from one channel (like nir_if), because in both passes there is at least one bcsel operation before calling trunc(). Now, the floor and ceil tests pass for dvecN.

Is that assumption valid for nir_bcsel?
Comment 57 Jason Ekstrand 2016-02-16 14:20:31 UTC
(In reply to Iago Toral from comment #55)
> How about dvec3/4 loads (ubo, ssbo, shared variables)? Do we want to split
> these or is it fine to handle them in the backend? I think they are easy
> enough to handle in the backend (I am doing that already), but it is not
> difficult to have NIR split them either.\

Whatever makes your life easier.  If they're not bad to just implement in the backend, do it.  If it makes things easier to have the spliting pass split them, then do that.

(In reply to Samuel Iglesias from comment #56)
> In our current implementation, the lowering passes for floor() and ceil()
> are using trunc() to do part of the calculations. Previosly, we scalarized
> trunc() implementation because of its use of nir_if instruction.
> 
> I scalarized both floor() and ceil() lowering passes under the hypothesis
> that nir_bcsel only reads from one channel (like nir_if), because in both
> passes there is at least one bcsel operation before calling trunc(). Now,
> the floor and ceil tests pass for dvecN.
> 
> Is that assumption valid for nir_bcsel?

No.  nir_bcsel is supposed to be fully vectorized.  I'm not sure how well tested that is though, so there may be a bug in the vec4 NIR code.
Comment 58 Iago Toral 2016-02-23 08:13:50 UTC
(In reply to Jason Ekstrand from comment #30)
> Created attachment 120957 [details]
> NIR indirect lowering pass
> 
> (In reply to Iago Toral from comment #29)
> > Hey Jason/Connor,
> > 
> > the lowering of trunc for doubles has some code that looks like this
> > (pseudo-code):
> > 
> > if (exponent < 0) {
> >    mask = 0x0
> > } else if (exponent > 52) {
> >    mask = 0x7fffffffffffffff;
> > } else {
> >    /* This is a 64-bit integer op, needs to be split into hi/lo 32-bit ops */
> >    mask =  (1LL << frac_bits) - 1;
> > }
> > 
> > The current implementation I have works fine using bcsel. It looks something
> > like this (again, pseudo-code):
> > 
> > mask = bcsel(exponent < 0,
> >              0x7fffffffffffffff,
> >              bcsel(exponent > 52,
> >                    0x0000000000000000,
> >                    (1LL << frac_bits) -1))
> > 
> > My problem with this is that "(1LL << frac_bits) - 1" is a 64-bit integer
> > operation that we have to implement in terms of hi/lo 32-bit integer
> > operations (at least until we support 64-bit integers), so it is really a
> > bunch of instructions. Because I use bcsel, it means that we generate code
> > for that even if exponent is not in [1..51], which is not ideal.
> 
> Right.  I would encourage you not to use if's too much because branching may
> be more expensive than bcsel depending on what paths different invocations
> take.  However, if one side of the if is overwhelmingly more likely than the
> other, then control-flow is probably a good idea.

I have been revisiting this. Because if statements in NIR are strictly scalar, this lowering needs to be scalarized as well. I wonder if the scalarized code resulting of this defeats the purpose of using the if statement for the vec4 backend, since we lose the ability to use vector instructions.

Some quick experiments with a simple trunc() test show these results (#instructions):

backed         bcsel      if (unscalarized)     if (scalarized)
----------------------------------------------------------------
vec4            65              69                   102
fs (simd8)      67              85                    85
fs (simd16)     95             119                   119

bcsel implementations have less overall instructions as expected, although as discussed before, if implementations may be better in some cases since they might end up executing less instructions in some cases. However, it is clear that the required scalarization for the if statement in the vec4 backend makes things much worse, to a point that  I am not sure any more that this is a win in this scenario.

So we have 2 options again:

1) Go back to the bcsel implementation for both backends.
2) Pass an is_scalar flag to the lowering pass, choose the bcsel implementation for non scalar backends and the scalarized if implementation for scalar.

2) _might_ be better overall from a performance standpoint but I wonder if it is worth having two different implementations of this. This decision would also affect the implementation of roundEven().

What do you think?
Comment 59 Iago Toral 2016-03-01 15:02:26 UTC
I think the fp64 is now mostly complete, in the last days we have been going through the patch set cleaning things up a bit in preparation for review, the result is in this branch:

https://github.com/Igalia/mesa/tree/i965-fp64

FYI, we have also expanded the fp64 piglit tests to cover things that we found weren't covered by existing tests. There is a branch with these here:

https://github.com/Igalia/piglit/tree/arb_gpu_shader_fp64

It still needs a rebase and some clean up work, but we should be sending this for review soon too.

As it is now, we pass all the fp64 tests, except for a few that fail because of spilling, that could probably take some optimization work to fix. We also have no regressions in non-double funcionality in any hardware generation that we have available (gen5 to gen9).

There are still a couple of things that we need to confirm so we can squash or remove a few patches in the series:

1. We need to make a decision about d2b. We discussed in the list that f2b does not need to handle denorms and neither does NVIDIA. On the other hand, NVIDIA does handle d2b with denorms (that is, it returns true for them). Since doubles are more about precission than performance this could make sense for us too. If we want this behavior I have a patch for this in the series, following a similar approach to the one I sento to the list for f2b, (I should get rid of the resolve_source_modifiers() for gen8 as Jason mentioned for the f2b case):

https://github.com/Igalia/mesa/commit/df6c1d940be455d0e40d9eee2cb0e8987e4baa16

2. We need to make a decision regarding the scalarization of the lowerings of trunc() and roundEven() in NIR to use if statements instead of bcsel. For scalar this seems like a possible win, but for vec4 it probably isn't since we lose the ability to do vector instructions. See my previous comment for more details.

With those two questions answered we would have everything ready for review, the only important thing missing I can think of is spilling. As I mentioned above there are a few piglit tests that fail because of this, I think this might be a combination of the spilling code not being able to spill registers with a size > 1 (at least for vec4) and some needed optimization work, but I think it is probably a good idea to start the review while we look into this.

With the two questions above pending, the series should be 190+ patches long, so it is pretty big. The composition looks more or less like this:

1) execsize/width fixes: 9 patches (the ones we sent in December to the list)
2) NIR: ~60 patches
3) i965: ~130 patches

Notice that some of the i965 patches are _required_ after the NIR patches so that we don't break things in the driver because NIR introduces sized types that drivers need to handle, even if we don't care about fp64 yet.

Also, as soon as we get sized types in NIR we are going to break drivers that don't support sized types (freedreno/vc4). I suppose we should ping Eric and Rob and point them at our branch so they can have some time to fix their drivers accordingly. We did fix some obvious things for them, but they should run piglit and probably fix some more stuff.

I suppose we could try to break the series into smaller chunks that make sense to review individually. Initially I was thinking that we might review the NIR stuff before the i965 bits... maybe that's a reasonable way to start with this, including the minimum i965 stuff required to not break non-double things.

Does this make sense to you?
Comment 60 Jason Ekstrand 2016-03-01 17:12:59 UTC
(In reply to Iago Toral from comment #58)
> (In reply to Jason Ekstrand from comment #30)
> > Created attachment 120957 [details]
> > NIR indirect lowering pass
> > 
> > (In reply to Iago Toral from comment #29)
> > > Hey Jason/Connor,
> > > 
> > > the lowering of trunc for doubles has some code that looks like this
> > > (pseudo-code):
> > > 
> > > if (exponent < 0) {
> > >    mask = 0x0
> > > } else if (exponent > 52) {
> > >    mask = 0x7fffffffffffffff;
> > > } else {
> > >    /* This is a 64-bit integer op, needs to be split into hi/lo 32-bit ops */
> > >    mask =  (1LL << frac_bits) - 1;
> > > }
> > > 
> > > The current implementation I have works fine using bcsel. It looks something
> > > like this (again, pseudo-code):
> > > 
> > > mask = bcsel(exponent < 0,
> > >              0x7fffffffffffffff,
> > >              bcsel(exponent > 52,
> > >                    0x0000000000000000,
> > >                    (1LL << frac_bits) -1))
> > > 
> > > My problem with this is that "(1LL << frac_bits) - 1" is a 64-bit integer
> > > operation that we have to implement in terms of hi/lo 32-bit integer
> > > operations (at least until we support 64-bit integers), so it is really a
> > > bunch of instructions. Because I use bcsel, it means that we generate code
> > > for that even if exponent is not in [1..51], which is not ideal.
> > 
> > Right.  I would encourage you not to use if's too much because branching may
> > be more expensive than bcsel depending on what paths different invocations
> > take.  However, if one side of the if is overwhelmingly more likely than the
> > other, then control-flow is probably a good idea.
> 
> I have been revisiting this. Because if statements in NIR are strictly
> scalar, this lowering needs to be scalarized as well. I wonder if the
> scalarized code resulting of this defeats the purpose of using the if
> statement for the vec4 backend, since we lose the ability to use vector
> instructions.
> 
> Some quick experiments with a simple trunc() test show these results
> (#instructions):
> 
> backed         bcsel      if (unscalarized)     if (scalarized)
> ----------------------------------------------------------------
> vec4            65              69                   102
> fs (simd8)      67              85                    85
> fs (simd16)     95             119                   119
> 
> bcsel implementations have less overall instructions as expected, although
> as discussed before, if implementations may be better in some cases since
> they might end up executing less instructions in some cases. However, it is
> clear that the required scalarization for the if statement in the vec4
> backend makes things much worse, to a point that  I am not sure any more
> that this is a win in this scenario.
> 
> So we have 2 options again:
> 
> 1) Go back to the bcsel implementation for both backends.
> 2) Pass an is_scalar flag to the lowering pass, choose the bcsel
> implementation for non scalar backends and the scalarized if implementation
> for scalar.
> 
> 2) _might_ be better overall from a performance standpoint but I wonder if
> it is worth having two different implementations of this. This decision
> would also affect the implementation of roundEven().
> 
> What do you think?

I'm sorry that it has taken so long for me to get back to you.  We've had a lot going on at the office lately.

I think (1) is probably the best option for a couple of reasons:

 a) Simplicity, as you mentioned, and the ability to vectorize in vec4.

 b) Even in simd8 mode, you are executing 8 threads at a time and you will only actually skip instructions if all 8 threads take the same side of the branch.  While this may happen fairly frequently depending on the context, there's still a decent chance that all of the instructions will get executed anyway.

 c) Even if (b) isn't a problem, using bcsel gives us much more freedom when scheduling instructions.  Any sort of control flow acts as an instruction scheduling barrier and prevents us from moving things around.  This is a problem for both register pressure (causing spilling) and instruction latency.  If we use bcsel, there's a decent chance that we can move instructions around and hide latency enough to make up for the extra instructions being executed.
Comment 61 Jason Ekstrand 2016-03-01 17:22:11 UTC
(In reply to Iago Toral from comment #59)
> I think the fp64 is now mostly complete, in the last days we have been going
> through the patch set cleaning things up a bit in preparation for review,
> the result is in this branch:
> 
> https://github.com/Igalia/mesa/tree/i965-fp64
> 
> FYI, we have also expanded the fp64 piglit tests to cover things that we
> found weren't covered by existing tests. There is a branch with these here:
> 
> https://github.com/Igalia/piglit/tree/arb_gpu_shader_fp64
> 
> It still needs a rebase and some clean up work, but we should be sending
> this for review soon too.
> 
> As it is now, we pass all the fp64 tests, except for a few that fail because
> of spilling, that could probably take some optimization work to fix. We also
> have no regressions in non-double funcionality in any hardware generation
> that we have available (gen5 to gen9).

Yes, spilling is important but I think we can start landing things before we get that 100% fixed.

> There are still a couple of things that we need to confirm so we can squash
> or remove a few patches in the series:
> 
> 1. We need to make a decision about d2b. We discussed in the list that f2b
> does not need to handle denorms and neither does NVIDIA. On the other hand,
> NVIDIA does handle d2b with denorms (that is, it returns true for them).
> Since doubles are more about precission than performance this could make
> sense for us too. If we want this behavior I have a patch for this in the
> series, following a similar approach to the one I sento to the list for f2b,
> (I should get rid of the resolve_source_modifiers() for gen8 as Jason
> mentioned for the f2b case):
> 
> https://github.com/Igalia/mesa/commit/
> df6c1d940be455d0e40d9eee2cb0e8987e4baa16

I think the spec allows for a lot of wiggle room.  With doubles, it may be easier to implement with bitwise operations like you did in that f2b patch.  Apps shouldn't count on any particular behaviour there other than 0.0 -> false and obviously non-zero -> true.  Any app that counts on the behaviour of d2b on denorms is going to be broken.

> 2. We need to make a decision regarding the scalarization of the lowerings
> of trunc() and roundEven() in NIR to use if statements instead of bcsel. For
> scalar this seems like a possible win, but for vec4 it probably isn't since
> we lose the ability to do vector instructions. See my previous comment for
> more details.

See above.

> With those two questions answered we would have everything ready for review,
> the only important thing missing I can think of is spilling. As I mentioned
> above there are a few piglit tests that fail because of this, I think this
> might be a combination of the spilling code not being able to spill
> registers with a size > 1 (at least for vec4) and some needed optimization
> work, but I think it is probably a good idea to start the review while we
> look into this.
> 
> With the two questions above pending, the series should be 190+ patches
> long, so it is pretty big. The composition looks more or less like this:
> 
> 1) execsize/width fixes: 9 patches (the ones we sent in December to the list)
> 2) NIR: ~60 patches
> 3) i965: ~130 patches
> 
> Notice that some of the i965 patches are _required_ after the NIR patches so
> that we don't break things in the driver because NIR introduces sized types
> that drivers need to handle, even if we don't care about fp64 yet.
> 
> Also, as soon as we get sized types in NIR we are going to break drivers
> that don't support sized types (freedreno/vc4). I suppose we should ping
> Eric and Rob and point them at our branch so they can have some time to fix
> their drivers accordingly. We did fix some obvious things for them, but they
> should run piglit and probably fix some more stuff.

That should happen sooner rather than later.

> I suppose we could try to break the series into smaller chunks that make
> sense to review individually. Initially I was thinking that we might review
> the NIR stuff before the i965 bits... maybe that's a reasonable way to start
> with this, including the minimum i965 stuff required to not break non-double
> things.

I think the thing to do there would be to break at least one chunk off.  Namely, the bare minimum NIR and i965 changes required to support the new sized types.  That will all have to be squashed into a single commit along with the vc4 and freedreno changes.  We should get that out ASAP so that Rob and Eric have a chance to fix their drivers so we aren't waiting on them before we can start landing the rest of it.

Beyond that, I don't really care how it's split up.  The NIR pass for doing double lowering can be it's own thing.  I guess fs and vec4 could be separate.  I'll leave it up to you how you want to do that.  It's the same set of patches at the end of the day.
Comment 62 Iago Toral 2016-03-02 07:32:41 UTC
(In reply to Jason Ekstrand from comment #60)
> (In reply to Iago Toral from comment #58)
> > (In reply to Jason Ekstrand from comment #30)
> > > Created attachment 120957 [details]
> > > NIR indirect lowering pass
> > > 
> > > (In reply to Iago Toral from comment #29)
> > > > Hey Jason/Connor,
> > > > 
> > > > the lowering of trunc for doubles has some code that looks like this
> > > > (pseudo-code):
> > > > 
> > > > if (exponent < 0) {
> > > >    mask = 0x0
> > > > } else if (exponent > 52) {
> > > >    mask = 0x7fffffffffffffff;
> > > > } else {
> > > >    /* This is a 64-bit integer op, needs to be split into hi/lo 32-bit ops */
> > > >    mask =  (1LL << frac_bits) - 1;
> > > > }
> > > > 
> > > > The current implementation I have works fine using bcsel. It looks something
> > > > like this (again, pseudo-code):
> > > > 
> > > > mask = bcsel(exponent < 0,
> > > >              0x7fffffffffffffff,
> > > >              bcsel(exponent > 52,
> > > >                    0x0000000000000000,
> > > >                    (1LL << frac_bits) -1))
> > > > 
> > > > My problem with this is that "(1LL << frac_bits) - 1" is a 64-bit integer
> > > > operation that we have to implement in terms of hi/lo 32-bit integer
> > > > operations (at least until we support 64-bit integers), so it is really a
> > > > bunch of instructions. Because I use bcsel, it means that we generate code
> > > > for that even if exponent is not in [1..51], which is not ideal.
> > > 
> > > Right.  I would encourage you not to use if's too much because branching may
> > > be more expensive than bcsel depending on what paths different invocations
> > > take.  However, if one side of the if is overwhelmingly more likely than the
> > > other, then control-flow is probably a good idea.
> > 
> > I have been revisiting this. Because if statements in NIR are strictly
> > scalar, this lowering needs to be scalarized as well. I wonder if the
> > scalarized code resulting of this defeats the purpose of using the if
> > statement for the vec4 backend, since we lose the ability to use vector
> > instructions.
> > 
> > Some quick experiments with a simple trunc() test show these results
> > (#instructions):
> > 
> > backed         bcsel      if (unscalarized)     if (scalarized)
> > ----------------------------------------------------------------
> > vec4            65              69                   102
> > fs (simd8)      67              85                    85
> > fs (simd16)     95             119                   119
> > 
> > bcsel implementations have less overall instructions as expected, although
> > as discussed before, if implementations may be better in some cases since
> > they might end up executing less instructions in some cases. However, it is
> > clear that the required scalarization for the if statement in the vec4
> > backend makes things much worse, to a point that  I am not sure any more
> > that this is a win in this scenario.
> > 
> > So we have 2 options again:
> > 
> > 1) Go back to the bcsel implementation for both backends.
> > 2) Pass an is_scalar flag to the lowering pass, choose the bcsel
> > implementation for non scalar backends and the scalarized if implementation
> > for scalar.
> > 
> > 2) _might_ be better overall from a performance standpoint but I wonder if
> > it is worth having two different implementations of this. This decision
> > would also affect the implementation of roundEven().
> > 
> > What do you think?
> 
> I'm sorry that it has taken so long for me to get back to you.  We've had a
> lot going on at the office lately.

Don't worry at all, we were not blocked on this. We have both implementations so it is only a matter of deciding which one to go with.

> I think (1) is probably the best option for a couple of reasons:
> 
>  a) Simplicity, as you mentioned, and the ability to vectorize in vec4.
> 
>  b) Even in simd8 mode, you are executing 8 threads at a time and you will
> only actually skip instructions if all 8 threads take the same side of the
> branch.  While this may happen fairly frequently depending on the context,
> there's still a decent chance that all of the instructions will get executed
> anyway.
> 
>  c) Even if (b) isn't a problem, using bcsel gives us much more freedom when
> scheduling instructions.  Any sort of control flow acts as an instruction
> scheduling barrier and prevents us from moving things around.  This is a
> problem for both register pressure (causing spilling) and instruction
> latency.  If we use bcsel, there's a decent chance that we can move
> instructions around and hide latency enough to make up for the extra
> instructions being executed.

Yep, that makes a good case for 1), let's go with that.
Comment 63 Iago Toral 2016-03-02 07:45:38 UTC
(In reply to Jason Ekstrand from comment #61)
> (In reply to Iago Toral from comment #59)
(...) 
> > There are still a couple of things that we need to confirm so we can squash
> > or remove a few patches in the series:
> > 
> > 1. We need to make a decision about d2b. We discussed in the list that f2b
> > does not need to handle denorms and neither does NVIDIA. On the other hand,
> > NVIDIA does handle d2b with denorms (that is, it returns true for them).
> > Since doubles are more about precission than performance this could make
> > sense for us too. If we want this behavior I have a patch for this in the
> > series, following a similar approach to the one I sento to the list for f2b,
> > (I should get rid of the resolve_source_modifiers() for gen8 as Jason
> > mentioned for the f2b case):
> > 
> > https://github.com/Igalia/mesa/commit/
> > df6c1d940be455d0e40d9eee2cb0e8987e4baa16
> 
> I think the spec allows for a lot of wiggle room.  With doubles, it may be
> easier to implement with bitwise operations like you did in that f2b patch. 
> Apps shouldn't count on any particular behaviour there other than 0.0 ->
> false and obviously non-zero -> true.  Any app that counts on the behaviour
> of d2b on denorms is going to be broken.

If we want the easiest implementation it is actually the one that does not handle denorms, since that is just the CMP plus a swizzled MOV to generate a 32-bit result. The bitwise implementation requires and AND (to remove the sign) and OR of both 32-bit chunks in the DF (to generate a 32-bit result we can compare with 0) and a CMP to generate the boolean result.

So, I'll go with the CMP + MOV solution unless you prefer otherwise.

(...)

> > Also, as soon as we get sized types in NIR we are going to break drivers
> > that don't support sized types (freedreno/vc4). I suppose we should ping
> > Eric and Rob and point them at our branch so they can have some time to fix
> > their drivers accordingly. We did fix some obvious things for them, but they
> > should run piglit and probably fix some more stuff.
> 
> That should happen sooner rather than later.
> 
> > I suppose we could try to break the series into smaller chunks that make
> > sense to review individually. Initially I was thinking that we might review
> > the NIR stuff before the i965 bits... maybe that's a reasonable way to start
> > with this, including the minimum i965 stuff required to not break non-double
> > things.
> 
> I think the thing to do there would be to break at least one chunk off. 
> Namely, the bare minimum NIR and i965 changes required to support the new
> sized types.  That will all have to be squashed into a single commit along
> with the vc4 and freedreno changes.  We should get that out ASAP so that Rob
> and Eric have a chance to fix their drivers so we aren't waiting on them
> before we can start landing the rest of it.

Sounds good to me, we will start preparing that and ping Eric and Rob. We'll send the result for review as soon as it is done.

And thanks for the quick reply, as always! :)

> Beyond that, I don't really care how it's split up.  The NIR pass for doing
> double lowering can be it's own thing.  I guess fs and vec4 could be
> separate.  I'll leave it up to you how you want to do that.  It's the same
> set of patches at the end of the day.
Comment 64 Iago Toral 2016-03-14 15:05:51 UTC
since we imagine that reviewing the gen8+ fp64 implementation is going to take a while, we have already started implementing support for gen7. With a few fixes to handle 64-bit immediates (unsupported by gen7 hardware) and correct some bugs in copy-propagation triggered by the implementation of this, gen7 is already looking pretty good, with only 92 fails out of ~2900 tests, however, there is a major issue that we have bumped into: apparently gen7 does not like writes with a stride > 1, which we need to do all the time in fp64.

The Haswell PRM (vol7, 3D Media GPGPU Engine, Register Region Restrictions), says the following:

"When destination spans two registers, the source MUST span two registers."

Which is not present in the Broadwell PRMs. Unfortunately, it looks like changing things to obey this restriction does not fix anything. We bumped into this while implementing support for 64-bit immediates. Our initial implementation would do something like this (pseudo-code):

fs_reg setup_imm_df(double v) {
   vgrf<double> tmp;
   tmp = retype(tmp, unsigned)

   vgrf<double> v_low, v_high;
   v_low = retype(tmp, unsigned)
   v_high = retype(tmp, unsigned)

   mov(v_low, brw_imm_ud(low32(v)));
   mov(v_high, brw_imm_ud(high32(v)));

   mov(stride(tmp, 2), stride(v_low, 2));
   mov(stride(horiz_offset(tmp, 1), 2), stride(v_high, 2));

   return retype(tmp, double)
}

That implementation respects the HSW restriction for writes that span 2 registers, however I found the second SIMD register (reg_offset 1) for tmp is never written. Of course, we don't need to do this for immediates, we can just return a stride 0 region for tmp and that works fine, but this going to be a problem everywhere else.

I am not sure what to do about this, since I did not find anything useful in the PRMs to explain what could be going on other than that restriction which we are already obeying. Do you have any ideas? If not, would it be possible that someone at intel runs this through the simulator (I'd provide the aub trace file) to check if that gives any clues?
Comment 65 Iago Toral 2016-03-14 15:12:33 UTC
> fs_reg setup_imm_df(double v) {
>    vgrf<double> tmp;
>    tmp = retype(tmp, unsigned)
> 
>    vgrf<double> v_low, v_high;
>    v_low = retype(tmp, unsigned)
>    v_high = retype(tmp, unsigned)
> 
>    mov(v_low, brw_imm_ud(low32(v)));
>    mov(v_high, brw_imm_ud(high32(v)));

Just to be clear, these two MOVs above move to both reg_offset 0 and reg_offset 1, so they write both SIMD registers allocated for both v_low and v_high.

>    mov(stride(tmp, 2), stride(v_low, 2));
>    mov(stride(horiz_offset(tmp, 1), 2), stride(v_high, 2));
> 
>    return retype(tmp, double)
> }
Comment 66 Iago Toral 2016-03-28 12:25:18 UTC
Jason, Connor:

last week Curro spent some time looking at our fp64 branch and testing some things and we have been discussing some aspects of the hardware in fp64 that are not all that well documented (or not even documented at all :)) and that may have some important implications in the implementation, specifically for the vec4 backend.

I am sharing the main take aways from that discussion here so we can have them
documented for future reference. For some of these we need to discuss solutions, Curro has some suggestions that I share below so I'd like to know your thoughts on them.

1. It seems that IVB requires that fp64 instruction have their execsize doubled,
unlike HSW+ where the execsize matches the actual number of fp64 elements  processed. I think we might be able to do this in brw_set_dest at the same place
where we adjust the execution size of normal instructions, acting only when we detect a dst that is a 64-bit type.

2. It seems that the ExecMask in fp64 instructions is interpreted as 1-bit per fp64 component and not as 1-bit per 32-bit component. This has implications for non-uniform control flow. For example, in a gen7 VS SIMD4x2 execution (2 logical threads, one for each vertex) of a dvec2 fp64 ALU operation subject to control-flow, the execution of the operation for the first vertex is done according to the ExecMask bits 0 and 1, which map components X/Y of the first logical thread and execution of the second vertex will consider bits 2 and 3 of the ExecMask which map to components Z/W , also of the first thread. This means that the operation will execute for the second vertex or not depending on whether the operation executes for the first vertex. Piglit does have any instances of fp64 tests involving non-uniform control flow like this which is why we did not observe this behavior before, but we did write an ad-hoc test that seems to verify that this problem exists. 

Curro suggested that we can work around this problem in HSW+ by taking advantage of the fact that in these gens it is possible to process 8 DF elements in ALU operations, so we can simply not split dvec4 operations in dvec2 chunks, which would fix the problem automatically. Unfortunately, this also means that we need to figure out a solution to the problem with swizzles and writemasks working on 32-bit elements (more on that topic below).

Unfortunately, that solution would not work for IVB, since that hardware cannot run ALU instructions operating on more than 4 DF components. Curro thinks that in this case we could divide the SIMD4x2 execution in 2 SIMD4x1 instructions. That is, we generate one dvec4 instruction for each logical thread and use NibCtrl to fixup execution masking for the second instruction.

3. If we implement any solution that involves getting rid of the double splitting pass like the ones suggested above, we need to figure out a new solution to address the fact that swizzles operate in 32-bit elements, since we are back to a situation where we have to deal with dvec4 operands.

Below is a quick reference of the layout of a full dvec4 operand on a SIMD8 register (the l/h suffixes stand for high/low 32-bit).


         low dvec2       high dvec2 
       0   1   2   3  | 4   5   6   7 <- 32-bit channels
       xl  xh  yl  yh   zl  zh  wl  wh


My understanding here is that the hardware is going to replicate the swizzle we use across both dvec2 boundaries so we should be able to represent a logical swizzle like XZ as XY (which would select the first 2 32-bit channels on each side of the dvec2 boundary) directly, but there will be a lot of combinations that would not be directly representable. Also, we can't operate logical channels that belong to different sides of the dvec2 boundary (i.e. add vgrf1.x:df vgrf2.x:df vgrf3.w:df). To do something like this I think we might need to use Align1 mode to "align" the data to be on the same side of the dvec2 boundary. This only gets more complicated if we think about all the  permutations of swizzles that are possible... I suppose we will need something to break the swizzles into separate things we can handle individually, etc and then something else to recombine the results.

4. Apparently, in fp64 some writemaks operate in 64-bit elements and others operate in 32-bit elements (!). Specifcially, Curro says that XY and ZW operate as 32-bit and map to components XZ and YW whereas any other writemask represents 64-bit components. This means that our current code that uses the splitting pass and XY/ZW writemasks is actually operating with 32-bit channels as we thought but it is not really writing to the channels we thought it was :). This is probably irrelevant in the current implementation because we only ever use one of these 2 writemaks and they both write to different channels. However, if we start working with full dvec4 elements, this is going to matter. In fact, this means that logical writemasks XY and ZW cannot be represented in
hardware (!) and  that we would need to work around these by using temporaries and swizzling data around, which immediatly combines with the issue above to create a small version of hell :(.

5) It seems that gen7 hardware has a bug by which writes that span more than one register and operate on 32-bit components won't work. Unfortunately, we use this all the time in fp64 since accessing data in 32-bit element chunks is a common pattern, Curro verified this in the simulator, apparently there is inconsistent execmasking in this situation with the second half of the instruction which results in the second part of the write not happening
reliably. Curro thinks that the best way to deal with this is to split the operations in two, each one operating on a single register, which would require
to monkey with NibCtrl again.

Opinions?
Comment 67 Francisco Jerez 2016-03-28 21:27:44 UTC
(In reply to Iago Toral from comment #66)
>[..] 
> 2. It seems that the ExecMask in fp64 instructions is interpreted as 1-bit
> per fp64 component and not as 1-bit per 32-bit component. This has
> implications for non-uniform control flow. For example, in a gen7 VS SIMD4x2
> execution (2 logical threads, one for each vertex) of a dvec2 fp64 ALU
> operation subject to control-flow, the execution of the operation for the
> first vertex is done according to the ExecMask bits 0 and 1, which map
> components X/Y of the first logical thread and execution of the second
> vertex will consider bits 2 and 3 of the ExecMask which map to components
> Z/W , also of the first thread. This means that the operation will execute
> for the second vertex or not depending on whether the operation executes for
> the first vertex. Piglit does have any instances of fp64 tests involving
> non-uniform control flow like this which is why we did not observe this
> behavior before, but we did write an ad-hoc test that seems to verify that
> this problem exists. 
> 
> Curro suggested that we can work around this problem in HSW+ by taking
> advantage of the fact that in these gens it is possible to process 8 DF
> elements in ALU operations, so we can simply not split dvec4 operations in
> dvec2 chunks, which would fix the problem automatically. Unfortunately, this
> also means that we need to figure out a solution to the problem with
> swizzles and writemasks working on 32-bit elements (more on that topic
> below).
> 
> Unfortunately, that solution would not work for IVB, since that hardware
> cannot run ALU instructions operating on more than 4 DF components. Curro
> thinks that in this case we could divide the SIMD4x2 execution in 2 SIMD4x1
> instructions. That is, we generate one dvec4 instruction for each logical
> thread and use NibCtrl to fixup execution masking for the second instruction.
> 

I don't think that getting this to work on IVB would be particularly difficult -- You can just emit 8-wide instructions universally and then hook up the FS back-end's SIMD lowering pass so things are chopped up in pieces IVB parts will be able handle.  NibCtrl seems to work okay on IVB for this purpose, we just need some way to represent it in the VEC4 IR.

>[...]
> 5) It seems that gen7 hardware has a bug by which writes that span more than
> one register and operate on 32-bit components won't work. Unfortunately, we
> use this all the time in fp64 since accessing data in 32-bit element chunks
> is a common pattern, Curro verified this in the simulator, apparently there
> is inconsistent execmasking in this situation with the second half of the
> instruction which results in the second part of the write not happening
> reliably. Curro thinks that the best way to deal with this is to split the
> operations in two, each one operating on a single register, which would
> require to monkey with NibCtrl again.

For completeness, the problem is that pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where QtrCtrl is the 8-bit quarter of the execution mask signals specified in the instruction control fields) for the second compressed half of any single-precision instruction (for double-precision instructions it's hardwired to use NibCtrl+1), which means that the EU will apply the wrong execution controls for the second sequential GRF write if the number of channels per GRF is not exactly eight in single-precision mode (or four in double-float mode).

> 
> Opinions?
Comment 68 Samuel Iglesias Gonsálvez 2016-04-01 12:07:44 UTC
Hello Jason, Connor:

I have found a problem with mod() working with double operands: mod(x, y) is calculated as x - y * floor(x/y).

When working with doubles on i965, we need to do two lowering passes:
one to calculate the reciprocal of 'y' and other for the floor().

If x == y, then x/y == 1.0 but I found that calculated value is slightly
less than 1.0 because of rounding errors. In that case, floor() returns 0.0 and mod(x,y) ends up returning 'y' instead of 0. This is not happening for all values and that's the reason we have not seen it before.

We propose to create a lowering pass for mod() with doubles in nir_lowering_double_ops, where we check if x == y and return zero or the formula depending on the case. It would affect the performance for mod() operation due to the added bcsel but, as it only affects doubles (we can make nir_opt_algebraic to skip the double's case), we think it is not a bad idea.

What do you think?
Comment 69 Iago Toral 2016-04-02 08:15:45 UTC
(In reply to Samuel Iglesias from comment #68)
> Hello Jason, Connor:
> 
> I have found a problem with mod() working with double operands: mod(x, y) is
> calculated as x - y * floor(x/y).
> 
> When working with doubles on i965, we need to do two lowering passes:
> one to calculate the reciprocal of 'y' and other for the floor().
> 
> If x == y, then x/y == 1.0 but I found that calculated value is slightly
> less than 1.0 because of rounding errors. In that case, floor() returns 0.0
> and mod(x,y) ends up returning 'y' instead of 0. This is not happening for
> all values and that's the reason we have not seen it before.
> 
> We propose to create a lowering pass for mod() with doubles in
> nir_lowering_double_ops, where we check if x == y and return zero or the
> formula depending on the case. It would affect the performance for mod()
> operation due to the added bcsel but, as it only affects doubles (we can
> make nir_opt_algebraic to skip the double's case), we think it is not a bad
> idea.
> 
> What do you think?

I've been thinking about this and this is not a good idea, we probably have the same problem every time that a = b * N where N is an integer. If we want to fix this I think we need to check the fractional part of a / b and round the result if it is "almost" 1.0... that is going to lead to some awful code, because round is also lowered and it is a truly ugly hack but at least it should get things like mod(10, 5)  or mod (10.6, 5.2) working as expected. It might break things like mod(10, 4.999...) but I think that's probably more acceptable.

BTW, we found this issue in the desktop fp64 CTS tests.
Comment 70 Iago Toral 2016-04-04 09:19:12 UTC
Created attachment 122690 [details]
Sample test showcasing non-uniform control flow problems in align16

I forgot to attach this. The test executes non-uniform control flow with DF operations in a vertex shader. If we run this in gen7, it will execute in align16 mode.

The test renders a quad and is expected to produce 3 red vertices and 1 green vertex, but the result is 2 red and 2 green, which means that one vertex is taking the incorrect branch of the conditional, confirming Curro's point that we have an execmask problem. The test produces the expected result in gen8 because it runs VS in scalar mode.
Comment 71 Grazvydas Ignotas 2016-04-06 20:03:20 UTC
Out of the interest I've tried the https://github.com/Igalia i965-attrib64 branch and it breaks font rendering on my Skylake machine (xfce desktop, DRI3, X.Org 1.18.1). I could try bisecting if that helps, just thought it might be something known or expected.
Comment 72 Jason Ekstrand 2016-04-06 23:17:35 UTC
(In reply to Iago Toral from comment #66)
> Jason, Connor:
> 
> last week Curro spent some time looking at our fp64 branch and testing some
> things and we have been discussing some aspects of the hardware in fp64 that
> are not all that well documented (or not even documented at all :)) and that
> may have some important implications in the implementation, specifically for
> the vec4 backend.
>
> Opinions?

Short version:

a) The hardware is busted.
b) I think Curro knows what he's talking about.  :-)

Longer version:

I see a couple of options here:  One is to just scalarize all double stuff.  On Ivy Bridge, I think you would also have to double up instructions, one for each half.  It's not a great option from the perspective of performance but is perhaps the easiest to implement.

The second option is what Curro's suggesting where you try and use the hardware as much as possible and fall back to nasty things only when you have to.  Unfortunately, this is going to cause a lot of pain in the generator because suddenly lots of stuff may become align1 at least on IVB.

If we want to do option 2, I think we want some sort of lowering pass that runs after all the optimizations and before generation that applies all the nasty work-arounds.  Proior to the lowering pass we would treat swizzles and writemasks as something sensible such as actually working on dvec4 components.  The lowering pass would turn those into something the hardware can handle and maybe add extra code to resolve swizzles and writemasks that the hardware can't do natively.  On Ivy Bridge, the hardware can do almost nothing natively and the pass would be aware of that.  It would need to run after all the optimizations because we want optimizations to be able to assume swizzles are sensible so we don't have to implement crazy hardware restrictions at every level.  If it's practical, this seems like the best plan.

As far as the NIR splitting pass goes: If it's not useful, throw it away!  No one will get their feelings hurt.  Connor and I were working under the assumption that the hardware was at least mildly sane and, as it turns out, we were wrong.
Comment 73 Francisco Jerez 2016-04-06 23:47:36 UTC
(In reply to Jason Ekstrand from comment #72)
> (In reply to Iago Toral from comment #66)
> > Jason, Connor:
> > 
> > last week Curro spent some time looking at our fp64 branch and testing some
> > things and we have been discussing some aspects of the hardware in fp64 that
> > are not all that well documented (or not even documented at all :)) and that
> > may have some important implications in the implementation, specifically for
> > the vec4 backend.
> >
> > Opinions?
> 
> Short version:
> 
> a) The hardware is busted.
> b) I think Curro knows what he's talking about.  :-)
> 
> Longer version:
> 
> I see a couple of options here:  One is to just scalarize all double stuff. 
> On Ivy Bridge, I think you would also have to double up instructions, one
> for each half.  It's not a great option from the perspective of performance
> but is perhaps the easiest to implement.
> 
> The second option is what Curro's suggesting where you try and use the
> hardware as much as possible and fall back to nasty things only when you
> have to.  Unfortunately, this is going to cause a lot of pain in the
> generator because suddenly lots of stuff may become align1 at least on IVB.
> 
Just a short comment on this point: I don't think IVB will be much worse. From the functional point of view it's not that much different from HSW+, its primary limitation is that Align16 FP64 instructions can't do more than one dvec4 at a time, but that's easily solvable by hooking up the SIMD width lowering pass (in addition to the swizzle and writemask lowering pass that could be used on later gens), because NibCtrl behaves as expected on FP64 Align16 instructions even on IVB thankfully.

> If we want to do option 2, I think we want some sort of lowering pass that
> runs after all the optimizations and before generation that applies all the
> nasty work-arounds.  Proior to the lowering pass we would treat swizzles and
> writemasks as something sensible such as actually working on dvec4
> components.  The lowering pass would turn those into something the hardware
> can handle and maybe add extra code to resolve swizzles and writemasks that
> the hardware can't do natively.

Sounds reasonable.

> On Ivy Bridge, the hardware can do almost
> nothing natively and the pass would be aware of that.  It would need to run
> after all the optimizations because we want optimizations to be able to
> assume swizzles are sensible so we don't have to implement crazy hardware
> restrictions at every level.  If it's practical, this seems like the best
> plan.
> 
> As far as the NIR splitting pass goes: If it's not useful, throw it away! 
> No one will get their feelings hurt.  Connor and I were working under the
> assumption that the hardware was at least mildly sane and, as it turns out,
> we were wrong.
Comment 74 Iago Toral 2016-04-07 06:40:20 UTC
(In reply to Grazvydas Ignotas from comment #71)
> Out of the interest I've tried the https://github.com/Igalia i965-attrib64
> branch and it breaks font rendering on my Skylake machine (xfce desktop,
> DRI3, X.Org 1.18.1). I could try bisecting if that helps, just thought it
> might be something known or expected.

Thanks for reporting this! that branch is not the one we are working with though, that is for the development of GL_vertex_attrib_64bit which is still in an early stage and it is done on top of i965-fp64, which is the one that has the fp64 implementation for gen8+. It would be interesting if you could try with i965-fp64 to see if your problems are strictly related to the fp64 implementation or vertex attrib 64. We are aware of a number of problems in the fp64 implementation that we have been discussing here, but I don't think any of them should lead to the issues you report so if you can see that problem with the i965-fp64 branch alone it might be something else that we need to look into.
Comment 75 Iago Toral 2016-04-07 06:52:36 UTC
(In reply to Jason Ekstrand from comment #72)
> (In reply to Iago Toral from comment #66)
> > Jason, Connor:
> > 
> > last week Curro spent some time looking at our fp64 branch and testing some
> > things and we have been discussing some aspects of the hardware in fp64 that
> > are not all that well documented (or not even documented at all :)) and that
> > may have some important implications in the implementation, specifically for
> > the vec4 backend.
> >
> > Opinions?
> 
> Short version:
> 
> a) The hardware is busted.
> b) I think Curro knows what he's talking about.  :-)
> 
> Longer version:
> 
> I see a couple of options here:  One is to just scalarize all double stuff. 
> On Ivy Bridge, I think you would also have to double up instructions, one
> for each half.  It's not a great option from the perspective of performance
> but is perhaps the easiest to implement.
> 
> The second option is what Curro's suggesting where you try and use the
> hardware as much as possible and fall back to nasty things only when you
> have to.  Unfortunately, this is going to cause a lot of pain in the
> generator because suddenly lots of stuff may become align1 at least on IVB.
> 
> If we want to do option 2, I think we want some sort of lowering pass that
> runs after all the optimizations and before generation that applies all the
> nasty work-arounds.  Proior to the lowering pass we would treat swizzles and
> writemasks as something sensible such as actually working on dvec4
> components.  The lowering pass would turn those into something the hardware
> can handle and maybe add extra code to resolve swizzles and writemasks that
> the hardware can't do natively.  On Ivy Bridge, the hardware can do almost
> nothing natively and the pass would be aware of that.  It would need to run
> after all the optimizations because we want optimizations to be able to
> assume swizzles are sensible so we don't have to implement crazy hardware
> restrictions at every level.  If it's practical, this seems like the best
> plan.

Ok, I think we can try to do 2) first and fall back to 1) if we run into too much trouble for some reason, at this point it is difficult to be certain that the hardware is going to work in any specific way, specially across all these generations...

> As far as the NIR splitting pass goes: If it's not useful, throw it away! 
> No one will get their feelings hurt.  Connor and I were working under the
> assumption that the hardware was at least mildly sane and, as it turns out,
> we were wrong.
Comment 76 Iago Toral 2016-04-07 06:54:58 UTC
(In reply to Francisco Jerez from comment #73)
> (In reply to Jason Ekstrand from comment #72)
> > (In reply to Iago Toral from comment #66)
> > > Jason, Connor:
> > > 
> > > last week Curro spent some time looking at our fp64 branch and testing some
> > > things and we have been discussing some aspects of the hardware in fp64 that
> > > are not all that well documented (or not even documented at all :)) and that
> > > may have some important implications in the implementation, specifically for
> > > the vec4 backend.
> > >
> > > Opinions?
> > 
> > Short version:
> > 
> > a) The hardware is busted.
> > b) I think Curro knows what he's talking about.  :-)
> > 
> > Longer version:
> > 
> > I see a couple of options here:  One is to just scalarize all double stuff. 
> > On Ivy Bridge, I think you would also have to double up instructions, one
> > for each half.  It's not a great option from the perspective of performance
> > but is perhaps the easiest to implement.
> > 
> > The second option is what Curro's suggesting where you try and use the
> > hardware as much as possible and fall back to nasty things only when you
> > have to.  Unfortunately, this is going to cause a lot of pain in the
> > generator because suddenly lots of stuff may become align1 at least on IVB.
> > 
> Just a short comment on this point: I don't think IVB will be much worse.
> From the functional point of view it's not that much different from HSW+,
> its primary limitation is that Align16 FP64 instructions can't do more than
> one dvec4 at a time, but that's easily solvable by hooking up the SIMD width
> lowering pass (in addition to the swizzle and writemask lowering pass that
> could be used on later gens), because NibCtrl behaves as expected on FP64
> Align16 instructions even on IVB thankfully.

I think there is still a problem with this: the fact that NibCtrl only works with DF instructions, but we would still need to deal with UD access to DF data... wouldn't that be broken in this case?
Comment 77 Francisco Jerez 2016-04-07 21:09:57 UTC
(In reply to Iago Toral from comment #76)
> (In reply to Francisco Jerez from comment #73)
> > (In reply to Jason Ekstrand from comment #72)
> > > (In reply to Iago Toral from comment #66)
> > > > Jason, Connor:
> > > > 
> > > > last week Curro spent some time looking at our fp64 branch and testing some
> > > > things and we have been discussing some aspects of the hardware in fp64 that
> > > > are not all that well documented (or not even documented at all :)) and that
> > > > may have some important implications in the implementation, specifically for
> > > > the vec4 backend.
> > > >
> > > > Opinions?
> > > 
> > > Short version:
> > > 
> > > a) The hardware is busted.
> > > b) I think Curro knows what he's talking about.  :-)
> > > 
> > > Longer version:
> > > 
> > > I see a couple of options here:  One is to just scalarize all double stuff. 
> > > On Ivy Bridge, I think you would also have to double up instructions, one
> > > for each half.  It's not a great option from the perspective of performance
> > > but is perhaps the easiest to implement.
> > > 
> > > The second option is what Curro's suggesting where you try and use the
> > > hardware as much as possible and fall back to nasty things only when you
> > > have to.  Unfortunately, this is going to cause a lot of pain in the
> > > generator because suddenly lots of stuff may become align1 at least on IVB.
> > > 
> > Just a short comment on this point: I don't think IVB will be much worse.
> > From the functional point of view it's not that much different from HSW+,
> > its primary limitation is that Align16 FP64 instructions can't do more than
> > one dvec4 at a time, but that's easily solvable by hooking up the SIMD width
> > lowering pass (in addition to the swizzle and writemask lowering pass that
> > could be used on later gens), because NibCtrl behaves as expected on FP64
> > Align16 instructions even on IVB thankfully.
> 
> I think there is still a problem with this: the fact that NibCtrl only works
> with DF instructions, but we would still need to deal with UD access to DF
> data... wouldn't that be broken in this case?

Sure, but that problem is fully orthogonal to IVB's level of support for FP64 in Align16 mode and needs to be addressed for later platforms anyway.  IVB's limitation on the execution size is specific to double precision types in Align16 mode, so if you do either Align1 or UD execution type you avoid the problem.
Comment 78 Grazvydas Ignotas 2016-04-07 22:55:25 UTC
(In reply to Iago Toral from comment #74)
> It would be interesting if you could
> try with i965-fp64 to see if your problems are strictly related to the fp64
> implementation or vertex attrib 64.
I've just tested i965-fp64 and it seems to be fine, only i965-attrib64 has the issue.
Comment 79 Iago Toral 2016-04-08 06:12:49 UTC
(In reply to Francisco Jerez from comment #77)
> (In reply to Iago Toral from comment #76)
> > (In reply to Francisco Jerez from comment #73)
> > > (In reply to Jason Ekstrand from comment #72)
> > > > (In reply to Iago Toral from comment #66)
> > > > > Jason, Connor:
> > > > > 
> > > > > last week Curro spent some time looking at our fp64 branch and testing some
> > > > > things and we have been discussing some aspects of the hardware in fp64 that
> > > > > are not all that well documented (or not even documented at all :)) and that
> > > > > may have some important implications in the implementation, specifically for
> > > > > the vec4 backend.
> > > > >
> > > > > Opinions?
> > > > 
> > > > Short version:
> > > > 
> > > > a) The hardware is busted.
> > > > b) I think Curro knows what he's talking about.  :-)
> > > > 
> > > > Longer version:
> > > > 
> > > > I see a couple of options here:  One is to just scalarize all double stuff. 
> > > > On Ivy Bridge, I think you would also have to double up instructions, one
> > > > for each half.  It's not a great option from the perspective of performance
> > > > but is perhaps the easiest to implement.
> > > > 
> > > > The second option is what Curro's suggesting where you try and use the
> > > > hardware as much as possible and fall back to nasty things only when you
> > > > have to.  Unfortunately, this is going to cause a lot of pain in the
> > > > generator because suddenly lots of stuff may become align1 at least on IVB.
> > > > 
> > > Just a short comment on this point: I don't think IVB will be much worse.
> > > From the functional point of view it's not that much different from HSW+,
> > > its primary limitation is that Align16 FP64 instructions can't do more than
> > > one dvec4 at a time, but that's easily solvable by hooking up the SIMD width
> > > lowering pass (in addition to the swizzle and writemask lowering pass that
> > > could be used on later gens), because NibCtrl behaves as expected on FP64
> > > Align16 instructions even on IVB thankfully.
> > 
> > I think there is still a problem with this: the fact that NibCtrl only works
> > with DF instructions, but we would still need to deal with UD access to DF
> > data... wouldn't that be broken in this case?
> 
> Sure, but that problem is fully orthogonal to IVB's level of support for
> FP64 in Align16 mode and needs to be addressed for later platforms anyway. 
> IVB's limitation on the execution size is specific to double precision types
> in Align16 mode, so if you do either Align1 or UD execution type you avoid
> the problem.

I see, yeah, I suppose we should be safe then.
Comment 80 Iago Toral 2016-04-08 06:26:12 UTC
(In reply to Grazvydas Ignotas from comment #78)
> (In reply to Iago Toral from comment #74)
> > It would be interesting if you could
> > try with i965-fp64 to see if your problems are strictly related to the fp64
> > implementation or vertex attrib 64.
> I've just tested i965-fp64 and it seems to be fine, only i965-attrib64 has
> the issue.

Cool, thanks for testing this! I'll let the guys working on that know about this issue. If you're interested, they are tracking their progress in this bug report:

https://bugs.freedesktop.org/show_bug.cgi?id=94442

You mentioned that you could bisect the problem, I am sure that would be really useful for them too.
Comment 81 Samuel Iglesias Gonsálvez 2016-04-11 06:21:24 UTC
(In reply to Iago Toral from comment #69)
> (In reply to Samuel Iglesias from comment #68)
> > Hello Jason, Connor:
> > 
> > I have found a problem with mod() working with double operands: mod(x, y) is
> > calculated as x - y * floor(x/y).
> > 
> > When working with doubles on i965, we need to do two lowering passes:
> > one to calculate the reciprocal of 'y' and other for the floor().
> > 
> > If x == y, then x/y == 1.0 but I found that calculated value is slightly
> > less than 1.0 because of rounding errors. In that case, floor() returns 0.0
> > and mod(x,y) ends up returning 'y' instead of 0. This is not happening for
> > all values and that's the reason we have not seen it before.
> > 
> > We propose to create a lowering pass for mod() with doubles in
> > nir_lowering_double_ops, where we check if x == y and return zero or the
> > formula depending on the case. It would affect the performance for mod()
> > operation due to the added bcsel but, as it only affects doubles (we can
> > make nir_opt_algebraic to skip the double's case), we think it is not a bad
> > idea.
> > 
> > What do you think?
> 
> I've been thinking about this and this is not a good idea, we probably have
> the same problem every time that a = b * N where N is an integer. If we want
> to fix this I think we need to check the fractional part of a / b and round
> the result if it is "almost" 1.0... that is going to lead to some awful
> code, because round is also lowered and it is a truly ugly hack but at least
> it should get things like mod(10, 5)  or mod (10.6, 5.2) working as
> expected. It might break things like mod(10, 4.999...) but I think that's
> probably more acceptable.
> 
> BTW, we found this issue in the desktop fp64 CTS tests.

The generic solution would be if x = N * y and the result of mod(x,y) is equal to 'y' (so floor(x/y) = N - 1), then we should return 0 because mod(x,y) ∈ [0,y). I think this is the proper solution although we add a bcsel in that case.

What do you think?
Comment 82 Francisco Jerez 2016-05-04 11:43:46 UTC
I've been running some experiments on vec4 hardware and playing some
combinatorics trying to come up with a plan to address the limitations
of Align16 double-precision instructions we've been discussing.  Part
of the following is optional and could be left out for a first
approximation without sacrificing functional correctness -- The point
is to have a reasonably workable initial implementation that still
allows some room for improvement in the future.

The high-level idea is to have the front-end feed the back-end with
(mostly) unlowered dvec4 instructions that would be processed by the
back-end optimizer as usual, without considering the extremely
annoying limitations of the hardware.  After the back-end optimization
loop some additional (fully self-contained) machinery would lower
double-precision instructions into a form the EUs would be able to
deal with.  The bulk of the required infrastructure would be:

 - A swizzle lowering pass that would eliminate unsupported source
   regions by breaking SIMD vector instructions into chunks so that
   each individual instruction would write a subset of the destination
   components and only use legal swizzle combinations (in cases where
   there is source/destination overlap some minor copying would be
   necessary) -- The execution of each generated vector chunk could
   potentially be pipelined if the destination dependency controls are
   set up correctly.  More details on how the splitting could be done
   below.

 - A writemask lowering pass that would run after swizzle lowering and
   split the resulting instructions into smaller chunks to clean up
   illegal writemask combinations (XY and ZW).  The swizzle lowering
   approach proposed below attempts to minimize the amount of
   additional division that this pass would need to do by emitting
   legal writemask combinations where possible -- In fact the most
   naive swizzle lowering algorithm conceivable that simply expands
   the instruction into scalar components would render this pass
   unnecessary so this could be left out completely in the initial
   implementation.

 - A SIMD width lowering pass that would divide SIMD4x2 instructions
   that are too wide for the hardware to execute or hit some sort of
   restriction that can be eliminated by dividing the instruction into
   independent SIMD4 channels.  The SIMD lowering pass from the scalar
   back-end could be taken as starting point.

The following trick allows representing many swizzles that cross vec2
boundaries by exploiting some of the hardware's stupidity.  Assume you
have a compressed SIMD4x2 DF instruction and a given source has
vstride set to zero and identity swizzle.  For each channel the
hardware would read the following locations from the register file, at
least ideally:

 x0 <- r0.0:df
 y0 <- r0.1:df
 z0 <- r0.0:df
 w0 <- r0.1:df
 x1 <- r0.0:df
 y1 <- r0.1:df
 z1 <- r0.0:df
 w1 <- r0.1:df

assuming that r0.0 is the base register.  This is actually what
happens on BDW+ and doesn't help us much except for representing
uniforms.  Pre-Gen8 hardware however behaves like:

 x0 <- r0.0:df
 y0 <- r0.1:df
 z0 <- r0.0:df
 w0 <- r0.1:df
 x1 <- r1.0:df
 y1 <- r1.1:df
 z1 <- r1.0:df
 w1 <- r1.1:df

due to an instruction decompression bug (or feature?) that causes it
to increment the register offset of the second half by a fixed amount
of one GRF regardless of the vertical stride.  The end result is the
instruction behaves as if the source had the logical region
r0.0<4>.xyxy (where the swizzles are actual logical swizzles, i.e. in
units of whole 64bit components) and we have managed to get the ZW
components to cross the vec2 boundary and read the XY components from
the first oword of the register.  If vector splitting is applied
wisely in addition to this, we effectively gain one degree of freedom
and can now represent a large portion of the dvec4 swizzle space by
splitting into two instructions only (close to 70% of all swizzle
combinations).  A particularly fruitful way to split (works for
roughly 55% of the swizzle space) involves dividing the XW and YZ
components into separate instructions, because on the one hand it
ensures full orthogonality between the swizzle of each component and
on the other hand it allows shifting the second component in each pair
with respect to the first by 16B freely using the vstride trick, and
at the same time it sidesteps the writemask limitation (for
comparison, splitting into XY+ZW or XZ+YW wouldn't have all of these
properties at the same time).  In cases where the vstride trick
doesn't work (e.g. BDW) SIMD splitting would have to be applied
afterwards to unroll the instruction into a pair of SIMD4 instructions
with vstride=0 and so that the base register of the second one is
shifted by one GRF from the first one -- The end result is still
strictly better than scalarizing the original instruction completely
in most cases since the latter requires four compressed instructions
while the proposed approach generates the same number of uncompressed
instructions.

The following table summarizes how the splitting could be carried out
for each swizzle combination.  Each row corresponds to some set of
combinations of the first two swizzle components and each column
corresponds to some set of combinations of the last two swizzle
components.  'v' is used as a shorthand for the set of DF swizzles
that read from the first oword of a vec4 (i.e. X or Y) while '^' is
used for the second oword (i.e. Z or W).  The letters indicate the
kind of splitting that needs to be carried out, from A which has the
highest likelihood to be supported natively as a single instruction,
to E which is the worst and needs to be fully scalarized in all cases.
The meaning of the class subscripts is explained below together with
the description of each splitting class.

      vv  v^   ^v   ^^
  vv  A   B    B    A
  v^  Cy  Dyz  B    B
  ^v  Cx  B    Dxw  B
  ^^  E   Cz   Cw   A

E.g. swizzle XZYX reads the first oword for all components except the
second which reads the second oword, so it would belong to the set
v^vv, which corresponds to splitting class Cy.  Because lower
(i.e. worse) splitting classes are typically more general than higher
ones (with some exceptions discussed below), it would be allowable for
an implementation to demote swizzle combinations to a splitting class
worse than the one shown on the table in order to avoid implementing
certain splitting classes -- E.g. it would be valid for an
implementation to ignore the table above altogether and implement
splitting class E exclusively.

 - Class A (48/256 swizzles) can be subdivided into:
   - Class A+ (12/256 swizzles) which can always be supported
     natively.  It applies for swizzles such that the first and third
     components pair up correctly and the second and fourth components
     pair up as well -- To be more precise, two swizzle components i
     and j are said to "pair up" correctly if they are equal modulo
     two, i.e. if they refer to the same subcomponent within the oword
     they refer to respectively -- E.g. x pairs up with z and y pairs
     up with w.

   - Class A- (36/256 swizzles) which applies in every other case and
     requires splitting into XW+YZ components.

 - Class B (96/256 swizzles) can be handled by splitting into XW+YZ
   regardless of the swizzles.

 - Class Ci (64/256 swizzles) can be subdivided into:
   - Class Ci+ (32/256 swizzles) requires splitting into a single
     scalar component 'i' plus a dvec3 with all remaining components.
     It applies when the swizzle of the "lone" component in the dvec3
     (i.e. i XOR 1) is paired up correctly.

   - Class Ci- (32/256 swizzles) applies in every other case and requires
     splitting into the scalar component 'i', the scalar component 'i
     XOR 3' and a dvec2 with all remaining components.

 - Class Dij (32/256 swizzles) can be subdivided into:
   - Class Dij+ (8/256 swizzles) requires splitting into XZ+YW and
     applies when the swizzles for the first and third components and
     second and fourth components pair up respectively.
  
   - Class Dij- (24/256 swizzles) applies otherwise and requires
     splitting into scalar component 'i', scalar component 'j' and a
     dvec2 with the two remaining components.

 - Class E (16/256 swizzles) requires full unrolling into scalar
   components.

The following diagram summarizes how classes can be demoted into each
other (which implies that one is able to represent a superset of the
swizzles from the other).  When the original class had subscripts the
union for all possible subscripts is intended in the diagram.  The
symbol '->' denotes 'can be demoted to' and symbol '~~' denotes 'is
equivalent to'.

  A+ -> A- ~~ B -> C- -> D- -> E
               C+ -^ D+ -^

This approach could be extended to instructions with multiple sources
without much difficulty by dividing the instruction iteratively for
each source -- The caveat is that for multiple-source instructions the
probability of having to make use of one of the worse classes
increases, but the situation doesn't change qualitatively and the best-
(one instruction) and worst-case scenarios (full scalarization) remain
unchanged.

P.S.: Would be great to be able to use LaTeX markup in bugzilla. :P
Comment 83 Iago Toral 2016-05-18 10:32:54 UTC
(In reply to Francisco Jerez from comment #82)
> I've been running some experiments on vec4 hardware and playing some
> combinatorics trying to come up with a plan to address the limitations
> of Align16 double-precision instructions we've been discussing.  Part
> of the following is optional and could be left out for a first
> approximation without sacrificing functional correctness -- The point
> is to have a reasonably workable initial implementation that still
> allows some room for improvement in the future.

Thanks for the very detailed post Curro! it is going to be super useful to have it documented here going forward :)

Now that we have landed the scalar backend I will be back to this.

> The high-level idea is to have the front-end feed the back-end with
> (mostly) unlowered dvec4 instructions that would be processed by the
> back-end optimizer as usual, without considering the extremely
> annoying limitations of the hardware.  After the back-end optimization
> loop some additional (fully self-contained) machinery would lower
> double-precision instructions into a form the EUs would be able to
> deal with.

Right, that sounds a like a good idea and will also make for a much cleaner solution too like you pointed out since it would allow us to separate the parts that deal with all the peculiar aspects of fp64/vec4.

(...)

> The following trick allows representing many swizzles that cross vec2
> boundaries by exploiting some of the hardware's stupidity.  Assume you
> have a compressed SIMD4x2 DF instruction and a given source has
> vstride set to zero and identity swizzle.  For each channel the
> hardware would read the following locations from the register file, at
> least ideally:
> 
>  x0 <- r0.0:df
>  y0 <- r0.1:df
>  z0 <- r0.0:df
>  w0 <- r0.1:df
>  x1 <- r0.0:df
>  y1 <- r0.1:df
>  z1 <- r0.0:df
>  w1 <- r0.1:df
> 
> assuming that r0.0 is the base register.  This is actually what
> happens on BDW+ and doesn't help us much except for representing
> uniforms.  Pre-Gen8 hardware however behaves like:
> 
>  x0 <- r0.0:df
>  y0 <- r0.1:df
>  z0 <- r0.0:df
>  w0 <- r0.1:df
>  x1 <- r1.0:df
>  y1 <- r1.1:df
>  z1 <- r1.0:df
>  w1 <- r1.1:df
> 
> due to an instruction decompression bug (or feature?) that causes it
> to increment the register offset of the second half by a fixed amount
> of one GRF regardless of the vertical stride.  The end result is the
> instruction behaves as if the source had the logical region
> r0.0<4>.xyxy (where the swizzles are actual logical swizzles, i.e. in
> units of whole 64bit components) and we have managed to get the ZW
> components to cross the vec2 boundary and read the XY components from
> the first oword of the register.

For the sake of completeness, one idea that we discussed was to try and exploit this to access components Z/W like:

r0.2<0,2,1>.xyzw:df

That is, we start the region at offset 16B (i.e. right where Z for the first vertex starts) and then use the vstride=0 trick to read past the first GRF and access the Z/W data for the second vertex in the SIMD4x2 execution.

I have to say, however, that I tried to do some quick tests to verify that this actually works and it didn't for me. I think in this case we might be violating a few hardware restrictions because:

a) The region (for an execsize of 8) would span over 3 GRFs (which I think is not allowed?)
b) The first and last GRFs covered would select less DF elements than the second. I think I read something in the docs that required that regions selected "homogeneous" parts of the registers selected.

>  If vector splitting is applied
> wisely in addition to this, we effectively gain one degree of freedom
> and can now represent a large portion of the dvec4 swizzle space by
> splitting into two instructions only (close to 70% of all swizzle
> combinations).  A particularly fruitful way to split (works for
> roughly 55% of the swizzle space) involves dividing the XW and YZ
> components into separate instructions, because on the one hand it
> ensures full orthogonality between the swizzle of each component and
> on the other hand it allows shifting the second component in each pair
> with respect to the first by 16B freely using the vstride trick, and

To clarify this last sentence a bit further, the idea here is that with an execsize of 8 we use regions with a vstride of 2 (so 4 regions in total, each one 16B large). With XW and YZ swizzles each component in the swizzle falls in a separate 16B region and we select each of the components using a different 32-bit swizzle. For example, with XW, for the first vertex in the SIMD4x2 execution, component X falls in the first 16B region and W in the second (for the second vertex they fall inthe 3rd and 4th respectively). We select X in the first region using a xy 32-bit swizzle and we select W in the second region by using a zw 32-bit swizzle.

This means that we could implement something like:

mov(8) r0.xyzw:df r2.xyzw:df

By splitting it in two like this:

mov(8) r0.xw:df r2.xw:df
mov(8) r0.yz:df r2.yz:df

And then generating:

mov(8) r0<1>.xw r2<2,2,1>.xyzw
mov(8) r0<1>.yz r2<2,2,1>.zwxy

> at the same time it sidesteps the writemask limitation (for
> comparison, splitting into XY+ZW or XZ+YW wouldn't have all of these
> properties at the same time).  In cases where the vstride trick
> doesn't work (e.g. BDW) SIMD splitting would have to be applied
> afterwards to unroll the instruction into a pair of SIMD4 instructions
> with vstride=0 and so that the base register of the second one is
> shifted by one GRF from the first one -- The end result is still
> strictly better than scalarizing the original instruction completely
> in most cases since the latter requires four compressed instructions
> while the proposed approach generates the same number of uncompressed
> instructions.

Are we still interested in enabling the vec4 backend on BDW+ now that these gens are fully scalar?

> The following table summarizes how the splitting could be carried out
> for each swizzle combination.  Each row corresponds to some set of
> combinations of the first two swizzle components and each column
> corresponds to some set of combinations of the last two swizzle
> components.  'v' is used as a shorthand for the set of DF swizzles
> that read from the first oword of a vec4 (i.e. X or Y) while '^' is
> used for the second oword (i.e. Z or W).  The letters indicate the
> kind of splitting that needs to be carried out, from A which has the
> highest likelihood to be supported natively as a single instruction,
> to E which is the worst and needs to be fully scalarized in all cases.
> The meaning of the class subscripts is explained below together with
> the description of each splitting class.
> 
>       vv  v^   ^v   ^^
>   vv  A   B    B    A
>   v^  Cy  Dyz  B    B
>   ^v  Cx  B    Dxw  B
>   ^^  E   Cz   Cw   A
> 
> E.g. swizzle XZYX reads the first oword for all components except the
> second which reads the second oword, so it would belong to the set
> v^vv, which corresponds to splitting class Cy.  Because lower
> (i.e. worse) splitting classes are typically more general than higher
> ones (with some exceptions discussed below), it would be allowable for
> an implementation to demote swizzle combinations to a splitting class
> worse than the one shown on the table in order to avoid implementing
> certain splitting classes -- E.g. it would be valid for an
> implementation to ignore the table above altogether and implement
> splitting class E exclusively.
> 
>  - Class A (48/256 swizzles) can be subdivided into:
>    - Class A+ (12/256 swizzles) which can always be supported
>      natively.  It applies for swizzles such that the first and third
>      components pair up correctly and the second and fourth components
>      pair up as well -- To be more precise, two swizzle components i
>      and j are said to "pair up" correctly if they are equal modulo
>      two, i.e. if they refer to the same subcomponent within the oword
>      they refer to respectively -- E.g. x pairs up with z and y pairs
>      up with w.

As far as I understand, some of these don't need the vstride trick (like XYZW, which we can represent with a region like <2,2,1>.xyzw:df) while others do (like XYXY, which would need something like <0,2,1>.xyzw:df)

>    - Class A- (36/256 swizzles) which applies in every other case and
>      requires splitting into XW+YZ components.

For example, XXYY. In this case we would split it in XX and YY. The former would be represented using a <2,2,1>.xyxy region  (which actually selects XXZZ) and the latter using <2,2,1>.zwzw. (which actually selects YYWW).

>  - Class B (96/256 swizzles) can be handled by splitting into XW+YZ
>    regardless of the swizzles.
> 
>  - Class Ci (64/256 swizzles) can be subdivided into:
>    - Class Ci+ (32/256 swizzles) requires splitting into a single
>      scalar component 'i' plus a dvec3 with all remaining components.
>      It applies when the swizzle of the "lone" component in the dvec3
>      (i.e. i XOR 1) is paired up correctly.
> 
>    - Class Ci- (32/256 swizzles) applies in every other case and requires
>      splitting into the scalar component 'i', the scalar component 'i
>      XOR 3' and a dvec2 with all remaining components.
> 
>  - Class Dij (32/256 swizzles) can be subdivided into:
>    - Class Dij+ (8/256 swizzles) requires splitting into XZ+YW and
>      applies when the swizzles for the first and third components and
>      second and fourth components pair up respectively.
>   
>    - Class Dij- (24/256 swizzles) applies otherwise and requires
>      splitting into scalar component 'i', scalar component 'j' and a
>      dvec2 with the two remaining components.
> 
>  - Class E (16/256 swizzles) requires full unrolling into scalar
>    components.
> 
> The following diagram summarizes how classes can be demoted into each
> other (which implies that one is able to represent a superset of the
> swizzles from the other).  When the original class had subscripts the
> union for all possible subscripts is intended in the diagram.  The
> symbol '->' denotes 'can be demoted to' and symbol '~~' denotes 'is
> equivalent to'.
> 
>   A+ -> A- ~~ B -> C- -> D- -> E
>                C+ -^ D+ -^
> 
> This approach could be extended to instructions with multiple sources
> without much difficulty by dividing the instruction iteratively for
> each source -- The caveat is that for multiple-source instructions the
> probability of having to make use of one of the worse classes
> increases, but the situation doesn't change qualitatively and the best-
> (one instruction) and worst-case scenarios (full scalarization) remain
> unchanged.

Yeah, 2-operand instructions are going to lead to more splitting in the end but the good thing about this is that we can start by fully scalarizing and then improve things by supporting more swizzle classes iteratively.

> P.S.: Would be great to be able to use LaTeX markup in bugzilla. :P
Comment 84 Francisco Jerez 2016-05-21 21:10:48 UTC
(In reply to Iago Toral from comment #83)
>[...]
>> due to an instruction decompression bug (or feature?) that causes it
>> to increment the register offset of the second half by a fixed amount
>> of one GRF regardless of the vertical stride.  The end result is the
>> instruction behaves as if the source had the logical region
>> r0.0<4>.xyxy (where the swizzles are actual logical swizzles, i.e. in
>> units of whole 64bit components) and we have managed to get the ZW
>> components to cross the vec2 boundary and read the XY components from
>> the first oword of the register.
>
> For the sake of completeness, one idea that we discussed was to try and exploit
> this to access components Z/W like:
>
> r0.2<0,2,1>.xyzw:df
>
> That is, we start the region at offset 16B (i.e. right where Z for the first
> vertex starts) and then use the vstride=0 trick to read past the first GRF and
> access the Z/W data for the second vertex in the SIMD4x2 execution.
>
> I have to say, however, that I tried to do some quick tests to verify that this
> actually works and it didn't for me. I think in this case we might be violating
> a few hardware restrictions because:
>

I'm pretty sure I tried that on HSW at some point and it worked fine for
me, I'll see if I can dig up the exact assembly code I used.

> a) The region (for an execsize of 8) would span over 3 GRFs (which I think is
> not allowed?)

Not really, the region would span four rows (because the width has a
fixed value of two for DF instructions in Align16 mode and as you say
the execution size is eight), like:

 r0.2 r0.3
 r0.2 r0.3
 r1.2 r1.3
 r1.2 r1.3

So the whole region is contained in two GRF registers which is allowed
by the hardware.

> b) The first and last GRFs covered would select less DF elements than the
> second. I think I read something in the docs that required that regions
> selected "homogeneous" parts of the registers selected.
>

The first GRF (r0) would be read by four channels and the second GRF
(r1) by another four, so the hardware should be happy AFAICT. ;)

>>  If vector splitting is applied
>> wisely in addition to this, we effectively gain one degree of freedom
>> and can now represent a large portion of the dvec4 swizzle space by
>> splitting into two instructions only (close to 70% of all swizzle
>> combinations).  A particularly fruitful way to split (works for
>> roughly 55% of the swizzle space) involves dividing the XW and YZ
>> components into separate instructions, because on the one hand it
>> ensures full orthogonality between the swizzle of each component and
>> on the other hand it allows shifting the second component in each pair
>> with respect to the first by 16B freely using the vstride trick, and
>
> To clarify this last sentence a bit further, the idea here is that with an
> execsize of 8 we use regions with a vstride of 2 (so 4 regions in total, each

VStride doesn't necessarily has to be 2, and the fact that you can
control the vstride independently from the other regioning parameters
triplicates the number of different swizzles you can represent using
this trick.  E.g. if you want both components of the XW (or YZ)
instruction to read from the same oword of the input (either the first
or the second one) you set vstride to zero and then pick the appropriate
swizzle for each component, which is always possible because the
swizzles for the X and W components of the instruction (or respectively
Y and Z) are fully independent.

> one 16B large). With XW and YZ swizzles each component in the swizzle falls in
> a separate 16B region and we select each of the components using a different
> 32-bit swizzle. For example, with XW, for the first vertex in the SIMD4x2
> execution, component X falls in the first 16B region and W in the second (for
> the second vertex they fall inthe 3rd and 4th respectively). We select X in the
> first region using a xy 32-bit swizzle and we select W in the second region by
> using a zw 32-bit swizzle.
>
> This means that we could implement something like:
>
> mov(8) r0.xyzw:df r2.xyzw:df
>

That's the identity swizzle you could implement as-is ;).  Or is it just
an example for the sake of explanation?

> By splitting it in two like this:
>
> mov(8) r0.xw:df r2.xw:df
> mov(8) r0.yz:df r2.yz:df
>
> And then generating:
>
> mov(8) r0<1>.xw r2<2,2,1>.xyzw
> mov(8) r0<1>.yz r2<2,2,1>.zwxy
>

The swizzles of the second instruction seem backwards, they would have
to be:

| mov(8) r0<1>.yz r2<2,2,1>.xyzw

for the Y component of the instruction to read r2.1 (the Y component of
the original dvec4) and the Z component of the instruction to read r2.2
(the Z component of the vector).  The assembly you have written would be
equivalent to the logical instruction (with whole-DF swizzle notation):

| mov(8) r0.xyzw:df r2.xxww:df

>> at the same time it sidesteps the writemask limitation (for
>> comparison, splitting into XY+ZW or XZ+YW wouldn't have all of these
>> properties at the same time).  In cases where the vstride trick
>> doesn't work (e.g. BDW) SIMD splitting would have to be applied
>> afterwards to unroll the instruction into a pair of SIMD4 instructions
>> with vstride=0 and so that the base register of the second one is
>> shifted by one GRF from the first one -- The end result is still
>> strictly better than scalarizing the original instruction completely
>> in most cases since the latter requires four compressed instructions
>> while the proposed approach generates the same number of uncompressed
>> instructions.
>
> Are we still interested in enabling the vec4 backend on BDW+ now that these
> gens are fully scalar?
>

I guess it doesn't matter, I wouldn't worry too much about BDW+ right
now, but I think we can support it almost for free in terms of
implementation complexity (because you'll need a SIMD lowering pass for
other reasons anyway), so it should still be possible to make them go
back to vector in the future if it happens to be useful (e.g. where
register pressure kills the scalar back-end).

>> The following table summarizes how the splitting could be carried out
>> for each swizzle combination.  Each row corresponds to some set of
>> combinations of the first two swizzle components and each column
>> corresponds to some set of combinations of the last two swizzle
>> components.  'v' is used as a shorthand for the set of DF swizzles
>> that read from the first oword of a vec4 (i.e. X or Y) while '^' is
>> used for the second oword (i.e. Z or W).  The letters indicate the
>> kind of splitting that needs to be carried out, from A which has the
>> highest likelihood to be supported natively as a single instruction,
>> to E which is the worst and needs to be fully scalarized in all cases.
>> The meaning of the class subscripts is explained below together with
>> the description of each splitting class.
>> 
>>       vv  v^   ^v   ^^
>>   vv  A   B    B    A
>>   v^  Cy  Dyz  B    B
>>   ^v  Cx  B    Dxw  B
>>   ^^  E   Cz   Cw   A
>> 
>> E.g. swizzle XZYX reads the first oword for all components except the
>> second which reads the second oword, so it would belong to the set
>> v^vv, which corresponds to splitting class Cy.  Because lower
>> (i.e. worse) splitting classes are typically more general than higher
>> ones (with some exceptions discussed below), it would be allowable for
>> an implementation to demote swizzle combinations to a splitting class
>> worse than the one shown on the table in order to avoid implementing
>> certain splitting classes -- E.g. it would be valid for an
>> implementation to ignore the table above altogether and implement
>> splitting class E exclusively.
>> 
>>  - Class A (48/256 swizzles) can be subdivided into:
>>    - Class A+ (12/256 swizzles) which can always be supported
>>      natively.  It applies for swizzles such that the first and third
>>      components pair up correctly and the second and fourth components
>>      pair up as well -- To be more precise, two swizzle components i
>>      and j are said to "pair up" correctly if they are equal modulo
>>      two, i.e. if they refer to the same subcomponent within the oword
>>      they refer to respectively -- E.g. x pairs up with z and y pairs
>>      up with w.
>
> As far as I understand, some of these don't need the vstride trick (like XYZW,
> which we can represent with a region like <2,2,1>.xyzw:df) while others do
> (like XYXY, which would need something like <0,2,1>.xyzw:df)
>

Yeah, it depends on whether the second pair of vector components of the
instruction map to the same oword of the input (in which case you need
the trick) or a different one (in which case you don't).

>>    - Class A- (36/256 swizzles) which applies in every other case and
>>      requires splitting into XW+YZ components.
>
> For example, XXYY. In this case we would split it in XX and YY. The former
> would be represented using a <2,2,1>.xyxy region  (which actually selects XXZZ)
> and the latter using <2,2,1>.zwzw. (which actually selects YYWW).
>

Hm, the example XXYY does belong to class A-, but (like everything else
in this class) you would have to split the instruction into XW+YZ
components and then emit code like:

| mov(8) r0.xw:df r2<0,2,1>.xyzw:df
| mov(8) r0.yz:df r2<0,2,1>.zwxy:df

>>[...]
>> This approach could be extended to instructions with multiple sources
>> without much difficulty by dividing the instruction iteratively for
>> each source -- The caveat is that for multiple-source instructions the
>> probability of having to make use of one of the worse classes
>> increases, but the situation doesn't change qualitatively and the best-
>> (one instruction) and worst-case scenarios (full scalarization) remain
>> unchanged.
>
> Yeah, 2-operand instructions are going to lead to more splitting in the end but
> the good thing about this is that we can start by fully scalarizing and then
> improve things by supporting more swizzle classes iteratively.
>

Agreed. :)
Comment 85 Francisco Jerez 2016-05-22 00:19:07 UTC
(In reply to Francisco Jerez from comment #84)
>[...]
> >>  - Class A (48/256 swizzles) can be subdivided into:
> >>    - Class A+ (12/256 swizzles) which can always be supported
> >>      natively.  It applies for swizzles such that the first and third
> >>      components pair up correctly and the second and fourth components
> >>      pair up as well -- To be more precise, two swizzle components i
> >>      and j are said to "pair up" correctly if they are equal modulo
> >>      two, i.e. if they refer to the same subcomponent within the oword
> >>      they refer to respectively -- E.g. x pairs up with z and y pairs
> >>      up with w.
> >
> > As far as I understand, some of these don't need the vstride trick (like XYZW,
> > which we can represent with a region like <2,2,1>.xyzw:df) while others do
> > (like XYXY, which would need something like <0,2,1>.xyzw:df)
> >
> 
> Yeah, it depends on whether the second pair of vector components of the
> instruction map to the same oword of the input (in which case you need
> the trick) or a different one (in which case you don't).

Er, that sentence should have read "it depends on whether the first and second pairs of vector components...", also sorry for the weird line wrapping -- Should have used the preview function before saving. ;)

>[...] 
>
Comment 86 Iago Toral 2016-05-23 06:30:55 UTC
(In reply to Francisco Jerez from comment #84)
> (In reply to Iago Toral from comment #83)
> >[...]
> >> due to an instruction decompression bug (or feature?) that causes it
> >> to increment the register offset of the second half by a fixed amount
> >> of one GRF regardless of the vertical stride.  The end result is the
> >> instruction behaves as if the source had the logical region
> >> r0.0<4>.xyxy (where the swizzles are actual logical swizzles, i.e. in
> >> units of whole 64bit components) and we have managed to get the ZW
> >> components to cross the vec2 boundary and read the XY components from
> >> the first oword of the register.
> >
> > For the sake of completeness, one idea that we discussed was to try and exploit
> > this to access components Z/W like:
> >
> > r0.2<0,2,1>.xyzw:df
> >
> > That is, we start the region at offset 16B (i.e. right where Z for the first
> > vertex starts) and then use the vstride=0 trick to read past the first GRF and
> > access the Z/W data for the second vertex in the SIMD4x2 execution.
> >
> > I have to say, however, that I tried to do some quick tests to verify that this
> > actually works and it didn't for me. I think in this case we might be violating
> > a few hardware restrictions because:
> >
> 
> I'm pretty sure I tried that on HSW at some point and it worked fine for
> me, I'll see if I can dig up the exact assembly code I used.


Don't worry, I am not 100% certain that this is the example I tried. Maybe I was trying uniforms or something slightly different. Let me try to hit the problem again and if I do then we can check what is going on with it.

> > a) The region (for an execsize of 8) would span over 3 GRFs (which I think is
> > not allowed?)
> 
> Not really, the region would span four rows (because the width has a
> fixed value of two for DF instructions in Align16 mode and as you say
> the execution size is eight), like:
> 
>  r0.2 r0.3
>  r0.2 r0.3
>  r1.2 r1.3
>  r1.2 r1.3
> 
> So the whole region is contained in two GRF registers which is allowed
> by the hardware.

Yeah, you're right, for some reason I was thinking of the same case but with a vstride of 2.

> > b) The first and last GRFs covered would select less DF elements than the
> > second. I think I read something in the docs that required that regions
> > selected "homogeneous" parts of the registers selected.
> >
> 
> The first GRF (r0) would be read by four channels and the second GRF
> (r1) by another four, so the hardware should be happy AFAICT. ;)

Yep.

> >>  If vector splitting is applied
> >> wisely in addition to this, we effectively gain one degree of freedom
> >> and can now represent a large portion of the dvec4 swizzle space by
> >> splitting into two instructions only (close to 70% of all swizzle
> >> combinations).  A particularly fruitful way to split (works for
> >> roughly 55% of the swizzle space) involves dividing the XW and YZ
> >> components into separate instructions, because on the one hand it
> >> ensures full orthogonality between the swizzle of each component and
> >> on the other hand it allows shifting the second component in each pair
> >> with respect to the first by 16B freely using the vstride trick, and
> >
> > To clarify this last sentence a bit further, the idea here is that with an
> > execsize of 8 we use regions with a vstride of 2 (so 4 regions in total, each
> 
> VStride doesn't necessarily has to be 2, and the fact that you can
> control the vstride independently from the other regioning parameters
> triplicates the number of different swizzles you can represent using
> this trick.  E.g. if you want both components of the XW (or YZ)
> instruction to read from the same oword of the input (either the first
> or the second one) you set vstride to zero and then pick the appropriate
> swizzle for each component, which is always possible because the
> swizzles for the X and W components of the instruction (or respectively
> Y and Z) are fully independent.
> 
> > one 16B large). With XW and YZ swizzles each component in the swizzle falls in
> > a separate 16B region and we select each of the components using a different
> > 32-bit swizzle. For example, with XW, for the first vertex in the SIMD4x2
> > execution, component X falls in the first 16B region and W in the second (for
> > the second vertex they fall inthe 3rd and 4th respectively). We select X in the
> > first region using a xy 32-bit swizzle and we select W in the second region by
> > using a zw 32-bit swizzle.
> > This means that we could implement something like:
> >
> > mov(8) r0.xyzw:df r2.xyzw:df
> >
> 
> That's the identity swizzle you could implement as-is ;).  Or is it just
> an example for the sake of explanation?

Yeah, I just wanted to make an example to showcase how we would split and use vstride.

> > By splitting it in two like this:
> >
> > mov(8) r0.xw:df r2.xw:df
> > mov(8) r0.yz:df r2.yz:df
> >
> > And then generating:
> >
> > mov(8) r0<1>.xw r2<2,2,1>.xyzw
> > mov(8) r0<1>.yz r2<2,2,1>.zwxy
> >
> 
> The swizzles of the second instruction seem backwards, they would have
> to be:
> 
> | mov(8) r0<1>.yz r2<2,2,1>.xyzw
> 
> for the Y component of the instruction to read r2.1 (the Y component of
> the original dvec4) and the Z component of the instruction to read r2.2
> (the Z component of the vector).

Oops, yes. Thanks for pointing this out! :)

>  The assembly you have written would be
> equivalent to the logical instruction (with whole-DF swizzle notation):
> 
> | mov(8) r0.xyzw:df r2.xxww:df
> 
> >> at the same time it sidesteps the writemask limitation (for
> >> comparison, splitting into XY+ZW or XZ+YW wouldn't have all of these
> >> properties at the same time).  In cases where the vstride trick
> >> doesn't work (e.g. BDW) SIMD splitting would have to be applied
> >> afterwards to unroll the instruction into a pair of SIMD4 instructions
> >> with vstride=0 and so that the base register of the second one is
> >> shifted by one GRF from the first one -- The end result is still
> >> strictly better than scalarizing the original instruction completely
> >> in most cases since the latter requires four compressed instructions
> >> while the proposed approach generates the same number of uncompressed
> >> instructions.
> >
> > Are we still interested in enabling the vec4 backend on BDW+ now that these
> > gens are fully scalar?
> >
> 
> I guess it doesn't matter, I wouldn't worry too much about BDW+ right
> now, but I think we can support it almost for free in terms of
> implementation complexity (because you'll need a SIMD lowering pass for
> other reasons anyway), so it should still be possible to make them go
> back to vector in the future if it happens to be useful (e.g. where
> register pressure kills the scalar back-end).

Right.

> >> The following table summarizes how the splitting could be carried out
> >> for each swizzle combination.  Each row corresponds to some set of
> >> combinations of the first two swizzle components and each column
> >> corresponds to some set of combinations of the last two swizzle
> >> components.  'v' is used as a shorthand for the set of DF swizzles
> >> that read from the first oword of a vec4 (i.e. X or Y) while '^' is
> >> used for the second oword (i.e. Z or W).  The letters indicate the
> >> kind of splitting that needs to be carried out, from A which has the
> >> highest likelihood to be supported natively as a single instruction,
> >> to E which is the worst and needs to be fully scalarized in all cases.
> >> The meaning of the class subscripts is explained below together with
> >> the description of each splitting class.
> >> 
> >>       vv  v^   ^v   ^^
> >>   vv  A   B    B    A
> >>   v^  Cy  Dyz  B    B
> >>   ^v  Cx  B    Dxw  B
> >>   ^^  E   Cz   Cw   A
> >> 
> >> E.g. swizzle XZYX reads the first oword for all components except the
> >> second which reads the second oword, so it would belong to the set
> >> v^vv, which corresponds to splitting class Cy.  Because lower
> >> (i.e. worse) splitting classes are typically more general than higher
> >> ones (with some exceptions discussed below), it would be allowable for
> >> an implementation to demote swizzle combinations to a splitting class
> >> worse than the one shown on the table in order to avoid implementing
> >> certain splitting classes -- E.g. it would be valid for an
> >> implementation to ignore the table above altogether and implement
> >> splitting class E exclusively.
> >> 
> >>  - Class A (48/256 swizzles) can be subdivided into:
> >>    - Class A+ (12/256 swizzles) which can always be supported
> >>      natively.  It applies for swizzles such that the first and third
> >>      components pair up correctly and the second and fourth components
> >>      pair up as well -- To be more precise, two swizzle components i
> >>      and j are said to "pair up" correctly if they are equal modulo
> >>      two, i.e. if they refer to the same subcomponent within the oword
> >>      they refer to respectively -- E.g. x pairs up with z and y pairs
> >>      up with w.
> >
> > As far as I understand, some of these don't need the vstride trick (like XYZW,
> > which we can represent with a region like <2,2,1>.xyzw:df) while others do
> > (like XYXY, which would need something like <0,2,1>.xyzw:df)
> >
> 
> Yeah, it depends on whether the second pair of vector components of the
> instruction map to the same oword of the input (in which case you need
> the trick) or a different one (in which case you don't).
> 
> >>    - Class A- (36/256 swizzles) which applies in every other case and
> >>      requires splitting into XW+YZ components.
> >
> > For example, XXYY. In this case we would split it in XX and YY. The former
> > would be represented using a <2,2,1>.xyxy region  (which actually selects XXZZ)
> > and the latter using <2,2,1>.zwzw. (which actually selects YYWW).
> >
> 
> Hm, the example XXYY does belong to class A-, but (like everything else
> in this class) you would have to split the instruction into XW+YZ
> components and then emit code like:
> 
> | mov(8) r0.xw:df r2<0,2,1>.xyzw:df
> | mov(8) r0.yz:df r2<0,2,1>.zwxy:df

Oh yes, absolutely.

> >>[...]
> >> This approach could be extended to instructions with multiple sources
> >> without much difficulty by dividing the instruction iteratively for
> >> each source -- The caveat is that for multiple-source instructions the
> >> probability of having to make use of one of the worse classes
> >> increases, but the situation doesn't change qualitatively and the best-
> >> (one instruction) and worst-case scenarios (full scalarization) remain
> >> unchanged.
> >
> > Yeah, 2-operand instructions are going to lead to more splitting in the end but
> > the good thing about this is that we can start by fully scalarizing and then
> > improve things by supporting more swizzle classes iteratively.
> >
> 
> Agreed. :)
Comment 87 Iago Toral 2016-05-26 06:54:33 UTC
(In reply to Iago Toral from comment #86)
> (In reply to Francisco Jerez from comment #84)
> > (In reply to Iago Toral from comment #83)
> > >[...]
> > >> due to an instruction decompression bug (or feature?) that causes it
> > >> to increment the register offset of the second half by a fixed amount
> > >> of one GRF regardless of the vertical stride.  The end result is the
> > >> instruction behaves as if the source had the logical region
> > >> r0.0<4>.xyxy (where the swizzles are actual logical swizzles, i.e. in
> > >> units of whole 64bit components) and we have managed to get the ZW
> > >> components to cross the vec2 boundary and read the XY components from
> > >> the first oword of the register.
> > >
> > > For the sake of completeness, one idea that we discussed was to try and exploit
> > > this to access components Z/W like:
> > >
> > > r0.2<0,2,1>.xyzw:df
> > >
> > > That is, we start the region at offset 16B (i.e. right where Z for the first
> > > vertex starts) and then use the vstride=0 trick to read past the first GRF and
> > > access the Z/W data for the second vertex in the SIMD4x2 execution.
> > >
> > > I have to say, however, that I tried to do some quick tests to verify that this
> > > actually works and it didn't for me. I think in this case we might be violating
> > > a few hardware restrictions because:
> > >
> > 
> > I'm pretty sure I tried that on HSW at some point and it worked fine for
> > me, I'll see if I can dig up the exact assembly code I used.
> 
> 
> Don't worry, I am not 100% certain that this is the example I tried. Maybe I
> was trying uniforms or something slightly different. Let me try to hit the
> problem again and if I do then we can check what is going on with it.

Ok, so I got back to this and it does work fine, the problem I was having before is that I was using a dvec3 uniform and due to the vstride=0 bug we need to use SIMD splitting to source from that. We haven't implemented that yet and so far I was using a hack to copy the uniform to a vgrf and source from that instead so I could test some things in the meantime, but when I wrote that code I did only consider double and dvec2 uniforms so I had to update the hack to deal with dvec3 uniforms too.
Comment 88 Juan A. Suarez 2016-06-06 12:34:14 UTC
(In reply to Iago Toral from comment #66)
> Curro suggested that we can work around this problem in HSW+ by taking
> advantage of the fact that in these gens it is possible to process 8 DF
> elements in ALU operations, so we can simply not split dvec4 operations in
> dvec2 chunks, which would fix the problem automatically. Unfortunately, this
> also means that we need to figure out a solution to the problem with
> swizzles and writemasks working on 32-bit elements (more on that topic
> below).
> 
> Unfortunately, that solution would not work for IVB, since that hardware
> cannot run ALU instructions operating on more than 4 DF components. Curro
> thinks that in this case we could divide the SIMD4x2 execution in 2 SIMD4x1
> instructions. That is, we generate one dvec4 instruction for each logical
> thread and use NibCtrl to fixup execution masking for the second instruction.
> 
> 3. If we implement any solution that involves getting rid of the double
> splitting pass like the ones suggested above, we need to figure out a new
> solution to address the fact that swizzles operate in 32-bit elements, since
> we are back to a situation where we have to deal with dvec4 operands.
> 


Related with this topic, current liveness analysis code (and also DCE optimization) just ignore the types when dealing with the swizzles. This works fine, except when reading/writing a channel from DF type (64-bit channel), and write/read it later as F (32-bit channel).

For instance, for this piece is code:

    mov(8) vgrf3.0.x:UD[1], 2576980378U[1] NoMask
    mov(8) vgrf3.0.y:UD[1], 1071225241U[1] NoMask
    mov(8) vgrf3.1.x:UD[1], 2576980378U[1] NoMask
    mov(8) vgrf3.1.y:UD[1], 1071225241U[1] NoMask
    mov(8) vgrf2.0.xy:DF[2], vgrf3.0.xxxx:DF[2]

Our liveness analysis just decide vgrf3.0.y channel is not live anymore and thus DCE removes it. That is because in latest instruction it doesn't realize vgrf3.0.x:DF is reading both vgrf3.0.x:F and vgrf3.0.y:F.

In order to avoid introducing too many changes in our current code, I've added a pair of commits[1][2] that basically try to check if a 32-bit channels was previously read as 64-bit (and the other way around). This seems to fix above problem.

But as it was said in previous comments HSW interprets all swizzles channels as 32 bits, no matter the type. Iago added a pass that expands the 64bit swizzles in 32bits. But this means once we expand them, we need to read all the channels as 32bits,no matter the type.

In order to control this, I've added a boolean to specify if all the channels must be interpreted as 32bit or not[3]. This is used both in liveness analysis and DCE.

Just dropping here this comment to get feedback if the approach sounds reasonable, or if we must change it.

[1] https://github.com/Igalia/mesa/commit/7bb5dd6b8263f73ecf2f10ff1efc85130cc87bcd
[2] https://github.com/Igalia/mesa/commit/06ad62a92515af242ba105e635f24e5c5117dda7
[3] https://github.com/Igalia/mesa/commit/ca8cf009aa2eb61c145211c3461e29e4bf9a62ac
Comment 89 Francisco Jerez 2016-06-06 18:16:39 UTC
(In reply to Juan A. Suarez from comment #88)
>[..]
> Related with this topic, current liveness analysis code (and also DCE
> optimization) just ignore the types when dealing with the swizzles. This
> works fine, except when reading/writing a channel from DF type (64-bit
> channel), and write/read it later as F (32-bit channel).
> 
> For instance, for this piece is code:
> 
>     mov(8) vgrf3.0.x:UD[1], 2576980378U[1] NoMask
>     mov(8) vgrf3.0.y:UD[1], 1071225241U[1] NoMask
>     mov(8) vgrf3.1.x:UD[1], 2576980378U[1] NoMask
>     mov(8) vgrf3.1.y:UD[1], 1071225241U[1] NoMask
>     mov(8) vgrf2.0.xy:DF[2], vgrf3.0.xxxx:DF[2]
> 
> Our liveness analysis just decide vgrf3.0.y channel is not live anymore and
> thus DCE removes it. That is because in latest instruction it doesn't
> realize vgrf3.0.x:DF is reading both vgrf3.0.x:F and vgrf3.0.y:F.
> 
> In order to avoid introducing too many changes in our current code, I've
> added a pair of commits[1][2] that basically try to check if a 32-bit
> channels was previously read as 64-bit (and the other way around). This
> seems to fix above problem.

Seems very sketchy.  You need to fix the current dataflow analysis logic to
flip/check as many bits per channel from the live, use or def bitsets as the
corresponding source or destination datatype takes, just like the FS back-end
handles multi-GRF defs and uses.

> But as it was said in previous comments HSW interprets all swizzles channels
> as 32 bits, no matter the type. Iago added a pass that expands the 64bit
> swizzles in 32bits. But this means once we expand them, we need to read all
> the channels as 32bits,no matter the type.

I'd recommend against mixing logical and physical swizzle representations in
the IR, because you lose the main benefit of the logical representation which
is you don't need to think about the physical representation (which is hard
because there is no longer a 1:1 correspondence between swizzle units and
instruction component units), and OTOH because some of the logical swizzle
combinations (the ones that require vstride to be set to zero) are currently
irrepresentable if you use physical swizzles.  I think the most sensible
approach would be to have the swizzle and writemask lowering passes output
IR with swizzles encoded according to the logical representation (or rather
the subset of it which can be directly supported by the hardware), and then do
the logical to physical translation at codegen time (e.g. during
convert_to_hw_regs()).

> In order to control this, I've added a boolean to specify if all the
> channels must be interpreted as 32bit or not[3]. This is used both in
> liveness analysis and DCE.

*Shudder* I'd rather not have the semantics of the IR depend on an external
boolean flag stored in the visitor class.

> Just dropping here this comment to get feedback if the approach sounds
> reasonable, or if we must change it.
> 
> [1]
> https://github.com/Igalia/mesa/commit/
> 7bb5dd6b8263f73ecf2f10ff1efc85130cc87bcd
> [2]
> https://github.com/Igalia/mesa/commit/
> 06ad62a92515af242ba105e635f24e5c5117dda7
> [3]
> https://github.com/Igalia/mesa/commit/
> ca8cf009aa2eb61c145211c3461e29e4bf9a62ac
Comment 90 Iago Toral 2016-06-07 06:24:22 UTC
(In reply to Francisco Jerez from comment #89)
> (In reply to Juan A. Suarez from comment #88)
> >[..]
> > Related with this topic, current liveness analysis code (and also DCE
> > optimization) just ignore the types when dealing with the swizzles. This
> > works fine, except when reading/writing a channel from DF type (64-bit
> > channel), and write/read it later as F (32-bit channel).
> > 
> > For instance, for this piece is code:
> > 
> >     mov(8) vgrf3.0.x:UD[1], 2576980378U[1] NoMask
> >     mov(8) vgrf3.0.y:UD[1], 1071225241U[1] NoMask
> >     mov(8) vgrf3.1.x:UD[1], 2576980378U[1] NoMask
> >     mov(8) vgrf3.1.y:UD[1], 1071225241U[1] NoMask
> >     mov(8) vgrf2.0.xy:DF[2], vgrf3.0.xxxx:DF[2]
> > 
> > Our liveness analysis just decide vgrf3.0.y channel is not live anymore and
> > thus DCE removes it. That is because in latest instruction it doesn't
> > realize vgrf3.0.x:DF is reading both vgrf3.0.x:F and vgrf3.0.y:F.
> > 
> > In order to avoid introducing too many changes in our current code, I've
> > added a pair of commits[1][2] that basically try to check if a 32-bit
> > channels was previously read as 64-bit (and the other way around). This
> > seems to fix above problem.
> 
> Seems very sketchy.  You need to fix the current dataflow analysis logic to
> flip/check as many bits per channel from the live, use or def bitsets as the
> corresponding source or destination datatype takes, just like the FS back-end
> handles multi-GRF defs and uses.
> 
> > But as it was said in previous comments HSW interprets all swizzles channels
> > as 32 bits, no matter the type. Iago added a pass that expands the 64bit
> > swizzles in 32bits. But this means once we expand them, we need to read all
> > the channels as 32bits,no matter the type.
> 
> I'd recommend against mixing logical and physical swizzle representations in
> the IR, because you lose the main benefit of the logical representation which
> is you don't need to think about the physical representation (which is hard
> because there is no longer a 1:1 correspondence between swizzle units and
> instruction component units), and OTOH because some of the logical swizzle
> combinations (the ones that require vstride to be set to zero) are currently
> irrepresentable if you use physical swizzles.  I think the most sensible
> approach would be to have the swizzle and writemask lowering passes output
> IR with swizzles encoded according to the logical representation (or rather
> the subset of it which can be directly supported by the hardware), and then
> do
> the logical to physical translation at codegen time (e.g. during
> convert_to_hw_regs()).

Yes, I was thinking about doing the same thing. If we call the swizzle translation pass from convert_to_hw_regs directly we should never have to worry about someone adding another opt pass that runs after it.

> > In order to control this, I've added a boolean to specify if all the
> > channels must be interpreted as 32bit or not[3]. This is used both in
> > liveness analysis and DCE.
> 
> *Shudder* I'd rather not have the semantics of the IR depend on an external
> boolean flag stored in the visitor class.
> 
> > Just dropping here this comment to get feedback if the approach sounds
> > reasonable, or if we must change it.
> > 
> > [1]
> > https://github.com/Igalia/mesa/commit/
> > 7bb5dd6b8263f73ecf2f10ff1efc85130cc87bcd
> > [2]
> > https://github.com/Igalia/mesa/commit/
> > 06ad62a92515af242ba105e635f24e5c5117dda7
> > [3]
> > https://github.com/Igalia/mesa/commit/
> > ca8cf009aa2eb61c145211c3461e29e4bf9a62ac
Comment 91 Juan A. Suarez 2016-06-10 14:18:11 UTC
(In reply to Francisco Jerez from comment #89)
> (In reply to Juan A. Suarez from comment #88)
> >[..]
> > Related with this topic, current liveness analysis code (and also DCE
> > optimization) just ignore the types when dealing with the swizzles. This
> > works fine, except when reading/writing a channel from DF type (64-bit
> > channel), and write/read it later as F (32-bit channel).
> > 
> > For instance, for this piece is code:
> > 
> >     mov(8) vgrf3.0.x:UD[1], 2576980378U[1] NoMask
> >     mov(8) vgrf3.0.y:UD[1], 1071225241U[1] NoMask
> >     mov(8) vgrf3.1.x:UD[1], 2576980378U[1] NoMask
> >     mov(8) vgrf3.1.y:UD[1], 1071225241U[1] NoMask
> >     mov(8) vgrf2.0.xy:DF[2], vgrf3.0.xxxx:DF[2]
> > 
> > Our liveness analysis just decide vgrf3.0.y channel is not live anymore and
> > thus DCE removes it. That is because in latest instruction it doesn't
> > realize vgrf3.0.x:DF is reading both vgrf3.0.x:F and vgrf3.0.y:F.
> > 
> > In order to avoid introducing too many changes in our current code, I've
> > added a pair of commits[1][2] that basically try to check if a 32-bit
> > channels was previously read as 64-bit (and the other way around). This
> > seems to fix above problem.
> 
> Seems very sketchy.  You need to fix the current dataflow analysis logic to
> flip/check as many bits per channel from the live, use or def bitsets as the
> corresponding source or destination datatype takes, just like the FS back-end
> handles multi-GRF defs and uses.
> 

Thanks for the feedback!

I threw away the changes done, and re-did it following your suggestion. Now, it flips/checks as many bits as the datatypes require.

I find this implementation more easy to understand.

Commit is in https://github.com/Igalia/mesa/commit/59fd1cc6b99dc993e0da772674a57390a54955f3
Comment 92 Francisco Jerez 2016-06-17 20:53:19 UTC
(In reply to Juan A. Suarez from comment #91)
> (In reply to Francisco Jerez from comment #89)
> > (In reply to Juan A. Suarez from comment #88)
> > >[..]
> > > Related with this topic, current liveness analysis code (and also DCE
> > > optimization) just ignore the types when dealing with the swizzles. This
> > > works fine, except when reading/writing a channel from DF type (64-bit
> > > channel), and write/read it later as F (32-bit channel).
> > > 
> > > For instance, for this piece is code:
> > > 
> > >     mov(8) vgrf3.0.x:UD[1], 2576980378U[1] NoMask
> > >     mov(8) vgrf3.0.y:UD[1], 1071225241U[1] NoMask
> > >     mov(8) vgrf3.1.x:UD[1], 2576980378U[1] NoMask
> > >     mov(8) vgrf3.1.y:UD[1], 1071225241U[1] NoMask
> > >     mov(8) vgrf2.0.xy:DF[2], vgrf3.0.xxxx:DF[2]
> > > 
> > > Our liveness analysis just decide vgrf3.0.y channel is not live anymore and
> > > thus DCE removes it. That is because in latest instruction it doesn't
> > > realize vgrf3.0.x:DF is reading both vgrf3.0.x:F and vgrf3.0.y:F.
> > > 
> > > In order to avoid introducing too many changes in our current code, I've
> > > added a pair of commits[1][2] that basically try to check if a 32-bit
> > > channels was previously read as 64-bit (and the other way around). This
> > > seems to fix above problem.
> > 
> > Seems very sketchy.  You need to fix the current dataflow analysis logic to
> > flip/check as many bits per channel from the live, use or def bitsets as the
> > corresponding source or destination datatype takes, just like the FS back-end
> > handles multi-GRF defs and uses.
> > 
> 
> Thanks for the feedback!
> 
> I threw away the changes done, and re-did it following your suggestion. Now,
> it flips/checks as many bits as the datatypes require.
> 
> I find this implementation more easy to understand.
> 
> Commit is in
> https://github.com/Igalia/mesa/commit/
> 59fd1cc6b99dc993e0da772674a57390a54955f3

Looks somewhat better, but we need to settle on some well-defined meaning for the bits of the live/defs/uses sets and stick to it consistently, you cannot interpret the live bitsets as either eight or four bits per GRF depending on the instruction and expect things to work smoothly.  The meaning of the live bitsets should be independent from the instructions writing or reading from them -- The dataflow tracking logic is already complicated enough to have different sections of the same live bitset representing variable liveness differently.

The previous implementation used to represent live sets in terms of logical 32-bit components, which breaks down when you use non-32-bit types in the program, or when you do any kind of SIMD splitting.  It would be possible to support DF code with only four bits per GRF by redefining the current representation slightly, so bit '4*j + i' of the live set would refer to the logical OR of the liveness of the i-th and i+4-th dwords of the same GRF, but that would cause the live intervals calculated for DF instructions to be less accurate, and it would be rather awkward to implement -- What you've done instead allocating eight bits per GRF looks fine (you'd have each bit represent one physical dword of the GRF), but you need to initialize (or check) the whole eight bits of each GRF written (or read) by the instruction regardless of whether the instruction is double- or single-precision.
Comment 93 Samuel Iglesias Gonsálvez 2017-06-22 08:26:26 UTC
FP64 support was already added to both i965's vec4 and scalar shader backends in mesa master. Closing the bug.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.