In order to use NIR for everything in the i965 backend (which we would like to do), we need to write a NIR -> vec4 pass. Most of the work should be fairly straightforward. The brw_fs_nir.cpp and brw_vec4_visitor.cpp files should provide a pretty good template of what needs to be done. A grep for INTEL_USE_NIR should show you how we actually plumb it through for the FS backend. It would probably be good to start with Vertex shaders on HSW or IVB as SNB- has issues with booleans and we already have NIR doing scalar VS on BDW. It may be a bit tricky to get NIR to generate comparable code quality. However, we can only even start looking into that once we get decent NIR -> vec4 pass so don't worry about that for now. I (Jason) will act as the primary contact for anyone working on this but I'm also going to Cc Matt Turner as he knows vec4 better than I do.
SNB's boolean handling is fine, actually. Pre-SNB are the problematic ones.
Antía and I will be looking at this.
(In reply to Eduardo Lima Mitev from comment #2) > Antía and I will be looking at this. What is the current status on this? Is there a branch somewhere that I can look at?
We have been making progress, a bit slowly at the beginning as we were getting comfortable with NIR, but more steadily lately. We have a very basic shader working, and the status roughly is: - By now we are reusing brw_vec4_visitor class, and adding the implementation of our NIR methods into a separate file brw_vec4_nir. It is mostly boilerplate now, and we hook into it from brw_vec4::run(). - Intrinsics load_input and store_output partially working. Right now we have only support for one attribute (expected at offset 0) and one output (gl_Position). This is for simplicity and also because until a few days ago, there was a weird bug causing const_index[0] of all inputs and outputs to be zero instead of the driver offset. I manage to track down the problem to nir_lower_io code, but then after checking with master, I saw it has been fixed. So I'm currently extending load_input and store_output to handle any number of attrs/outputs. - ALU instructions: Antia has implemented a lot of these already. - load_const: For this, we did the same as fs_nir, allocating a new register and loading the constant on-demand, as part of get_nir_src(). We then realized that this is not optimal for us, since we can emit vector instructions and save a few movs. So we are planning to update the current implementation to have a map with a register for each SSA variable, and pick from that table when emitting. Thoughts on that? This is pretty much it. I have also looked into uniforms and system values, and after implementing the other intrinsics I feel those will be go very similar. We have also being tracking changes upstream that affect us, more or less regularly. And there have been quite a few :), it feels like shooting a moving target sometimes. We already had in plans to put together a more or less coherent branch by the end of this week, so I just created: https://github.com/Igalia/mesa/commits/nir-vec4 after rebasing master and fixing a few things. Notice that many of those are just dirty commits from our daily development, so it probably makes sense only to look at it as a snapshot of where we stand. Notice also that there are work we have done which is not in that branch, like a new implementation of intrinsics load_input, store_output and partially uniforms, and likely more ALUs. Thank you beforehand for the feedback!
(In reply to Eduardo Lima Mitev from comment #4) > We have been making progress, a bit slowly at the beginning as we were > getting comfortable with NIR, but more steadily lately. > > We have a very basic shader working, and the status roughly is: > > - By now we are reusing brw_vec4_visitor class, and adding the > implementation of our NIR methods into a separate file brw_vec4_nir. It is > mostly boilerplate now, and we hook into it from brw_vec4::run(). > > - Intrinsics load_input and store_output partially working. Right now we > have only support for one attribute (expected at offset 0) and one output > (gl_Position). This is for simplicity and also because until a few days ago, > there was a weird bug causing const_index[0] of all inputs and outputs to be > zero instead of the driver offset. I manage to track down the problem to > nir_lower_io code, but then after checking with master, I saw it has been > fixed. So I'm currently extending load_input and store_output to handle any > number of attrs/outputs. > > - ALU instructions: Antia has implemented a lot of these already. > > - load_const: For this, we did the same as fs_nir, allocating a new register > and loading the constant on-demand, as part of get_nir_src(). We then > realized that this is not optimal for us, since we can emit vector > instructions and save a few movs. So we are planning to update the current > implementation to have a map with a register for each SSA variable, and pick > from that table when emitting. Thoughts on that? Yes, it's much better to use a VF immediate for the cases where the constant is a float and each component is representable in 8 bits. This should cover a lot of the common constants. The reason why I did it that way in the FS backend is that a) we scalarize already so it doesn't much matter and b) we don't actually have a concept of vectors most of the time. The reason for re-allocating and re-eimitting every time is that the constant can be anywhere in the program (such as at the top) and we want the copy-propagation pass to be able to propagate them. Having them in different control flow can cause problems for this. > This is pretty much it. I have also looked into uniforms and system values, > and after implementing the other intrinsics I feel those will be go very > similar. > > We have also being tracking changes upstream that affect us, more or less > regularly. And there have been quite a few :), it feels like shooting a > moving target sometimes. > > We already had in plans to put together a more or less coherent branch by > the end of this week, so I just created: > > https://github.com/Igalia/mesa/commits/nir-vec4 Thanks for putting that together. I just took a quick look through and it looks like you're making decent progress. I do have a couple of comments: 1) There are two places so far, brw_type_for_nir_type and brw_conditional_for_nir_comparison, where we would like to put things in a common place. Recently, I added a brw_nir.h file and I believe Ken added a brw_nir.c file which would be a perfect place to put those. In fact, we could probably simplify some of the current FS NIR code with your conditionals function. 2) We probably want to have emit_intrinsic be one big switch that does everything like we do in the FS. I know that one big switch can be annoying at times, but so is having a pile of little 3-line helpers that we have to declare in the C++ class. Not a big deal either way. > after rebasing master and fixing a few things. Notice that many of those are > just dirty commits from our daily development, so it probably makes sense > only to look at it as a snapshot of where we stand. Notice also that there > are work we have done which is not in that branch, like a new implementation > of intrinsics load_input, store_output and partially uniforms, and likely > more ALUs. > > Thank you beforehand for the feedback! Thank you for the extremely prompt (given that it was the middle of the night there) reply! Keep up the good work!
(In reply to Jason Ekstrand from comment #5) > (In reply to Eduardo Lima Mitev from comment #4) > > > > - load_const: For this, we did the same as fs_nir, allocating a new register > > and loading the constant on-demand, as part of get_nir_src(). We then > > realized that this is not optimal for us, since we can emit vector > > instructions and save a few movs. So we are planning to update the current > > implementation to have a map with a register for each SSA variable, and pick > > from that table when emitting. Thoughts on that? > > The reason for re-allocating and re-eimitting every time is that the > constant can be anywhere in the program (such as at the top) and we want the > copy-propagation pass to be able to propagate them. Having them in > different control flow can cause problems for this. > Hmm, yes we will have to take that into account. Pre-allocating registers for every ssa variable will have to be done carefully (i.e, in a per-implementation basis) to avoid unnecessary register spilling. I will give it some thought and ask you before implementing it. For now, current (unoptimized) solution is enough to move on. > I do have a couple of comments: > > 1) There are two places so far, brw_type_for_nir_type and > brw_conditional_for_nir_comparison, where we would like to put things in a > common place. Recently, I added a brw_nir.h file and I believe Ken added a > brw_nir.c file which would be a perfect place to put those. In fact, we > could probably simplify some of the current FS NIR code with your > conditionals function. > Yes, it is a good idea and will leave our code less cluttered. However, I will turn brw_nir.c to c++ to add these methods. Any caveat with that? > 2) We probably want to have emit_intrinsic be one big switch that does > everything like we do in the FS. I know that one big switch can be annoying > at times, but so is having a pile of little 3-line helpers that we have to > declare in the C++ class. Not a big deal either way. > Agree, will change that and take note. > Thank you for the extremely prompt (given that it was the middle of the > night there) reply! Keep up the good work! Well, it was totally worth the effort getting some code ready for you to look at. Thanks a lot for the early review!
(In reply to Eduardo Lima Mitev from comment #6) > (In reply to Jason Ekstrand from comment #5) > > 1) There are two places so far, brw_type_for_nir_type and > > brw_conditional_for_nir_comparison, where we would like to put things in a > > common place. Recently, I added a brw_nir.h file and I believe Ken added a > > brw_nir.c file which would be a perfect place to put those. In fact, we > > could probably simplify some of the current FS NIR code with your > > conditionals function. > > > > Yes, it is a good idea and will leave our code less cluttered. However, I > will turn brw_nir.c to c++ to add these methods. Any caveat with that? That's ok if we have a good reason for it. However, neither of those functions does anything that requires C++ so I don't see any good reason to change it either. They just use #defines that are defined in brw_defines.h and brw_reg.h both of which are C friendly.
Hi Jason, just to keep you updated with our progress, we are done implementing most of the core functionality: - Control-flow (impl, block, loop, jump, if, etc) - Most intrinsics that are relevant in the vertex shader; except load_ubo, atomic counters, and indirect versions of uniforms, input and output. We are deferring indirects until a later stage and a better understanding of it. There is a known bug in the setup of uniforms values that I'm currently dealing with. - Most ALU operations. We are currently testing them extensively, fixing bugs, etc; so this is ongoing work. Texture related instructions are still missing. We will probably start working on them by the end of this week. I also have pending to revisit the way we emit constant values. Regarding global values, we have been working on it using the code in brw_fs_nir as reference. But we were not able to find any example that requires this support. As far as we see the combination of nir_lower_global_vars_to_local and inlining of functions can handle all of them. As a test/hack, we removed the lowering, but that is needed to support nir_intrinsic_load_var and store_var, that were not previously supported on brw_fs_nir. As a final test, we removed the support for globals in brw_fs_nir (variable nir_globals, and their usage), and run piglit (all test suite), without finding any regression. Are you aware of any specific case in which handling globals is still needed? This week we will start running specific piglit/dEQP tests on our branch to get a better picture of where we stand. I foresee a lot of new tasks coming from these runs. Do you have any advice as to what (sub)set of tests we could try to cover first? As usual, we have a snapshot of our (dirty) git tree at https://github.com/Igalia/mesa/commits/nir-vec4 Thanks!
(In reply to Eduardo Lima Mitev from comment #8) > Hi Jason, just to keep you updated with our progress, we are done > implementing most of the core functionality: > > - Control-flow (impl, block, loop, jump, if, etc) > - Most intrinsics that are relevant in the vertex shader; except load_ubo, > atomic counters, and indirect versions of uniforms, input and output. We are > deferring indirects until a later stage and a better understanding of it. > There is a known bug in the setup of uniforms values that I'm currently > dealing with. > - Most ALU operations. We are currently testing them extensively, fixing > bugs, etc; so this is ongoing work. Good work! I'll try and give your branch another look here soon. It sounds like you're making good progress! > Texture related instructions are still missing. We will probably start > working on them by the end of this week. > I also have pending to revisit the way we emit constant values. > > Regarding global values, we have been working on it using the code in > brw_fs_nir as reference. But we were not able to find any example that > requires this support. As far as we see the combination of > nir_lower_global_vars_to_local and inlining of functions can handle all of > them. As a test/hack, we removed the lowering, but that is needed to support > nir_intrinsic_load_var and store_var, that were not previously supported on > brw_fs_nir. As a final test, we removed the support for globals in > brw_fs_nir (variable nir_globals, and > their usage), and run piglit (all test suite), without finding any > regression. Are you aware of any specific case in which handling globals is > still needed? I'm not sure what you mean by "global values". If you're referring to variables with type global, you'll never see them. Before you even see the code, the nir_lower_global_vars_to_local gets run and, since everything is inlined, that lowers *all* the globals to locals. The locals_to_regs pass then turns them into registers. > This week we will start running specific piglit/dEQP tests on our branch to > get a better picture of where we stand. I foresee a lot of new tasks coming > from these runs. Do you have any advice as to what (sub)set of tests we > could try to cover first? I'd start by just doing a full piglit run both with and without NIR and comparing the too. My usual method is to look at the piles of crashes/failures and grab whatever looks like the simplest test. Most of the shader_runner tests need at least uniforms and two urb writes (one for vertices and one for success/failure color) to work. If you think those are working, I'd probably start with the ones labled "algebraic". Also, you probably want to avoid generated tests as long as you can. They tend to have very long shaders (that's why we generate them) so they can be a real pain to debug. > As usual, we have a snapshot of our (dirty) git tree at > https://github.com/Igalia/mesa/commits/nir-vec4 Thanks! I'll try and take a look!
(In reply to Jason Ekstrand from comment #9) > (In reply to Eduardo Lima Mitev from comment #8) > > > > Regarding global values, we have been working on it using the code in > > brw_fs_nir as reference. But we were not able to find any example that > > requires this support. As far as we see the combination of > > nir_lower_global_vars_to_local and inlining of functions can handle all of > > them. As a test/hack, we removed the lowering, but that is needed to support > > nir_intrinsic_load_var and store_var, that were not previously supported on > > brw_fs_nir. As a final test, we removed the support for globals in > > brw_fs_nir (variable nir_globals, and > > their usage), and run piglit (all test suite), without finding any > > regression. Are you aware of any specific case in which handling globals is > > still needed? > > I'm not sure what you mean by "global values". If you're referring to > variables with type global, you'll never see them. Before you even see the > code, the nir_lower_global_vars_to_local gets run and, since everything is > inlined, that lowers *all* the globals to locals. The locals_to_regs pass > then turns them into registers. > Ok, that explains it. Indeed I was talking about global variables, and we also suspected that all those get lowered to locales. Apparently we just got confused with global variables vs. the global register handling in fs_nir. It is clarified now. > > This week we will start running specific piglit/dEQP tests on our branch to > > get a better picture of where we stand. I foresee a lot of new tasks coming > > from these runs. Do you have any advice as to what (sub)set of tests we > > could try to cover first? > > I'd start by just doing a full piglit run both with and without NIR and > comparing the too. My usual method is to look at the piles of > crashes/failures and grab whatever looks like the simplest test. Most of > the shader_runner tests need at least uniforms and two urb writes (one for > vertices and one for success/failure color) to work. If you think those are > working, I'd probably start with the ones labled "algebraic". Also, you > probably want to avoid generated tests as long as you can. They tend to > have very long shaders (that's why we generate them) so they can be a real > pain to debug. > Great, we will take this into account. Thanks! > > As usual, we have a snapshot of our (dirty) git tree at > > https://github.com/Igalia/mesa/commits/nir-vec4 > > Thanks! I'll try and take a look! I found out that my implementation of uniforms setup and load is completely buggy, so I'm fixing it now. Just so you know.
This is odd. In some piglit tests (like "bin/glean -o -v -v -v -t +api2"), I'm getting a nir_shader with no uniforms declared, yet it loads and uses some: NIR (final form) for vertex shader: decl_var shader_in vec4 in_0 (0, 0) decl_var shader_out vec4 out_0 (0, 0) decl_overload main returning void impl main { decl_reg vec4 r2 decl_reg vec4 r3 decl_reg vec1 r4 decl_reg vec4 r5 decl_reg vec1 r6 decl_reg vec4 r7 decl_reg vec1 r8 decl_reg vec4 r9 decl_reg vec1 r10 decl_reg vec4 r11 block block_0: /* preds: */ r2 = intrinsic load_input () () (0, 1) r3 = intrinsic load_uniform () () (0, 1) r4 = fdot4 r2, r3 r5 = intrinsic load_uniform () () (4, 1) r6 = fdot4 r2, r5 r7 = intrinsic load_uniform () () (8, 1) r8 = fdot4 r2, r7 r9 = intrinsic load_uniform () () (12, 1) r10 = fdot4 r2, r9 r11 = vec4 r4, r6, r8, r10 intrinsic store_output (r11) () (0, 1) /* succs: block_1 */ block block_1: } This looks like a bug to me. Am I missing something?
(In reply to Eduardo Lima Mitev from comment #11) > This is odd. In some piglit tests (like "bin/glean -o -v -v -v -t +api2"), > I'm getting a nir_shader with no uniforms declared, yet it loads and uses > some: > > NIR (final form) for vertex shader: > decl_var shader_in vec4 in_0 (0, 0) > decl_var shader_out vec4 out_0 (0, 0) > decl_overload main returning void > > impl main { > decl_reg vec4 r2 > decl_reg vec4 r3 > decl_reg vec1 r4 > decl_reg vec4 r5 > decl_reg vec1 r6 > decl_reg vec4 r7 > decl_reg vec1 r8 > decl_reg vec4 r9 > decl_reg vec1 r10 > decl_reg vec4 r11 > block block_0: > /* preds: */ > r2 = intrinsic load_input () () (0, 1) > r3 = intrinsic load_uniform () () (0, 1) > r4 = fdot4 r2, r3 > r5 = intrinsic load_uniform () () (4, 1) > r6 = fdot4 r2, r5 > r7 = intrinsic load_uniform () () (8, 1) > r8 = fdot4 r2, r7 > r9 = intrinsic load_uniform () () (12, 1) > r10 = fdot4 r2, r9 > r11 = vec4 r4, r6, r8, r10 > intrinsic store_output (r11) () (0, 1) > /* succs: block_1 */ > block block_1: > } > > This looks like a bug to me. Am I missing something? Taking a wild stab in the dark, I'm going to guess that you're hitting an ARB or fixed-function program. Those come in via the prog_to_nir path and the uniform setup is a bit different.
Yeah, that definitely looks bizarre - but I think it's expected. Fixed-function vertex programs and ARB programs go through prog_to_nir.c (the Mesa IR -> NIR translator), which generates load_uniform intrinsics directly, instead of creating uniform variables then having nir_lower_io lower variable access to intrinsics. It just sets nir->num_uniforms directly, and brw_nir.c skips the part that inspects uniform variables when dealing with FF/ARB programs. /* Set the number of uniforms */ shader->num_uniforms = 4 * c->prog->Parameters->NumParameters; Sorry for the confusion :)
(In reply to Kenneth Graunke from comment #13) > Yeah, that definitely looks bizarre - but I think it's expected. > > Fixed-function vertex programs and ARB programs go through prog_to_nir.c > (the Mesa IR -> NIR translator), which generates load_uniform intrinsics > directly, instead of creating uniform variables then having nir_lower_io > lower variable access to intrinsics. It just sets nir->num_uniforms > directly, and brw_nir.c skips the part that inspects uniform variables when > dealing with FF/ARB programs. > > /* Set the number of uniforms */ > shader->num_uniforms = 4 * c->prog->Parameters->NumParameters; > Thanks Jason and Kenneth! I managed to fix my implementation based on your comments. Just for the sake of clarity, I wonder if it is worth "fixing" the inconsistency in nir_shader for these cases, perhaps injecting the missing declarations or something like that. Dunno, probably not.
(In reply to Eduardo Lima Mitev from comment #14) > (In reply to Kenneth Graunke from comment #13) > > Yeah, that definitely looks bizarre - but I think it's expected. > > > > Fixed-function vertex programs and ARB programs go through prog_to_nir.c > > (the Mesa IR -> NIR translator), which generates load_uniform intrinsics > > directly, instead of creating uniform variables then having nir_lower_io > > lower variable access to intrinsics. It just sets nir->num_uniforms > > directly, and brw_nir.c skips the part that inspects uniform variables when > > dealing with FF/ARB programs. > > > > /* Set the number of uniforms */ > > shader->num_uniforms = 4 * c->prog->Parameters->NumParameters; > > > > Thanks Jason and Kenneth! I managed to fix my implementation based on your > comments. > > Just for the sake of clarity, I wonder if it is worth "fixing" the > inconsistency in nir_shader for these cases, perhaps injecting the missing > declarations or something like that. Dunno, probably not. I don't think so. I think what we want is a longer-term fix that provides some way to declare in the nir_shader what uniform ranges get used and what to do with them. Unfortunately, that's kind of ill-defined at the moment. I'm hoping to get around to making it better-defined soon.
Hi Jason, I have tracked most of our current piglit failures (>800 test fails, a few crashes and a GPU hang) to this loop in get_mul_for_src (nir_opt_peephole_ffma.c) when trying to produce a multiply-add operation: for (unsigned i = 0; i < 4; i++) { if (!(alu->dest.write_mask & (1 << i))) break; swizzle[i] = swizzle[src->swizzle[i]]; } Does this early break look correct to you? If I remove it I don't see any piglit regressions but I get >800 piglit new tests passes with our nir->vec4 pass (including a gpu hang and a number of crashes). In a debug session I see that we are doing a multiply with only one component on both operands and and the add uses all 4 components. Because the mul part only uses one channel it has a writemask of 1, which triggers the early break in the loop, producing an output swizzle of 0123 instead of 0000 (which is the swizzle we get without the early break). This looks strange to me, because src.swizzle which represents the swizzle of the mul operation as source to the add operation, is 0000. In this particular case that leads to a crash later on because we end up using that swizzle to compute the following for operand 1 of the multiply-add: ffma->src[1].swizzle = 0123 ffma->src[1]->src.ssa->num_components = 1 which then breaks this assertion in validate_alu_src: if (nir_alu_instr_channel_used(instr, index, i)) assert(src->swizzle[i] < num_components); It seems clear that the swizzle we get for ffma->src[1] is not right and we would expect 0000 here. Removing the early break fixes that along with a whole lot of other issues, but I wonder if that is a proper fix or just a workaround to something else lurking somewhere else... What do you think?
(In reply to Iago Toral from comment #16) > Hi Jason, > > I have tracked most of our current piglit failures (>800 test fails, a few > crashes and a GPU hang) to this loop in get_mul_for_src > (nir_opt_peephole_ffma.c) when trying to produce a multiply-add operation: > > for (unsigned i = 0; i < 4; i++) { > if (!(alu->dest.write_mask & (1 << i))) > break; > > swizzle[i] = swizzle[src->swizzle[i]]; > } Right.. That one's tricky. We should be breaking based on what channels are readread rather than written. There are two options here: we can either use the writemask or the number of channels used by the *add*. Since this pass always runs with the shader in SSA form, it doesn't matter which you use so long as use use the information from the add.
It has been a few weeks since my last update on the status of this work, so here it goes: * Support for all instructions are already in place except texture operations. We currently have partial support for texture operations and are working hard to finish the remaining ones and start testing them all. * We are passing a large set of piglit tests. $ INTEL_USE_NIR=1 ./piglit-run.py tests/all ./results/nir-vec4/ -x "tex-" -x texture -x teximage -x texcolor -x texel -x texwrap [23905/23905] crash: 9, fail: 147, pass: 14755, skip: 8990, warn: 4 / Thank you for running Piglit! Our branch shows only ~81 regressions in piglit, compared to vec4_visitor. We are excluding texture related tests to the best possible, but it is very likely that some of these 81 regressions are due to missing texture ops support. In dEQP we still have a number of regressions but for different reasons we have been unable to quantify them (test runs abort). So we are focusing on being regression-free on piglit first. As usual, a snapshot of our (work-in-progress) branch can be found at https://github.com/Igalia/mesa/commits/nir-vec4. We hope to start cleaning-up and squashing patches very soon, and I will try to keep you posted more regularly.
Last week we finished texture support and today we fixed *all* remaining piglit regressions, compared to the vec4_visitor path. We still have some work left improving certain implementations and getting rid of some FIXMEs, but we will start cleaning and squashing patches already. https://github.com/Igalia/mesa/commits/nir-vec4
(In reply to Eduardo Lima Mitev from comment #19) > Last week we finished texture support and today we fixed *all* remaining > piglit regressions, compared to the vec4_visitor path. > > We still have some work left improving certain implementations and getting > rid of some FIXMEs, but we will start cleaning and squashing patches already. > > https://github.com/Igalia/mesa/commits/nir-vec4 Good work! I look forward to seeing the patches. I haven't had much chance to look over things for a while. At this point, I'll probably just wait for patches unless you want feedback on something specific prior to sending them. Once we get the initial pass merged, it would be good to spend some time looking at how the NIR pass performs relative to the old fs_visitor code. One tool we have for this is shader-db. Matt ran shader-db on your branch yesterday and the instruction count is at +1% vs. non-NIR in affected programs. That is actually pretty fantastic. When we initially did the FS backend, we started out at something like +4% and when we initially did scalar VS, it started at +6%. It shouldn't take terribly long to figure out the major pain-points and get them fixed. Talk to me, Matt, or Ken offline and we'll find a way to get shader-db to you. Improving the shader-db numbers is *not* a prerequisite for getting it merged, just a prerequisite for turning it on by default.
(In reply to Jason Ekstrand from comment #20) > (In reply to Eduardo Lima Mitev from comment #19) > > We still have some work left improving certain implementations and getting > > rid of some FIXMEs, but we will start cleaning and squashing patches already. > > > > https://github.com/Igalia/mesa/commits/nir-vec4 > > Good work! I look forward to seeing the patches. I haven't had much chance > to look over things for a while. At this point, I'll probably just wait for > patches unless you want feedback on something specific prior to sending them. > Yes, it sounds good. We are preparing the patch set and hope to send it this week already. The only controversial part at this point is the uniform support. The way vec4_visitor and fs_nir handle uniform the file differ significantly. Right now we have the vec4_visitor implementation but to make it work well with nir, we have had to patch nir directly in some rather hackish ways. So Igo and Samuel are evaluating a full rewrite of the uniform support to have it similar to fs_nir layout instead. I will let them comment on it directly, if/when they see fit > Once we get the initial pass merged, it would be good to spend some time > looking at how the NIR pass performs relative to the old fs_visitor code. > One tool we have for this is shader-db. > Yes, we know shader-db and have used it already. We will definitely be looking into performance after we get the support in. > > Matt ran shader-db on your branch > yesterday and the instruction count is at +1% vs. non-NIR in affected > programs. That is actually pretty fantastic. > ... > It shouldn't take terribly long to figure out > the major pain-points and get them fixed. Talk to me, Matt, or Ken offline > and we'll find a way to get shader-db to you. > Ok great. Thank you both for the offer. > Improving the shader-db numbers is *not* a prerequisite for getting it > merged, just a prerequisite for turning it on by default. Yes, it makes perfect sense.
(In reply to Eduardo Lima Mitev from comment #21) > (In reply to Jason Ekstrand from comment #20) > > (In reply to Eduardo Lima Mitev from comment #19) > > > We still have some work left improving certain implementations and getting > > > rid of some FIXMEs, but we will start cleaning and squashing patches already. > > > > > > https://github.com/Igalia/mesa/commits/nir-vec4 > > > > Good work! I look forward to seeing the patches. I haven't had much chance > > to look over things for a while. At this point, I'll probably just wait for > > patches unless you want feedback on something specific prior to sending them. > > > > Yes, it sounds good. We are preparing the patch set and hope to send it this > week already. The current implementation only targets GLSL vertex shaders. It does not cover geometry shaders or ARB_vertex_program, our plan was to tackle these in a second iteration while we get feedback on the glsl vertex shader implementation. > The only controversial part at this point is the uniform support. The way > vec4_visitor and fs_nir handle uniform the file differ significantly. Right > now we have the vec4_visitor implementation but to make it work well with > nir, we have had to patch nir directly in some rather hackish ways. So Igo > and Samuel are evaluating a full rewrite of the uniform support to have it > similar to fs_nir layout instead. I will let them comment on it directly, > if/when they see fit I'll explain this issue in a separate comment.
Jason, about the stuff with uniforms that has been bugging me: The thing is that the vec4 backend and the nir_lower_io pass do not seem to work very well together for uniforms. The reason for this is that the vec4 backend assumes that uniform elements are vec4-sized while nir_lower_io (specifically, the get_io_offset function) doesn't. This creates a problem because the offsets we compute in nir are not in the same units than what the vec4 backend expects. The get_io_offset function accumulates the constant part of the offset into the uniform we are accessing (the base offset) and then emits code to compute the variable indexing part (if we have non-constant array access). Both use a function type_size() that does not measure sizes in units of vec4 like the current vec4 implementation does. This creates a problem, specifically when we have uniform arrays with indirect accesses, since the way the vec4 backend handles these assumes that each uniform element is vec4-sized, padding it if it isn't. Initially, I fixed this by patching get_io_offset so that as soon as it was computing an offset into something that had indirect indexing in a vertex shader, it would use vec4 units to compute sizes all over get_io_offset. That solution worked fine, but it did not look very clean to me (I imagine we do not want NIR to deal with this sort of things, since the decision to pad uniform elements to vec4 sizes seems like a backend thing to me and other drivers could decide to not do that), so I was trying to find a different solution. The current implementation manages to get away without patching get_io_offset (for the most part) by handling the difference between what it needs and what nir_lower_io delivers in the nir_vec4 backend. The only situation where we can't do this is when we have indirect indexing into an array (more on that later). This solution is cleaner in the sense that it does not have to modify nir as much (only when indirect indexing is involved), but at the same time it feels like the nir_vec4 backend needs to do extra work only because nir_lower_io is not doing exactly what it needs, so that does not feel completely right to me, but maybe it is unavoidable. If indirect indexing into arrays is present, however, we still need to patch get_io_offset, because in that case this function emits code directly to multiply the indirect index by the size of the array elements, which we need in units of vec4. I guess this diff with the changes we have right now to nir_lower_io.c should make the issue we are dealing with here clear: http://pastebin.com/iqGSabsv Therefore, lately I have been thinking that the only way to avoid this is to have the vec4 backend work without assuming that uniform elements are vec4-sized. I started a quick experiment to see what this would involve. Basically, I modified things so that uniform_size[] does not compute sizes in vec4 units, then modified the uniform setup code so that we don't pad uniforms elements smaller than a vec4 size with 0s. This also required small changes in move_uniform_array_access_to_pull_constants because that also assumes uniform values padded to a vec4 size of course, and I think we would also need to modify brw_wm_surface_state.c so that we don't create constant surfaces for uniforms with BRW_SURFACEFORMAT_R32G32B32A32_FLOAT. Then I noticed that the sampler messages we use to pull uniform array data with variable index also expect uniform data padded to a vec4 size since they use oword offsets (so I hacked this to use unaligned messages instead that take a byte offset), then I realized that uniforms loaded as push constants also need to be padded to a vec4 size, I am still not sure about the implications of changing this... and I am sure there will be plenty of other things that would need to be fixed. The bottom line is that going for this looks like a rather big change and a significant departure from what the current vec4_visitor does, which means that even if we do this successfully, maintaining both implementations until we can get rid of the old vec4_visitor would probably be a pain. So at this point I am not sure about the best approach to follow, I think our current patch for nir_lower_io with indirect array accesses is a bit ugly, but at the same time it is small, quite self-contained and it works, the alternative to this seems to be a large change that would make the nir-vec4 backend quite different to the current vec4_visitor and that also looks troublesome (and I guess it would also take some time to get working), so I am not sure if it is a good idea right now or if there is a better solution to this problem that I am missing. What do you think?
(In reply to Iago Toral from comment #23) > Jason, about the stuff with uniforms that has been bugging me: > > The thing is that the vec4 backend and the nir_lower_io pass do not seem to > work very well together for uniforms. The reason for this is that the vec4 > backend assumes that uniform elements are vec4-sized while nir_lower_io > (specifically, the get_io_offset function) doesn't. This creates a problem > because the offsets we compute in nir are not in the same units than what > the vec4 backend expects. [...] > Initially, I fixed this by patching get_io_offset so that as soon as it was > computing an offset into something that had indirect indexing in a vertex > shader, it would use vec4 units to compute sizes all over get_io_offset. > That solution worked fine, but it did not look very clean to me (I imagine > we do not want NIR to deal with this sort of things, since the decision to > pad uniform elements to vec4 sizes seems like a backend thing to me and > other drivers could decide to not do that), so I was trying to find a > different solution. [...] > So at this point I am not sure about the best approach to follow, I think > our current patch for nir_lower_io with indirect array accesses is a bit > ugly, but at the same time it is small, quite self-contained and it works, > the alternative to this seems to be a large change that would make the > nir-vec4 backend quite different to the current vec4_visitor and that also > looks troublesome (and I guess it would also take some time to get working), > so I am not sure if it is a good idea right now or if there is a better > solution to this problem that I am missing. > > What do you think? The plan that Connor and myself have had is to rename nir_lower_io to nir_lower_io_scalar and add a nir_lower_io_vec4 for vec4-based backends. Hopefully, those can share most of their code. The whole point of nir_lower_io is to turn it from variables and derefs into something that is nice for drivers to work with. If that means we need a vec4 version in order to be nice to vec4-based backends, that's fine. Intel isn't the only hardware with vec4-based shaders.
(In reply to Jason Ekstrand from comment #24) > (In reply to Iago Toral from comment #23) > > Jason, about the stuff with uniforms that has been bugging me: > > > > The thing is that the vec4 backend and the nir_lower_io pass do not seem to > > work very well together for uniforms. The reason for this is that the vec4 > > backend assumes that uniform elements are vec4-sized while nir_lower_io > > (specifically, the get_io_offset function) doesn't. This creates a problem > > because the offsets we compute in nir are not in the same units than what > > the vec4 backend expects. > > [...] > > > Initially, I fixed this by patching get_io_offset so that as soon as it was > > computing an offset into something that had indirect indexing in a vertex > > shader, it would use vec4 units to compute sizes all over get_io_offset. > > That solution worked fine, but it did not look very clean to me (I imagine > > we do not want NIR to deal with this sort of things, since the decision to > > pad uniform elements to vec4 sizes seems like a backend thing to me and > > other drivers could decide to not do that), so I was trying to find a > > different solution. > > [...] > > > So at this point I am not sure about the best approach to follow, I think > > our current patch for nir_lower_io with indirect array accesses is a bit > > ugly, but at the same time it is small, quite self-contained and it works, > > the alternative to this seems to be a large change that would make the > > nir-vec4 backend quite different to the current vec4_visitor and that also > > looks troublesome (and I guess it would also take some time to get working), > > so I am not sure if it is a good idea right now or if there is a better > > solution to this problem that I am missing. > > > > What do you think? > > The plan that Connor and myself have had is to rename nir_lower_io to > nir_lower_io_scalar and add a nir_lower_io_vec4 for vec4-based backends. > Hopefully, those can share most of their code. The whole point of > nir_lower_io is to turn it from variables and derefs into something that is > nice for drivers to work with. If that means we need a vec4 version in > order to be nice to vec4-based backends, that's fine. Intel isn't the only > hardware with vec4-based shaders. Nice, that makes sense. In that case I guess we will leave our implementation as is and modify it when that plan materializes.
(In reply to Iago Toral from comment #25) > (In reply to Jason Ekstrand from comment #24) > > (In reply to Iago Toral from comment #23) > > > Jason, about the stuff with uniforms that has been bugging me: > > > > > > The thing is that the vec4 backend and the nir_lower_io pass do not seem to > > > work very well together for uniforms. The reason for this is that the vec4 > > > backend assumes that uniform elements are vec4-sized while nir_lower_io > > > (specifically, the get_io_offset function) doesn't. This creates a problem > > > because the offsets we compute in nir are not in the same units than what > > > the vec4 backend expects. > > > > [...] > > > > > Initially, I fixed this by patching get_io_offset so that as soon as it was > > > computing an offset into something that had indirect indexing in a vertex > > > shader, it would use vec4 units to compute sizes all over get_io_offset. > > > That solution worked fine, but it did not look very clean to me (I imagine > > > we do not want NIR to deal with this sort of things, since the decision to > > > pad uniform elements to vec4 sizes seems like a backend thing to me and > > > other drivers could decide to not do that), so I was trying to find a > > > different solution. > > > > [...] > > > > > So at this point I am not sure about the best approach to follow, I think > > > our current patch for nir_lower_io with indirect array accesses is a bit > > > ugly, but at the same time it is small, quite self-contained and it works, > > > the alternative to this seems to be a large change that would make the > > > nir-vec4 backend quite different to the current vec4_visitor and that also > > > looks troublesome (and I guess it would also take some time to get working), > > > so I am not sure if it is a good idea right now or if there is a better > > > solution to this problem that I am missing. > > > > > > What do you think? > > > > The plan that Connor and myself have had is to rename nir_lower_io to > > nir_lower_io_scalar and add a nir_lower_io_vec4 for vec4-based backends. > > Hopefully, those can share most of their code. The whole point of > > nir_lower_io is to turn it from variables and derefs into something that is > > nice for drivers to work with. If that means we need a vec4 version in > > order to be nice to vec4-based backends, that's fine. Intel isn't the only > > hardware with vec4-based shaders. > > Nice, that makes sense. In that case I guess we will leave our > implementation as is and modify it when that plan materializes. It sounded, from what you said above, like you had already modified nir_lower_io to handle vec4. If that's the case just go ahead and do the rename and add the vec4 version. No reason to wait for me or Connor to do it.
(In reply to Jason Ekstrand from comment #26) > It sounded, from what you said above, like you had already modified > nir_lower_io to handle vec4. If that's the case just go ahead and do the > rename and add the vec4 version. No reason to wait for me or Connor to do > it. Yeah, I think your approach is great. Instead of passing a gl_shader_stage and checking for != MESA_SHADER_FRAGMENT, we should just pass a boolean indicating whether we should make things scalar or pad to vec4s. Or have nir_lower_io_scalar and nir_lower_io_vec4 functions (effectively the same thing). That way, i965's scalar-or-not choices aren't baked into the generic code, and everybody can pick what they want.
(In reply to Kenneth Graunke from comment #27) > (In reply to Jason Ekstrand from comment #26) > > It sounded, from what you said above, like you had already modified > > nir_lower_io to handle vec4. If that's the case just go ahead and do the > > rename and add the vec4 version. No reason to wait for me or Connor to do > > it. Partially, I did that only when indirect indexing was involved, otherwise we could work with scalar units, which is not great. In any case, it makes sense that I add the change as part of the series so I'll do that. > Yeah, I think your approach is great. > > Instead of passing a gl_shader_stage and checking for != > MESA_SHADER_FRAGMENT, we should just pass a boolean indicating whether we > should make things scalar or pad to vec4s. Or have nir_lower_io_scalar and > nir_lower_io_vec4 functions (effectively the same thing). > > That way, i965's scalar-or-not choices aren't baked into the generic code, > and everybody can pick what they want. Yeah, that sounds good to me. Thanks for the feedback!
Hi all, FYI we submitted the series to the mailing-list late last week: http://lists.freedesktop.org/archives/mesa-dev/2015-June/087448.html We are now focusing on extending the NIR->vec4 pass to support geometry shaders and ARB_vertex_program shaders. A tree containing the patch-set we submitted can be found at: https://github.com/Igalia/mesa/tree/nir-vec4-v1
Hi Eduardo, Antía, FYI, I've mostly gotten SIMD8 Geometry Shaders (new on Broadwell) working - which is the first backend using NIR for geometry shaders. You can find my work-in-progress branch here: http://cgit.freedesktop.org/~kwg/mesa/log/?h=simd8gs I've done things a little differently than the old vec4 backend - namely, I moved the vertex count (how many times EmitVertex() has been called) into NIR, introducing new intrinsics. My thinking is that NIR's optimizations should be able to constant fold/propagate the counter values, so in common/simple cases, NIR will simply eliminate the "safety check" if statements, as they'll be if (true). Knowing that the vertex count is an immediate may allow us to avoid per-slot-offset messages as well, and should also help us use the "Static Vertex Count" feature on Broadwell. It's not quite done yet: I haven't actually added code to do safety checks yet; there are a few Piglit regressions still, and a few tests try to use 700 registers for the payload data :) But I figured I'd share it so you can see where I'm going with this, and in case you wanted to reuse the NIR intrinsics.
(In reply to Kenneth Graunke from comment #30) > Hi Eduardo, Antía, > > FYI, I've mostly gotten SIMD8 Geometry Shaders (new on Broadwell) working - > which is the first backend using NIR for geometry shaders. > > You can find my work-in-progress branch here: > http://cgit.freedesktop.org/~kwg/mesa/log/?h=simd8gs > > I've done things a little differently than the old vec4 backend - namely, I > moved the vertex count (how many times EmitVertex() has been called) into > NIR, introducing new intrinsics. > > My thinking is that NIR's optimizations should be able to constant > fold/propagate the counter values, so in common/simple cases, NIR will > simply eliminate the "safety check" if statements, as they'll be if (true). > Knowing that the vertex count is an immediate may allow us to avoid > per-slot-offset messages as well, and should also help us use the "Static > Vertex Count" feature on Broadwell. > > It's not quite done yet: I haven't actually added code to do safety checks > yet; there are a few Piglit regressions still, and a few tests try to use > 700 registers for the payload data :) But I figured I'd share it so you can > see where I'm going with this, and in case you wanted to reuse the NIR > intrinsics. Hi Ken, thanks for sharing this early. Eduardo and Antia are on holidays this week but they will look into this as soon as they are back. Since the new lowering pass will lower all instances of nir_intrinsic_emit_vertex to the new intrinsics we will have to adapt our GS implementation. Our plan was to send a v2 of our patches next week which would include support for vec4 GS using the old intrinsics, but seeing this maybe it makes more sense to wait until your work gets merged. What do you think?
Ah, if you've already written code to use the old intrinsics, then feel free to proceed with that. They should work fine. I suspect vec4-nir will be mergeable before my scalar-GS work. Once vec4-nir lands, I can always write patches to introduce/use my new intrinsics in the existing vec4 backend, which would prove whether they're useful or not.
Hi, just a heads-up that today we submitted the second version of the series: http://lists.freedesktop.org/archives/mesa-dev/2015-July/089576.html
A little update: The backend is already upstream. Jason reviewed the series and after some polishing, he pushed it upstream himself (thanks!). There were some regressions on gen4, gen5 and gen8 which we (the Igalia team) couldn't detect due to lack of available HW. However Jason sent a mini-series fixing all these issues <http://lists.freedesktop.org/archives/mesa-dev/2015-August/090617.html>. With these patches, the backend can be considered to have 0 regressions on all Intel gens. Now we have started work on improving quality of emitted code, so that we don't perform worse than vec4_visitor. This is a requirement to activate the backend by default. I think we can keep this bug open to track this work too, but I'm ok with closing this and opening a new one just for the optimization work. For reference, this is more or less where we stand today comparing the NIR-vec4 path against non-NIR, on HSW: total instructions in shared programs: 1853749 -> 1838155 (-0.84%) instructions in affected programs: 1707212 -> 1691618 (-0.91%) helped: 5310 HURT: 13531 GAINED: 0 LOST: 0 In the coming days week, we are going to work on identifying optimization opportunities and implementing them. We will create a series of patches, and at some point submit them to mesa-dev for review. If you want to follow progress or do early testing, we will be pushing to this branch: https://github.com/Igalia/mesa/tree/nir-vec4-quality cheers
About time for another update: The NIR-vec4 backend is the default now. Even though there are a considerable amount of code-quality regressions compared to the vec4_visitor backend, it has been set default already to hit the next release window and get proper testing. For reference, here is where we stand now compared to non-NIR path: total instructions in shared programs: 1853747 -> 1801527 (-2.82%) instructions in affected programs: 1694180 -> 1641960 (-3.08%) helped: 6913 HURT: 10932 GAINED: 0 LOST: 0 At this point, the huge majority of regressions come from lack of register coalescing. To illustrate with a simple example, chunks like this is what NIR is giving to backends: r2 = fdot4 r0, r1 r3.x = imov r2.x When we could simply have: r3.x = fdot r0, r1 This is mostly nir_lower_vec_to_movs's fault. While this behavior can and should be corrected by the opt_register_coalesce pass, we have been trying to mitigate the problem at NIR level too, for two reasons: 1) to give backends a better, more optimized code, that uses less registers, b) since NIR is shared by different backends, it is desirable to provide an optimization pass that more backends could potentially benefit from. At this point, we have produced a new NIR pass which we called 'nir_lower_vec_and_coalesce', that will take the NIR shader right after SSA, and will try to coalesce registers by propagating vecN destination components to the instructions that define its sources. The pass will produce new reduced vecN instructions, containing the channels that were not propagated. Hence, this pass is compatible with lower_vec_to_movs(), and if in the future we decide to disable it, we can still use lower_vec_and_coalesce transparently. Right now, the pass is very conservative about the conditions in which it will propagate register components. The reason is that at this point we are very interested in feedback about the whole idea, before we add more complexity to the pass. The branch is here: https://github.com/Igalia/mesa/commits/elima/nir-vec4-quality There are 2 patches only, one that adds the pass to NIR; and a second that just activates it on i965 for non-scalar shaders. Only the first one is interesting. With this pass enabled, we get these shader-db VS results against vec4_visitor: total instructions in shared programs: 1853747 -> 1762126 (-4.94%) instructions in affected programs: 1681255 -> 1589634 (-5.45%) helped: 7751 HURT: 9344 GAINED: 0 LOST: 0 And these against NIR-vec4 as in current master: total instructions in shared programs: 1801527 -> 1762126 (-2.19%) instructions in affected programs: 1156923 -> 1117522 (-3.41%) helped: 10283 HURT: 1281 GAINED: 0 LOST: 0 It is not very impressive but there are a lot of clear opportunities left out, that would add some complexity that is maybe not worth at this point. There are no piglit or dEQP regressions observed. What do you think?
(In reply to Eduardo Lima Mitev from comment #35) > At this point, we have produced a new NIR pass which we called > 'nir_lower_vec_and_coalesce', that will take the NIR shader right after SSA, > and will try to coalesce registers by propagating vecN destination > components to the instructions that define its sources. The pass will > produce new reduced vecN instructions, containing the channels that were not > propagated. Hence, this pass is compatible with lower_vec_to_movs(), and if > in the future we decide to disable it, we can still use > lower_vec_and_coalesce transparently. > > Right now, the pass is very conservative about the conditions in which it > will propagate register components. The reason is that at this point we are > very interested in feedback about the whole idea, before we add more > complexity to the pass. > > The branch is here: > https://github.com/Igalia/mesa/commits/elima/nir-vec4-quality > > There are 2 patches only, one that adds the pass to NIR; and a second that > just activates it on i965 for non-scalar shaders. Only the first one is > interesting. > > With this pass enabled, we get these shader-db VS results against > vec4_visitor: > > total instructions in shared programs: 1853747 -> 1762126 (-4.94%) > instructions in affected programs: 1681255 -> 1589634 (-5.45%) > helped: 7751 > HURT: 9344 > GAINED: 0 > LOST: 0 > > And these against NIR-vec4 as in current master: > > total instructions in shared programs: 1801527 -> 1762126 (-2.19%) > instructions in affected programs: 1156923 -> 1117522 (-3.41%) > helped: 10283 > HURT: 1281 > GAINED: 0 > LOST: 0 > > It is not very impressive but there are a lot of clear opportunities left > out, that would add some complexity that is maybe not worth at this point. > > There are no piglit or dEQP regressions observed. > > What do you think? I'm sorry I haven't had a chance to get back to you. I've had a lot on my plate the last few weeks. I did take a brief look at it today but didn't really give a full review. Could you please send it to the list as at least an RFC so I can more easily make comments inline? Thanks! I also took a stab at helping things out: http://lists.freedesktop.org/archives/mesa-dev/2015-September/093737.html I'm not 100% sure that I like it, but it seems to be effective: > total instructions in shared programs: 1801527 -> 1761720 (-2.21%) > instructions in affected programs: 978662 -> 938855 (-4.07%) > helped: 7924 > HURT: 497 With the total GLSL IR vs. NIR delta being > total instructions in shared programs: 1853737 -> 1761720 (-4.96%) > instructions in affected programs: 1686064 -> 1594047 (-5.46%) > helped: 7717 > HURT: 9942 Not sure how the two passes interact, I put your branch on top of mine and the final GLSL IR vs. NIR results with both of them are > total instructions in shared programs: 1853737 -> 1697708 (-8.42%) > instructions in affected programs: 1695435 -> 1539406 (-9.20%) > helped: 11648 > HURT: 5540 That seems pretty decent to me. Eventually, I do think we want to make register coalescing in the back-end non-terrible, but these seem to help at the very least. If you get it sent to the list, I'll try and read your pass again and wrap my head around it.
I had a chat with Matt on IRC today and He's getting more and more convinced that making a better register coalescing pass for vec4 is highly impractical. Therefore, it looks like anything we can do in NIR to make things nicer for the backend is probably the way to do it. I've got two series sitting on the list that both help quite a bit. One is the one mentioned in comment #36: http://lists.freedesktop.org/archives/mesa-dev/2015-September/093737.html The other I've been talking to Eduardo about already.
I'm going ahead and marking this bug as resolved. I'm sure we can keep finding performance things and I'm working with Alejandro on one at the moment. However, the pass is there, on by default, and the numbers are looking pretty good so I don't think we need the bug open anymore. Good work everyone!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.