INTEL_DEBUG=shader_time reports that almost 0 time is spent in the vertex shader in a microbenchmark (OglTerrainFlyInst). However, the FS is really simple, and the VS is pretty large. I managed to cut roughly 50% of the VS instructions, and that roughly doubled the performance. So clearly the VS is important.
Using performance monitoring, it looks like the VS actually takes up ~70% of the time. So, shader_time is just full of lies. It appears that it's undercounting the VS time in Unigine Valley as well.
I believe the timestamp register is getting reset in almost every VS, so we just count 0 most of the time.
Perhaps it can be improved? Perhaps we need to develop something better?
This should pretty much be fixed by:
Author: Kenneth Graunke <firstname.lastname@example.org>
Date: Sat Jun 14 03:53:07 2014 -0700
i965/vec4: Fix dead code elimination for VGRFs of size > 1.
When faced with code such as:
mov vgrf31.0:UD, 960D
mov vgrf31.1:UD, vgrf30.xxxx:UD
The dead code eliminator didn't consider reg_offsets, so it decided that
the second instruction was writing was writing to the same register as
the first one, and eliminated the first one. But they're actually
This fixes INTEL_DEBUG=shader_time for vertex shaders. In the above
code, vgrf31.0 represents the offset into the shader_time buffer where
the data should be written, and vgrf31.1 represents the actual time
data. With a completely undefined offset, results were...unexpected.
I think this is probably one of the few cases (maybe only case) where we
generate multiple MOVs to a large VGRF. Normally, we just use them as
texturing results; the other SEND-from-GRF uses a size 1 VGRF.
Signed-off-by: Kenneth Graunke <email@example.com>
Reviewed-by: Matt Turner <firstname.lastname@example.org>
It's at least now giving me data in the same ballpark as the other methods, though I haven't checked if it's entirely the same. Thankfully it was something simple.