Bug 87478

Summary: i915_drv OpenGL very CPU intensive under Qt5/QtQuick 2
Product: Mesa Reporter: jpsinthemix
Component: Drivers/DRI/i915Assignee: Ian Romanick <idr>
Status: RESOLVED MOVED QA Contact:
Severity: blocker    
Priority: medium CC: chgena, jpsinthemix, v_2e
Version: 10.4   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: log given by INTEL_DEBUG=all
Correct (portion) of an INTEL_DEBUG=all log

Description jpsinthemix 2014-12-19 06:54:33 UTC
Hi,
I am testing Qt5/QtQuick 2 on an admittedly old desktop with an Intel Corporation 82915G/GV/910GL Integrated Graphics Controller. The issue I'm seeing is very high i915_drv CPU usage for all QML scripts. The problem is related to the way QtQuick 2 renders: entire scene redraws are triggered on each change in the scene (rather than simply redrawing what has in fact changed). For example, a static full screen scene with a text field in the center is fully redrawn on each cursor blink. The result is continual very high (60-80% ) CPU usage. Here is a partial (oprofile) report (approx 5-10 sec):

CPU: P4 / Xeon, speed 3200.2 MHz (estimated)

samples  %        image name               symbol name
100478   30.4292  i915_dri.so              fetch_texel_2d_A_UNORM8
55176    16.7097  i915_dri.so              fetch_vector4
42980    13.0162  i915_dri.so              _mesa_execute_program
25721     7.7895  i915_dri.so              store_vector4
20444     6.1913  i915_dri.so              linear_texel_locations
19952     6.0423  i915_dri.so              _mesa_unpack_ubyte_rgba_row
11530     3.4918  i915_dri.so              sample_linear_2d
10194     3.0872  i915_dri.so              unpack_uint_z_Z24_UNORM_X8_UINT
7353      2.2268  i915_dri.so              _swrast_write_rgba_span
6993      2.1178  i915_dri.so              pack_uint_Z24_UNORM_S8_UINT
3505      1.0615  i915_dri.so              __x86.get_pc_thunk.bx
3229      0.9779  i915_dri.so              fetch_texel_lod
2717      0.8228  i915_dri.so              _swrast_exec_fragment_program
2562      0.7759  no-vmlinux               /no-vmlinux
2379      0.7205  i915_dri.so              fetch_vector1
1609      0.4873  i915_dri.so              blend_general
1503      0.4552  i915_dri.so              blend_general_float.isra.1
1459      0.4418  libc-2.20.so             _int_free
1414      0.4282  libc-2.20.so             malloc
1197      0.3625  i915_dri.so              general_triangle
1190      0.3604  libc-2.20.so             _int_malloc
1095      0.3316  i915_dri.so              _swrast_depth_test_span
952       0.2883  i915_dri.so              _mesa_convert_colors
372       0.1127  i915_dri.so              pack_row_ubyte_B8G8R8A8_UNORM
290       0.0878  i915_dri.so              run_vp
262       0.0793  libQt5Quick.so.5.4.0     /usr/lib/libQt5Quick.so.5.4.0
219       0.0663  i915_dri.so              _mesa_pack_ubyte_rgba_row
196       0.0594  i915_dri.so              _mesa_get_format_bytes
179       0.0542  i915_dri.so              _swrast_span_interpolate_z
175       0.0530  libc-2.20.so             free
170       0.0515  libQt5Core.so.5.4.0      /usr/lib/libQt5Core.so.5.4.0

In comparison, if I uninstall the i915_drv to force the use of the llvm swast driver I get 25% CPU usage, still too high, but much lower than i915_drv.

Can anyone familiar with the i915_drv code think of where the bottleneck is occurring and possibly suggest code modifications which might improve things here? Note that using Qt4 or Qt5 OpenGL Widget-based GUIs with the i915_drv driver presents no problems, the problem lies squarely with the implementation of QtQuick 2 in that most rendering decisions have been off-loaded to the GPU and associated  driver). I'd be happy to test patches and even mess with the code. Any help or suggestions would be greatly appreciated.

thanks much for your time,
John
Comment 1 Ian Romanick 2014-12-19 23:03:24 UTC
It sounds like the application requires more resources than the aging i915 has to offer.  Seeing _swrast_exec_fragment_program in the profile tells me that the driver is falling back to software rasterization.  This usually occurs when the fragment program is too big.  The i965 can only handle shaders with upto 64 instructions.

Unless the shader is "just barely" too big, it is unlikely that we'll be able to do anything about it.  What output does running with 'INTEL_DEBUG=perf' give?
Comment 2 jpsinthemix 2014-12-30 05:32:45 UTC
(In reply to Ian Romanick from comment #1)
> It sounds like the application requires more resources than the aging i915
> has to offer.  Seeing _swrast_exec_fragment_program in the profile tells me
> that the driver is falling back to software rasterization.  This usually
> occurs when the fragment program is too big.  The i965 can only handle
> shaders with upto 64 instructions.
> 
> Unless the shader is "just barely" too big, it is unlikely that we'll be
> able to do anything about it.  What output does running with
> 'INTEL_DEBUG=perf' give?

For some reason I am unable to get any output via INTEL_DEBUG=perf. This is for the i915_drv, not the i965_drv driver. I think the Qt program I'm using for testing is re-directing both stdin/out to /dev/null.

The Qt program I'm running is the display manager 'sddm' which basically runs a QtQuick2 Main.qml script. I've started intel_gpu_top and then run sddm, and its shows 2-3% GPU usage, not surprising I suppose, because as you noted the fragment shader is being run on the CPU via swrast fallback.

I have, however, found the problematic shaders; they are hiqsubpixeldistancefieldtext.frag,vert} shown below. This shader pair handles the text rendering for sddm using 'distance-field' anti-aliasing. Note that the 'highp/lowp/mediump' qualifiers (for GLES) are not relevant (and are unset) here.

There are also simpler (lower-quality versions), loqsubpixeldistancefieldtext.{frag,vert}, which are not employed by QtQuick2 here, but if I re-build the QtQuick library and force their use instead of the hiqsubpixeldistancefieldtext.* versions, then the program runs as expected: intel_gpu_top usage remains at about 2-3% and CPU usage fluctuates in the 0-1% range. Listings of these lo* shaders
are also shown below.

My guess here is that the 5 texture2D/texture2DProj() calls in hiqsubpixeldistancefieldtext.frag are
triggering the swrast fallback; there are only 2 texture2DProj() calls in loqsubpixeldistancefieldtext.frag. If so, is this related to the i915 limits (i915_context.h) for maximum tex instructions and/or maximum tex indirections (I915_MAX_TEX_INSN=32 and I915_MAX_TEX_INDIRECT=4)?

If this is so, is there any wiggle room here in the driver whereby, say up 6 texture2D/texture2DProj()
calls could be handled w/o a swrast fallback?

Thanks again for your time,
John


// ==== hiqsubpixeldistancefieldtext.vert
// ===============================================

uniform highp mat4 matrix;
uniform highp vec2 textureScale;
uniform highp float fontScale;
uniform highp vec4 vecDelta;

attribute highp vec4 vCoord;
attribute highp vec2 tCoord;

varying highp vec2 sampleCoord;
varying highp vec3 sampleFarLeft;
varying highp vec3 sampleNearLeft;
varying highp vec3 sampleNearRight;
varying highp vec3 sampleFarRight;

void main()
{
    sampleCoord = tCoord * textureScale;
    gl_Position = matrix * vCoord;

    // Calculate neighbor pixel position in item space.
    highp vec3 wDelta = gl_Position.w * vecDelta.xyw;
    highp vec3 farLeft = vCoord.xyw - 0.667 * wDelta;
    highp vec3 nearLeft = vCoord.xyw - 0.333 * wDelta;
    highp vec3 nearRight = vCoord.xyw + 0.333 * wDelta;
    highp vec3 farRight = vCoord.xyw + 0.667 * wDelta;

    // Calculate neighbor texture coordinate.
    highp vec2 scale = textureScale / fontScale;
    highp vec2 base = sampleCoord - scale * vCoord.xy;
    sampleFarLeft = vec3(base * farLeft.z + scale * farLeft.xy, farLeft.z);
    sampleNearLeft = vec3(base * nearLeft.z + scale * nearLeft.xy, nearLeft.z);
    sampleNearRight = vec3(base * nearRight.z + scale * nearRight.xy, nearRight.z);
    sampleFarRight = vec3(base * farRight.z + scale * farRight.xy, farRight.z);
}

// ==== hiqsubpixeldistancefieldtext.frag
// ===============================================

varying highp vec2 sampleCoord;
varying highp vec3 sampleFarLeft;
varying highp vec3 sampleNearLeft;
varying highp vec3 sampleNearRight;
varying highp vec3 sampleFarRight;

uniform sampler2D _qt_texture;
uniform lowp vec4 color;
uniform mediump float alphaMin;
uniform mediump float alphaMax;

void main()
{
    highp vec4 n;
    n.x = texture2DProj(_qt_texture, sampleFarLeft).a;
    n.y = texture2DProj(_qt_texture, sampleNearLeft).a;
    highp float c = texture2D(_qt_texture, sampleCoord).a;
    n.z = texture2DProj(_qt_texture, sampleNearRight).a;
    n.w = texture2DProj(_qt_texture, sampleFarRight).a;
#if 0
    // Blurrier, faster.
    n = smoothstep(alphaMin, alphaMax, n);
    c = smoothstep(alphaMin, alphaMax, c);
#else
    // Sharper, slower.
    highp vec2 d = min(abs(n.yw - n.xz) * 2., 0.67);
    highp vec2 lo = mix(vec2(alphaMin), vec2(0.5), d);
    highp vec2 hi = mix(vec2(alphaMax), vec2(0.5), d);
    n = smoothstep(lo.xxyy, hi.xxyy, n);
    c = smoothstep(lo.x + lo.y, hi.x + hi.y, 2. * c);
#endif
    gl_FragColor = vec4(0.333 * (n.xyz + n.yzw + c), c) * color.w;
}

// ==== loqsubpixeldistancefieldtext.vert
// ===============================================

uniform highp mat4 matrix;
uniform highp vec2 textureScale;
uniform highp float fontScale;
uniform highp vec4 vecDelta;

attribute highp vec4 vCoord;
attribute highp vec2 tCoord;

varying highp vec3 sampleNearLeft;
varying highp vec3 sampleNearRight;

void main()
{
    highp vec2 sampleCoord = tCoord * textureScale;
    gl_Position = matrix * vCoord;

    // Calculate neighbor pixel position in item space.
    highp vec3 wDelta = gl_Position.w * vecDelta.xyw;
    highp vec3 nearLeft = vCoord.xyw - 0.25 * wDelta;
    highp vec3 nearRight = vCoord.xyw + 0.25 * wDelta;

    // Calculate neighbor texture coordinate.
    highp vec2 scale = textureScale / fontScale;
    highp vec2 base = sampleCoord - scale * vCoord.xy;
    sampleNearLeft = vec3(base * nearLeft.z + scale * nearLeft.xy, nearLeft.z);
    sampleNearRight = vec3(base * nearRight.z + scale * nearRight.xy, nearRight.z);
}

// ==== loqsubpixeldistancefieldtext.frag
// ===============================================

varying highp vec3 sampleNearLeft;
varying highp vec3 sampleNearRight;

uniform sampler2D _qt_texture;
uniform lowp vec4 color;
uniform mediump float alphaMin;
uniform mediump float alphaMax;

void main()
{
    highp vec2 n;
    n.x = texture2DProj(_qt_texture, sampleNearLeft).a;
    n.y = texture2DProj(_qt_texture, sampleNearRight).a;
    n = smoothstep(alphaMin, alphaMax, n);
    highp float c = 0.5 * (n.x + n.y);
    gl_FragColor = vec4(n.x, c, n.y, c) * color.w;
}
Comment 3 jpsinthemix 2015-01-13 11:19:37 UTC
Created attachment 112165 [details]
log given by INTEL_DEBUG=all

Attached is a log generated using INTEL_DEBUG=all
Comment 4 jpsinthemix 2015-01-13 11:24:07 UTC
I finally got the INTEL_DEBUG=perf output:

QML debugging is enabled. Only use this in a safe environment.
i915_program_error: Exceeded max nr indirect texture lookups (6 out of 4)
ENTER FALLBACK 10000: Program
LEAVE FALLBACK Program
ENTER FALLBACK 10000: Program <--- These simply repeat
LEAVE FALLBACK Program <--------------|

This is what I expected from looking at the fragment shader, too many texture lookups..
Comment 5 jpsinthemix 2015-01-16 09:41:27 UTC
Created attachment 112334 [details]
Correct (portion) of an INTEL_DEBUG=all log

The previous INTEL_DEBUG=all log (attachment # ) was for a modified version of the fragment shader hiqsubpixeldistancefieldtext.frag; my apologies for posting the wrong log.

This attachment is a portion of an INTEL_DEBUG=all log showing the ARB assembly of the (unmodified) fragment shader hiqsubpixeldistancefieldtext.frag.

Based on Issue (24) of https://www.opengl.org/registry/specs/ARB/fragment_program.txt I'm confused by the 6 indirections: there is the base indirection, 4 texture2DProj() calls, and 1 texture2D() call. So if there is an indirection per texture2D*() call, we would indeed get 1+5=6 indirections. But why can't the texture2D*() texture coordinates for all texture lookups be be done in one phase/node and the texture2D*() calls done in a second phase/node? Or is that the texture coordinate TEMPs have to be set up on a per-texture2D*() basis?
Comment 6 gnn 2015-07-13 10:47:06 UTC
This is due to the BUG 89062. The temporary solution is to define environment variable QML_DISABLE_DISTANCEFIELD for Qt5 apps.
Comment 7 GitLab Migration User 2019-09-18 19:38:19 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/745.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.