Hi, I am testing Qt5/QtQuick 2 on an admittedly old desktop with an Intel Corporation 82915G/GV/910GL Integrated Graphics Controller. The issue I'm seeing is very high i915_drv CPU usage for all QML scripts. The problem is related to the way QtQuick 2 renders: entire scene redraws are triggered on each change in the scene (rather than simply redrawing what has in fact changed). For example, a static full screen scene with a text field in the center is fully redrawn on each cursor blink. The result is continual very high (60-80% ) CPU usage. Here is a partial (oprofile) report (approx 5-10 sec): CPU: P4 / Xeon, speed 3200.2 MHz (estimated) samples % image name symbol name 100478 30.4292 i915_dri.so fetch_texel_2d_A_UNORM8 55176 16.7097 i915_dri.so fetch_vector4 42980 13.0162 i915_dri.so _mesa_execute_program 25721 7.7895 i915_dri.so store_vector4 20444 6.1913 i915_dri.so linear_texel_locations 19952 6.0423 i915_dri.so _mesa_unpack_ubyte_rgba_row 11530 3.4918 i915_dri.so sample_linear_2d 10194 3.0872 i915_dri.so unpack_uint_z_Z24_UNORM_X8_UINT 7353 2.2268 i915_dri.so _swrast_write_rgba_span 6993 2.1178 i915_dri.so pack_uint_Z24_UNORM_S8_UINT 3505 1.0615 i915_dri.so __x86.get_pc_thunk.bx 3229 0.9779 i915_dri.so fetch_texel_lod 2717 0.8228 i915_dri.so _swrast_exec_fragment_program 2562 0.7759 no-vmlinux /no-vmlinux 2379 0.7205 i915_dri.so fetch_vector1 1609 0.4873 i915_dri.so blend_general 1503 0.4552 i915_dri.so blend_general_float.isra.1 1459 0.4418 libc-2.20.so _int_free 1414 0.4282 libc-2.20.so malloc 1197 0.3625 i915_dri.so general_triangle 1190 0.3604 libc-2.20.so _int_malloc 1095 0.3316 i915_dri.so _swrast_depth_test_span 952 0.2883 i915_dri.so _mesa_convert_colors 372 0.1127 i915_dri.so pack_row_ubyte_B8G8R8A8_UNORM 290 0.0878 i915_dri.so run_vp 262 0.0793 libQt5Quick.so.5.4.0 /usr/lib/libQt5Quick.so.5.4.0 219 0.0663 i915_dri.so _mesa_pack_ubyte_rgba_row 196 0.0594 i915_dri.so _mesa_get_format_bytes 179 0.0542 i915_dri.so _swrast_span_interpolate_z 175 0.0530 libc-2.20.so free 170 0.0515 libQt5Core.so.5.4.0 /usr/lib/libQt5Core.so.5.4.0 In comparison, if I uninstall the i915_drv to force the use of the llvm swast driver I get 25% CPU usage, still too high, but much lower than i915_drv. Can anyone familiar with the i915_drv code think of where the bottleneck is occurring and possibly suggest code modifications which might improve things here? Note that using Qt4 or Qt5 OpenGL Widget-based GUIs with the i915_drv driver presents no problems, the problem lies squarely with the implementation of QtQuick 2 in that most rendering decisions have been off-loaded to the GPU and associated driver). I'd be happy to test patches and even mess with the code. Any help or suggestions would be greatly appreciated. thanks much for your time, John
It sounds like the application requires more resources than the aging i915 has to offer. Seeing _swrast_exec_fragment_program in the profile tells me that the driver is falling back to software rasterization. This usually occurs when the fragment program is too big. The i965 can only handle shaders with upto 64 instructions. Unless the shader is "just barely" too big, it is unlikely that we'll be able to do anything about it. What output does running with 'INTEL_DEBUG=perf' give?
(In reply to Ian Romanick from comment #1) > It sounds like the application requires more resources than the aging i915 > has to offer. Seeing _swrast_exec_fragment_program in the profile tells me > that the driver is falling back to software rasterization. This usually > occurs when the fragment program is too big. The i965 can only handle > shaders with upto 64 instructions. > > Unless the shader is "just barely" too big, it is unlikely that we'll be > able to do anything about it. What output does running with > 'INTEL_DEBUG=perf' give? For some reason I am unable to get any output via INTEL_DEBUG=perf. This is for the i915_drv, not the i965_drv driver. I think the Qt program I'm using for testing is re-directing both stdin/out to /dev/null. The Qt program I'm running is the display manager 'sddm' which basically runs a QtQuick2 Main.qml script. I've started intel_gpu_top and then run sddm, and its shows 2-3% GPU usage, not surprising I suppose, because as you noted the fragment shader is being run on the CPU via swrast fallback. I have, however, found the problematic shaders; they are hiqsubpixeldistancefieldtext.frag,vert} shown below. This shader pair handles the text rendering for sddm using 'distance-field' anti-aliasing. Note that the 'highp/lowp/mediump' qualifiers (for GLES) are not relevant (and are unset) here. There are also simpler (lower-quality versions), loqsubpixeldistancefieldtext.{frag,vert}, which are not employed by QtQuick2 here, but if I re-build the QtQuick library and force their use instead of the hiqsubpixeldistancefieldtext.* versions, then the program runs as expected: intel_gpu_top usage remains at about 2-3% and CPU usage fluctuates in the 0-1% range. Listings of these lo* shaders are also shown below. My guess here is that the 5 texture2D/texture2DProj() calls in hiqsubpixeldistancefieldtext.frag are triggering the swrast fallback; there are only 2 texture2DProj() calls in loqsubpixeldistancefieldtext.frag. If so, is this related to the i915 limits (i915_context.h) for maximum tex instructions and/or maximum tex indirections (I915_MAX_TEX_INSN=32 and I915_MAX_TEX_INDIRECT=4)? If this is so, is there any wiggle room here in the driver whereby, say up 6 texture2D/texture2DProj() calls could be handled w/o a swrast fallback? Thanks again for your time, John // ==== hiqsubpixeldistancefieldtext.vert // =============================================== uniform highp mat4 matrix; uniform highp vec2 textureScale; uniform highp float fontScale; uniform highp vec4 vecDelta; attribute highp vec4 vCoord; attribute highp vec2 tCoord; varying highp vec2 sampleCoord; varying highp vec3 sampleFarLeft; varying highp vec3 sampleNearLeft; varying highp vec3 sampleNearRight; varying highp vec3 sampleFarRight; void main() { sampleCoord = tCoord * textureScale; gl_Position = matrix * vCoord; // Calculate neighbor pixel position in item space. highp vec3 wDelta = gl_Position.w * vecDelta.xyw; highp vec3 farLeft = vCoord.xyw - 0.667 * wDelta; highp vec3 nearLeft = vCoord.xyw - 0.333 * wDelta; highp vec3 nearRight = vCoord.xyw + 0.333 * wDelta; highp vec3 farRight = vCoord.xyw + 0.667 * wDelta; // Calculate neighbor texture coordinate. highp vec2 scale = textureScale / fontScale; highp vec2 base = sampleCoord - scale * vCoord.xy; sampleFarLeft = vec3(base * farLeft.z + scale * farLeft.xy, farLeft.z); sampleNearLeft = vec3(base * nearLeft.z + scale * nearLeft.xy, nearLeft.z); sampleNearRight = vec3(base * nearRight.z + scale * nearRight.xy, nearRight.z); sampleFarRight = vec3(base * farRight.z + scale * farRight.xy, farRight.z); } // ==== hiqsubpixeldistancefieldtext.frag // =============================================== varying highp vec2 sampleCoord; varying highp vec3 sampleFarLeft; varying highp vec3 sampleNearLeft; varying highp vec3 sampleNearRight; varying highp vec3 sampleFarRight; uniform sampler2D _qt_texture; uniform lowp vec4 color; uniform mediump float alphaMin; uniform mediump float alphaMax; void main() { highp vec4 n; n.x = texture2DProj(_qt_texture, sampleFarLeft).a; n.y = texture2DProj(_qt_texture, sampleNearLeft).a; highp float c = texture2D(_qt_texture, sampleCoord).a; n.z = texture2DProj(_qt_texture, sampleNearRight).a; n.w = texture2DProj(_qt_texture, sampleFarRight).a; #if 0 // Blurrier, faster. n = smoothstep(alphaMin, alphaMax, n); c = smoothstep(alphaMin, alphaMax, c); #else // Sharper, slower. highp vec2 d = min(abs(n.yw - n.xz) * 2., 0.67); highp vec2 lo = mix(vec2(alphaMin), vec2(0.5), d); highp vec2 hi = mix(vec2(alphaMax), vec2(0.5), d); n = smoothstep(lo.xxyy, hi.xxyy, n); c = smoothstep(lo.x + lo.y, hi.x + hi.y, 2. * c); #endif gl_FragColor = vec4(0.333 * (n.xyz + n.yzw + c), c) * color.w; } // ==== loqsubpixeldistancefieldtext.vert // =============================================== uniform highp mat4 matrix; uniform highp vec2 textureScale; uniform highp float fontScale; uniform highp vec4 vecDelta; attribute highp vec4 vCoord; attribute highp vec2 tCoord; varying highp vec3 sampleNearLeft; varying highp vec3 sampleNearRight; void main() { highp vec2 sampleCoord = tCoord * textureScale; gl_Position = matrix * vCoord; // Calculate neighbor pixel position in item space. highp vec3 wDelta = gl_Position.w * vecDelta.xyw; highp vec3 nearLeft = vCoord.xyw - 0.25 * wDelta; highp vec3 nearRight = vCoord.xyw + 0.25 * wDelta; // Calculate neighbor texture coordinate. highp vec2 scale = textureScale / fontScale; highp vec2 base = sampleCoord - scale * vCoord.xy; sampleNearLeft = vec3(base * nearLeft.z + scale * nearLeft.xy, nearLeft.z); sampleNearRight = vec3(base * nearRight.z + scale * nearRight.xy, nearRight.z); } // ==== loqsubpixeldistancefieldtext.frag // =============================================== varying highp vec3 sampleNearLeft; varying highp vec3 sampleNearRight; uniform sampler2D _qt_texture; uniform lowp vec4 color; uniform mediump float alphaMin; uniform mediump float alphaMax; void main() { highp vec2 n; n.x = texture2DProj(_qt_texture, sampleNearLeft).a; n.y = texture2DProj(_qt_texture, sampleNearRight).a; n = smoothstep(alphaMin, alphaMax, n); highp float c = 0.5 * (n.x + n.y); gl_FragColor = vec4(n.x, c, n.y, c) * color.w; }
Created attachment 112165 [details] log given by INTEL_DEBUG=all Attached is a log generated using INTEL_DEBUG=all
I finally got the INTEL_DEBUG=perf output: QML debugging is enabled. Only use this in a safe environment. i915_program_error: Exceeded max nr indirect texture lookups (6 out of 4) ENTER FALLBACK 10000: Program LEAVE FALLBACK Program ENTER FALLBACK 10000: Program <--- These simply repeat LEAVE FALLBACK Program <--------------| This is what I expected from looking at the fragment shader, too many texture lookups..
Created attachment 112334 [details] Correct (portion) of an INTEL_DEBUG=all log The previous INTEL_DEBUG=all log (attachment # ) was for a modified version of the fragment shader hiqsubpixeldistancefieldtext.frag; my apologies for posting the wrong log. This attachment is a portion of an INTEL_DEBUG=all log showing the ARB assembly of the (unmodified) fragment shader hiqsubpixeldistancefieldtext.frag. Based on Issue (24) of https://www.opengl.org/registry/specs/ARB/fragment_program.txt I'm confused by the 6 indirections: there is the base indirection, 4 texture2DProj() calls, and 1 texture2D() call. So if there is an indirection per texture2D*() call, we would indeed get 1+5=6 indirections. But why can't the texture2D*() texture coordinates for all texture lookups be be done in one phase/node and the texture2D*() calls done in a second phase/node? Or is that the texture coordinate TEMPs have to be set up on a per-texture2D*() basis?
This is due to the BUG 89062. The temporary solution is to define environment variable QML_DISABLE_DISTANCEFIELD for Qt5 apps.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/745.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.