Created attachment 59896 [details] intel_error_decode dump When resizing the thunar (xfce file manager) to more than 2048px, screen-wide rendering corruptions appear and eventually a gpu hang is triggered. i945GM linux-3.4_rc2 Fedora16 + updates
In future just attach the raw i915_error_state; so that if we improve the decoder we can re-run against the original data. So something ate your batchbuffer: 0x01300368: 0x7d040441: 3DSTATE_LOAD_STATE_IMMEDIATE_1 0x0130036c: 0xfffffff0: S2: texcoord formats: 0=2D 1=NP 2=NP 3=NP 4=NP 5=NP 6=NP 7=NP 0x01300370: 0x00008144: S6: alpha_test=always, alpha_ref=0x0, depth_test=always, cbuf blend enable, src_blnd_fct=zero, dst_blnd_fct=inv_src_colr, cbuf write enable, tristrip_provoking_vertex=0 0x01300374: 0x7d000003: 3DSTATE_MAP_STATE 0x01300378: 0x00000001: mask 0x0130037c: 0x08044d40: map 0 MS2 0x01300380: 0x01402d80: map 0 MS3 [width=12, height=11, format=32b argb8888, tiling=none] 0x01300384: 0x01600000: map 0 MS4 [pitch=48, max_lod=0, vol_depth=0, cube_face_ena=0, miplayout right] 0x01300388: 0xf6010000: UNKNOWN 0x0130038c: HEAD 0x19180000: MI UNKNOWN 0x01300390: 0x00000000: MI_NOOP 0x01300394: 0x00000000: MI_NOOP 0x01300398: 0x19083c00: MI UNKNOWN 0x0130039c: 0x00000000: MI_NOOP 0x013003a0: 0x00000000: MI_NOOP 0x013003a4: 0x15200000: MI UNKNOWN 0x013003a8: 0x01000000: MI_USER_INTERRUPT 0x013003ac: 0x00000000: MI_NOOP 0x013003b0: 0x7f9c0003: 3DPRIMITIVE sequential indirect RECTLIST, 3 starting from 11 0x013003b4: 0x0000000b: start Where the GPU hung should have been a 3DSTATE_SAMPLER_STATE
Nope, actually should have been 0x01300434: 0x7d050008: 3DSTATE_PIXEL_SHADER_PROGRAM 0x01300438: 0x19180000: PS000: DCL S0 2D 0x0130043c: 0x00000000: PS000 0x01300440: 0x00000000: PS000 0x01300444: 0x19083c00: PS001: DCL T0.xyzw 0x01300448: 0x00000000: PS001 0x0130044c: 0x00000000: PS001 0x01300450: 0x15200000: PS002: TEXLD oC, S0, T0 0x01300454: 0x01000000: PS002 0x01300458: 0x00000000: PS002 0xf601000 is far larger than your aperture size so won't correspond to a relocation. So the GPU went nuts (or we have a pathological cache coherency issue, GPU went nuts is more likely).
0xf6010000 is quite a strange value, and there doesn't appear to be any other corruption in those pages. Can you grab a few more error-states to see if there are any more patterns?
while trying to generate a few error-states,I hit the following assertion: #3 0x426af6a5 in __assert_fail_base ( fmt=0x427f0e48 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x3291cc "!too_large(dst->drawable.width, dst->drawable.height)", file=0x329055 "gen3_render.c", line=4252, function=0x329590 "gen3_render_fill_boxes_try_blt") at assert.c:94 #4 0x426af757 in __GI___assert_fail ( assertion=0x3291cc "!too_large(dst->drawable.width, dst->drawable.height)", file=0x329055 "gen3_render.c", line=4252, function=0x329590 "gen3_render_fill_boxes_try_blt") at assert.c:103 #5 0x002c2044 in gen3_render_fill_boxes_try_blt (sna=0x9395270, op=<optimized out>, format=537004168, color=0x96aa1ec, dst=0x96b56b0, dst_bo=0x96b5b38, box=0xbfe3dd2c, n=1) at gen3_render.c:4252 #6 0x002cb781 in gen3_render_fill_boxes (sna=0x9395270, op=1 '\001', format=537004168, color=0x96aa1ec, dst=0x96b56b0, dst_bo=0x96b5b38, box=0xbfe3dd2c, n=1) ---Type <return> to continue, or q <return> to quit--- at gen3_render.c:4318 #7 0x0028a096 in sna_composite_rectangles (op=1 '\001', dst=0x967c710, color=0x96aa1ec, num_rects=1, rects=0x96aa1f4) at sna_composite.c:795 #8 0x08149a41 in CompositeRects (op=3 '\003', pDst=0x967c710, color=0x96aa1ec, nRect=1, rects=0x96aa1f4) at picture.c:1671 #9 0x0814ee9a in ProcRenderFillRectangles (client=0x9589160) at render.c:1481 #10 0x0814a064 in ProcRenderDispatch (client=0x9589160) at render.c:2063 #11 0x08076195 in Dispatch () at dispatch.c:439 #12 0x0806439a in main (argc=7, argv=0xbfe3df94, envp=0xbfe3dfb4) at main.c:287
Created attachment 59905 [details] another error state capture
Created attachment 59906 [details] yet another error state capture
Have you been running with assertions enabled during the error-state capture? Or did you need to turn those off? (Just optimistically hoping that the assertions predict the error.)
The repeating pattern is stray writes across the batch-buffer.
If you do get bored, seeing if you can hit the hang with enable-debug=full would also help. For the moment, I'll concentrate on the assertion failure, wonder what other assertions are missing. It seems you have an interesting arrangement of monitors. The framebuffer pitch is 8192 (so 2048 pixels wide or thereabout presumably) yet much taller.
Unfourtunatly, the GPU hang occured with assertions enabled, I'll try to get a hang with debug=full. The framebuffer pitch is 8192 (so 2048 pixels wide or thereabout presumably) yet much taller. Currently, the only active display is the 1280x800 builtin notebook panel - I hit the issue by acciently resizing a window "too large".
Ok, that assertion is more of a warning that I don't want to hit that path rather than a predictor of an error.
strange, with debug=full I don't get any of those rendering corruptions. However, I seem to have hit an assertion (unfourtunatly no debugger attached), those were the last few lines of the log: [ 222.090] glyphs_via_mask: small mask [format=28888, depth=32, size=30464], rendering glyphs to upload buffer [ 222.090] sna_pixmap_create_upload(68, 14, 32) [ 222.091] kgem_create_buffer_2d: 68x14, 32 bpp, stride=272 [ 222.091] kgem_create_buffer: size=3808, flags=1 [write?=1, inplace?=0, last?=0] [ 222.091] kgem_create_buffer: skip write 3 buffer, need 1 [ 222.091] kgem_create_buffer: too small (960 < 3808) [ 222.091] kgem_retire [ 222.091] kgem_retire -- need_retire=0 [ 222.091] search_linear_cache: num_pages=1, flags=2, use_active? 0 [ 222.091] search_linear_cache: found handle=86 (num_pages=1) in linear inactive cache [ 222.092] kgem_create_buffer: reusing ordinary handle 86 for io [ 222.092] kgem_create_buffer(pages=1) new handle=86 [ 222.092] kgem_create_buffer: this=256, right=208896 [ 222.092] kgem_create_buffer: this=256, right=960 [ 222.092] kgem_create_buffer: this=256, right=960 [ 222.092] kgem_create_buffer: this=256, right=960 [ 222.092] kgem_create_buffer: this=256, right=320 [ 222.092] kgem_create_proxy: target handle=86, offset=0, length=3808, io=1 [ 222.092] __sna_damage_all(68, 14) [ 222.093] __sna_damage_all(68, 14) [ 222.093] sna_pixmap_create_upload: serial=6945, usage=0 [ 222.093] glyphs_via_mask: glyph to mask (0, 4)x(5, 7) [ 222.093] glyphs_via_mask: glyph to mask (7, 4)x(4, 7) [ 222.093] glyphs_via_mask: glyph to mask (11, 4)x(8, 10) [ 222.093] glyphs_via_mask: glyph to mask (19, 4)x(6, 7) [ 222.093] glyphs_via_mask: glyph to mask (25, 2)x(4, 9) [ 222.094] glyphs_via_mask: glyph to mask (31, 4)x(6, 7) [ 222.094] glyphs_via_mask: glyph to mask (39, 0)x(1, 11) [ 222.094] glyphs_via_mask: glyph to mask (41, 13)x(7, 1) [ 222.094] glyphs_via_mask: glyph to mask (49, 4)x(6, 10) [ 222.094] glyphs_via_mask: glyph to mask (57, 4)x(4, 7) [ 222.094] glyphs_via_mask: glyph to mask (62, 4)x(6, 7) [ 222.094] sna_composite(3 src=(11, 52), mask=(0, 0), dst=(11, 52)+(0, 0), size=(68, 14) [ 222.094] sna_compute_composite_region: dst=(11, 52)x(68, 14) [ 222.094] sna_compute_composite_region: initial clip against dst->pDrawable: (11, 52), (79, 66) [ 222.095] clip_to_dst: region: 1x[(11, 52), (79, 66)], clip: 1x[(0, 0), (3609, 5275)] [ 222.095] sna_compute_composite_region: clip against dst->pCompositeClip: (11, 52), (79, 66) [ 222.095] sna_compute_composite_region: clip against src: (11, 52), (79, 66) [ 222.095] sna_compute_composite_region: clip against mask: (11, 52), (79, 66) [ 222.095] sna_composite: composite region extents:+(0, 0) -> (11, 52), (79, 66) + (0, 0) [ 222.095] gen3_render_composite() [ 222.095] kgem_bo_is_busy: domain: 1 exec? 0, rq? 0 [ 222.095] gen3_composite_fallback: src is already on the GPU, try to use GPU [ 222.095] sna_pixmap_force_to_gpu(pixmap=0x8dcac40) [ 222.096] sna_pixmap_force_to_gpu: forcing creation of gpu bo (3609x5275@32, flags=1) [ 222.096] default_tiling: entire source is damaged, using Y-tiling [ 222.096] kgem_choose_tiling: pitch too large for tliing [14436] [ 222.096] kgem_choose_tiling: 3609x5275 -> 0 [ 222.096] kgem_create_2d(3609x5275, bpp=32, tiling=0, exact=0, inactive=1, cpu-mapping=0, gtt-mapping=0, scanout?=0, temp?=0) [ 222.096] from inactive: pitch=14464, tiling=0: handle=16, id=3736 [ 222.096] sna_pixmap_force_to_gpu: created gpu bo [ 222.096] sna_pixmap_move_to_gpu(pixmap=6943, usage=0) [ 222.097] sna_pixmap_move_to_gpu: CPU damage? 1 [ 222.097] sna_pixmap_move_to_gpu: uploading 1 damage boxes [ 222.097] sna_replace(handle=16, 3609x5275, bpp=32, tiling=0) [ 222.097] kgem_bo_mapped: map=(nil), tiling=0 [ 222.097] indirect_replace: size=18591 vs 512 [ 222.097] kgem_bo_is_busy: domain: 0 exec? 0, rq? 0 [ 222.097] kgem_bo_is_mappable: domain=0, offset: 8400896 size: 81362944 [ 222.097] kgem_bo_map: handle=16, offset=8400896, tiling=0, map=(nil), domain=0
I expected that to blow up eventually, here you go: commit 90e2740e7e459c56205fa65bab1ae3dbfd5d3945 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Apr 13 13:51:57 2012 +0100 sna: Remove the conflicting assertion during GTT map Reported-by: Clemens Eisserer <linuxhippy@gmail.com> References: https://bugs.freedesktop.org/show_bug.cgi?id=48636 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
commit 89f2b09b1e5be9842747998ea4fe32a6f1ede4cc Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Apr 13 16:37:43 2012 +0100 sna: Avoid using TILING_Y for large objects on gen2/3 References: https://bugs.freedesktop.org/show_bug.cgi?id=48636 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> That should prevent the gen3 assertion.
Since the hangs seems to be with the window < 2048 px wide, and > 2048 px tall, can you try this: diff --git a/src/sna/sna_render.c b/src/sna/sna_render.c index 8af80f2..732c991 100644 --- a/src/sna/sna_render.c +++ b/src/sna/sna_render.c @@ -891,6 +891,8 @@ sna_render_picture_partial(struct sna *sna, BoxRec box; int offset; + return 0; + DBG(("%s (%d, %d)x(%d, %d) [dst=(%d, %d)]\n", __FUNCTION__, x, y, w, h, dst_x, dst_y)); @@ -1741,7 +1743,7 @@ sna_render_composite_redirect(struct sna *sna, height > sna->render.max_3d_size) return FALSE; - if (op->dst.bo->pitch <= sna->render.max_3d_pitch) { + if (op->dst.bo->pitch <= sna->render.max_3d_pitch && 0) { BoxRec box; int w, h, offset;
Still happens with "sna: Avoid using TILING_Y for large objects on gen2/3" applied sometimes. I also noticed very high CPU load while resizing the window (works well with other applications) where according to sysprof 62% of total system time end up in pixman_fill_sse2 with spends about 50% of its time in kernel-space handling page faults.
The problem is that the GPU can only handle surfaces up to 2048x2048 in size, and we have to play tricks to keep as much of the display accelerated as possible. This includes rendering to only subsurfaces (copying to and from using a different engine within the GPU before using the 3D pipeline) and tiling. And then there are the paths where it looked like the complexity wasn't worth it, or which are normally used differently, where it looks to make more sense just to use the CPU than fight with the GPU. For pixman_fill() to end up dominating in the profiles suggests that an earlier fallback was taken, or the surface is too large to be created on the GPU by default, and just being a nuisance. Next item on the wishlist would be a low impact means of analysing why. The debug log will tell in great detail, but finding it remains a challenge.
Thanks for the detailed explanation. I also tried with your short-circuit patch and I no longer get CPU crashes or hangs, only very poor performance when resizing the window. I also noticed problems with Java2Demo, with that patch applied - should I investigate further or are a few problems to be expected?
No, that patch disables a fast path (whereby we simply create a pointer inside the original surface) and causes a fallback. That should never trigger corruption. :( Ok, can you keep using the patch for a while to be sure that the hang doesn't reoccur and I'll scour the docs for a detail I've missed regarding the 945gm. :(
Can you try the 3 patches in http://cgit.freedesktop.org/~ickle/xf86-video-intel/log/?h=for-clemens ? They just tweak the alignment of the subsurface, but you never know...
Didn't see any ill-effects of those patches so I pushed them to master in the hope they help stability in some extremely cases, and here!
I believe I have this nailed with the commits up to: commit 61cac5c265279d45677262216a0ba56f548cd898 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu May 3 22:33:59 2012 +0100 sna: Maintain a reference to the chain of proxies Rather than attempt to flatten the chain to the last link, we may need to hold a reference to the intermediate links in case of batch buffer submission. Fixes http://tnsp.org/~ccr/intel-gfx/test.html Reported-by: Matti Hamalainen <ccr@tnsp.org> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=49436 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> in particular, commit 19fd24a4db994bb5c5ce4a73f06d9394a758ea91 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu May 3 17:35:10 2012 +0100 sna: Fix offset for combining damage Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Unfourtunatly I still get those artifacts with the latest git snapshot :/
Is it possible yet to get a debug=full Xorg.0.log?
As one of the few SNA bugs left, can I have a quick status update? :)
I'll see to get that debug=full log uploaded today
Please find the debug=full log at http://93.83.133.214/debug-log.7z , testing started right after the first VT switch and lead to a GPU hang right before testing ended.
Thanks Clemens, lets hope all will be revealed...
I was tracking through the code to see why we tried to create a large glyph mask, fail and fallback and noticed: [ 16.829] (II) intel(0): SNA compiled from 2.18.0-175-g42a8461 which explains why you are not hitting the code to prevent that added in commit 0b81bafb802bb86454739ed46cf45571bccef735 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Apr 6 15:14:45 2012 +0100 sna/glyphs: Prefer a temporary upload mask for large glyph masks If the required temporary mask is larger than the 3D pipeline can handle, just render to a CPU buffer rather than redirect every glyph composition. Mind updating? Because you're missing the code to fixup some crashes with large windows... *g*
Strange, could it be your browser-cache is playing tricks with you? (I already uploaded a file with exactly the same name). I downloaded and extracted the file just to make sure, and mine tells me: [ 1842.826] (II) intel(0): SNA compiled from 2.19.0-54-g3c9759e
Never underestimate my stupidity. I suspect I downloaded it on one machine and read the old file on another...
Can you do me another favour, and recompile with #define DEBUG_FLUSH_SYNC 1 in src/sna/kgem.c and grab another full debug log? Adding that define will cause the driver to check after each batch to find the culprit causing the GPU hang. (Sorry for not thinking of that earlier.)
Its not like those things never happen to me ;) Heres the log with flusg=1: http://93.83.133.214/debug-log2.7z
Hmm, can you please test this theory: diff --git a/src/sna/kgem.c b/src/sna/kgem.c index df69b90..2819b3c 100644 --- a/src/sna/kgem.c +++ b/src/sna/kgem.c @@ -767,7 +767,7 @@ void kgem_get_tile_size(struct kgem *kgem, int tiling, { if (kgem->gen <= 30) { if (tiling) { - *tile_width = 512; + *tile_width = 8192; /* prevent sub-row offsets */ if (kgem->gen < 30) { *tile_height = 16; *tile_size = 2048; @@ -780,6 +780,16 @@ void kgem_get_tile_size(struct kgem *kgem, int tiling, *tile_height = 1; *tile_size = 1; } + } else if (kgem->gen < 33) { + if (tiling) { + *tile_width = 8192; /* prevent sub-row offsets */ + *tile_height = tiling == I915_TILING_X ? 8 : 32; + *tile_size = 4096; + } else { + *tile_width = 1; + *tile_height = 1; + *tile_size = 1; + } } else switch (tiling) { default: case I915_TILING_NONE:
Created attachment 61811 [details] [review] Disable sub-row offsets
Unfourtunatly it didn't help, here's its debug-log: http://93.83.133.214/debug-log3.7z
btw testing was done with icewm (because xfce doesn't work with the monitor setting I have at home caused by the xorg-server bug you mentioned), so it seems to be wm independent.
Oh this is actually more insidious than I realised.... And rather obvious when looking at the batch buffers - I'm shocked that I failed to hit this when testing on gen3... diff --git a/src/sna/kgem.c b/src/sna/kgem.c index df69b90..27ec0b9 100644 --- a/src/sna/kgem.c +++ b/src/sna/kgem.c @@ -3512,6 +3512,7 @@ struct kgem_bo *kgem_create_proxy(struct kgem_bo *target, if (bo == NULL) return NULL; + bo->unique_id = kgem_get_unique_id(kgem); bo->reusable = false; bo->size.bytes = length;
Works now perfectly fine :) btw - I changed kgem_get_unique_id(kgem) to kgem_get_unique_id(target) to make it compile.
Whoops, right you are, ended up passing in kgem instead... commit f91dcc44dcc15850f82666b1bcdd27182400e7dc Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri May 18 20:09:41 2012 +0100 sna: Give the proxy a unique name So that if we cache the current destination bo (for example, gen3) then a new proxy (or even just a new batchbuffer) will indeed cause the destination buffer to be updated. Reported-and-tested-by: Clemens Eisserer <linuxhippy@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=48636 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.