Created attachment 44123 [details] Xorg.0.log Forwarding from launchpad user Kamal Mostafa https://bugs.launchpad.net/ubuntu/+source/linux/+bug/728672 Original report: Running my own build of the natty 'master-next' kernel branch (2.6.38-6.33 at SHA1-ID "57545cb"; based on linux 2.6.38-rc7), the default unity desktop session makes compiz crash immediately upon login (100% reproducible): compiz[1388]: segfault at 20 ip 00007ff0cee9702a sp 00007fffc50a9580 error 4 in i965_dri.so[7ff0cee6b000+ac000] The crash does NOT occur when running the natty main kernel (2.6.38-5.32), so this appears to be a regression caused by the master-next branch. Note that I the gnome-panel (classic) session works fine. Also, after unity/compiz crashes, I can then switch to a VT and launch gnome-panel and compiz manually... In that state, "unity --replace" does make the same crash occur immediately. (Effectively, only unity seems to provoke this compiz crash, and it always does so). uname: Linux marconi 2.6.38-6-generic #33~kamal~57545cb (my build from master-next) unity: Installed: 3.6.0-0ubuntu2 (from natty/main) compiz: Installed: 1:0.9.4-0ubuntu3 (from natty/main) libgl1-mesa-dri: Installed: 7.10.1~git20110215.cc1636b6-0ubuntu2 (from natty/main) xf86-video-intel: 2.14.0 libdrm: 2.4.23
Created attachment 44125 [details] ThreadStacktrace.txt StacktraceTop: intel_miptree_image_copy (intel=0x1592fc0, dst=0x0, face=0, level=0, src=0x2e0a2a0) at intel_mipmap_tree.c:435 copy_image_data_to_tree (intel=0x1592fc0, unit=<value optimized out>) at intel_tex_validate.c:91 intel_finalize_mipmap_tree (intel=0x1592fc0, unit=<value optimized out>) at intel_tex_validate.c:232 brw_validate_textures (brw=0x1592fc0) at brw_tex.c:56 brw_try_draw_prims (ctx=0x1592fc0, arrays=0x15e3b60, prim=0x7fffc50a97b0, nr_prims=1, ib=0x0, index_bounds_valid=<value optimized out>, min_index=0, max_index=3) at brw_draw.c:319
Created attachment 44126 [details] BootDmesg.txt
Created attachment 44127 [details] CurrentDmesg.txt
Reverting c2e0eb167070a6e9dcb49c84c13c79a30d672431 (drm/i915: fix corruptions on i8xx due to relaxed fencing) fixed the issue for him. Unfortunately bea96046b4245e9abd65ed7acfed9adfd5f6c639 (drm/i915: Gen4+ has non-power-of-two strides) from drm-intel-staging did not
If you are in a position to build your own kernel, can you print out the failing values in the i915_tiling_ok() function?
Created attachment 44154 [details] [review] FOR TESTING ONLY: copious debug tracing to i915_tiling_ok() I added printk's to log all paths through i915_tiling_ok(). Attaching the patch here just for reference.
Created attachment 44155 [details] dmesg log of the i915_tiling_ok() tracing through the compiz crash Chris, here's the log of all calls to i915_tiling_ok(). Look for "return=0" which first occurs when I log in, just before compiz crashes, e.g: [ 46.181700] i915_tiling_ok(dev,1536,40960,1)+line 45 return=0 tile_width=512 tile_height=8 ... [ 46.328162] compiz[1647]: segfault at 20 ip 00007fb4b583702a sp 00007fff0f20eda0 error 4 in i965_dri.so[7fb4b580b000+ac000] Note that the "+line NN" line numbers are from the top of i915_tiling_ok(), marking the return from that function.
Chris, your patch bea96046b4245e9abd65ed7acfed9adfd5f6c639 (drm/i915: Gen4+ has non-power-of-two strides) from drm-intel-staging moved this chunk up higher in i915_tiling_ok(), but did you perhaps miss a line?: + if (INTEL_INFO(dev)->gen >= 4) { + /* 965+ just needs multiples of tile width */ + if (stride & (tile_width - 1)) + return false; >>>>> return true; <<<< SHOULD THIS BE ADDED? + } else { That return true; was present in the chunk before you moved it: - /* 965+ just needs multiples of tile width */ - if (INTEL_INFO(dev)->gen >= 4) { - if (stride & (tile_width - 1)) - return false; - return true; - } ... and without it, we'll still end up returning false for the particular problem values when we hit this down at the very bottom: + if (size % (tile_height * stride)) return false; I'll try adding that return true; and see what happens, but please advise whether it makes sense to do so.
Created attachment 44156 [details] [review] drm/i915: fix Gen4+ has non-power-of-two strides I confirm that adding the missing "return true;" does resolve the crash. The attached patch (on top of "drm/i915: Gen4+ has non-power-of-two strides") implements such.
In this case, the kernel is correct. Userspace has requested tiling on too small a buffer, so there is indeed the potential for a random hang depending upon the allocation of the buffer in the GTT (since the sampler will read attempt to access memory beyond the allocation). So we need to head up through the stack and nail the culprit. If you add diff --git a/intel/intel_bufmgr_gem.c b/intel/intel_bufmgr_gem.c index f5ab0a6..08e7d98 100644 --- a/intel/intel_bufmgr_gem.c +++ b/intel/intel_bufmgr_gem.c @@ -1741,6 +1741,7 @@ drm_intel_gem_bo_set_tiling_internal(drm_intel_bo *bo, DRM_IOCTL_I915_GEM_SET_TILING, &set_tiling); } while (ret == -1 && (errno == EINTR || errno == EAGAIN)); + assert(ret == 0); if (ret == -1) return -errno; to libdrm and attach gdb to compiz and print the bt and locals for the next crash. That should find the root cause.
Looking closely at callers of set_tiling, we do indeed overallocate bo (rounding up to the next cache-size) and so the kernel test is overzealous. commit 0ee537abbd10a9abf11e1c22ee32a68e8c12ed4a Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Mar 6 09:03:16 2011 +0000 Revert "drm/i915: fix corruptions on i8xx due to relaxed fencing" This reverts commit c2e0eb167070a6e9dcb49c84c13c79a30d672431. As it turns out, userspace already depends upon being able to enable tiling on existing bo which it promises to be large enough for its purposes i.e. it will not access beyond the end of the last full-tile row. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35016 Reported-and-tested-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
(In reply to comment #11) > Looking closely at callers of set_tiling, we do indeed overallocate bo > (rounding up to the next cache-size) and so the kernel test is overzealous. > > commit 0ee537abbd10a9abf11e1c22ee32a68e8c12ed4a > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Sun Mar 6 09:03:16 2011 +0000 > > Revert "drm/i915: fix corruptions on i8xx due to relaxed fencing" > > This reverts commit c2e0eb167070a6e9dcb49c84c13c79a30d672431. > > As it turns out, userspace already depends upon being able to enable > tiling on existing bo which it promises to be large enough for its > purposes i.e. it will not access beyond the end of the last full-tile > row. > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35016 > Reported-and-tested-by: Kamal Mostafa <kamal@canonical.com> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Without c2e0e gen 2 hardware gets screen corruption! So I guess it's this check that causes this bug: + /* Size needs to be aligned to a full tile row */ + if (size & (tile_height * stride - 1)) + return false; Perhaps remove only that bit, or instead add a IS_GEN2(dev) check to it?
Never mind, Daniel told me that it's fixed in userspace with a newer xf86-video-intel + libdrm. This bug can be closed again, sorry for the noise.
I am still getting the screen corruption with recent xubuntu on gen2 hardware GM855. using these packages xserver-xorg-video-intel 2.14.0-4ubuntu4 libdrm 2.4.23-1ubuntu3 Which versions are supposed to fix the issue?
xf86-video-intel commit d21d781466785c317131a8a57606925867265dc8 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Tue Feb 22 18:31:44 2011 +0100 Fix relaxed tiling on gen2 Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> in 2.14.901
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.