Bug 35016

Summary: [Arrandale] Compiz segfault after upgrading from 2.6.38-rc6 to 2.6.38-rc7
Product: xorg Reporter: roberth <sarvatt>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: indan, kamal
Version: 7.6 (2010.12)   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg.0.log
none
ThreadStacktrace.txt
none
BootDmesg.txt
none
CurrentDmesg.txt
none
FOR TESTING ONLY: copious debug tracing to i915_tiling_ok()
none
dmesg log of the i915_tiling_ok() tracing through the compiz crash
none
drm/i915: fix Gen4+ has non-power-of-two strides none

Description roberth 2011-03-04 06:46:01 UTC
Created attachment 44123 [details]
Xorg.0.log

Forwarding from launchpad user Kamal Mostafa 

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/728672

Original report:

Running my own build of the natty 'master-next' kernel branch (2.6.38-6.33 at SHA1-ID "57545cb"; based on linux 2.6.38-rc7), the default unity desktop session makes compiz crash immediately upon login (100% reproducible):

  compiz[1388]: segfault at 20 ip 00007ff0cee9702a sp 00007fffc50a9580 error 4 in i965_dri.so[7ff0cee6b000+ac000]

The crash does NOT occur when running the natty main kernel (2.6.38-5.32), so this appears to be a regression caused by the master-next branch.

Note that I the gnome-panel (classic) session works fine. Also, after unity/compiz crashes, I can then switch to a VT and launch gnome-panel and compiz manually... In that state, "unity --replace" does make the same crash occur immediately. (Effectively, only unity seems to provoke this compiz crash, and it always does so).

uname: Linux marconi 2.6.38-6-generic #33~kamal~57545cb (my build from master-next)
unity: Installed: 3.6.0-0ubuntu2 (from natty/main)
compiz: Installed: 1:0.9.4-0ubuntu3 (from natty/main)
libgl1-mesa-dri: Installed: 7.10.1~git20110215.cc1636b6-0ubuntu2 (from natty/main)
xf86-video-intel: 2.14.0
libdrm: 2.4.23
Comment 1 roberth 2011-03-04 06:48:00 UTC
Created attachment 44125 [details]
ThreadStacktrace.txt

StacktraceTop:
 intel_miptree_image_copy (intel=0x1592fc0, dst=0x0, face=0, level=0, src=0x2e0a2a0) at intel_mipmap_tree.c:435
 copy_image_data_to_tree (intel=0x1592fc0, unit=<value optimized out>) at intel_tex_validate.c:91
 intel_finalize_mipmap_tree (intel=0x1592fc0, unit=<value optimized out>) at intel_tex_validate.c:232
 brw_validate_textures (brw=0x1592fc0) at brw_tex.c:56
 brw_try_draw_prims (ctx=0x1592fc0, arrays=0x15e3b60, prim=0x7fffc50a97b0, nr_prims=1, ib=0x0, index_bounds_valid=<value optimized out>, min_index=0, max_index=3) at brw_draw.c:319
Comment 2 roberth 2011-03-04 06:48:21 UTC
Created attachment 44126 [details]
BootDmesg.txt
Comment 3 roberth 2011-03-04 06:48:38 UTC
Created attachment 44127 [details]
CurrentDmesg.txt
Comment 4 roberth 2011-03-04 06:50:00 UTC
Reverting c2e0eb167070a6e9dcb49c84c13c79a30d672431 (drm/i915: fix corruptions on i8xx due to relaxed fencing) fixed the issue for him. Unfortunately bea96046b4245e9abd65ed7acfed9adfd5f6c639 (drm/i915: Gen4+ has non-power-of-two strides) from drm-intel-staging did not
Comment 5 Chris Wilson 2011-03-04 12:14:57 UTC
If you are in a position to build your own kernel, can you print out the failing values in the i915_tiling_ok() function?
Comment 6 Kamal Mostafa 2011-03-05 12:31:40 UTC
Created attachment 44154 [details] [review]
FOR TESTING ONLY: copious debug tracing to i915_tiling_ok()

I added printk's to log all paths through i915_tiling_ok().  Attaching the patch here just for reference.
Comment 7 Kamal Mostafa 2011-03-05 12:41:27 UTC
Created attachment 44155 [details]
dmesg log of the i915_tiling_ok() tracing through the compiz crash

Chris, here's the log of all calls to i915_tiling_ok().  Look for "return=0" which first occurs when I log in, just before compiz crashes, e.g:

[   46.181700] i915_tiling_ok(dev,1536,40960,1)+line 45 return=0 tile_width=512 tile_height=8
...
[   46.328162] compiz[1647]: segfault at 20 ip 00007fb4b583702a sp 00007fff0f20eda0 error 4 in i965_dri.so[7fb4b580b000+ac000]

Note that the "+line NN" line numbers are from the top of i915_tiling_ok(), marking the return from that function.
Comment 8 Kamal Mostafa 2011-03-05 13:11:09 UTC
Chris, your patch bea96046b4245e9abd65ed7acfed9adfd5f6c639 (drm/i915: Gen4+ has non-power-of-two strides) from drm-intel-staging moved this chunk up higher in i915_tiling_ok(), but did you perhaps miss a line?:

+	if (INTEL_INFO(dev)->gen >= 4) {
+		/* 965+ just needs multiples of tile width */
+		if (stride & (tile_width - 1))
+			return false;
>>>>>           return true;            <<<<  SHOULD THIS BE ADDED?
+	} else {

That return true; was present in the chunk before you moved it:

-	/* 965+ just needs multiples of tile width */
-	if (INTEL_INFO(dev)->gen >= 4) {
-		if (stride & (tile_width - 1))
-			return false;
-		return true;
-	}

... and without it, we'll still end up returning false for the particular problem values when we hit this down at the very bottom:

+	if (size % (tile_height * stride))
 		return false;

I'll try adding that return true; and see what happens, but please advise whether it makes sense to do so.
Comment 9 Kamal Mostafa 2011-03-05 13:43:15 UTC
Created attachment 44156 [details] [review]
drm/i915: fix Gen4+ has non-power-of-two strides

I confirm that adding the missing "return true;" does resolve the crash.

The attached patch (on top of "drm/i915: Gen4+ has non-power-of-two strides") implements such.
Comment 10 Chris Wilson 2011-03-06 00:55:35 UTC
In this case, the kernel is correct. Userspace has requested tiling on too small a buffer, so there is indeed the potential for a random hang depending upon the allocation of the buffer in the GTT (since the sampler will read attempt to access memory beyond the allocation).

So we need to head up through the stack and nail the culprit.

If you add

diff --git a/intel/intel_bufmgr_gem.c b/intel/intel_bufmgr_gem.c
index f5ab0a6..08e7d98 100644
--- a/intel/intel_bufmgr_gem.c
+++ b/intel/intel_bufmgr_gem.c
@@ -1741,6 +1741,7 @@ drm_intel_gem_bo_set_tiling_internal(drm_intel_bo *bo,
                            DRM_IOCTL_I915_GEM_SET_TILING,
                            &set_tiling);
        } while (ret == -1 && (errno == EINTR || errno == EAGAIN));
+       assert(ret == 0);
        if (ret == -1)
                return -errno;
 
to libdrm and attach gdb to compiz and print the bt and locals for the next crash. That should find the root cause.
Comment 11 Chris Wilson 2011-03-06 01:08:50 UTC
Looking closely at callers of set_tiling, we do indeed overallocate bo (rounding up to the next cache-size) and so the kernel test is overzealous.

commit 0ee537abbd10a9abf11e1c22ee32a68e8c12ed4a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Mar 6 09:03:16 2011 +0000

    Revert "drm/i915: fix corruptions on i8xx due to relaxed fencing"
    
    This reverts commit c2e0eb167070a6e9dcb49c84c13c79a30d672431.
    
    As it turns out, userspace already depends upon being able to enable
    tiling on existing bo which it promises to be large enough for its
    purposes i.e. it will not access beyond the end of the last full-tile
    row.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35016
    Reported-and-tested-by: Kamal Mostafa <kamal@canonical.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 12 Indan Zupancic 2011-03-09 22:16:41 UTC
(In reply to comment #11)
> Looking closely at callers of set_tiling, we do indeed overallocate bo
> (rounding up to the next cache-size) and so the kernel test is overzealous.
> 
> commit 0ee537abbd10a9abf11e1c22ee32a68e8c12ed4a
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Sun Mar 6 09:03:16 2011 +0000
> 
>     Revert "drm/i915: fix corruptions on i8xx due to relaxed fencing"
> 
>     This reverts commit c2e0eb167070a6e9dcb49c84c13c79a30d672431.
> 
>     As it turns out, userspace already depends upon being able to enable
>     tiling on existing bo which it promises to be large enough for its
>     purposes i.e. it will not access beyond the end of the last full-tile
>     row.
> 
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35016
>     Reported-and-tested-by: Kamal Mostafa <kamal@canonical.com>
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Without c2e0e gen 2 hardware gets screen corruption!

So I guess it's this check that causes this bug:

+	/* Size needs to be aligned to a full tile row */
+	if (size & (tile_height * stride - 1))
+		return false;

Perhaps remove only that bit, or instead add a IS_GEN2(dev) check to it?
Comment 13 Indan Zupancic 2011-03-10 02:37:55 UTC
Never mind, Daniel told me that it's fixed in userspace with
a newer xf86-video-intel + libdrm.

This bug can be closed again, sorry for the noise.
Comment 14 Felix Braun 2011-03-20 13:00:34 UTC
I am still getting the screen corruption with recent xubuntu on gen2 hardware GM855.

using these packages
xserver-xorg-video-intel 2.14.0-4ubuntu4
libdrm 2.4.23-1ubuntu3

Which versions are supposed to fix the issue?
Comment 15 Chris Wilson 2011-03-20 13:03:59 UTC
xf86-video-intel commit d21d781466785c317131a8a57606925867265dc8
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Feb 22 18:31:44 2011 +0100

    Fix relaxed tiling on gen2
    
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

in 2.14.901

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.