Bug 71605 - 2.99.906 causes XOrg to die quite frequently
Summary: 2.99.906 causes XOrg to die quite frequently
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-14 11:50 UTC by Jay Little
Modified: 2013-11-14 16:28 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
DMESG output with extended drm logging turned on (161.53 KB, text/plain)
2013-11-14 11:50 UTC, Jay Little
no flags Details
XOrg log file that was created when the crash occurred (18.60 KB, text/plain)
2013-11-14 11:51 UTC, Jay Little
no flags Details
XOrg log of crash with debugging enabled (not full debugging) (30.38 KB, text/plain)
2013-11-14 15:24 UTC, Jay Little
no flags Details
GDM log of crash with debugging enabled (not full debugging) (26.90 KB, text/plain)
2013-11-14 15:25 UTC, Jay Little
no flags Details

Description Jay Little 2013-11-14 11:50:48 UTC
Created attachment 89185 [details]
DMESG output with extended drm logging turned on

So with the recent release of 2.99.906, I've run into a serious issue.  Namely XOrg is crashing - alot.  Downgrading to 2.99.904 resolves the issue though I wasn't having this problem a few weeks ago with an interim build.

The easiest way for me to replicate this is:

[1] Log into GDM
[2] Wait for Cinnamon to start
[3] Click on the Thunderbird icon in my docky dock

Around 50% of the time, this will cause XOrg to die and gdm will restart it.  If this doesn't cause XOrg to die, goofing around for a few minutes in other apps will eventually cause it.  In any event when Thunderbird dies this way it's always in the process of "zooming the window in" from the center of the screen.

I've attached the relevant XOrg log (though it's pretty useless).  I've also attached my dmesg dump as I had turned on extended drm logging in preparation for replicating and opening a bug on the issue this morning.
Comment 1 Jay Little 2013-11-14 11:51:30 UTC
Created attachment 89186 [details]
XOrg log file that was created when the crash occurred
Comment 2 Chris Wilson 2013-11-14 12:02:11 UTC
Yeah, no indication of failure there. Try attaching gdb (a second computer is best).
Comment 3 Jay Little 2013-11-14 12:06:24 UTC
Unfortunately I'm not in a position to do that :(  I guess I'll have to hope somebody else can replicate the bug.  If I could narrow down to a specific commit, would that help?
Comment 4 Chris Wilson 2013-11-14 12:10:49 UTC
That would be a reasonable approach. Also look if anything is captured in other logs like /var/log/gdm/:0.log*
Comment 5 Jay Little 2013-11-14 12:49:24 UTC
There is nothing worth seeing in the GDM logs.

I've narrowed it down to a specific commit.  The issue starts with the following commit:

[f3225fcb38686f3b9701725bf3a11ecf1c100c3f] sna: Be move conservative with tiling sizes for older fenced gen: Crash

http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=f3225fcb38686f3b9701725bf3a11ecf1c100c3f

I'm testing the commit made directly prior to make absolutely sure it does not have the issue. So far so good.
Comment 6 Jay Little 2013-11-14 13:25:24 UTC
Yeah so far the previous commit seems rock solid.  So I'll be sticking with that one for the time being.
Comment 7 Chris Wilson 2013-11-14 14:29:11 UTC
False positive. That code is not even executed for your system and cannot cause the behaviour you suggest.
Comment 8 Chris Wilson 2013-11-14 14:30:53 UTC
If it is that frequent, please compile with --enable-debug=full and attach the debug log leading up to the crash.
Comment 9 Jay Little 2013-11-14 14:43:05 UTC
It's definitely something in that commit.  I have no doubts about it.  The commit immediately prior has been rock solid.  In the bugged versions getting X to crash required virtually no effort.  If opening up Thunderbird after initially logging in didn't do it, opening a few tabs in the browser, scrolling around a bit and then opening thunderbird again would do it.

I will recompile the bugged version with debugging and provide the results.
Comment 10 Jay Little 2013-11-14 15:02:05 UTC
I am unable to replicate the issue when the driver is compiled in debug mode.  X runs quite sluggish but absolutely refuses to crash.  As soon as I switch to the same driver not compiled in debug mode I can replicate the issue.
Comment 11 Chris Wilson 2013-11-14 15:03:31 UTC
Just try with --enable-debug then and look for any assertion failure in /var/log/gdm/ (or friends).
Comment 12 Jay Little 2013-11-14 15:10:48 UTC
I'm curious - in regards to:

if (kgem->gen < 033)

Would that evaluate to true or false for my hardware?  I'm assuming based on your statements here it would evaluate to false.  What would the value of kgem->gen be for my hardware?
Comment 13 Jay Little 2013-11-14 15:14:35 UTC
I think based on what I'm seeing in intel_module.c, the gen for my haswell hardware is going to be 075.

If that's true it stands to reason that a patch like the following would change the behavior of the driver for my hardware:

-			size *= 2;
+			if (kgem->gen < 033)
+				size *= 2;

Instead of size always being doubled, it would now only be doubled for earlier generations of the hardware.
Comment 14 Chris Wilson 2013-11-14 15:18:24 UTC
(In reply to comment #13)
> I think based on what I'm seeing in intel_module.c, the gen for my haswell
> hardware is going to be 075.
> 
> If that's true it stands to reason that a patch like the following would
> change the behavior of the driver for my hardware:
> 
> -			size *= 2;
> +			if (kgem->gen < 033)
> +				size *= 2;
> 
> Instead of size always being doubled, it would now only be doubled for
> earlier generations of the hardware.

Apart from that entire block is only evaluate if gen < 040.
Comment 15 Jay Little 2013-11-14 15:24:45 UTC
Created attachment 89197 [details]
XOrg log of crash with debugging enabled (not full debugging)
Comment 16 Jay Little 2013-11-14 15:25:19 UTC
Created attachment 89198 [details]
GDM log of crash with debugging enabled (not full debugging)
Comment 17 Chris Wilson 2013-11-14 15:41:18 UTC
It would help immensely to have debug symbols, addr2line -e /usr/lib/xorg/modules/drivers/intel_drv.so -i 0x239f1 0x26205 0xb5971 0xb5949
Comment 18 Jay Little 2013-11-14 15:45:27 UTC
I will provide that momentarily ;)  Note: I'm switching between drivers so that my laptop will remain at least somewhat useable during this process.

This code in the patch looked a bit odd:

-		size = 3*kgem->aperture_fenced;
-		if (kgem->aperture_total == kgem->aperture_mappable)
-			size += kgem->aperture;
-		if (size > kgem->aperture_mappable &&
-		    kgem_ring_is_idle(kgem, kgem->ring))
-			return false;
+		if (kgem->aperture_fenced) {
+			size = 3*kgem->aperture_fenced;
+			if (kgem->aperture_total == kgem->aperture_mappable)
+				size += kgem->aperture;
+			if (size > kgem->aperture_mappable &&
+			    kgem_ring_is_idle(kgem, kgem->ring)) {
+				DBG(("%s: opportunistic fence flush\n", __FUNCTION__));
+				return false;
+			}
+		}
 
 		size = kgem->aperture_fenced;

It almost feels like that last line ought to have been deleted. Prior to this revision if execution got here and kgem->aperture_fenced == 0 then I'm guessing "return false" would've been executed.  In this revision this situation would be allowed to continue.
Comment 19 Chris Wilson 2013-11-14 15:51:24 UTC
It's a flush at a low watermark. If the batch has no other fences, then we need to be careful that we only evaluate this command (and its fenced bo) against the high watermark. Otherwise we reject operations that could actually be handled by an early flush and starting with an empty batch.
Comment 20 Jay Little 2013-11-14 15:55:45 UTC
The command you gave me doesn't seem to produce any useful output:

addr2line -e /usr/lib/xorg/modules/drivers/intel_drv.so -i 0x239f1 0x26205 0xb5971 0xb5949
??:0
??:0
??:0
??:0
Comment 21 Chris Wilson 2013-11-14 15:59:36 UTC
Your installation stripped all of the debug symbols from the drivers. Either correct the install process to preserve the symbols (avoid stripping), or point addr2line to the original intel_drv.so in the build path.
Comment 22 Jay Little 2013-11-14 16:04:00 UTC
Yes it did - running it against the binary in the build directory did the trick.  I added the -f option as well:

addr2line -e ./intel_drv.so -i -f 0x239f1 0x26205 0xb5971 0xb5949
kgem_bo_set_purgeable
kgem.c:?
__kgem_bo_destroy
kgem.c:?
sna_tiling_blt_copy_boxes
:?
sna_tiling_blt_copy_boxes
:?
Comment 23 Chris Wilson 2013-11-14 16:15:47 UTC
You win.

commit 6e9a8c5ae2883ca21d117ac672dd8a55b3429dc1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Nov 14 16:14:09 2013 +0000

    sna: Add the missing braces around the conditional block
    
    Fixes regression from
    commit f3225fcb38686f3b9701725bf3a11ecf1c100c3f
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Tue Nov 5 08:38:22 2013 +0000
    
        sna: Be move conservative with tiling sizes for older fenced gen
    
    Reported-by: Jay Little <jaylittle@jaylittle.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=71605
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 24 Jay Little 2013-11-14 16:28:45 UTC
My initial testing results are very promising.  I'll continue to use the driver over the next few hours and see how it goes.  Also, thanks a lot for your help here.  It's a real pleasure working with you Chris!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.