Bug 73351

Summary: xf86-video-intel-2.99.907 segfaults while the machine is unattended
Product: xorg Reporter: Mike Auty <mike.auty>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: highest CC: contrib, marti, novaslicences
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg.0.log for the crashed session
none
GDB "bt full" results for the segfault
none
Xorg.0.log for 1.14.5 crashed session
none
GDB "bt full" results for the 1.14.5 segfault
none
gdb-info.txt
none
valgrind.log
none
Xorg.1.log from the valgrind run
none
GDB "bt full" results for the assertion failure with git 6db99169 and assertions
none
Xorg.log with --enable-debug=full
none
GDB "bt full" results for the assertion failure with git 6db99169 and assertions - take 2 none

Description Mike Auty 2014-01-07 10:36:55 UTC
Created attachment 91591 [details]
Xorg.0.log for the crashed session

I'm running into the following segfault every morning, after I've left my machine unattended at night.  The monitor is switched off (which also powers the graphics tablet), and when I return the next morning I'm met with the segfault, which I've attached gdb to and recorded a "bt full" of the crash.  The timings on this log will be wrong, because gdb trapped the segfault and didn't start again until this morning, but I can get the right timings tonight, if that would be useful?

This happened with 2.99.906 as well as 2.99.907, however I believe it did not fail with 2.99.905.  I should also note that I had been using 2.99.906 since late November, but only started to experience problems after I upgraded to gnome-3.10.2 over the Christmas period.

I'm attaching my Xorg.0.log.old (from after the crash and X restarted), and the gdb "bt full" output.  Please let me know if there's any further patches or tests I can conduct, but understand I can only reliably reproduce this overnight.
Comment 1 Mike Auty 2014-01-07 10:37:29 UTC
Created attachment 91592 [details]
GDB "bt full" results for the segfault
Comment 2 Chris Wilson 2014-01-07 10:56:15 UTC
Do you see the same issue with xorg-1.14?
Comment 3 Chris Wilson 2014-01-07 11:09:50 UTC
Also can you please run with ./configure --enable-debug?
Comment 4 Mike Auty 2014-01-08 02:33:56 UTC
Created attachment 91634 [details]
Xorg.0.log for 1.14.5 crashed session

Yes, this still happens with 1.14.5, here's the new Xorg.0.log with --enable-debug used to compile the 2.99.907 drivers.  Very similar gdb "bt full" to follow...
Comment 5 Mike Auty 2014-01-08 02:34:40 UTC
Created attachment 91635 [details]
GDB "bt full" results for the 1.14.5 segfault
Comment 6 Chris Wilson 2014-01-08 09:23:27 UTC
If it manages to get that far with assertions disabled, it is an extremely subtle bug. :|

Can I ask if you can p *priv, p *priv->cpu_bo, p *priv->gpu_bo in frame #2 (_sna_pixmap_move_to_cpu) ?
Comment 7 Chris Wilson 2014-01-08 09:23:45 UTC
*assertions enabled.
Comment 8 Mike Auty 2014-01-08 09:25:07 UTC
I'm a bit of a beginner when it comes to gdb, could you let me know how I should do that please?
Comment 9 Chris Wilson 2014-01-08 09:29:46 UTC
From within gdb, after the break due to the segfault, either type "frame 2" or "up" twice. Then type "p *priv", "p *priv->cpu_bo", "p *priv->gpu_bo". And hope gcc hasn't optimised away priv. If it complains that priv is not available, try "p *sna_pixmap(pixmap)", "p *sna_pixmap(pixmap)->cpu_bo", "p *sna_pixmap(pixmap)->gpu_bo". (And cross your fingers.)

A last resort would be --enable-debug=full and hoping the error still occurs. The Xorg.0.log will be massive (may even be greater than 2GiB) but only the last 1000 lines will be enough to check to see if the bug exposes itself.
Comment 10 Marti Raudsepp 2014-01-09 15:23:57 UTC
*** Bug 73439 has been marked as a duplicate of this bug. ***
Comment 11 Chris Wilson 2014-01-10 20:00:40 UTC
If you do manage to catch this in gdb, and have irc available, please ping me (ickle on #intel-gfx irc.freenode.net) as there are quite a few things that I want to inspect - and I'm not quite sure where to start digging.
Comment 12 Mike Auty 2014-01-10 20:31:22 UTC
Sure, I'm afraid I haven't been able to recreate it the last two nights, and from tomorrow I'm going to be away until Wednesday, but I'm hoping to be able to recreate it soon.  I'll run through the lists of all the variables you've left me, and I'll try to catch you on IRC if I can...
Comment 13 Marti Raudsepp 2014-01-14 23:37:22 UTC
Finally managed to catch this in a functioning gdb. Note that for me, this bug very rarely occurs "while the machine is unattended", but usually crashes in the middle of normal usage -- like now.

This time it's my laptop with Intel i5-560M (I believe HD Graphics 3000)

I only ever had 2 Firefox windows, a few terminals and Skype open in this session, as I had just started up Xorg recently.

I saved a core file and backed up the binaries just in case you need more data.

"bt full" and all this is also available in the attached file.

(gdb) up
#1  0x00007f833010953b in _sna_pixmap_move_to_cpu (pixmap=0x1a03de0, flags=3) at sna_accel.c:2118
2118                            memset(pixmap->devPrivate.ptr, priv->clear_color,
(gdb) up
#2  0x00007f833010a80b in sna_drawable_move_region_to_cpu (drawable=0x19f39b0, region=0x7fffa41c2fa0, flags=3) at sna_accel.c:2487
2487                    return _sna_pixmap_move_to_cpu(pixmap, flags | MOVE_READ);
(gdb) p *priv
$1 = {pixmap = 0x1a03de0, gpu_bo = 0x1e1bb70, cpu_bo = 0x0, gpu_damage = 0x1f43e21, cpu_damage = 0x0, cow = 0x0, ptr = 0x0, move_to_gpu = 0x0, move_to_gpu_data = 0x0, flush_list = {next = 0x1a03ea8, prev = 0x1a03ea8}, 
  cow_list = {next = 0x0, prev = 0x0}, stride = 192, clear_color = 0, source_count = 4, pinned = 2 '\002', create = 11 '\v', mapped = 0 '\000', flush = 1 '\001', shm = 0 '\000', clear = 1 '\001', header = 1 '\001', 
  cpu = 0 '\000'}
(gdb) p *priv->cpu_bo
Cannot access memory at address 0x0
(gdb) p *priv->gpu_bo
$2 = {rq = 0x2391820, exec = 0x0, proxy = 0x0, list = {next = 0x1e1bb88, prev = 0x1e1bb88}, request = {next = 0x2391838, prev = 0x1524a48}, vma = {next = 0x1e1bba8, prev = 0x1e1bba8}, map__cpu = 0x7f832b2dd000, 
  map__gtt = 0x7f832bdce000, binding = {next = 0x20a1a50, format = 324, offset = 0}, presumed_offset = 215392256, unique_id = 62201, refcnt = 2, handle = 221, target_handle = 4294967295, delta = 0, size = {pages = {
      count = 6, bucket = 2}, bytes = 268435462}, pitch = 512, tiling = 1, reusable = 0, gpu_dirty = 0, gtt_dirty = 0, domain = 3, needs_flush = 1, snoop = 0, io = 0, flush = 1, scanout = 0, purged = 0}
(gdb) p *sna_pixmap(pixmap)
$1 = {pixmap = 0x1a03de0, gpu_bo = 0x1e1bb70, cpu_bo = 0x0, gpu_damage = 0x1f43e21, cpu_damage = 0x0, cow = 0x0, ptr = 0x0, move_to_gpu = 0x0, move_to_gpu_data = 0x0, 
  flush_list = {next = 0x1a03ea8, prev = 0x1a03ea8}, cow_list = {next = 0x0, prev = 0x0}, stride = 192, clear_color = 0, source_count = 4, pinned = 2 '\002', create = 11 '\v', 
  mapped = 0 '\000', flush = 1 '\001', shm = 0 '\000', clear = 1 '\001', header = 1 '\001', cpu = 0 '\000'}
(gdb) p *sna_pixmap(pixmap)->cpu_bo
Cannot access memory at address 0x0
(gdb) p *sna_pixmap(pixmap)->gpu_bo
$2 = {rq = 0x2391820, exec = 0x0, proxy = 0x0, list = {next = 0x1e1bb88, prev = 0x1e1bb88}, request = {next = 0x2391838, prev = 0x1524a48}, vma = {next = 0x1e1bba8, 
    prev = 0x1e1bba8}, map__cpu = 0x7f832b2dd000, map__gtt = 0x7f832bdce000, binding = {next = 0x20a1a50, format = 324, offset = 0}, presumed_offset = 215392256, 
  unique_id = 62201, refcnt = 2, handle = 221, target_handle = 4294967295, delta = 0, size = {pages = {count = 6, bucket = 2}, bytes = 268435462}, pitch = 512, tiling = 1, 
  reusable = 0, gpu_dirty = 0, gtt_dirty = 0, domain = 3, needs_flush = 1, snoop = 0, io = 0, flush = 1, scanout = 0, purged = 0}
Comment 14 Marti Raudsepp 2014-01-14 23:38:04 UTC
Created attachment 92102 [details]
gdb-info.txt
Comment 15 Chris Wilson 2014-01-15 10:24:46 UTC
Interesting... Should have asked for p *pixmap as well :|

Afaics, there are two possibilities here. Either some memory corruption surgically cleared out parts of *priv, or pixmap->devPrivate.ptr has been erroneously set somewhere. Both should be impossible.

To be sure I've added some more asserts in

commit 40beee995e71b1fa4cd146a986f53e03595a4378
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jan 15 10:07:32 2014 +0000

    sna: Add some more paranoia around dangling pointers
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=73351
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

please do keep on running with ./configure --enable-debug and keep gdb handy.
Comment 16 Chris Wilson 2014-01-15 12:05:02 UTC
*** Bug 73397 has been marked as a duplicate of this bug. ***
Comment 17 Marti Raudsepp 2014-01-15 23:56:07 UTC
(In reply to comment #15)
> Interesting... Should have asked for p *pixmap as well :|

D'oh I missed this before. But as I said, I still have the core file, so here's the pixmap:

(gdb) p *pixmap
$3 = {drawable = {type = 1 '\001', class = 0 '\000', depth = 32 ' ', bitsPerPixel = 32 ' ', id = 0, x = 0, y = 0, width = 48, height = 48, pScreen = 0x15219b0, 
    serialNumber = 265}, devPrivates = 0x1a03e28, refcnt = 4, devKind = 192, devPrivate = {ptr = 0x7f832ca99000, val = 140201366753280, uval = 140201366753280, 
    fptr = 0x7f832ca99000}, screen_x = 1380, screen_y = 912, usage_hint = 0, master_pixmap = 0x7f832d111280}
Comment 18 Marti Raudsepp 2014-01-16 00:33:37 UTC
Created attachment 92197 [details]
valgrind.log

Ok, it crashed again under valgrind (log attached). This time I have a lot more warnings, including many invalid writes. Using xf86-video-intel-git 2.99.907.16.g40beee9

Interestingly, it didn't print a stack trace into Xorg's log again; does valgrind somehow override Xorg's SIGSEGV handler?
Comment 19 Marti Raudsepp 2014-01-16 00:34:43 UTC
Created attachment 92198 [details]
Xorg.1.log from the valgrind run
Comment 20 Mike Auty 2014-01-16 04:02:03 UTC
So I managed to catch it again, but before I'd managed to apply that patch, unfortunately.

All the "p *priv[...]" ones said they'd been optimized out, and the first "p *sna_pixmap(pixmap)" one I tried bombed out.  5:(

(gdb) p *sna_pixmap(pixmap)

Program received signal SIGABRT, Aborted.
[Thread 0x7f9532b4b700 (LWP 14081) exited]
[Thread 0x7f9533b4d700 (LWP 14079) exited]
[Thread 0x7f953334c700 (LWP 14080) exited]
[Thread 0x7f953234a700 (LWP 14082) exited]


I'll rebuild with the extra assertions and see if I can get it to trigger again.  Sorry it's taking such a long time to recreate...
Comment 21 Chris Wilson 2014-01-16 08:59:00 UTC
(In reply to comment #18)
> Created attachment 92197 [details]
> valgrind.log
> 
> Ok, it crashed again under valgrind (log attached). This time I have a lot
> more warnings, including many invalid writes. Using xf86-video-intel-git
> 2.99.907.16.g40beee9

Apologies, it appears you compiled it without the valgrind headers and it hasn't included the magic to suppress all the false positives. 

> Interestingly, it didn't print a stack trace into Xorg's log again; does
> valgrind somehow override Xorg's SIGSEGV handler?

No. More likely it hit an assertion in that case...
Comment 22 Chris Wilson 2014-01-16 10:17:26 UTC
The *pixmap has what I feared - pixmap->devPrivate.ptr is non-zero, but we apparently have never allocated CPU space for it (priv->ptr being NULL), nor is it mapped. So where did that dangling pointer come from?
Comment 23 Chris Wilson 2014-01-16 12:20:40 UTC
Another layer of asserts pushed to xf86-video-intel.git. I'd recommend using gdb as it should at least catch the issue identified thus far earlier than valgrind.
Comment 24 Marti Raudsepp 2014-01-16 23:35:18 UTC
Caught an assertion failure...

Xorg.valgrind: sna.h:514: sna_pixmap_unmap: Assertion `pixmap->devPrivate.ptr == ((void*)((uintptr_t)(priv->ptr) & ~1))' failed.

(I'm not running in valgrind any more, but I still have the diversion)

I backed up the core file, stderr and Xorg log, let me know if you need any more info.

(gdb) thread apply all bt full

Thread 3 (Thread 0x7f72d8a94700 (LWP 12032)):
#0  0x00007f72deab603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
No symbol table info available.
#1  0x00007f72d939e1f9 in __run__ (arg=0x1433450) at sna_threads.c:61
        t = 0x1433450
        signals = {__val = {18446744067267100671, 18446744073709551615 <repeats 15 times>}}
        __PRETTY_FUNCTION__ = "__run__"
#2  0x00007f72deab20a2 in start_thread () from /usr/lib/libpthread.so.0
No symbol table info available.
#3  0x00007f72dd7e132d in clone () from /usr/lib/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x7f72d8293700 (LWP 12033)):
#0  0x00007f72deab603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
No symbol table info available.
#1  0x00007f72d939e1f9 in __run__ (arg=0x14334c0) at sna_threads.c:61
        t = 0x14334c0
        signals = {__val = {18446744067267100671, 18446744073709551615 <repeats 15 times>}}
        __PRETTY_FUNCTION__ = "__run__"
#2  0x00007f72deab20a2 in start_thread () from /usr/lib/libpthread.so.0
No symbol table info available.
#3  0x00007f72dd7e132d in clone () from /usr/lib/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x7f72df7a4880 (LWP 12030)):
#0  0x00007f72dd731369 in raise () from /usr/lib/libc.so.6
No symbol table info available.
#1  0x00007f72dd732768 in abort () from /usr/lib/libc.so.6
No symbol table info available.
#2  0x00007f72dd72a456 in __assert_fail_base () from /usr/lib/libc.so.6
No symbol table info available.
#3  0x00007f72dd72a502 in __assert_fail () from /usr/lib/libc.so.6
No symbol table info available.
#4  0x00007f72d9307700 in sna_pixmap_unmap (pixmap=0x1b365d0, priv=0x1b37e90) at sna.h:514
        __PRETTY_FUNCTION__ = "sna_pixmap_unmap"
#5  0x00007f72d9308a2e in sna_pixmap_change_tiling (pixmap=0x1b365d0, tiling=1) at sna_accel.c:676
        priv = 0x1b37e90
        sna = 0x7f72df6cd000
        bo = 0x1b9aed0
        box = {x1 = 0, y1 = 0, x2 = 48, y2 = 48}
        __PRETTY_FUNCTION__ = "sna_pixmap_change_tiling"
#6  0x00007f72d9407a96 in sna_pixmap_set_dri (sna=0x7f72df6cd000, pixmap=0x1b365d0) at sna_dri.c:164
        priv = 0x1b37e90
        tiling = 1
        __PRETTY_FUNCTION__ = "sna_pixmap_set_dri"
#7  0x00007f72d9407f27 in sna_dri_create_buffer (draw=0x1b365d0, attachment=0, format=32) at sna_dri.c:259
        sna = 0x7f72df6cd000
        buffer = 0x0
        private = 0x1c063e0
        pixmap = 0x1b365d0
        bo = 0x1b9a2b0
        flags = 0
        size = 3145776
---Type <return> to continue, or q <return> to quit---
        bpp = 32
        __PRETTY_FUNCTION__ = "sna_dri_create_buffer"
#8  0x00000000005554f3 in ?? ()
No symbol table info available.
#9  0x00000000005563db in ?? ()
No symbol table info available.
#10 0x0000000000556630 in DRI2GetBuffersWithFormat ()
No symbol table info available.
#11 0x0000000000557fe0 in ?? ()
No symbol table info available.
#12 0x0000000000435d1e in ?? ()
No symbol table info available.
#13 0x0000000000439b2a in ?? ()
No symbol table info available.
#14 0x00007f72dd71db05 in __libc_start_main () from /usr/lib/libc.so.6
No symbol table info available.
#15 0x00000000004250ce in _start ()
No symbol table info available.


Here's something weird... Is this supposed to occur?
(gdb) gcore
warning: target file /proc/12030/cmdline contained unexpected null characters
[...]

% hexdump -C /proc/12030/cmdline
00000000  58 6f 72 67 2e 76 61 6c  67 72 69 6e 64 00 3a 32  |Xorg.valgrind.:2|
00000010  00 2d 62 61 63 6b 67 72  6f 75 6e 64 00 6e 6f 6e  |.-background.non|
00000020  65 00 2d 76 65 72 62 6f  73 65 00 2d 61 75 74 68  |e.-verbose.-auth|
00000030  00 2f 76 61 72 2f 72 75  6e 2f 67 64 6d 2f 61 75  |./var/run/gdm/au|
00000040  74 68 2d 66 6f 72 2d 67  64 6d 2d 52 73 74 66 58  |th-for-gdm-RstfX|
00000050  5a 2f 64 61 74 61 62 61  73 65 00 2d 73 65 61 74  |Z/database.-seat|
00000060  00 73 65 61 74 30 00 2d  6e 6f 6c 69 73 74 65 6e  |.seat0.-nolisten|
00000070  00 74 63 70 00 76 74 31  00                       |.tcp.vt1.|
Comment 25 Marti Raudsepp 2014-01-17 00:38:13 UTC
I can now reproduce the assertion failure on my laptop with >75% chance by: logging in, starting up Firefox, holding down Ctrl+Tab to cycle through my open tabs, and then choosing "Quit" from the menu. Xorg crashes quickly after the Firefox window disappears.

Also, it seems that using the address bar history to complete URLs has a high chance of causing this.
Comment 26 Mike Auty 2014-01-17 01:34:18 UTC
Created attachment 92251 [details]
GDB "bt full" results for the assertion failure with git 6db99169 and assertions

Bah, I got another gdb bt full (I believe from git commit 6db99169), but typed "u" rather than "up", and the whole thing restarted before I could get any of the "p *" commands run.  I think this was the same assertion that Marti got, and I can't recreate it as effectively as he can, but it only seems to happen after using the machine with visually intensive programs (so lots of different web pages, viewing videos, etc).

The only unusual output I saw in the Xorg.0.log was a "setversion 1.4: permission denied" message which I haven't seen before?

[ 15062.962] (II) AIGLX: Suspending AIGLX clients for VT switch
[ 15127.364] (II) AIGLX: Resuming AIGLX clients after VT switch
[ 15127.364] (II) intel(0): switch to mode 1920x1200@60.0 on HDMI1 using pipe 0, position (0, 0), rotation normal, reflection none
[ 15127.415] setversion 1.4 failed: Permission denied
[ 15127.628] (II) intel(0): switch to mode 1920x1200@60.0 on HDMI1 using pipe 0, position (0, 0), rotation normal, reflection none
[ 15140.860] (II) AIGLX: Suspending AIGLX clients for VT switch

I'll keep trying to hit it again, just let me know what data you'd like if I do (at the moment I'm still trying for that list of "p *" commands).
Comment 27 Marti Raudsepp 2014-01-17 01:51:40 UTC
Created attachment 92253 [details]
Xorg.log with --enable-debug=full

Perhaps this is useful, here's the last 1000 lines of Xorg.log output from a run built with --enable-debug=full

Full log here: https://www.voicecom.ee/~marti/.wuPIxhbvLq/Xorg.fulldebug.gz (3.7MB compressed)
Comment 28 Mike Auty 2014-01-17 02:57:54 UTC
Created attachment 92256 [details]
GDB "bt full" results for the assertion failure with git 6db99169 and assertions - take 2

And then like buses...!  Here's another one, this time I did manage to get the "p *" commands, and I checked out the memory around the devPrivate.ptr (it was all zeroed).
Comment 29 Chris Wilson 2014-01-17 09:26:34 UTC
commit 5f3ee21a307a4ff4db189bd53e58a70ec01ee6bc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jan 17 08:40:34 2014 +0000

    sna: Nullify pixmap->devPrivate.ptr after promoting CPU bo to GPU
    
    When we convert a CPU bo into a GPU bo, we need to remove any dangling
    shadow pointers we use for devPrivate.ptr. Whilst the bo remains alive
    these are incoherent, but if we ever replace the GPU bo (for example to
    change tiling for DRI2) then the dangling pointer becomes invalid and
    will explode on next use.
    
    Reported-by: Mike Aury <mike.auty@gmail.com>
    Reported-by:  Marti Raudsepp <marti@juffo.org>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73351
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 30 Chris Wilson 2014-01-30 08:32:29 UTC
*** Bug 74208 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.