Summary: | xf86-video-intel-2.99.907 segfaults while the machine is unattended | ||
---|---|---|---|
Product: | xorg | Reporter: | Mike Auty <mike.auty> |
Component: | Driver/intel | Assignee: | Chris Wilson <chris> |
Status: | RESOLVED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | highest | CC: | contrib, marti, novaslicences |
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Created attachment 91592 [details]
GDB "bt full" results for the segfault
Do you see the same issue with xorg-1.14? Also can you please run with ./configure --enable-debug? Created attachment 91634 [details]
Xorg.0.log for 1.14.5 crashed session
Yes, this still happens with 1.14.5, here's the new Xorg.0.log with --enable-debug used to compile the 2.99.907 drivers. Very similar gdb "bt full" to follow...
Created attachment 91635 [details]
GDB "bt full" results for the 1.14.5 segfault
If it manages to get that far with assertions disabled, it is an extremely subtle bug. :| Can I ask if you can p *priv, p *priv->cpu_bo, p *priv->gpu_bo in frame #2 (_sna_pixmap_move_to_cpu) ? *assertions enabled. I'm a bit of a beginner when it comes to gdb, could you let me know how I should do that please? From within gdb, after the break due to the segfault, either type "frame 2" or "up" twice. Then type "p *priv", "p *priv->cpu_bo", "p *priv->gpu_bo". And hope gcc hasn't optimised away priv. If it complains that priv is not available, try "p *sna_pixmap(pixmap)", "p *sna_pixmap(pixmap)->cpu_bo", "p *sna_pixmap(pixmap)->gpu_bo". (And cross your fingers.) A last resort would be --enable-debug=full and hoping the error still occurs. The Xorg.0.log will be massive (may even be greater than 2GiB) but only the last 1000 lines will be enough to check to see if the bug exposes itself. *** Bug 73439 has been marked as a duplicate of this bug. *** If you do manage to catch this in gdb, and have irc available, please ping me (ickle on #intel-gfx irc.freenode.net) as there are quite a few things that I want to inspect - and I'm not quite sure where to start digging. Sure, I'm afraid I haven't been able to recreate it the last two nights, and from tomorrow I'm going to be away until Wednesday, but I'm hoping to be able to recreate it soon. I'll run through the lists of all the variables you've left me, and I'll try to catch you on IRC if I can... Finally managed to catch this in a functioning gdb. Note that for me, this bug very rarely occurs "while the machine is unattended", but usually crashes in the middle of normal usage -- like now. This time it's my laptop with Intel i5-560M (I believe HD Graphics 3000) I only ever had 2 Firefox windows, a few terminals and Skype open in this session, as I had just started up Xorg recently. I saved a core file and backed up the binaries just in case you need more data. "bt full" and all this is also available in the attached file. (gdb) up #1 0x00007f833010953b in _sna_pixmap_move_to_cpu (pixmap=0x1a03de0, flags=3) at sna_accel.c:2118 2118 memset(pixmap->devPrivate.ptr, priv->clear_color, (gdb) up #2 0x00007f833010a80b in sna_drawable_move_region_to_cpu (drawable=0x19f39b0, region=0x7fffa41c2fa0, flags=3) at sna_accel.c:2487 2487 return _sna_pixmap_move_to_cpu(pixmap, flags | MOVE_READ); (gdb) p *priv $1 = {pixmap = 0x1a03de0, gpu_bo = 0x1e1bb70, cpu_bo = 0x0, gpu_damage = 0x1f43e21, cpu_damage = 0x0, cow = 0x0, ptr = 0x0, move_to_gpu = 0x0, move_to_gpu_data = 0x0, flush_list = {next = 0x1a03ea8, prev = 0x1a03ea8}, cow_list = {next = 0x0, prev = 0x0}, stride = 192, clear_color = 0, source_count = 4, pinned = 2 '\002', create = 11 '\v', mapped = 0 '\000', flush = 1 '\001', shm = 0 '\000', clear = 1 '\001', header = 1 '\001', cpu = 0 '\000'} (gdb) p *priv->cpu_bo Cannot access memory at address 0x0 (gdb) p *priv->gpu_bo $2 = {rq = 0x2391820, exec = 0x0, proxy = 0x0, list = {next = 0x1e1bb88, prev = 0x1e1bb88}, request = {next = 0x2391838, prev = 0x1524a48}, vma = {next = 0x1e1bba8, prev = 0x1e1bba8}, map__cpu = 0x7f832b2dd000, map__gtt = 0x7f832bdce000, binding = {next = 0x20a1a50, format = 324, offset = 0}, presumed_offset = 215392256, unique_id = 62201, refcnt = 2, handle = 221, target_handle = 4294967295, delta = 0, size = {pages = { count = 6, bucket = 2}, bytes = 268435462}, pitch = 512, tiling = 1, reusable = 0, gpu_dirty = 0, gtt_dirty = 0, domain = 3, needs_flush = 1, snoop = 0, io = 0, flush = 1, scanout = 0, purged = 0} (gdb) p *sna_pixmap(pixmap) $1 = {pixmap = 0x1a03de0, gpu_bo = 0x1e1bb70, cpu_bo = 0x0, gpu_damage = 0x1f43e21, cpu_damage = 0x0, cow = 0x0, ptr = 0x0, move_to_gpu = 0x0, move_to_gpu_data = 0x0, flush_list = {next = 0x1a03ea8, prev = 0x1a03ea8}, cow_list = {next = 0x0, prev = 0x0}, stride = 192, clear_color = 0, source_count = 4, pinned = 2 '\002', create = 11 '\v', mapped = 0 '\000', flush = 1 '\001', shm = 0 '\000', clear = 1 '\001', header = 1 '\001', cpu = 0 '\000'} (gdb) p *sna_pixmap(pixmap)->cpu_bo Cannot access memory at address 0x0 (gdb) p *sna_pixmap(pixmap)->gpu_bo $2 = {rq = 0x2391820, exec = 0x0, proxy = 0x0, list = {next = 0x1e1bb88, prev = 0x1e1bb88}, request = {next = 0x2391838, prev = 0x1524a48}, vma = {next = 0x1e1bba8, prev = 0x1e1bba8}, map__cpu = 0x7f832b2dd000, map__gtt = 0x7f832bdce000, binding = {next = 0x20a1a50, format = 324, offset = 0}, presumed_offset = 215392256, unique_id = 62201, refcnt = 2, handle = 221, target_handle = 4294967295, delta = 0, size = {pages = {count = 6, bucket = 2}, bytes = 268435462}, pitch = 512, tiling = 1, reusable = 0, gpu_dirty = 0, gtt_dirty = 0, domain = 3, needs_flush = 1, snoop = 0, io = 0, flush = 1, scanout = 0, purged = 0} Created attachment 92102 [details]
gdb-info.txt
Interesting... Should have asked for p *pixmap as well :| Afaics, there are two possibilities here. Either some memory corruption surgically cleared out parts of *priv, or pixmap->devPrivate.ptr has been erroneously set somewhere. Both should be impossible. To be sure I've added some more asserts in commit 40beee995e71b1fa4cd146a986f53e03595a4378 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Jan 15 10:07:32 2014 +0000 sna: Add some more paranoia around dangling pointers References: https://bugs.freedesktop.org/show_bug.cgi?id=73351 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> please do keep on running with ./configure --enable-debug and keep gdb handy. *** Bug 73397 has been marked as a duplicate of this bug. *** (In reply to comment #15) > Interesting... Should have asked for p *pixmap as well :| D'oh I missed this before. But as I said, I still have the core file, so here's the pixmap: (gdb) p *pixmap $3 = {drawable = {type = 1 '\001', class = 0 '\000', depth = 32 ' ', bitsPerPixel = 32 ' ', id = 0, x = 0, y = 0, width = 48, height = 48, pScreen = 0x15219b0, serialNumber = 265}, devPrivates = 0x1a03e28, refcnt = 4, devKind = 192, devPrivate = {ptr = 0x7f832ca99000, val = 140201366753280, uval = 140201366753280, fptr = 0x7f832ca99000}, screen_x = 1380, screen_y = 912, usage_hint = 0, master_pixmap = 0x7f832d111280} Created attachment 92197 [details]
valgrind.log
Ok, it crashed again under valgrind (log attached). This time I have a lot more warnings, including many invalid writes. Using xf86-video-intel-git 2.99.907.16.g40beee9
Interestingly, it didn't print a stack trace into Xorg's log again; does valgrind somehow override Xorg's SIGSEGV handler?
Created attachment 92198 [details]
Xorg.1.log from the valgrind run
So I managed to catch it again, but before I'd managed to apply that patch, unfortunately. All the "p *priv[...]" ones said they'd been optimized out, and the first "p *sna_pixmap(pixmap)" one I tried bombed out. 5:( (gdb) p *sna_pixmap(pixmap) Program received signal SIGABRT, Aborted. [Thread 0x7f9532b4b700 (LWP 14081) exited] [Thread 0x7f9533b4d700 (LWP 14079) exited] [Thread 0x7f953334c700 (LWP 14080) exited] [Thread 0x7f953234a700 (LWP 14082) exited] I'll rebuild with the extra assertions and see if I can get it to trigger again. Sorry it's taking such a long time to recreate... (In reply to comment #18) > Created attachment 92197 [details] > valgrind.log > > Ok, it crashed again under valgrind (log attached). This time I have a lot > more warnings, including many invalid writes. Using xf86-video-intel-git > 2.99.907.16.g40beee9 Apologies, it appears you compiled it without the valgrind headers and it hasn't included the magic to suppress all the false positives. > Interestingly, it didn't print a stack trace into Xorg's log again; does > valgrind somehow override Xorg's SIGSEGV handler? No. More likely it hit an assertion in that case... The *pixmap has what I feared - pixmap->devPrivate.ptr is non-zero, but we apparently have never allocated CPU space for it (priv->ptr being NULL), nor is it mapped. So where did that dangling pointer come from? Another layer of asserts pushed to xf86-video-intel.git. I'd recommend using gdb as it should at least catch the issue identified thus far earlier than valgrind. Caught an assertion failure... Xorg.valgrind: sna.h:514: sna_pixmap_unmap: Assertion `pixmap->devPrivate.ptr == ((void*)((uintptr_t)(priv->ptr) & ~1))' failed. (I'm not running in valgrind any more, but I still have the diversion) I backed up the core file, stderr and Xorg log, let me know if you need any more info. (gdb) thread apply all bt full Thread 3 (Thread 0x7f72d8a94700 (LWP 12032)): #0 0x00007f72deab603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 No symbol table info available. #1 0x00007f72d939e1f9 in __run__ (arg=0x1433450) at sna_threads.c:61 t = 0x1433450 signals = {__val = {18446744067267100671, 18446744073709551615 <repeats 15 times>}} __PRETTY_FUNCTION__ = "__run__" #2 0x00007f72deab20a2 in start_thread () from /usr/lib/libpthread.so.0 No symbol table info available. #3 0x00007f72dd7e132d in clone () from /usr/lib/libc.so.6 No symbol table info available. Thread 2 (Thread 0x7f72d8293700 (LWP 12033)): #0 0x00007f72deab603f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 No symbol table info available. #1 0x00007f72d939e1f9 in __run__ (arg=0x14334c0) at sna_threads.c:61 t = 0x14334c0 signals = {__val = {18446744067267100671, 18446744073709551615 <repeats 15 times>}} __PRETTY_FUNCTION__ = "__run__" #2 0x00007f72deab20a2 in start_thread () from /usr/lib/libpthread.so.0 No symbol table info available. #3 0x00007f72dd7e132d in clone () from /usr/lib/libc.so.6 No symbol table info available. Thread 1 (Thread 0x7f72df7a4880 (LWP 12030)): #0 0x00007f72dd731369 in raise () from /usr/lib/libc.so.6 No symbol table info available. #1 0x00007f72dd732768 in abort () from /usr/lib/libc.so.6 No symbol table info available. #2 0x00007f72dd72a456 in __assert_fail_base () from /usr/lib/libc.so.6 No symbol table info available. #3 0x00007f72dd72a502 in __assert_fail () from /usr/lib/libc.so.6 No symbol table info available. #4 0x00007f72d9307700 in sna_pixmap_unmap (pixmap=0x1b365d0, priv=0x1b37e90) at sna.h:514 __PRETTY_FUNCTION__ = "sna_pixmap_unmap" #5 0x00007f72d9308a2e in sna_pixmap_change_tiling (pixmap=0x1b365d0, tiling=1) at sna_accel.c:676 priv = 0x1b37e90 sna = 0x7f72df6cd000 bo = 0x1b9aed0 box = {x1 = 0, y1 = 0, x2 = 48, y2 = 48} __PRETTY_FUNCTION__ = "sna_pixmap_change_tiling" #6 0x00007f72d9407a96 in sna_pixmap_set_dri (sna=0x7f72df6cd000, pixmap=0x1b365d0) at sna_dri.c:164 priv = 0x1b37e90 tiling = 1 __PRETTY_FUNCTION__ = "sna_pixmap_set_dri" #7 0x00007f72d9407f27 in sna_dri_create_buffer (draw=0x1b365d0, attachment=0, format=32) at sna_dri.c:259 sna = 0x7f72df6cd000 buffer = 0x0 private = 0x1c063e0 pixmap = 0x1b365d0 bo = 0x1b9a2b0 flags = 0 size = 3145776 ---Type <return> to continue, or q <return> to quit--- bpp = 32 __PRETTY_FUNCTION__ = "sna_dri_create_buffer" #8 0x00000000005554f3 in ?? () No symbol table info available. #9 0x00000000005563db in ?? () No symbol table info available. #10 0x0000000000556630 in DRI2GetBuffersWithFormat () No symbol table info available. #11 0x0000000000557fe0 in ?? () No symbol table info available. #12 0x0000000000435d1e in ?? () No symbol table info available. #13 0x0000000000439b2a in ?? () No symbol table info available. #14 0x00007f72dd71db05 in __libc_start_main () from /usr/lib/libc.so.6 No symbol table info available. #15 0x00000000004250ce in _start () No symbol table info available. Here's something weird... Is this supposed to occur? (gdb) gcore warning: target file /proc/12030/cmdline contained unexpected null characters [...] % hexdump -C /proc/12030/cmdline 00000000 58 6f 72 67 2e 76 61 6c 67 72 69 6e 64 00 3a 32 |Xorg.valgrind.:2| 00000010 00 2d 62 61 63 6b 67 72 6f 75 6e 64 00 6e 6f 6e |.-background.non| 00000020 65 00 2d 76 65 72 62 6f 73 65 00 2d 61 75 74 68 |e.-verbose.-auth| 00000030 00 2f 76 61 72 2f 72 75 6e 2f 67 64 6d 2f 61 75 |./var/run/gdm/au| 00000040 74 68 2d 66 6f 72 2d 67 64 6d 2d 52 73 74 66 58 |th-for-gdm-RstfX| 00000050 5a 2f 64 61 74 61 62 61 73 65 00 2d 73 65 61 74 |Z/database.-seat| 00000060 00 73 65 61 74 30 00 2d 6e 6f 6c 69 73 74 65 6e |.seat0.-nolisten| 00000070 00 74 63 70 00 76 74 31 00 |.tcp.vt1.| I can now reproduce the assertion failure on my laptop with >75% chance by: logging in, starting up Firefox, holding down Ctrl+Tab to cycle through my open tabs, and then choosing "Quit" from the menu. Xorg crashes quickly after the Firefox window disappears. Also, it seems that using the address bar history to complete URLs has a high chance of causing this. Created attachment 92251 [details] GDB "bt full" results for the assertion failure with git 6db99169 and assertions Bah, I got another gdb bt full (I believe from git commit 6db99169), but typed "u" rather than "up", and the whole thing restarted before I could get any of the "p *" commands run. I think this was the same assertion that Marti got, and I can't recreate it as effectively as he can, but it only seems to happen after using the machine with visually intensive programs (so lots of different web pages, viewing videos, etc). The only unusual output I saw in the Xorg.0.log was a "setversion 1.4: permission denied" message which I haven't seen before? [ 15062.962] (II) AIGLX: Suspending AIGLX clients for VT switch [ 15127.364] (II) AIGLX: Resuming AIGLX clients after VT switch [ 15127.364] (II) intel(0): switch to mode 1920x1200@60.0 on HDMI1 using pipe 0, position (0, 0), rotation normal, reflection none [ 15127.415] setversion 1.4 failed: Permission denied [ 15127.628] (II) intel(0): switch to mode 1920x1200@60.0 on HDMI1 using pipe 0, position (0, 0), rotation normal, reflection none [ 15140.860] (II) AIGLX: Suspending AIGLX clients for VT switch I'll keep trying to hit it again, just let me know what data you'd like if I do (at the moment I'm still trying for that list of "p *" commands). Created attachment 92253 [details] Xorg.log with --enable-debug=full Perhaps this is useful, here's the last 1000 lines of Xorg.log output from a run built with --enable-debug=full Full log here: https://www.voicecom.ee/~marti/.wuPIxhbvLq/Xorg.fulldebug.gz (3.7MB compressed) Created attachment 92256 [details]
GDB "bt full" results for the assertion failure with git 6db99169 and assertions - take 2
And then like buses...! Here's another one, this time I did manage to get the "p *" commands, and I checked out the memory around the devPrivate.ptr (it was all zeroed).
commit 5f3ee21a307a4ff4db189bd53e58a70ec01ee6bc Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jan 17 08:40:34 2014 +0000 sna: Nullify pixmap->devPrivate.ptr after promoting CPU bo to GPU When we convert a CPU bo into a GPU bo, we need to remove any dangling shadow pointers we use for devPrivate.ptr. Whilst the bo remains alive these are incoherent, but if we ever replace the GPU bo (for example to change tiling for DRI2) then the dangling pointer becomes invalid and will explode on next use. Reported-by: Mike Aury <mike.auty@gmail.com> Reported-by: Marti Raudsepp <marti@juffo.org> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73351 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> *** Bug 74208 has been marked as a duplicate of this bug. *** |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 91591 [details] Xorg.0.log for the crashed session I'm running into the following segfault every morning, after I've left my machine unattended at night. The monitor is switched off (which also powers the graphics tablet), and when I return the next morning I'm met with the segfault, which I've attached gdb to and recorded a "bt full" of the crash. The timings on this log will be wrong, because gdb trapped the segfault and didn't start again until this morning, but I can get the right timings tonight, if that would be useful? This happened with 2.99.906 as well as 2.99.907, however I believe it did not fail with 2.99.905. I should also note that I had been using 2.99.906 since late November, but only started to experience problems after I upgraded to gnome-3.10.2 over the Christmas period. I'm attaching my Xorg.0.log.old (from after the crash and X restarted), and the gdb "bt full" output. Please let me know if there's any further patches or tests I can conduct, but understand I can only reliably reproduce this overnight.