Bug 56591

Summary: [SNA] Xorg crashes with SNA, Kicad
Product: xorg Reporter: Matti Ruohonen <kiesus>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium    
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg log of crash
none
Xorg log of crash w debug symbols none

Description Matti Ruohonen 2012-10-30 18:43:40 UTC
Created attachment 69325 [details]
Xorg log of crash

My X crashes every time I try to open the module editor from the Kicad suite's pcbnew layout editor.
I have narrowed the problem down to SNA. I recompiled the xf86-video-intel driver with both SNA and UXA support (I'm on Funtoo, so USE flags sna and uxa enabled).
When I change the AccelMethod to "SNA", my X crashes immediately when I try to launch the module editor.
When I change the AccelMethod to "UXA", the module editor launches correctly. So I'm forced to use UXA for now, if I want to use Kicad.

I have Intel Core-i5 3570K (Ivy Bridge) CPU and I'm using the integrated HD graphics. (Asus P8Z77-V LE motherboard)
kernel: 3.6.1-gentoo
xorg-server: 1.12.4 (USE flags: ipv6 nptl udev xorg -dmx -doc -kdrive -minimal -selinux -static-libs -tslib -xnest -xvfb)
libdrm: from git ebuild this morning (libkms video_cards_intel, everything else disabled)
xf86-video-intel: from git ebuild today (2012-10-30) (USE: dri sna udev uxa xvmc -glamor)
(mesa: from git ebuild)
DE: Gnome 3.4.1

I'm using these commands to compile Kicad version 2011-07-08 from sources:

unzip kicad_sources-2011-07-08-BZR3044.zip
cd kicad_sources-2011-07-08_BZR3044
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local -DKICAD_KEEPCASE=ON -DKICAD_MINIZIP=OFF -DCMAKE_CXX_FLAGS=-m64 -DKICAD_STABLE_VERSION=ON ../
make
sudo make install

The same crash also happens with Kicad version 20120119_p3256 from portage tree.
(Kicad downloads can be found here: http://iut-tice.ujf-grenoble.fr/cao/ )

Xorg.0.log.old with some backtrace info:
http://pastebin.com/K1da4bn4

emerge --info output:
http://pastebin.com/T6J3GTTN
Comment 1 Chris Wilson 2012-10-30 19:14:06 UTC
Can you please grab a symbolic backtrace or run 'addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x2dfdd'? Also is this with current git?
Comment 2 Matti Ruohonen 2012-10-30 22:44:18 UTC
Sorry, I'm a real noob when it comes to debugging and traces and such. How should I get the backtrace?
Also, 'addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x2dfdd' gives just:
??:0

So I'm assuming I have to compile some stuff with debug symbols enabled? Is recompiling xorg-server, xf86-video-intel and maybe libdrm with debug symbols enough? How do I grab the useful backtrace after that?

What do you mean by 'current git'? I have synced my x11 overlay with layman to get the latest git ebuilds, and the ebuilds fetch stuff directly from the upstream git master as far as I know.
Comment 3 Matti Ruohonen 2012-10-31 08:09:28 UTC
Created attachment 69343 [details]
Xorg log of crash w debug symbols
Comment 4 Matti Ruohonen 2012-10-31 08:16:57 UTC
Okay, so I managed to compile the driver with debug symbols. Now the addr2line gives something that looks more useful. I have attached the new Xorg.0.log of the crash with the debug symbols in the xf86-video-intel driver.

Here are the adrr2line outputs for the driver addresses in the log:
[1509387.236] 3: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0x4c36d) [0x7f3da545536d]
[1509387.236] 4: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0x4f736) [0x7f3da5458736]
[1509387.236] 5: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0xc248f) [0x7f3da54cb48f]
[1509387.236] 6: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0xc70bd) [0x7f3da54d00bd]
[1509387.236] 7: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0x6f29c) [0x7f3da547829c]
[1509387.236] 8: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0x7204c) [0x7f3da547b04c]

ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x4c36d
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna.h:488
ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x4f736
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna_accel.c:2764
ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0xc248f
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/gen7_render.c:2452
ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0xc70bd
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/gen7_render.c:2806
ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x6f29c
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna_glyphs.c:552
ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x7204c
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna_glyphs.c:1664

The driver is from:
GIT update -->
   repository:               git://anongit.freedesktop.org/xorg/driver/xf86-video-intel
   at the commit:            31eb704b2ad7c861ec4e61fb9de0e9592fc6d269
   branch:                   master
Comment 5 Chris Wilson 2012-10-31 08:57:48 UTC
Hmm from that backtrace I'm not sure what happened. My first guess is that there is a large integer offset causing 16-bit overflow, so I applied:

commit bf81d552c4be039fbcf3272387828b1a8b3fbdb8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Oct 31 08:50:44 2012 +0000

    sna: Clamp the drawable box to prevent int16 overflow
    
    And assert that the box is valid when migrating.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=56591
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

I'm installing kicad and will try to reproduce. If you could test with the above patch and if it still fails, compile with USE=full-debug (or something like that) and the attach the log file, that would be great. (If you compress it with xz, it should just about fit into the file size limits -- so long as you quickly trigger the crash!)
Comment 6 Chris Wilson 2012-10-31 09:01:03 UTC
Kicad -> pcbnew -> "Open module editor" toolbar works here. So there would appear to be another environmental factor. Please do try to grab the debug log, thanks.
Comment 7 Matti Ruohonen 2012-10-31 09:33:21 UTC
It did still crash. I recompiled the intel driver so that it had --enable-debug during configure, is that what you meant, or is there something else too? Where should the debug log be? At least Xorg.0.log doesn't seem to have any more information than before.
Comment 8 Chris Wilson 2012-10-31 09:46:43 UTC
Did it crash in the same place with '--enable-debug'? For the extra logging (which will be quite slow and quite voluminous) use --enable-debug=full.
Comment 9 Matti Ruohonen 2012-10-31 10:51:42 UTC
It did crash pretty much in the same place. Not sure about the exact line numbers anymore, but the ones I have checked have been within like 5 lines in every driver version (different compile/debug options) I have tried. I've gotten pretty confused already with this stuff, going back and forth with the driver compilation and options.

Anyway, after some fiddling I managed to get the full debug log, but I had to compile and install it manually instead of through portage, because there is a typo in sna/sna_render.c:1337 '__FUNCION__'. I hope my doing that manually did not mess something up with this debugging.

The full debug log is at: http://jumi.lut.fi/~maruohon/Xorg.0.log.tar.xz
Comment 10 Chris Wilson 2012-10-31 11:21:12 UTC
Thanks for the debug log, that's nasty. Can you please repeat the addr2line on the last crash?
Comment 11 Matti Ruohonen 2012-10-31 18:09:39 UTC
I have generated some more debug logs and stuff for you.
They can be found at http://jumi.lut.fi/~maruohon/lisatila/sna/

There is a readme.txt that explains some of it. But basically I installed the latest version from git again and there is a fresh debug=full Xorg.0.log and the addr2line output associated with it, as well as the intel_drv.so file itself. I hope it helps...
Comment 12 Chris Wilson 2012-11-01 09:38:57 UTC
I keep looking at it and the only implication I can see is memory corruption in mempcy_to_tiled_x(). Could I kindly ask you to try valgrind (with either --enable-debug or --enable-debug=full)?

I've also put together a few patches to hopefully improve the kicad experience, can you test with -intel.git after trying to reproduce the issue with valgrind? Thanks.
Comment 13 Matti Ruohonen 2012-11-01 19:00:13 UTC
I added some more logs to the previous url. I don't know how I should have ran valgrind, included is the dot_fluxbox_startup.txt that has the command I used to start fluxbox under valgrind.

Also, while testing, I ran into a case of it not crashing in some cases. Also, a couple of times pcbnew experienced some major freezes while starting it up and the visibles menu on the right got corrupted until it refreshed (I think the board got drawn over it). I noticed there were some GPU hang notices in dmesg after the testing I did, I'm not sure at what point the GPU hang and if the not crashing occured after that (it said something like "disabling acceleration"). There is the Xorg.0.log_no_crash_but_something log from one of the not crashing cases. Sadly I didn't save the dmesg output and I did reboot after those tests. (there were about 5-8 lines of GPU hang notices and a bug reporting guide about getting i915_something_something from debugfs, which I don't think I have enabled in the kernel) After the reboot it kept crashing every time, while I tried to reproduce the condition for it not to crash.

So, can this just be a case of something in my software stack being messed up, me doing something wrong, my CPU/GPU/RAM/MB being faulty or is it really a bug somewhere? I don't want to waste any more of your time if it is an isolated problem on my OS installation or something like that. Unfortunately I don't have the time to do a fresh installation right now to try that out. Maybe in a week or two...

-matti
Comment 14 Chris Wilson 2012-11-01 21:01:40 UTC
Don't worry it is definitely a bug in the code, just looks like it requires a particular set of steps (including a particular configuration of DE) to trigger. The goal is to reproduce it locally so I can stop annoying you with debugging requests. :)
Comment 15 Matti Ruohonen 2012-11-01 21:34:14 UTC
Well in that case an extreme solution would be to just dump my whole root partition into a tar ball with dd and upload it to you ;) Or maybe tarring just the fs would save some unnecessary randomness from a full partition dump, which is around 13GB in .tar.gz atm if I remember correctly from the last backup.

Which pieces of software can affect the situation?
I uploaded a list of all the packages (I think...) with versions, that I have currently installed:
http://jumi.lut.fi/~maruohon/lisatila/sna/eix_cI_2012-11-01.txt
Comment 16 Matti Ruohonen 2012-11-02 07:42:42 UTC
Okay, so I did some more digging and had the overly brilliant idea of running git bisect ;) There is new stuff at http://jumi.lut.fi/~maruohon/sna/bisect/

Summary:
d4f7c58186849374cd929e20fa49ea2e93939a69 is the bad commit

There is also a new debug=full Xorg.0.log once again and the addr2line output that goes with it, but they are basically the same they have always been.

I hope this helps to narrow down the problem.
Comment 17 Chris Wilson 2012-11-02 12:58:00 UTC
Hmm, that bisect points away from memcpy_to_tiled_x as being the source of the trouble, and the backtrace is still indicative of it being the same problem. Double hmm.
Comment 18 Chris Wilson 2012-11-03 09:43:15 UTC
I'm forced to resort to suggesting valgrind again. :|

To run X under valgrind, we need to do a few tricks. First make sure you compile -intel with at least --enable-debug (or --enable-debug=full). Then, preferably from a ssh,

$ cp /usr/bin/Xorg /tmp/Xorg # drops setuid bit
$ sudo valgrind --trace-children=yes --logfile=/tmp/xorg.txt /tmp/Xorg -ac -noreset

And from a second ssh, start your session (e.g. DISPLAY=:0 xfce4-session). Reproduce, attach.

Thanks.
Comment 19 Matti Ruohonen 2012-11-05 21:43:36 UTC
New logs at http://jumi.lut.fi/~maruohon/lisatila/sna/2012-11-05/

I did three runs, all with the latest e62b0c from git, with debug=full:
- over ssh like you described, with valgrind: no crash
- locally from VT, without valgrind: crashed
- locally with your commands, with valgrind: no crash

So it seems that my previous valgrind run with the improvised startup didn't work since that one crashed if I remember correctly.
Comment 20 Chris Wilson 2012-11-05 21:47:34 UTC
Thanks. Do you mind checking the permissions on http://jumi.lut.fi/~maruohon/lisatila/sna/2012-11-05/valgrind_xorg_no_crash_ssh.txt ?
Comment 21 Matti Ruohonen 2012-11-05 22:01:25 UTC
Fixed the permissions, sorry about that.

I just tried one more thing that came to mind: I ran the "manual start up" command (from VT: sudo /tmp/Xorg -ac -noreset and then DISPLAY=:0 startfluxbox from another VT) without the valgring part and it crashed. So it seems that valgrind is the one thing that prevents it from crashing at this point.
Comment 22 Chris Wilson 2012-11-05 22:29:32 UTC
Yes, it looks like valgrind is detecting the error:

==1315== 
==1315== Invalid read of size 2
==1315==    at 0x928B091: box_inplace (sna.h:506)
==1315==    by 0x9292278: sna_pixmap_move_area_to_gpu (sna_accel.c:2554)
==1315==    by 0x9292C14: sna_drawable_use_bo (sna_accel.c:2774)
==1315==    by 0x9356C01: gen7_composite_set_target (gen7_render.c:2448)
==1315==    by 0x9357AA2: gen7_render_composite (gen7_render.c:2800)
==1315==    by 0x92DB12E: glyphs_to_dst (sna_glyphs.c:552)
==1315==    by 0x92DEA8D: sna_glyphs (sna_glyphs.c:1664)
==1315==    by 0x4F920E: damageGlyphs (in /tmp/Xorg)
==1315==    by 0x4F2FF6: ProcRenderCompositeGlyphs (in /tmp/Xorg)
==1315==    by 0x437260: Dispatch (in /tmp/Xorg)
==1315==    by 0x426466: main (in /tmp/Xorg)
==1315==  Address 0xd637054 is 20 bytes inside a block of size 208,464 free'd
==1315==    at 0x4C2A2FC: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==1315==    by 0x92CCFCD: __sna_damage_destroy (sna_damage.c:1469)
==1315==    by 0x928AD74: sna_damage_destroy (sna_damage.h:284)
==1315==    by 0x9291CB2: sna_pixmap_move_area_to_gpu (sna_accel.c:2470)
==1315==    by 0x9292C14: sna_drawable_use_bo (sna_accel.c:2774)
==1315==    by 0x9356C01: gen7_composite_set_target (gen7_render.c:2448)
==1315==    by 0x9357AA2: gen7_render_composite (gen7_render.c:2800)
==1315==    by 0x92DB12E: glyphs_to_dst (sna_glyphs.c:552)
==1315==    by 0x92DEA8D: sna_glyphs (sna_glyphs.c:1664)
==1315==    by 0x4F920E: damageGlyphs (in /tmp/Xorg)
==1315==    by 0x4F2FF6: ProcRenderCompositeGlyphs (in /tmp/Xorg)
==1315==    by 0x437260: Dispatch (in /tmp/Xorg)
Comment 23 Chris Wilson 2012-11-05 22:36:40 UTC
Oh boy, that was a silly one. Many thanks for sticking with it and getting valgrind up and running.

commit 28bda6707d979bca29dbea04e932819de204d920
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 5 22:30:29 2012 +0000

    sna: Prevent use-after-free during partial move-to-gpu
    
    As we reuse the input parameter 'box' to hold the array of boxes that
    need to be migrated, we need to be careful that we do not later confuse
    it with the original input parameter. Otherwise,
    
    ==1315== Invalid read of size 2
    ==1315==    at 0x928B091: box_inplace (sna.h:506)
    ==1315==    by 0x9292278: sna_pixmap_move_area_to_gpu (sna_accel.c:2554)
    ==1315==    by 0x9292C14: sna_drawable_use_bo (sna_accel.c:2774)
    ==1315==    by 0x9356C01: gen7_composite_set_target (gen7_render.c:2448)
    ==1315==    by 0x9357AA2: gen7_render_composite (gen7_render.c:2800)
    ==1315==    by 0x92DB12E: glyphs_to_dst (sna_glyphs.c:552)
    ==1315==    by 0x92DEA8D: sna_glyphs (sna_glyphs.c:1664)
    ==1315==    by 0x4F920E: damageGlyphs (in /tmp/Xorg)
    ==1315==    by 0x4F2FF6: ProcRenderCompositeGlyphs (in /tmp/Xorg)
    ==1315==    by 0x437260: Dispatch (in /tmp/Xorg)
    ==1315==    by 0x426466: main (in /tmp/Xorg)
    ==1315==  Address 0xd637054 is 20 bytes inside a block of size 208,464 free'd
    ==1315==    at 0x4C2A2FC: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
    ==1315==    by 0x92CCFCD: __sna_damage_destroy (sna_damage.c:1469)
    ==1315==    by 0x928AD74: sna_damage_destroy (sna_damage.h:284)
    ==1315==    by 0x9291CB2: sna_pixmap_move_area_to_gpu (sna_accel.c:2470)
    ==1315==    by 0x9292C14: sna_drawable_use_bo (sna_accel.c:2774)
    ==1315==    by 0x9356C01: gen7_composite_set_target (gen7_render.c:2448)
    ==1315==    by 0x9357AA2: gen7_render_composite (gen7_render.c:2800)
    ==1315==    by 0x92DB12E: glyphs_to_dst (sna_glyphs.c:552)
    ==1315==    by 0x92DEA8D: sna_glyphs (sna_glyphs.c:1664)
    ==1315==    by 0x4F920E: damageGlyphs (in /tmp/Xorg)
    ==1315==    by 0x4F2FF6: ProcRenderCompositeGlyphs (in /tmp/Xorg)
    ==1315==    by 0x437260: Dispatch (in /tmp/Xorg)
    
    Reported-by: Matti Ruohonen <kiesus@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=56591
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 24 Matti Ruohonen 2012-11-05 22:56:55 UTC
Oh, you actually found the issue, great :)
I'm glad that I could help.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.