Summary: | [SNA] Xorg crashes with SNA, Kicad | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Matti Ruohonen <kiesus> | ||||||
Component: | Driver/intel | Assignee: | Chris Wilson <chris> | ||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||
Severity: | normal | ||||||||
Priority: | medium | ||||||||
Version: | git | ||||||||
Hardware: | x86-64 (AMD64) | ||||||||
OS: | Linux (All) | ||||||||
Whiteboard: | |||||||||
i915 platform: | i915 features: | ||||||||
Attachments: |
|
Description
Matti Ruohonen
2012-10-30 18:43:40 UTC
Can you please grab a symbolic backtrace or run 'addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x2dfdd'? Also is this with current git? Sorry, I'm a real noob when it comes to debugging and traces and such. How should I get the backtrace? Also, 'addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x2dfdd' gives just: ??:0 So I'm assuming I have to compile some stuff with debug symbols enabled? Is recompiling xorg-server, xf86-video-intel and maybe libdrm with debug symbols enough? How do I grab the useful backtrace after that? What do you mean by 'current git'? I have synced my x11 overlay with layman to get the latest git ebuilds, and the ebuilds fetch stuff directly from the upstream git master as far as I know. Created attachment 69343 [details]
Xorg log of crash w debug symbols
Okay, so I managed to compile the driver with debug symbols. Now the addr2line gives something that looks more useful. I have attached the new Xorg.0.log of the crash with the debug symbols in the xf86-video-intel driver. Here are the adrr2line outputs for the driver addresses in the log: [1509387.236] 3: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0x4c36d) [0x7f3da545536d] [1509387.236] 4: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0x4f736) [0x7f3da5458736] [1509387.236] 5: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0xc248f) [0x7f3da54cb48f] [1509387.236] 6: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0xc70bd) [0x7f3da54d00bd] [1509387.236] 7: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0x6f29c) [0x7f3da547829c] [1509387.236] 8: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f3da5409000+0x7204c) [0x7f3da547b04c] ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x4c36d /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna.h:488 ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x4f736 /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna_accel.c:2764 ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0xc248f /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/gen7_render.c:2452 ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0xc70bd /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/gen7_render.c:2806 ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x6f29c /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna_glyphs.c:552 ivy masa # addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so 0x7204c /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna_glyphs.c:1664 The driver is from: GIT update --> repository: git://anongit.freedesktop.org/xorg/driver/xf86-video-intel at the commit: 31eb704b2ad7c861ec4e61fb9de0e9592fc6d269 branch: master Hmm from that backtrace I'm not sure what happened. My first guess is that there is a large integer offset causing 16-bit overflow, so I applied: commit bf81d552c4be039fbcf3272387828b1a8b3fbdb8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Oct 31 08:50:44 2012 +0000 sna: Clamp the drawable box to prevent int16 overflow And assert that the box is valid when migrating. References: https://bugs.freedesktop.org/show_bug.cgi?id=56591 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> I'm installing kicad and will try to reproduce. If you could test with the above patch and if it still fails, compile with USE=full-debug (or something like that) and the attach the log file, that would be great. (If you compress it with xz, it should just about fit into the file size limits -- so long as you quickly trigger the crash!) Kicad -> pcbnew -> "Open module editor" toolbar works here. So there would appear to be another environmental factor. Please do try to grab the debug log, thanks. It did still crash. I recompiled the intel driver so that it had --enable-debug during configure, is that what you meant, or is there something else too? Where should the debug log be? At least Xorg.0.log doesn't seem to have any more information than before. Did it crash in the same place with '--enable-debug'? For the extra logging (which will be quite slow and quite voluminous) use --enable-debug=full. It did crash pretty much in the same place. Not sure about the exact line numbers anymore, but the ones I have checked have been within like 5 lines in every driver version (different compile/debug options) I have tried. I've gotten pretty confused already with this stuff, going back and forth with the driver compilation and options. Anyway, after some fiddling I managed to get the full debug log, but I had to compile and install it manually instead of through portage, because there is a typo in sna/sna_render.c:1337 '__FUNCION__'. I hope my doing that manually did not mess something up with this debugging. The full debug log is at: http://jumi.lut.fi/~maruohon/Xorg.0.log.tar.xz Thanks for the debug log, that's nasty. Can you please repeat the addr2line on the last crash? I have generated some more debug logs and stuff for you. They can be found at http://jumi.lut.fi/~maruohon/lisatila/sna/ There is a readme.txt that explains some of it. But basically I installed the latest version from git again and there is a fresh debug=full Xorg.0.log and the addr2line output associated with it, as well as the intel_drv.so file itself. I hope it helps... I keep looking at it and the only implication I can see is memory corruption in mempcy_to_tiled_x(). Could I kindly ask you to try valgrind (with either --enable-debug or --enable-debug=full)? I've also put together a few patches to hopefully improve the kicad experience, can you test with -intel.git after trying to reproduce the issue with valgrind? Thanks. I added some more logs to the previous url. I don't know how I should have ran valgrind, included is the dot_fluxbox_startup.txt that has the command I used to start fluxbox under valgrind. Also, while testing, I ran into a case of it not crashing in some cases. Also, a couple of times pcbnew experienced some major freezes while starting it up and the visibles menu on the right got corrupted until it refreshed (I think the board got drawn over it). I noticed there were some GPU hang notices in dmesg after the testing I did, I'm not sure at what point the GPU hang and if the not crashing occured after that (it said something like "disabling acceleration"). There is the Xorg.0.log_no_crash_but_something log from one of the not crashing cases. Sadly I didn't save the dmesg output and I did reboot after those tests. (there were about 5-8 lines of GPU hang notices and a bug reporting guide about getting i915_something_something from debugfs, which I don't think I have enabled in the kernel) After the reboot it kept crashing every time, while I tried to reproduce the condition for it not to crash. So, can this just be a case of something in my software stack being messed up, me doing something wrong, my CPU/GPU/RAM/MB being faulty or is it really a bug somewhere? I don't want to waste any more of your time if it is an isolated problem on my OS installation or something like that. Unfortunately I don't have the time to do a fresh installation right now to try that out. Maybe in a week or two... -matti Don't worry it is definitely a bug in the code, just looks like it requires a particular set of steps (including a particular configuration of DE) to trigger. The goal is to reproduce it locally so I can stop annoying you with debugging requests. :) Well in that case an extreme solution would be to just dump my whole root partition into a tar ball with dd and upload it to you ;) Or maybe tarring just the fs would save some unnecessary randomness from a full partition dump, which is around 13GB in .tar.gz atm if I remember correctly from the last backup. Which pieces of software can affect the situation? I uploaded a list of all the packages (I think...) with versions, that I have currently installed: http://jumi.lut.fi/~maruohon/lisatila/sna/eix_cI_2012-11-01.txt Okay, so I did some more digging and had the overly brilliant idea of running git bisect ;) There is new stuff at http://jumi.lut.fi/~maruohon/sna/bisect/ Summary: d4f7c58186849374cd929e20fa49ea2e93939a69 is the bad commit There is also a new debug=full Xorg.0.log once again and the addr2line output that goes with it, but they are basically the same they have always been. I hope this helps to narrow down the problem. Hmm, that bisect points away from memcpy_to_tiled_x as being the source of the trouble, and the backtrace is still indicative of it being the same problem. Double hmm. I'm forced to resort to suggesting valgrind again. :| To run X under valgrind, we need to do a few tricks. First make sure you compile -intel with at least --enable-debug (or --enable-debug=full). Then, preferably from a ssh, $ cp /usr/bin/Xorg /tmp/Xorg # drops setuid bit $ sudo valgrind --trace-children=yes --logfile=/tmp/xorg.txt /tmp/Xorg -ac -noreset And from a second ssh, start your session (e.g. DISPLAY=:0 xfce4-session). Reproduce, attach. Thanks. New logs at http://jumi.lut.fi/~maruohon/lisatila/sna/2012-11-05/ I did three runs, all with the latest e62b0c from git, with debug=full: - over ssh like you described, with valgrind: no crash - locally from VT, without valgrind: crashed - locally with your commands, with valgrind: no crash So it seems that my previous valgrind run with the improvised startup didn't work since that one crashed if I remember correctly. Thanks. Do you mind checking the permissions on http://jumi.lut.fi/~maruohon/lisatila/sna/2012-11-05/valgrind_xorg_no_crash_ssh.txt ? Fixed the permissions, sorry about that. I just tried one more thing that came to mind: I ran the "manual start up" command (from VT: sudo /tmp/Xorg -ac -noreset and then DISPLAY=:0 startfluxbox from another VT) without the valgring part and it crashed. So it seems that valgrind is the one thing that prevents it from crashing at this point. Yes, it looks like valgrind is detecting the error: ==1315== ==1315== Invalid read of size 2 ==1315== at 0x928B091: box_inplace (sna.h:506) ==1315== by 0x9292278: sna_pixmap_move_area_to_gpu (sna_accel.c:2554) ==1315== by 0x9292C14: sna_drawable_use_bo (sna_accel.c:2774) ==1315== by 0x9356C01: gen7_composite_set_target (gen7_render.c:2448) ==1315== by 0x9357AA2: gen7_render_composite (gen7_render.c:2800) ==1315== by 0x92DB12E: glyphs_to_dst (sna_glyphs.c:552) ==1315== by 0x92DEA8D: sna_glyphs (sna_glyphs.c:1664) ==1315== by 0x4F920E: damageGlyphs (in /tmp/Xorg) ==1315== by 0x4F2FF6: ProcRenderCompositeGlyphs (in /tmp/Xorg) ==1315== by 0x437260: Dispatch (in /tmp/Xorg) ==1315== by 0x426466: main (in /tmp/Xorg) ==1315== Address 0xd637054 is 20 bytes inside a block of size 208,464 free'd ==1315== at 0x4C2A2FC: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==1315== by 0x92CCFCD: __sna_damage_destroy (sna_damage.c:1469) ==1315== by 0x928AD74: sna_damage_destroy (sna_damage.h:284) ==1315== by 0x9291CB2: sna_pixmap_move_area_to_gpu (sna_accel.c:2470) ==1315== by 0x9292C14: sna_drawable_use_bo (sna_accel.c:2774) ==1315== by 0x9356C01: gen7_composite_set_target (gen7_render.c:2448) ==1315== by 0x9357AA2: gen7_render_composite (gen7_render.c:2800) ==1315== by 0x92DB12E: glyphs_to_dst (sna_glyphs.c:552) ==1315== by 0x92DEA8D: sna_glyphs (sna_glyphs.c:1664) ==1315== by 0x4F920E: damageGlyphs (in /tmp/Xorg) ==1315== by 0x4F2FF6: ProcRenderCompositeGlyphs (in /tmp/Xorg) ==1315== by 0x437260: Dispatch (in /tmp/Xorg) Oh boy, that was a silly one. Many thanks for sticking with it and getting valgrind up and running. commit 28bda6707d979bca29dbea04e932819de204d920 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Nov 5 22:30:29 2012 +0000 sna: Prevent use-after-free during partial move-to-gpu As we reuse the input parameter 'box' to hold the array of boxes that need to be migrated, we need to be careful that we do not later confuse it with the original input parameter. Otherwise, ==1315== Invalid read of size 2 ==1315== at 0x928B091: box_inplace (sna.h:506) ==1315== by 0x9292278: sna_pixmap_move_area_to_gpu (sna_accel.c:2554) ==1315== by 0x9292C14: sna_drawable_use_bo (sna_accel.c:2774) ==1315== by 0x9356C01: gen7_composite_set_target (gen7_render.c:2448) ==1315== by 0x9357AA2: gen7_render_composite (gen7_render.c:2800) ==1315== by 0x92DB12E: glyphs_to_dst (sna_glyphs.c:552) ==1315== by 0x92DEA8D: sna_glyphs (sna_glyphs.c:1664) ==1315== by 0x4F920E: damageGlyphs (in /tmp/Xorg) ==1315== by 0x4F2FF6: ProcRenderCompositeGlyphs (in /tmp/Xorg) ==1315== by 0x437260: Dispatch (in /tmp/Xorg) ==1315== by 0x426466: main (in /tmp/Xorg) ==1315== Address 0xd637054 is 20 bytes inside a block of size 208,464 free'd ==1315== at 0x4C2A2FC: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==1315== by 0x92CCFCD: __sna_damage_destroy (sna_damage.c:1469) ==1315== by 0x928AD74: sna_damage_destroy (sna_damage.h:284) ==1315== by 0x9291CB2: sna_pixmap_move_area_to_gpu (sna_accel.c:2470) ==1315== by 0x9292C14: sna_drawable_use_bo (sna_accel.c:2774) ==1315== by 0x9356C01: gen7_composite_set_target (gen7_render.c:2448) ==1315== by 0x9357AA2: gen7_render_composite (gen7_render.c:2800) ==1315== by 0x92DB12E: glyphs_to_dst (sna_glyphs.c:552) ==1315== by 0x92DEA8D: sna_glyphs (sna_glyphs.c:1664) ==1315== by 0x4F920E: damageGlyphs (in /tmp/Xorg) ==1315== by 0x4F2FF6: ProcRenderCompositeGlyphs (in /tmp/Xorg) ==1315== by 0x437260: Dispatch (in /tmp/Xorg) Reported-by: Matti Ruohonen <kiesus@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=56591 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Oh, you actually found the issue, great :) I'm glad that I could help. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.