Created attachment 121049 [details] Xorg.log Got several times a crash of the XServer with the following error: Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE) Jan 14 03:46:34 kxslot26 Xorg[1910]: Fatal server error: Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE) TouchListenerGone: couldn't allocate events Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE) Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE) Jan 14 03:46:34 kxslot26 Xorg[1910]: Please consult the The X.Org Foundation support Jan 14 03:46:34 kxslot26 Xorg[1910]: at http://wiki.x.org Jan 14 03:46:34 kxslot26 Xorg[1910]: for help. Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information. Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE) Jan 14 03:46:34 kxslot26 Xorg[1910]: (II) AIGLX: Suspending AIGLX clients for VT switch This bug seems very similar to bug #92735 (https://bugs.freedesktop.org/show_bug.cgi?id=92735). However the xf86-video-intel version used here is the git 20151217, which should contain the supposed fix of this crash. System details: Kernel: Linux 3.18.13 #1 SMP Fri Jan 8 09:51:53 EST 2016 i686 GNU/Linux Graphic chipset: Happened on 915GM and 945GSE Distro: Custom debian based distro Xorg: 7.7 Tell me if you need any additional details
Can you monitor /sys/kernel/debug/dri/0/i915_gem_objects ? That's the usual giveaway.
Thanks for the info. I will monitor /sys/kernel/debug/dri/0/i915_gem_objects and report the results here. In the mean time, I had another crash, but this time with a segmentation fault: [156120.642] (EE) [156120.642] (EE) Backtrace: [156120.652] (EE) 0: /usr/bin/Xorg (xorg_backtrace+0x52) [0xb7646612] [156120.652] (EE) 1: /usr/bin/Xorg (0xb74a0000+0x1aa8b2) [0xb764a8b2] [156120.652] (EE) 2: linux-gate.so.1 (__kernel_rt_sigreturn+0x0) [0xb747ec78] [156120.652] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6996000+0x60630) [0xb69f6630] [156120.652] (EE) 4: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6996000+0x61d0c) [0xb69f7d0c] [156120.652] (EE) 5: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6996000+0x52d5c) [0xb69e8d5c] [156120.652] (EE) 6: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6996000+0x52ed9) [0xb69e8ed9] [156120.652] (EE) 7: /usr/bin/Xorg (0xb74a0000+0x127832) [0xb75c7832] [156120.652] (EE) 8: /usr/bin/Xorg (0xb74a0000+0xe1f85) [0xb7581f85] [156120.652] (EE) 9: /usr/bin/Xorg (0xb74a0000+0xe055f) [0xb758055f] [156120.652] (EE) 10: /usr/bin/Xorg (dixDestroyPixmap+0x11) [0xb74d84b1] [156120.652] (EE) 11: /usr/bin/Xorg (0xb74a0000+0x60fde) [0xb7500fde] [156120.652] (EE) 12: /usr/bin/Xorg (FreeResource+0xb9) [0xb7501be9] [156120.652] (EE) 13: /usr/bin/Xorg (0xb74a0000+0x3873d) [0xb74d873d] [156120.653] (EE) 14: /usr/bin/Xorg (0xb74a0000+0x3d086) [0xb74dd086] [156120.653] (EE) 15: /usr/bin/Xorg (0xb74a0000+0x4126a) [0xb74e126a] [156120.653] (EE) 16: /usr/bin/Xorg (0xb74a0000+0x2af2a) [0xb74caf2a] [156120.653] (EE) 17: /lib/i386-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0xb70aa723] [156120.653] (EE) 18: /usr/bin/Xorg (0xb74a0000+0x2af68) [0xb74caf68] [156120.653] (EE) [156120.653] (EE) Segmentation fault at address 0x24 [156120.653] (EE) Fatal server error: [156120.653] (EE) Caught signal 11 (Segmentation fault). Server aborting [156120.653] (EE) [156120.653] (EE) Please consult the The X.Org Foundation support at http://wiki.x.org for help. [156120.653] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information. [156120.653] (EE) [156120.653] (II) AIGLX: Suspending AIGLX clients for VT switch [156121.191] (EE) Server terminated with error (1). Closing log file. I thought I had configured X to generate a core file, but cannot find one. Should I put the NoTrapSignals option in xorg.conf for a core file to be generated?
(In reply to Franck Delache from comment #2) > I thought I had configured X to generate a core file, but cannot find one. > Should I put the NoTrapSignals option in xorg.conf for a core file to be > generated? I believe so, based on googling!
Created attachment 121268 [details] Xorg crash core file Core file generated by the Xorg segmentation fault
I got a Segmentation fault when closing an XClient application (see attached core file). Message displayed in the journal: Jan 23 00:59:00 kxslot26 kernel: Xorg[1913]: segfault at 24 ip b6a0e700 sp bfd402fc error 4 in intel_drv.so[b69ae000+1c7000] When opening the core file in gdb, I get the following stack trace: (gdb) bt full #0 list_is_empty (head=head@entry=0x24) at ../../../src/intel_list.h:269 No locals. #1 free_list (head=head@entry=0x24) at ../../../src/sna/sna_damage.c:206 No locals. #2 0xb6a0fddc in __sna_damage_destroy (damage=0x0) at ../../../src/sna/sna_damage.c:1605 No locals. #3 0xb6a00e2c in sna_damage_destroy (damage=0xba581ec8) at ../../../src/sna/sna_damage.h:325 No locals. #4 sna_destroy_pixmap (pixmap=pixmap@entry=0xba581e28) at ../../../src/sna/sna_accel.c:1480 No locals. #5 0xb6a00fa9 in sna_destroy_pixmap (pixmap=0xba581e28) at ../../../src/sna/sna_accel.c:1461 No locals. #6 0xb75df832 in ?? () No symbol table info available. #7 0xb7599f85 in ?? () No symbol table info available. #8 0xb759855f in ?? () No symbol table info available. #9 0xb75cea8a in FreePicture () No symbol table info available. #10 0xb7518fde in ?? () No symbol table info available. #11 0xb751a19c in FreeClientResources () No symbol table info available. #12 0xb74f4453 in CloseDownClient () No symbol table info available. #13 0xb74f50bf in ?? () No symbol table info available. #14 0xb74f926a in ?? () No symbol table info available. #15 0xb74e2f2a in ?? () No symbol table info available. #16 0xb70c2723 in __libc_start_main (main=0xb74e2f00, argc=10, argv=0xbfd40634, init=0xb766f870 <__libc_csu_init>, fini=0xb766f8e0 <__libc_csu_fini>, rtld_fini=0xb74a5c90 <_dl_fini>, stack_end=0xbfd4062c) at libc-start.c:287 result = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {-1222549504, 0, 0, 0, -956705371, 1992249780}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0xa, 0xb74e2f37 <_start>}, data = {prev = 0x0, cleanup = 0x0, canceltype = 10}}} not_first_call = <optimized out> #17 0xb74e2f68 in _start () I also monitored the i915_gem_objects: ==== i915_gem_objects 2 minutes before the crash ==== 41101 objects, 325840896 bytes 28483 [28483] objects, 240922624 [240922624] bytes in gtt 0 [0] active objects, 0 [0] bytes 28483 [28483] inactive objects, 240922624 [240922624] bytes 12614 unbound objects, 101982208 bytes 0 purgeable objects, 0 bytes 893 pinned mappable objects, 72069120 bytes 17302 fault mappable objects, 180113408 bytes 268435456 [268435456] gtt total Xorg: 41099 objects, 325705728 bytes (0 active, 219521024 inactive, 219521024 global, 9216000 shared, 101982208 unbound) ====================================================== ==== i915_gem_objects after a fresh restart of Xorg ==== 327 objects, 74973184 bytes 322 [322] objects, 169943040 [169943040] bytes in gtt 6 [6] active objects, 85413888 [85413888] bytes 316 [316] inactive objects, 84529152 [84529152] bytes 0 unbound objects, 0 bytes 0 purgeable objects, 0 bytes 82 pinned mappable objects, 161804288 bytes 124 fault mappable objects, 52559872 bytes 268435456 [268435456] gtt total Xorg: 325 objects, 74838016 bytes (56397824 active, 13975552 inactive, 70373376 global, 9216000 shared, 0 unbound) ======================================================== Fell free to ask if you need more information.
Hmm, a damage allocation failed and subsequent NULL dereference. Looks like it was marked as "all-damaged" so it would have worked fairly normally - just should have failed to do partial migrations. commit 21fa92a9ba254527a8661ddde837c7a4c5ff5493 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jan 25 15:37:01 2016 +0000 sna/damage: Prevent freeing a NULL damage marked as all-damage After an allocation failure, we may mark a NULL damage pointer as being all-damaged. This works on the whole as most places we only ask whether it is all or not damaged, and if we do a partial fallback we should just then migrate the whole pixmap. However, we tried to destroy the NULL damage pointer, and crashed. Reported-by: Franck Delache <franck.delache@grassvalley.com> References: https://bugs.freedesktop.org/show_bug.cgi?id=93720 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> That should prevent this specific crash. So given the exorbitant number of objects for Xorg - we definitely have a leak somewhere along the driver stack. Next question, what does xrestop report after a high uptime?
Here is the xrestop output on an Xorg running for 10 days (no crash yet on this one): xrestop - Display: localhost Monitoring 3 clients. XErrors: 0 Pixmaps: 3874K total, Other: 1957K total, All: 5832K total res-base Wins GCs Fnts Pxms Misc Pxm mem Other Total PID Identifier 0200000 25 41638 1 41634 41648 3874K 1953K 5828K ? com-miranda-executive-service-ServiceLauncherThread 0000000 1 0 2 0 74 0B 3K 3K ? <unknown> 0400000 1 1 0 0 0 0B 48B 48B ? xrestop And its belonging i915_gem_objects: 41648 objects, 347963392 bytes 3188 [3188] objects, 198594560 [198594560] bytes in gtt 0 [0] active objects, 0 [0] bytes 3188 [3188] inactive objects, 198594560 [198594560] bytes 1791 unbound objects, 11800576 bytes 0 purgeable objects, 0 bytes 11 pinned mappable objects, 73113600 bytes 7 fault mappable objects, 54583296 bytes 268435456 [268435456] gtt total Xorg: 41646 objects, 347828224 bytes (0 active, 75116544 inactive, 75116544 global, 9216000 shared, 268247040 unbound) And thanks for the commit, I will try that.
Phew. I don't have to look for a driver leak of buffer objects! Now to decide if that is a client of pixmaps or if someone else is taking a ref on the pixmap.
For your information, there is only one X Client running which is a Java application. JRE used is 1.7.0_76 with XRender pipeline activated, if that can help.
(In reply to Franck Delache from comment #9) > For your information, there is only one X Client running which is a Java > application. JRE used is 1.7.0_76 with XRender pipeline activated, if that > can help. It is sitting on 40+ thousand pixmaps, and accounts for all but a few of the buffers allocated (which are used as backing storage for pixmaps) for the system, although that appears to be only ~350MiB each buffer object takes up precious kernel resource, which may be exhausted even before we run out of client space. That xrestop still reports that the client owns the pixmap should be a good indicator that the leak is clientside.
For the record, my Java program renders a lot of images on the screen. It seems that every time the JVM wants to draw an image it allocates a pixmap and links it to the Java XRSurfaceData object (class used internally by the JVM). Once the Java XRSurfaceData object is not used anymore (i.e. cannot be strongly reached) AND is garbage collected then the pixmap is freed (one can look at the sun.java2d.Disposer class for more details). The problem is that even if the Java XRSurfaceData object is not referenced anymore, it does not mean it is immediately freed. One must wait for the Garbage Collector to do its job and free those objects. Thus the XServer and kernel resources are freed only when the JVM decides to cleanup its own heap. My application does not keep any reference over those XRSurfaceData objects and using a profiler, I can see that there are a lot of weakly references XRSurfaceData objects (i.e. objects ready to be garbage collected), but unfortunately the Garbage Collector does not run yet since the JVM still has enough free heap. Which goes to a point where the host resources are exhausted before the JVM consider it is time to do some cleanup in its heap. I am still scratching my head on how to resolve this issue, but would like some clarification on the resources involved. Could you answer some questions? - Does pixmaps are allocated in user space? - Does buffers (backing the pixmaps) are allocated in user space or kernel space? - What kind of kernel resources are used for each buffer? - Can we tune the amount of memory reserved by the kernel for those resources? That would help me understand the system behavior and see what could I tune to avoid that bad effect.
(In reply to Franck Delache from comment #11) > I am still scratching my head on how to resolve this issue, but would like > some clarification on the resources involved. Could you answer some > questions? > > - Does pixmaps are allocated in user space? The pixmap consists of a header (a userspace struct) plus some backing storage. That backing storage can just be normal system RAM or a set of kernel buffers. > - Does buffers (backing the pixmaps) are allocated in user space or kernel > space? Both depending on the use case. Not everything that can be rendered by X can be rendered by the GPU, for those we try to avoid using the GPU at all (and so just use a regular malloc()). Similarly, there can be times where we haven't decided to commit to using the GPU, though typically they get allocated backing storage by the kernel anyway (just used like regular RAM, and converted into a GPU buffer on demand). > - What kind of kernel resources are used for each buffer? Substantial, I assume around 512 bytes of overhead for each buffer. (It's probably closer to "just" 160 bytes in most cases.) The buffers are allocated from swappable storage (i.e. we can move the contents of the GPU buffers out to swap and return the pages back to the system), and only instantiated upon first use. > - Can we tune the amount of memory reserved by the kernel for those > resources? We don't reserve any memory ourselves (other than what is tied up by internal driver allocations from unswappable kernel memory, i.e. the structs we use to track the buffers and usage). That memory should be only allocated on demand, and returned to the system when no longer used. There is "stolen memory" which is what the BIOS sets aside from the system and reserved for igfx - which we do not fully utilize and you can recover some memory by tuning that in the BIOS. The majority of the memory we use should be for the buffers, and that is swappable or even discarded upon demand (we can't discard a buffer unless the user says so!). Modulo bugs in the memory management code, of course.
Since on this system there is no swap, it might explain my behavior (crash when resource is exhausted due to resources not freed up and not swappable). Thanks for the detailed information.
Marking this bug as fixed since the segmentation fault has been corrected by commit 21fa92. The (EE) TouchListenerGone: couldn't allocate events problem was caused by pixmaps not freed by my Java application, and thus is not a bug per se.
Created attachment 121441 [details] [review] Avoid overallocating event list for TouchOwnershipEvent Though this would remove that particular crash on out-of-memory...
Couldn't reproduce the (EE) TouchListenerGone: couldn't allocate events problem with the patch proposed in comment #15. Considering this bug completely fixed. Thanks!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.