93720 – Random crashes on (EE) TouchListenerGone: couldn't allocate events

Bug 93720 - Random crashes on (EE) TouchListenerGone: couldn't allocate events

Summary: Random crashes on (EE) TouchListenerGone: couldn't allocate events

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	7.7 (2012.06)
Hardware:	x86 (IA32) Linux (All)

Importance:	medium critical
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-01-14 21:11 UTC by Franck Delache
Modified:	2016-02-03 19:53 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
Xorg.log (23.27 KB, text/plain) 2016-01-14 21:11 UTC, Franck Delache	no flags	Details
Xorg crash core file (9.98 MB, application/octet-stream) 2016-01-25 15:09 UTC, Franck Delache	no flags	Details
Avoid overallocating event list for TouchOwnershipEvent (6.19 KB, patch) 2016-02-01 16:06 UTC, Chris Wilson	no flags	Details \| Splinter Review
View All

Description Franck Delache 2016-01-14 21:11:44 UTC

Created attachment 121049 [details]
Xorg.log

Got several times a crash of the XServer with the following error:

Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE)
Jan 14 03:46:34 kxslot26 Xorg[1910]: Fatal server error:
Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE) TouchListenerGone: couldn't allocate events
Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE)
Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE)
Jan 14 03:46:34 kxslot26 Xorg[1910]: Please consult the The X.Org Foundation support
Jan 14 03:46:34 kxslot26 Xorg[1910]: at http://wiki.x.org
Jan 14 03:46:34 kxslot26 Xorg[1910]: for help.
Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
Jan 14 03:46:34 kxslot26 Xorg[1910]: (EE)
Jan 14 03:46:34 kxslot26 Xorg[1910]: (II) AIGLX: Suspending AIGLX clients for VT switch

This bug seems very similar to bug #92735 (https://bugs.freedesktop.org/show_bug.cgi?id=92735). However the xf86-video-intel version used here is the git 20151217, which should contain the supposed fix of this crash.

System details:
Kernel: Linux 3.18.13 #1 SMP Fri Jan 8 09:51:53 EST 2016 i686 GNU/Linux
Graphic chipset: Happened on 915GM and 945GSE
Distro: Custom debian based distro
Xorg: 7.7

Tell me if you need any additional details

Comment 1 Chris Wilson 2016-01-15 09:45:36 UTC

Can you monitor /sys/kernel/debug/dri/0/i915_gem_objects ?
That's the usual giveaway.

Comment 2 Franck Delache 2016-01-15 14:38:38 UTC

Thanks for the info. I will monitor /sys/kernel/debug/dri/0/i915_gem_objects and report the results here.

In the mean time, I had another crash, but this time with a segmentation fault:
[156120.642] (EE) 
[156120.642] (EE) Backtrace:
[156120.652] (EE) 0: /usr/bin/Xorg (xorg_backtrace+0x52) [0xb7646612]
[156120.652] (EE) 1: /usr/bin/Xorg (0xb74a0000+0x1aa8b2) [0xb764a8b2]
[156120.652] (EE) 2: linux-gate.so.1 (__kernel_rt_sigreturn+0x0) [0xb747ec78]
[156120.652] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6996000+0x60630) [0xb69f6630]
[156120.652] (EE) 4: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6996000+0x61d0c) [0xb69f7d0c]
[156120.652] (EE) 5: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6996000+0x52d5c) [0xb69e8d5c]
[156120.652] (EE) 6: /usr/lib/xorg/modules/drivers/intel_drv.so (0xb6996000+0x52ed9) [0xb69e8ed9]
[156120.652] (EE) 7: /usr/bin/Xorg (0xb74a0000+0x127832) [0xb75c7832]
[156120.652] (EE) 8: /usr/bin/Xorg (0xb74a0000+0xe1f85) [0xb7581f85]
[156120.652] (EE) 9: /usr/bin/Xorg (0xb74a0000+0xe055f) [0xb758055f]
[156120.652] (EE) 10: /usr/bin/Xorg (dixDestroyPixmap+0x11) [0xb74d84b1]
[156120.652] (EE) 11: /usr/bin/Xorg (0xb74a0000+0x60fde) [0xb7500fde]
[156120.652] (EE) 12: /usr/bin/Xorg (FreeResource+0xb9) [0xb7501be9]
[156120.652] (EE) 13: /usr/bin/Xorg (0xb74a0000+0x3873d) [0xb74d873d]
[156120.653] (EE) 14: /usr/bin/Xorg (0xb74a0000+0x3d086) [0xb74dd086]
[156120.653] (EE) 15: /usr/bin/Xorg (0xb74a0000+0x4126a) [0xb74e126a]
[156120.653] (EE) 16: /usr/bin/Xorg (0xb74a0000+0x2af2a) [0xb74caf2a]
[156120.653] (EE) 17: /lib/i386-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0xb70aa723]
[156120.653] (EE) 18: /usr/bin/Xorg (0xb74a0000+0x2af68) [0xb74caf68]
[156120.653] (EE) 
[156120.653] (EE) Segmentation fault at address 0x24
[156120.653] (EE) 
Fatal server error:
[156120.653] (EE) Caught signal 11 (Segmentation fault). Server aborting
[156120.653] (EE) 
[156120.653] (EE) 
Please consult the The X.Org Foundation support 
	 at http://wiki.x.org
 for help. 
[156120.653] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[156120.653] (EE) 
[156120.653] (II) AIGLX: Suspending AIGLX clients for VT switch
[156121.191] (EE) Server terminated with error (1). Closing log file.

I thought I had configured X to generate a core file, but cannot find one. Should I put the NoTrapSignals option in xorg.conf for a core file to be generated?

Comment 3 Chris Wilson 2016-01-15 14:45:03 UTC

(In reply to Franck Delache from comment #2)
> I thought I had configured X to generate a core file, but cannot find one.
> Should I put the NoTrapSignals option in xorg.conf for a core file to be
> generated?

I believe so, based on googling!

Comment 4 Franck Delache 2016-01-25 15:09:07 UTC

Created attachment 121268 [details]
Xorg crash core file

Core file generated by the Xorg segmentation fault

Comment 5 Franck Delache 2016-01-25 15:13:25 UTC

I got a Segmentation fault when closing an XClient application (see attached core file).
Message displayed in the journal:
Jan 23 00:59:00 kxslot26 kernel: Xorg[1913]: segfault at 24 ip b6a0e700 sp bfd402fc error 4 in intel_drv.so[b69ae000+1c7000]

When opening the core file in gdb, I get the following stack trace:

(gdb) bt full
#0  list_is_empty (head=head@entry=0x24) at ../../../src/intel_list.h:269
No locals.
#1  free_list (head=head@entry=0x24) at ../../../src/sna/sna_damage.c:206
No locals.
#2  0xb6a0fddc in __sna_damage_destroy (damage=0x0) at ../../../src/sna/sna_damage.c:1605
No locals.
#3  0xb6a00e2c in sna_damage_destroy (damage=0xba581ec8) at ../../../src/sna/sna_damage.h:325
No locals.
#4  sna_destroy_pixmap (pixmap=pixmap@entry=0xba581e28) at ../../../src/sna/sna_accel.c:1480
No locals.
#5  0xb6a00fa9 in sna_destroy_pixmap (pixmap=0xba581e28) at ../../../src/sna/sna_accel.c:1461
No locals.
#6  0xb75df832 in ?? ()
No symbol table info available.
#7  0xb7599f85 in ?? ()
No symbol table info available.
#8  0xb759855f in ?? ()
No symbol table info available.
#9  0xb75cea8a in FreePicture ()
No symbol table info available.
#10 0xb7518fde in ?? ()
No symbol table info available.
#11 0xb751a19c in FreeClientResources ()
No symbol table info available.
#12 0xb74f4453 in CloseDownClient ()
No symbol table info available.
#13 0xb74f50bf in ?? ()
No symbol table info available.
#14 0xb74f926a in ?? ()
No symbol table info available.
#15 0xb74e2f2a in ?? ()
No symbol table info available.
#16 0xb70c2723 in __libc_start_main (main=0xb74e2f00, argc=10, argv=0xbfd40634, init=0xb766f870 <__libc_csu_init>, fini=0xb766f8e0 <__libc_csu_fini>, rtld_fini=0xb74a5c90 <_dl_fini>, stack_end=0xbfd4062c) at libc-start.c:287
        result = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {-1222549504, 0, 0, 0, -956705371, 1992249780}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0xa, 0xb74e2f37 <_start>}, data = {prev = 0x0, cleanup = 0x0, 
              canceltype = 10}}}
        not_first_call = <optimized out>
#17 0xb74e2f68 in _start ()


I also monitored the i915_gem_objects:
==== i915_gem_objects 2 minutes before the crash ====
41101 objects, 325840896 bytes
28483 [28483] objects, 240922624 [240922624] bytes in gtt
  0 [0] active objects, 0 [0] bytes
  28483 [28483] inactive objects, 240922624 [240922624] bytes
12614 unbound objects, 101982208 bytes
0 purgeable objects, 0 bytes
893 pinned mappable objects, 72069120 bytes
17302 fault mappable objects, 180113408 bytes
268435456 [268435456] gtt total

Xorg: 41099 objects, 325705728 bytes (0 active, 219521024 inactive, 219521024 global, 9216000 shared, 101982208 unbound)
======================================================

==== i915_gem_objects after a fresh restart of Xorg ====
327 objects, 74973184 bytes
322 [322] objects, 169943040 [169943040] bytes in gtt
  6 [6] active objects, 85413888 [85413888] bytes
  316 [316] inactive objects, 84529152 [84529152] bytes
0 unbound objects, 0 bytes
0 purgeable objects, 0 bytes
82 pinned mappable objects, 161804288 bytes
124 fault mappable objects, 52559872 bytes
268435456 [268435456] gtt total

Xorg: 325 objects, 74838016 bytes (56397824 active, 13975552 inactive, 70373376 global, 9216000 shared, 0 unbound)
========================================================

Fell free to ask if you need more information.

Comment 6 Chris Wilson 2016-01-25 15:40:34 UTC

Hmm, a damage allocation failed and subsequent NULL dereference. Looks like it was marked as "all-damaged" so it would have worked fairly normally - just should have failed to do partial migrations.

commit 21fa92a9ba254527a8661ddde837c7a4c5ff5493
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 25 15:37:01 2016 +0000

    sna/damage: Prevent freeing a NULL damage marked as all-damage
    
    After an allocation failure, we may mark a NULL damage pointer as being
    all-damaged. This works on the whole as most places we only ask whether it
    is all or not damaged, and if we do a partial fallback we should just
    then migrate the whole pixmap. However, we tried to destroy the NULL
    damage pointer, and crashed.
    
    Reported-by: Franck Delache <franck.delache@grassvalley.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=93720
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

That should prevent this specific crash.

So given the exorbitant number of objects for Xorg - we definitely have a leak somewhere along the driver stack. Next question, what does xrestop report after a high uptime?

Comment 7 Franck Delache 2016-01-25 16:46:21 UTC

Here is the xrestop output on an Xorg running for 10 days (no crash yet on this one):

xrestop - Display: localhost
          Monitoring 3 clients. XErrors: 0
          Pixmaps:    3874K total, Other:    1957K total, All:    5832K total

res-base Wins  GCs Fnts Pxms Misc   Pxm mem  Other   Total   PID Identifier    
0200000    25 41638    1 41634 41648     3874K   1953K   5828K   ?   com-miranda-executive-service-ServiceLauncherThread
0000000     1    0    2    0   74        0B      3K      3K   ?   <unknown>
0400000     1    1    0    0    0        0B     48B     48B   ?   xrestop

And its belonging i915_gem_objects:
41648 objects, 347963392 bytes
3188 [3188] objects, 198594560 [198594560] bytes in gtt
  0 [0] active objects, 0 [0] bytes
  3188 [3188] inactive objects, 198594560 [198594560] bytes
1791 unbound objects, 11800576 bytes
0 purgeable objects, 0 bytes
11 pinned mappable objects, 73113600 bytes
7 fault mappable objects, 54583296 bytes
268435456 [268435456] gtt total

Xorg: 41646 objects, 347828224 bytes (0 active, 75116544 inactive, 75116544 global, 9216000 shared, 268247040 unbound)

And thanks for the commit, I will try that.

Comment 8 Chris Wilson 2016-01-25 16:55:33 UTC

Phew. I don't have to look for a driver leak of buffer objects! Now to decide if that is a client of pixmaps or if someone else is taking a ref on the pixmap.

Comment 9 Franck Delache 2016-01-25 17:14:12 UTC

For your information, there is only one X Client running which is a Java application. JRE used is 1.7.0_76 with XRender pipeline activated, if that can help.

Comment 10 Chris Wilson 2016-01-25 21:37:47 UTC

(In reply to Franck Delache from comment #9)
> For your information, there is only one X Client running which is a Java
> application. JRE used is 1.7.0_76 with XRender pipeline activated, if that
> can help.

It is sitting on 40+ thousand pixmaps, and accounts for all but a few of the buffers allocated (which are used as backing storage for pixmaps) for the system, although that appears to be only ~350MiB each buffer object takes up precious kernel resource, which may be exhausted even before we run out of client space.

That xrestop still reports that the client owns the pixmap should be a good indicator that the leak is clientside.

Comment 11 Franck Delache 2016-01-28 22:07:13 UTC

For the record, my Java program renders a lot of images on the screen.
It seems that every time the JVM wants to draw an image it allocates a pixmap and links it to the Java XRSurfaceData object (class used internally by the JVM).

Once the Java XRSurfaceData object is not used anymore (i.e. cannot be strongly reached) AND is garbage collected then the pixmap is freed (one can look at the sun.java2d.Disposer class for more details).

The problem is that even if the Java XRSurfaceData object is not referenced anymore, it does not mean it is immediately freed. One must wait for the Garbage Collector to do its job and free those objects.

Thus the XServer and kernel resources are freed only when the JVM decides to cleanup its own heap.

My application does not keep any reference over those XRSurfaceData objects and using a profiler, I can see that there are a lot of weakly references XRSurfaceData objects (i.e. objects ready to be garbage collected), but unfortunately the Garbage Collector does not run yet since the JVM still has enough free heap.
Which goes to a point where the host resources are exhausted before the JVM consider it is time to do some cleanup in its heap.

I am still scratching my head on how to resolve this issue, but would like some clarification on the resources involved. Could you answer some questions?

 - Does pixmaps are allocated in user space?
 - Does buffers (backing the pixmaps) are allocated in user space or kernel space?
 - What kind of kernel resources are used for each buffer?
 - Can we tune the amount of memory reserved by the kernel for those resources?

That would help me understand the system behavior and see what could I tune to avoid that bad effect.

Comment 12 Chris Wilson 2016-01-29 13:39:46 UTC

(In reply to Franck Delache from comment #11)
> I am still scratching my head on how to resolve this issue, but would like
> some clarification on the resources involved. Could you answer some
> questions?
> 
>  - Does pixmaps are allocated in user space?

The pixmap consists of a header (a userspace struct) plus some backing storage. That backing storage can just be normal system RAM or a set of kernel buffers.

>  - Does buffers (backing the pixmaps) are allocated in user space or kernel
> space?

Both depending on the use case. Not everything that can be rendered by X can be rendered by the GPU, for those we try to avoid using the GPU at all (and so just use a regular malloc()). Similarly, there can be times where we haven't decided to commit to using the GPU, though typically they get allocated backing storage by the kernel anyway (just used like regular RAM, and converted into a GPU buffer on demand).

>  - What kind of kernel resources are used for each buffer?

Substantial, I assume around 512 bytes of overhead for each buffer. (It's probably closer to "just" 160 bytes in most cases.) The buffers are allocated from swappable storage (i.e. we can move the contents of the GPU buffers out to swap and return the pages back to the system), and only instantiated upon first use.

>  - Can we tune the amount of memory reserved by the kernel for those
> resources?

We don't reserve any memory ourselves (other than what is tied up by internal driver allocations from unswappable kernel memory, i.e. the structs we use to track the buffers and usage). That memory should be only allocated on demand, and returned to the system when no longer used. There is "stolen memory" which is what the BIOS sets aside from the system and reserved for igfx - which we do not fully utilize and you can recover some memory by tuning that in the BIOS. The majority of the memory we use should be for the buffers, and that is swappable or even discarded upon demand (we can't discard a buffer unless the user says so!). Modulo bugs in the memory management code, of course.

Comment 13 Franck Delache 2016-01-29 17:14:27 UTC

Since on this system there is no swap, it might explain my behavior (crash when resource is exhausted due to resources not freed up and not swappable).

Thanks for the detailed information.

Comment 14 Franck Delache 2016-02-01 15:40:11 UTC

Marking this bug as fixed since the segmentation fault has been corrected by commit 21fa92.
The (EE) TouchListenerGone: couldn't allocate events problem was caused by pixmaps not freed by my Java application, and thus is not a bug per se.

Comment 15 Chris Wilson 2016-02-01 16:06:47 UTC

Created attachment 121441 [details] [review]
Avoid overallocating event list for TouchOwnershipEvent

Though this would remove that particular crash on out-of-memory...

Comment 16 Franck Delache 2016-02-03 19:53:58 UTC

Couldn't reproduce the (EE) TouchListenerGone: couldn't allocate events problem with the patch proposed in comment #15.
Considering this bug completely fixed. Thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.