Bug 91577 - [HD Graphics 4000] Segfault accessing rq->bo->handle
Summary: [HD Graphics 4000] Segfault accessing rq->bo->handle
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 92062 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-08-06 16:05 UTC by Dominic
Modified: 2015-11-15 19:20 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
last 2000 lines of Xorg.0.log (138.69 KB, text/plain)
2015-08-06 16:07 UTC, Dominic
no flags Details
first 2000 lines of Xorg.0.log (128.07 KB, text/plain)
2015-08-06 16:08 UTC, Dominic
no flags Details
first 2000 lines of Xorg.0.log (non-stripped binary) (128.05 KB, text/plain)
2015-08-06 16:35 UTC, Dominic
no flags Details
last 2000 lines of Xorg.0.log (non-stripped binary) (145.90 KB, text/plain)
2015-08-06 16:36 UTC, Dominic
no flags Details
last 2000 lines of Xorg.0.log for the first segfault (142.31 KB, text/plain)
2015-08-06 16:52 UTC, Dominic
no flags Details
differences in /proc/<Xorg>/maps after scrolling once (1.59 MB, text/plain)
2015-08-07 09:36 UTC, Dominic
no flags Details
last 20k lines of Xorg.0.log (with the patches) (57.16 KB, text/plain)
2015-08-07 10:50 UTC, Dominic
no flags Details
last 20k lines of Xorg.0.log (at 672436efd5b69fb9007cb80804a351b1e1572b60) (63.89 KB, text/plain)
2015-08-07 12:32 UTC, Dominic
no flags Details
last 20k lines of Xorg.0.log (at 6bab82b91d64af2e31d9f5fe7705c68caf09004b, no crash) (2.61 MB, text/plain)
2015-08-07 13:09 UTC, Dominic
no flags Details
second to last 1M lines of Xorg.0.log (2.52 MB, text/plain)
2015-08-07 13:32 UTC, Dominic
no flags Details
from 3M to 2M before end of Xorg.0.log (2.81 MB, text/plain)
2015-08-07 13:38 UTC, Dominic
no flags Details
Chunk 1/3 (2.63 MB, text/plain)
2015-08-07 15:15 UTC, Dominic
no flags Details
Chunk 2/3 (2.62 MB, text/plain)
2015-08-07 15:20 UTC, Dominic
no flags Details
Chunk 3/3 (2.58 MB, text/plain)
2015-08-07 15:22 UTC, Dominic
no flags Details

Description Dominic 2015-08-06 16:05:00 UTC
I experience reproducable crashed with trunk (bc063c9f4af90542297877281b1b9f4692dafde4).

Steps to reproduce on my machine:
1. Open Links (on X)
2. Navigate to http://www.x.org/wiki/Development/Documentation/ServerDebugging/
3. Scroll down to the script "Debugging with one machine", "Version 1"
4. Select/deselect it using pointer repeatedly

After a short while, intel_drv.so crashes:

[119917.095] (EE) 
[119917.095] (EE) Backtrace:
[119917.102] (EE) 0: /usr/bin/Xorg (OsSigHandler+0x29) [0x5a34a9]
[119917.173] (EE) 1: /lib64/libc.so.6 (killpg+0x40) [0x7f60c968252f]
[119917.241] (EE) 2: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x166e6) [0x7f60c7645786]
[119917.307] (EE) 3: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x108094) [0x7f60c7828974]
[119917.371] (EE) 4: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x2d925) [0x7f60c7673b25]
[119917.435] (EE) 5: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x5169c) [0x7f60c76bb0bc]
[119917.435] (EE) 6: /usr/bin/Xorg (damagePolyFillRect+0x7b) [0x52451b]
[119917.435] (EE) 7: /usr/bin/Xorg (ProcPolyFillRectangle+0xe0) [0x436630]
[119917.435] (EE) 8: /usr/bin/Xorg (Dispatch+0x287) [0x439ed7]
[119917.435] (EE) 9: /usr/bin/Xorg (dix_main+0x3f5) [0x43e0a5]
[119917.499] (EE) 10: /lib64/libc.so.6 (__libc_start_main+0xf0) [0x7f60c966f730]
[119917.500] (EE) 11: /usr/bin/Xorg (_start+0x29) [0x427e59]
[119917.568] (EE) 12: ? (?+0x29) [0x29]
[119917.568] (EE) 
[119917.568] (EE) Segmentation fault at address 0x80
[119917.568] (EE) 
Fatal server error:
[119917.568] (EE) Caught signal 11 (Segmentation fault). Server aborting
Comment 1 Dominic 2015-08-06 16:07:12 UTC
Created attachment 117564 [details]
last 2000 lines of Xorg.0.log
Comment 2 Dominic 2015-08-06 16:08:09 UTC
Created attachment 117565 [details]
first 2000 lines of Xorg.0.log
Comment 3 Dominic 2015-08-06 16:34:54 UTC
Sorry for accidentally using intel_drv.so with debug symbols split.
Here is a backtrace for non-stripped intel_drv.so:

[   217.465] (EE) 
[   217.466] (EE) Backtrace:
[   217.472] (EE) 0: /usr/bin/Xorg (OsSigHandler+0x29) [0x5a34a9]
[   217.535] (EE) 1: /lib64/libc.so.6 (killpg+0x40) [0x7f070117652f]
[   217.614] (EE) 2: /usr/lib64/xorg/modules/drivers/intel_drv.so (kgem_retire__requests_ring+0x1a) [0x7f06ff119490]
[   217.673] (EE) 3: /usr/lib64/xorg/modules/drivers/intel_drv.so (kgem_retire__requests+0x1e) [0x7f06ff11955e]
[   217.735] (EE) 4: /usr/lib64/xorg/modules/drivers/intel_drv.so (kgem_retire+0x50) [0x7f06ff119abb]
[   217.794] (EE) 5: /usr/lib64/xorg/modules/drivers/intel_drv.so (__kgem_throttle_retire+0x3f) [0x7f06ff119b37]
[   217.854] (EE) 6: /usr/lib64/xorg/modules/drivers/intel_drv.so (search_linear_cache+0x26b) [0x7f06ff119fd0]
[   217.914] (EE) 7: /usr/lib64/xorg/modules/drivers/intel_drv.so (kgem_create_buffer+0x5fa) [0x7f06ff124470]
[   217.973] (EE) 8: /usr/lib64/xorg/modules/drivers/intel_drv.so (sna_write_boxes+0x13c6) [0x7f06ff18dc18]
[   218.033] (EE) 9: /usr/lib64/xorg/modules/drivers/intel_drv.so (sna_copy_boxes+0x14dc) [0x7f06ff1595b1]
[   218.093] (EE) 10: /usr/lib64/xorg/modules/drivers/intel_drv.so (sna_do_copy+0x7c8) [0x7f06ff12fd07]
[   218.152] (EE) 11: /usr/lib64/xorg/modules/drivers/intel_drv.so (sna_copy_area+0x1e2) [0x7f06ff12ff9a]
[   218.153] (EE) 12: /usr/bin/Xorg (damageCopyArea+0xd3) [0x526123]
[   218.153] (EE) 13: /usr/bin/Xorg (ProcCopyArea+0x161) [0x435cb1]
[   218.153] (EE) 14: /usr/bin/Xorg (Dispatch+0x287) [0x439ed7]
[   218.153] (EE) 15: /usr/bin/Xorg (dix_main+0x3f5) [0x43e0a5]
[   218.213] (EE) 16: /lib64/libc.so.6 (__libc_start_main+0xf0) [0x7f0701163730]
[   218.213] (EE) 17: /usr/bin/Xorg (_start+0x29) [0x427e59]
[   218.278] (EE) 18: ? (?+0x29) [0x29]
[   218.278] (EE) 
[   218.278] (EE) Segmentation fault at address 0x80
[   218.278] (EE) 
Fatal server error:
[   218.278] (EE) Caught signal 11 (Segmentation fault). Server aborting
Comment 4 Dominic 2015-08-06 16:35:36 UTC
Created attachment 117566 [details]
first 2000 lines of Xorg.0.log (non-stripped binary)
Comment 5 Dominic 2015-08-06 16:36:07 UTC
Created attachment 117567 [details]
last 2000 lines of Xorg.0.log (non-stripped binary)
Comment 6 Dominic 2015-08-06 16:51:00 UTC
Looking at the two traces, they are different (although both crashes are produced by the same "recipe".
Managed to reproduce the first one again:

[  1333.095] (EE) 
[  1333.096] (EE) Backtrace:
[  1333.102] (EE) 0: /usr/bin/Xorg (OsSigHandler+0x29) [0x5a34a9]
[  1333.171] (EE) 1: /lib64/libc.so.6 (killpg+0x40) [0x7f33c628452f]
[  1333.247] (EE) 2: /usr/lib64/xorg/modules/drivers/intel_drv.so (__kgem_ring_is_idle+0x24) [0x7f33c42286cb]
[  1333.311] (EE) 3: /usr/lib64/xorg/modules/drivers/intel_drv.so (gen7_render_fill+0x234) [0x7f33c431a392]
[  1333.374] (EE) 4: /usr/lib64/xorg/modules/drivers/intel_drv.so (sna_poly_fill_rect_blt+0x442) [0x7f33c4242e16]
[  1333.438] (EE) 5: /usr/lib64/xorg/modules/drivers/intel_drv.so (sna_poly_fill_rect+0x8f6) [0x7f33c4265781]
[  1333.438] (EE) 6: /usr/bin/Xorg (damagePolyFillRect+0x7b) [0x52451b]
[  1333.438] (EE) 7: /usr/bin/Xorg (ProcPolyFillRectangle+0xe0) [0x436630]
[  1333.438] (EE) 8: /usr/bin/Xorg (Dispatch+0x287) [0x439ed7]
[  1333.438] (EE) 9: /usr/bin/Xorg (dix_main+0x3f5) [0x43e0a5]
[  1333.502] (EE) 10: /lib64/libc.so.6 (__libc_start_main+0xf0) [0x7f33c6271730]
[  1333.502] (EE) 11: /usr/bin/Xorg (_start+0x29) [0x427e59]
[  1333.571] (EE) 12: ? (?+0x29) [0x29]
[  1333.571] (EE) 
[  1333.571] (EE) Segmentation fault at address 0x80
[  1333.571] (EE) 
Fatal server error:
[  1333.571] (EE) Caught signal 11 (Segmentation fault). Server aborting
Comment 7 Dominic 2015-08-06 16:52:02 UTC
Created attachment 117568 [details]
last 2000 lines of Xorg.0.log for the first segfault
Comment 8 Dominic 2015-08-06 17:10:21 UTC
More info:

First crash
__kgem_ring_is_idle+0x24 is in src/sna/kgem.c:3132
gen7_render_fill+0x234 is in src/sna/kgem.h:359
sna_poly_fill_rect_blt+0x442 is in src/sna/sna_accel.c:465

Second crash:
kgem_retire__requests_ring+0x1a is in src/sna/kgem.c:3061
kgem_retire__requests+0x1e is in src/sna/kgem.c:3095
kgem_retire+0x50 is in src/sna/kgem.c:3111
__kgem_throttle_retire+0x3f is in src/sna/kgem.c:549
search_linear_cache+0x26b is in src/sna/kgem.c:4438
kgem_create_buffer+0x5fa is in src/sna/kgem.c:7396
sna_write_boxes+0x13c6 is in src/sna/sna_io.c:1174
sna_copy_boxes+0x14dc is in src/sna/sna_accel.c:6857
sna_do_copy+0x7c8 is in src/sna/sna_accel.c:7176
sna_copy_area+0x1e2 is in src/sna/sna_accel.c:7267


So both segfault addresses are in calls to
 __kgem_busy(kgem, rq->bo->handle)

Unfortunately, I cannot catch it in gdb to see whether rq->bo or rq->bo->handle fails.
Comment 9 Chris Wilson 2015-08-07 09:01:04 UTC
Looks like you ran out of memory, and we then started failing to create rq->o.
Comment 10 Dominic 2015-08-07 09:22:51 UTC
Possibly. I experience a slow-down prior to the crashes.

I am running xf86-video-intel trunk because of a memory leak in the last released version (on my machine, just opening specific applications (like java/Arduino-IDE) brought Xorg memory usage to >2GiB)
This lead to X being oom-killed.
Valgrind pointed to intel_drv.so. I did not investigate this further as this seems fixed on trunk.

--

Just tested again. Indeed, Xorg is running out of memory:
Just scrolling through aforementioned page in Links brings up Xorg memory usage a lot. Memory footprint is not going down after Links is closed and valgrind thinks it still is accessible.

I don't suspect Links, I just chose it because it reliably (and quickly) exhibits this segfault. I am also seeing it working on gnome-terminal (less likely to crash) and Emacs (more so), but also "randomly".
Comment 11 Dominic 2015-08-07 09:36:31 UTC
Created attachment 117572 [details]
differences in /proc/<Xorg>/maps after scrolling once

Scrolling the page once from top to bottom results in /proc/<Xorg>/maps growing with many entries to "/drm mm object (deleted)" (see attached diff)
Comment 12 Dominic 2015-08-07 09:55:29 UTC
The references *are* dropped... after a while.
But apparently not quickly enough?

And should we fail more gracefully on oom? Or free cached objects (that's what they are, right?) before reaching a threshold?
Comment 13 Chris Wilson 2015-08-07 09:55:52 UTC
Cursor doesn't do anything for me with links in the terminal. Do you mind sending me the complete debug log, I want to see how we end up 64k buffers? In the meantime,

commit 69d8edc11173df021aa2e158b2530257113141fd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 7 10:08:17 2015 +0100

    sna: Handle batch allocation failure
    
    Whilst we currently do not try and submit a failed batch buffer
    allocation, we still treat it as a valid request. This explodes much
    later when we inspect the NULL rq->bo.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=91577
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

commit 8c465d0fbf84b1d29c54d620f09063d2b7ccfeb8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 7 10:15:42 2015 +0100

    sna: Fallback after a bo allocation failure for the batch
    
    If we fail to allocate the next bo to use for the next_request, we can
    just fallback to the delayed allocation used by !llc.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=91577
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

commit ca71199679cac9cc161c84cb09d12f133abf8a64
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 7 10:40:30 2015 +0100

    sna: Harden batch allocation against resource starvation
    
    If the batch allocation fails, retire the oldest request and try again.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=91577
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

should prevent the allocation failure, or if we can't prevent it, handle it correctly.
Comment 14 Chris Wilson 2015-08-07 09:58:16 UTC
By Links, do you mean the terminal browser? Or do you have some fancy GUI?
Comment 15 Chris Wilson 2015-08-07 10:12:43 UTC
(In reply to Dominic from comment #12)
> The references *are* dropped... after a while.
> But apparently not quickly enough?
> 
> And should we fail more gracefully on oom? Or free cached objects (that's
> what they are, right?) before reaching a threshold?

We do (or at least suppose to). That's the silly part, we reap our caches and try again before propagating a failure.
Comment 16 Dominic 2015-08-07 10:16:42 UTC
No, not links in terminal. That's fine. I mean "links -g" (I am using Links 2.10)

By debug log, do you mean the full Xorg.0.log with --enable-debug=full? (I'll have to think about how, it's still 23MiB bzip2'ed.)

I'll try your patches in a minute.
Comment 17 Chris Wilson 2015-08-07 10:24:34 UTC
Try xz, that may shave off a few more bytes. And say the last 20,000 lines? I just need enough to see what the allocation pattern is.
Comment 18 Chris Wilson 2015-08-07 10:40:05 UTC
Ah, here it is links2 -g. Messing around following your instructions my object count hits 2k and holds. That is likely to be machine dependent though. :|
Comment 19 Dominic 2015-08-07 10:43:08 UTC
The patches change the segfault but cannot prevent it. The segault I just got did not yield a meaningful backtrace, however. (heap corruption?)

[   365.632] kgem_cleanup_cache
[   365.632] kgem_cleanup_cache: sync on cleanup
[   365.632] __kgem_retire_requests_upto(handle=2)
[   365.632] __kgem_retire_rq: request 2 complete
[   365.632] kgem_retire, need_retire?=1
[   365.632] __kgem_busy: handle=17, busy=1, wedged=0
[   365.632] kgem_retire__flushing: 18 bo on flushing list
[   365.632] kgem_retire__requests_ring: ring=0, 0 outstanding requests, oldest=0
[   365.632] __kgem_busy: handle=3, busy=1, wedged=0
[   365.632] kgem_retire__requests_ring: ring=1, 2 outstanding requests, oldest=3
[   365.632] kgem_retire -- retired=0, need_retire=1
[   365.632] kgem_bo_is_busy: handle=64390, domain: 0 exec? 0, rq? 0
[   365.632] gen6_render_retire: resetting idle vbo
[   365.632] __kgem_bo_clear_busy: handle=3
[   365.632] __kgem_bo_clear_busy: handle=65024
[   365.632] kgem_cleanup_cache: need_expire?=0
[   365.632] __kgem_bo_map__cpu: failed to mmap handle=64326 (prime? 0), 65536 bytes, into CPU domain: 12
[   365.632] kgem_new_batch: unable to map batch bo, mallocing(size=65536)
[   365.632] _kgem_bo_destroy: handle=64326, proxy? 0
[   365.632] __kgem_bo_destroy: handle=64326, size=65536
[   365.632] __kgem_bo_destroy: handle=64326, purged
[   365.632] kgem_bo_move_to_inactive: moving handle=64326 to inactive
[   365.632] sigtrap_handler(sig=11) sigtrap=0
[   365.633] (EE) 
[   365.633] (EE) Backtrace:
[   365.634] (EE) 0: ? (?+0x190a720) [0x1eadba0]
[   365.638] (EE) 1: ? (?+0x190a720) [0x7f9638c6dc0f]
[   365.643] (EE) 2: /usr/lib64/libdw.so.1 (?+0x190a720) [0x7f9636c12086]
[   365.643] (EE) 3: /usr/lib64/libdw.so.1 (?+0x190a720) [0x7f9636c18cb4]
[   365.643] (EE) 4: /usr/lib64/libdw.so.1 (?+0x190a720) [0x7f9636c19c19]
[   365.644] (EE) 5: /usr/lib64/libdw.so.1 (?+0x190a720) [0x7f9636c1a637]
[   365.644] (EE) 6: /usr/lib64/libdw.so.1 (?+0x190a720) [0x7f9636c52a7b]
[   365.644] (EE) 7: /usr/lib64/libdw.so.1 (?+0x190a720) [0x7f9636c52c4a]
[   365.644] (EE) 8: ? (?+0x190a720) [0x1d49ac0]
[   365.644] (EE) 9: ? (?+0x190a720) [0x1ead290]
[   365.644] (EE) 10: ? (?+0x190a720) [0x1d4f900]
[   365.644] (EE) 11: ? (?+0x190a720) [0x1d40080]
[   365.644] (EE) 12: ? (?+0x190a720) [0x1d40270]
[   365.644] (EE) 13: ? (?+0x190a720) [0x1d44370]
[   365.644] (EE) 14: ? (?+0x190a720) [0x1d483d0]
[   365.644] (EE) 15: ? (?+0x190a720) [0x7f9638c5ad60]
[   365.644] (EE) 16: ? (?+0x190a720) [0x1d32550]
[   365.647] (EE) 17: ? (?+0x190a720) [0x190a720]
[   365.647] (EE) 
[   365.647] (EE) Segmentation fault at address 0x0
[   365.647] (EE) 
Fatal server error:
[   365.647] (EE) Caught signal 11 (Segmentation fault). Server aborting
Comment 20 Dominic 2015-08-07 10:50:50 UTC
Created attachment 117573 [details]
last 20k lines of Xorg.0.log (with the patches)

Here is the log with your modifications (20.000 lines).
From what it looks like with logging enabled (driver then is sufficiently slow), the chunks could be individual characters. Links seems to draw them one at a time.

Xorg reaches ~20% MEM at the time of crash - that should be about 4GiB. How can I see the object count reliably?
Comment 21 Chris Wilson 2015-08-07 11:13:25 UTC
(In reply to Dominic from comment #19)
> The patches change the segfault but cannot prevent it. The segault I just
> got did not yield a meaningful backtrace, however. (heap corruption?)
> 
> [   365.632] kgem_cleanup_cache
> [   365.632] kgem_cleanup_cache: sync on cleanup
> [   365.632] __kgem_retire_requests_upto(handle=2)
> [   365.632] __kgem_retire_rq: request 2 complete
> [   365.632] kgem_retire, need_retire?=1
> [   365.632] __kgem_busy: handle=17, busy=1, wedged=0
> [   365.632] kgem_retire__flushing: 18 bo on flushing list
> [   365.632] kgem_retire__requests_ring: ring=0, 0 outstanding requests,
> oldest=0
> [   365.632] __kgem_busy: handle=3, busy=1, wedged=0
> [   365.632] kgem_retire__requests_ring: ring=1, 2 outstanding requests,
> oldest=3
> [   365.632] kgem_retire -- retired=0, need_retire=1
> [   365.632] kgem_bo_is_busy: handle=64390, domain: 0 exec? 0, rq? 0
> [   365.632] gen6_render_retire: resetting idle vbo
> [   365.632] __kgem_bo_clear_busy: handle=3
> [   365.632] __kgem_bo_clear_busy: handle=65024
> [   365.632] kgem_cleanup_cache: need_expire?=0
> [   365.632] __kgem_bo_map__cpu: failed to mmap handle=64326 (prime? 0),
> 65536 bytes, into CPU domain: 12
> [   365.632] kgem_new_batch: unable to map batch bo, mallocing(size=65536)
> [   365.632] _kgem_bo_destroy: handle=64326, proxy? 0
> [   365.632] __kgem_bo_destroy: handle=64326, size=65536
> [   365.632] __kgem_bo_destroy: handle=64326, purged
> [   365.632] kgem_bo_move_to_inactive: moving handle=64326 to inactive

Hmm. That's an odd place for the log to stop. I've tweaked the DBG statements, maybe that will be clearer?

(In reply to Dominic from comment #20)
> Xorg reaches ~20% MEM at the time of crash - that should be about 4GiB. How
> can I see the object count reliably?

cat /sys/kernel/debug/dri/0/i915_gem_objects
Comment 22 Chris Wilson 2015-08-07 12:01:50 UTC
Plot thickens, it is not even the first malloc/map failure for the batch that dies.

Still picking out symptoms:
commit d78685787a595ca68aec08074744fa23ed4f3d93
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 7 12:59:57 2015 +0100

    sna: Fix sync'ing to the most recent request on forced cache cleanup
Comment 23 Chris Wilson 2015-08-07 12:12:11 UTC
(In reply to Dominic from comment #19)
> The patches change the segfault but cannot prevent it. The segault I just
> got did not yield a meaningful backtrace, however. (heap corruption?)

I suspect it is another malloc failure inside the unwinder. Though I still haven't found either the segfault or the sloppy cache.
Comment 24 Dominic 2015-08-07 12:31:09 UTC
Sorry for the delay. Here is another trace (without your last modifications, i.e. at 672436efd5b69fb9007cb80804a351b1e1572b60). Looks like before:

[   555.361] gen6_render_retire: resetting idle vbo
[   555.361] kgem_cleanup_cache
[   555.361] kgem_cleanup_cache: sync on cleanup
[   555.361] kgem_bo_wait: waiting for handle=2
[   555.361] __kgem_retire_requests_upto(handle=2)
[   555.361] __kgem_retire_rq: request 2 complete
[   555.361] kgem_retire, need_retire?=1
[   555.361] __kgem_busy: handle=10, busy=1, wedged=0
[   555.361] kgem_retire__flushing: 15 bo on flushing list
[   555.361] kgem_retire__requests_ring: ring=0, 0 outstanding requests, oldest=0
[   555.361] __kgem_busy: handle=3, busy=1, wedged=0
[   555.361] kgem_retire__requests_ring: ring=1, 2 outstanding requests, oldest=3
[   555.361] kgem_retire -- retired=0, need_retire=1
[   555.361] kgem_bo_is_busy: handle=16694, domain: 0 exec? 0, rq? 0
[   555.361] gen6_render_retire: resetting idle vbo
[   555.361] __kgem_bo_clear_busy: handle=3
[   555.361] __kgem_bo_clear_busy: handle=65003
[   555.361] kgem_cleanup_cache: need_expire?=0
[   555.363] __kgem_bo_map__cpu: failed to mmap handle=65127 (prime? 0), 65536 bytes, into CPU domain: 12
[   555.363] _kgem_bo_destroy: handle=65127, proxy? 0
[   555.363] __kgem_bo_destroy: handle=65127, size=65536
[   555.363] __kgem_bo_destroy: handle=65127, purged
[   555.363] kgem_bo_move_to_inactive: moving handle=65127 to inactive
[   555.363] sigtrap_handler(sig=11) sigtrap=0
[   555.365] (EE) 
[   555.366] (EE) Backtrace:
[   555.368] (EE) 0: ? (?+0x1c7e720) [0x2221ba0]
[   555.372] (EE) 1: ? (?+0x1c7e720) [0x7f143827dc0f]
[   555.380] (EE) 2: /lib64/librt.so.1 (?+0x1c7e720) [0x7f1436222094]
[   555.380] (EE) 3: /lib64/librt.so.1 (?+0x1c7e720) [0x7f1436228cde]
[   555.380] (EE) 4: /lib64/librt.so.1 (?+0x1c7e720) [0x7f1436229c43]
[   555.380] (EE) 5: /lib64/librt.so.1 (?+0x1c7e720) [0x7f143622a661]
[   555.380] (EE) 6: /lib64/librt.so.1 (?+0x1c7e720) [0x7f1436262aa5]
[   555.380] (EE) 7: /lib64/librt.so.1 (?+0x1c7e720) [0x7f1436262c74]
[   555.380] (EE) 8: ? (?+0x1c7e720) [0x20bdac0]
[   555.380] (EE) 9: ? (?+0x1c7e720) [0x2221290]
[   555.381] (EE) 10: ? (?+0x1c7e720) [0x20c3900]
[   555.381] (EE) 11: ? (?+0x1c7e720) [0x219aed0]
[   555.381] (EE) 12: ? (?+0x1c7e720) [0x21a0910]
[   555.381] (EE) 13: ? (?+0x1c7e720) [0x21a1270]
[   555.381] (EE) 14: ? (?+0x1c7e720) [0x21a4770]
[   555.381] (EE) 15: ? (?+0x1c7e720) [0x20b4270]
[   555.381] (EE) 16: ? (?+0x1c7e720) [0x20b8370]
[   555.381] (EE) 17: ? (?+0x1c7e720) [0x20bc3d0]
[   555.381] (EE) 18: ? (?+0x1c7e720) [0x7f143826ad60]
[   555.381] (EE) 19: ? (?+0x1c7e720) [0x20a6550]
[   555.385] (EE) 20: ? (?+0x1c7e720) [0x1c7e720]
[   555.385] (EE) 
[   555.385] (EE) Segmentation fault at address 0x80
[   555.385] (EE) 
Fatal server error:
[   555.385] (EE) Caught signal 11 (Segmentation fault). Server aborting

There was somewhere over 47k objects in use at that time.
Comment 25 Dominic 2015-08-07 12:32:05 UTC
Created attachment 117578 [details]
last 20k lines of Xorg.0.log (at 672436efd5b69fb9007cb80804a351b1e1572b60)
Comment 26 Dominic 2015-08-07 12:32:58 UTC
Will try the most recent changes now.
Comment 27 Dominic 2015-08-07 13:07:37 UTC
Was not able to trigger any crashes so far. Memory footprint drops as soon as I stop selecting/deselecting text.

However, I was able to trigger oom condition by *not* stopping ( (de)selecting every 2-3s or so on my machine is enough to keep Xorg busy on his CPU core and eating more and more objects.) After about 65k objects, using a tad more than 4GiB, it starts freeing.

-----
65707 objects, 4687548416 bytes
24237 [15] objects, 1699127296 [8998912] bytes in gtt
  0 [0] active objects, 0 [0] bytes
  24237 [15] inactive objects, 1699127296 [8998912] bytes
0 unbound objects, 0 bytes
0 purgeable objects, 0 bytes
6 pinned mappable objects, 8962048 bytes
7 fault mappable objects, 40960 bytes
2147483648 [268435456] gtt total

batch pool: 1 objects, 8192 bytes (0 active, 0 inactive, 0 global, 0 shared, 8192 unbound)

systemd-logind: 83 objects, 37683200 bytes (0 active, 0 inactive, 0 global, 5242880 shared, 37601280 unbound)
Xwayland: 21 objects, 7675904 bytes (0 active, 0 inactive, 0 global, 0 shared, 7598080 unbound)
Xwayland: 0 objects, 0 bytes (0 active, 0 inactive, 0 global, 0 shared, 0 unbound)
systemd-logind: 65333 objects, 4518924288 bytes (0 active, 1677623296 inactive, 1677623296 global, 35459072 shared, 2835603456 unbound)
gnome-shell: 262 objects, 149680128 bytes (0 active, 29831168 inactive, 29831168 global, 25890816 shared, 100548608 unbound)
-----

I collected a huge log containing two "garbage collections", terminating xorg via sysreq as soon as I saw the second one happening.
I'll try to attach the last 1M lines of it.
Comment 28 Dominic 2015-08-07 13:09:47 UTC
Created attachment 117579 [details]
last 20k lines of Xorg.0.log (at 6bab82b91d64af2e31d9f5fe7705c68caf09004b, no crash)
Comment 29 Chris Wilson 2015-08-07 13:29:53 UTC
Hmm, could do with the million lines before that :) In that log it looks like we never managed to hit the oom and so didn't trigger the bug.
Comment 30 Dominic 2015-08-07 13:32:01 UTC
Created attachment 117580 [details]
second to last 1M lines of Xorg.0.log

The oom condition did not happen in the last million lines; attaching the 1M chunk before that contains it:

...
[   511.122] gem_userptr: failed to map 0x7f42b7290000 + 3887104 bytes: 19
[   511.122] kgem_create_map: import failed, errno=19
...
Comment 31 Dominic 2015-08-07 13:32:37 UTC
Yeah, I just noticed :)
Comment 32 Dominic 2015-08-07 13:38:06 UTC
Created attachment 117581 [details]
from 3M to 2M before end of Xorg.0.log

And for what it's worth, the chunk before that as well.
Comment 33 Chris Wilson 2015-08-07 14:30:46 UTC
I think this should reduce the stress:

commit 07bcd3f0c715a37ac24a630173e0e28056567421
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 7 15:26:30 2015 +0100

    sna: Use direct uploads into a Damage tracked for v.small operations
    
    If the batch is being frequently flushed due to Damage+Composite
    tracking, we cannot amalgamate the small copies into the destination and
    so end up sending thousands and thousands of tiny batches with tiny
    render copies hogging the GPU.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=91577
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 34 Dominic 2015-08-07 14:58:20 UTC
Yes. I was not able to increase Xorgs memory footprint above 300MiB using Links alone anymore. This is way less than while idling after starting up GNOME before...

Would you still like me to upload chunks from Xorg.0.log?
Comment 35 Chris Wilson 2015-08-07 15:06:03 UTC
(In reply to Dominic from comment #34)
> Would you still like me to upload chunks from Xorg.0.log?

For completeness, please do. Might be something else stands out now.
Comment 36 Dominic 2015-08-07 15:15:46 UTC
Created attachment 117582 [details]
Chunk 1/3
Comment 37 Dominic 2015-08-07 15:20:35 UTC
Created attachment 117583 [details]
Chunk 2/3
Comment 38 Dominic 2015-08-07 15:22:59 UTC
Created attachment 117584 [details]
Chunk 3/3

As before in reverse order: Chunk 1 were the last 1M lines, Chunk 2 the 1M before it and Chunk 3 before that.
Comment 39 Chris Wilson 2015-08-07 16:30:40 UTC
Ok, made another tweak based on the new logs.

I think we have addressed the buggy failure handling in submission and the buggy throttling when hitting resource starvation - but I haven't worked out where that last segfault was coming from. So there is a crash lurking in there somewhere with resource starvation. :(

In the meantime however we have stopped spamming the GPU and stopped running out-of-memory in the first place. Overall, I think I can treat the bug as resolved...

Thanks for the bug report, all the testing and feedback. Please reopen if you can find a way to trigger it once more.
Comment 40 Chris Wilson 2015-10-16 16:16:30 UTC
*** Bug 92062 has been marked as a duplicate of this bug. ***
Comment 41 Joe Peterson 2015-10-30 16:24:39 UTC
Oops, I commented on the dup... here's the info again. Chris, van you verify that the version I am using is older than your fix and that you believe that a newer version should fix this? I can then try a newer build of xf86-video-intel...

I am getting similar crashes. I am at these versions:

xorg-server 1.17.3-1
xf86-video-intel 1:2.99.917+478+gdf72bc5-2

[  4653.592] (EE) Backtrace:
[  4653.592] (EE) 0: /usr/lib/xorg-server/Xorg (OsLookupColor+0x139) [0x596d09]
[  4653.593] (EE) 1: /usr/lib/libc.so.6 (__restore_rt+0x0) [0x7f5ce091c67f]
[  4653.594] (EE) 2: /usr/lib/xorg/modules/drivers/intel_drv.so (_init+0x144c0) [0x7f5cdb4c8ee0]
[  4653.595] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (_init+0x1589f) [0x7f5cdb4cb6bf]
[  4653.595] (EE) 4: /usr/lib/xorg/modules/drivers/intel_drv.so (_init+0x49879) [0x7f5cdb5332b9]
[  4653.595] (EE) 5: /usr/lib/xorg/modules/drivers/intel_drv.so (_init+0x49ae2) [0x7f5cdb533932]
[  4653.596] (EE) 6: /usr/lib/xorg-server/Xorg (DamageRegionAppend+0x3783) [0x5216c3]
[  4653.596] (EE) 7: /usr/lib/xorg-server/Xorg (SyncVerifyFence+0x2205) [0x4d8ec5]
[  4653.596] (EE) 8: /usr/lib/xorg-server/Xorg (SyncVerifyFence+0x3645) [0x4dbb85]
[  4653.596] (EE) 9: /usr/lib/xorg-server/Xorg (SendErrorToClient+0x2ff) [0x438fbf]
[  4653.596] (EE) 10: /usr/lib/xorg-server/Xorg (remove_fs_handlers+0x41b) [0x43d09b]
[  4653.598] (EE) 11: /usr/lib/libc.so.6 (__libc_start_main+0xf0) [0x7f5ce0909610]
[  4653.598] (EE) 12: /usr/lib/xorg-server/Xorg (_start+0x29) [0x427489]
[  4653.598] (EE) 13: ? (?+0x29) [0x29]
[  4653.598] (EE) 
[  4653.598] (EE) Segmentation fault at address 0x0
[  4653.598] (EE) 
Fatal server error:
[  4653.598] (EE) Caught signal 11 (Segmentation fault). Server aborting
Comment 42 Joe Peterson 2015-10-30 22:25:52 UTC
(In reply to Joe Peterson from comment #41)
> Oops, I commented on the dup... here's the info again. Chris, van you verify
> that the version I am using is older than your fix and that you believe that
> a newer version should fix this? I can then try a newer build of
> xf86-video-intel...
> 
> I am getting similar crashes. I am at these versions:
> 
> xorg-server 1.17.3-1
> xf86-video-intel 1:2.99.917+478+gdf72bc5-2


I had not realized that OsLookupColor is a red herring. My issue is likely unreleated - opening a new bug with a real backtrace...
Comment 43 vukasin0 2015-11-15 18:22:31 UTC
This issue is still persistent on Fedora Core 22 (KDE && Gnome3)


Installed versions: 
> xorg-x11-server-Xorg-1.17.4-1.fc22.x86_64
> xorg-x11-drv-intel-2.99.917-15.20150729.fc22.x86_64


Log:

> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE)
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) Backtrace:
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 0: /usr/libexec/Xorg (OsLookupColor+0x139) [0x59a6a9]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 1: /lib64/libc.so.6 (__restore_rt+0x0) [0x7fc72725fa4f]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 2: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0xa9dc) [0x7fc721e7008c]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 3: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0xbe07) [0x7fc721e728c7]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 4: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x1aab7) [0x7fc721e901a7]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 5: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x1ad11) [0x7fc721e90701]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 6: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x28d22) [0x7fc721eac282]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 7: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x2b993) [0x7fc721eb1d03]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 8: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x3f8ea) [0x7fc721ed9d8a]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 9: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x3fe0a) [0x7fc721eda6ba]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 10: /usr/libexec/Xorg (DamageRegionAppend+0x3783) [0x5255f3]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 11: /usr/libexec/Xorg (SendGraphicsExpose+0xeb3) [0x437ea3]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 12: /usr/libexec/Xorg (SendErrorToClient+0x2ff) [0x43a64f]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 13: /usr/libexec/Xorg (remove_fs_handlers+0x41b) [0x43e72b]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 14: /lib64/libc.so.6 (__libc_start_main+0xf0) [0x7fc72724b700]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 15: /usr/libexec/Xorg (_start+0x29) [0x428b19]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) 16: ? (?+0x29) [0x29]
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE)
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) Segmentation fault at address 0x7fc744d04de0
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE)
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: Fatal server error:
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) Caught signal 11 (Segmentation fault). Server aborting
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE)
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE)
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: Please consult the Fedora Project support
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: at http://wiki.x.org
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: for help.
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE) Please also check the log file at "/home/vukasin/.local/share/xorg/Xorg.0.log" for additional information.
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (EE)
> Nov 15 21:55:32 localhost /usr/libexec/gdm-x-session: (II) AIGLX: Suspending AIGLX clients for VT switch
> Nov 15 21:55:33 localhost /usr/libexec/gdm-x-session: (EE) Server terminated with error (1). Closing log file.
Comment 44 Chris Wilson 2015-11-15 19:20:58 UTC
Please have the courtesy of testing upstream before claiming the bug still exists.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.