67504 – intel driver crash in has_offload_slaves

Bug 67504 - intel driver crash in has_offload_slaves

Summary: intel driver crash in has_offload_slaves

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-07-29 20:27 UTC by Danny
Modified:	2013-08-05 15:40 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
gdb output (59.60 KB, text/plain) 2013-08-01 23:01 UTC, Danny	no flags	Details
File name glitch (94.85 KB, image/png) 2013-08-05 14:35 UTC, Tadas	no flags	Details
View All

Description Danny 2013-07-29 20:27:39 UTC

This is on a thinkpad T410 with Intel HD Graphics 5700MHD.

Since updating again to xorg-x11-drv-intel-2.21.12-1.fc18.x86_64 (.8 was also showing this same bug, 2.20 was not) on Fedora 18 I can reliably reproduce a segfault by min/maxing windows of some specific apps (sometimes also occurs under other conditions). This is with SNA enabled.

Xorg log backtrace seems bogus:

[220957.400] (EE)
[220957.400] (EE) Backtrace:
[220957.425] (EE) 0: /usr/bin/X (OsLookupColor+0x139) [0x472509]
[220957.427] (EE) 1: /lib64/libpthread.so.0 (__restore_rt+0x0) [0x3d9d00efff]
[220957.435] (EE) 2: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x2cad6) [0x7f6737f4ed96]
[220957.440] (EE) 3: /usr/lib64/xorg/modules/drivers/intel_drv.so (_init+0x4befe) [0x7f6737f8d30e]
[220957.441] (EE) 4: /usr/bin/X (BlockHandler+0x44) [0x43d9c4]
[220957.442] (EE) 5: /usr/bin/X (WaitForSomething+0x114) [0x469e84]
[220957.442] (EE) 6: /usr/bin/X (SendErrorToClient+0xe1) [0x4395d1]
[220957.443] (EE) 7: /usr/bin/X (_init+0x3a7a) [0x42b98a]
[220957.446] (EE) 8: /lib64/libc.so.6 (__libc_start_main+0xf5) [0x3d9c821a05]
[220957.447] (EE) 9: /usr/bin/X (_start+0x29) [0x428621]
[220957.447] (EE)
[220957.447] (EE) Segmentation fault at address 0x26
[220957.447]
Fatal server error:
[220957.448] Caught signal 11 (Segmentation fault). Server aborting
[220957.448]
[220957.448] (EE)

gdb backtrace shows:
Program received signal SIGSEGV, Segmentation fault.
has_offload_slaves (sna=0x7f6736c30000) at sna_accel.c:14700
14700           ScreenPtr screen = sna->scrn->pScreen;
(gdb) bt
#0  has_offload_slaves (sna=0x7f6736c30000) at sna_accel.c:14700
#1  0x00007f6737f416ee in stop_flush (scanout=0x10d7ab0, sna=0x7f6736c30000)
    at sna_accel.c:14763
#2  sna_accel_flush (sna=0x7f6736c30000) at sna_accel.c:14985
#3  sna_accel_block_handler (sna=0x7f6736c30000, tv=0x7fffda2846d8)
    at sna_accel.c:15425
#4  0x000000000043d9c4 in BlockHandler (
    pTimeout=pTimeout@entry=0x7fffda2846d8,
    pReadmask=pReadmask@entry=0x81c340 <LastSelectMask>) at dixutils.c:387
#5  0x0000000000469e84 in WaitForSomething (
    pClientsReady=pClientsReady@entry=0x133fa20) at WaitFor.c:210
#6  0x0000000000439581 in Dispatch () at dispatch.c:357
#7  0x00000000004282da in main (argc=9, argv=0x7fffda284ae8,
    envp=<optimized out>) at main.c:298

Comment 1 Chris Wilson 2013-07-29 20:37:35 UTC

Can you recompilewith --enable-debug=full, then grab a gdb "bt full" and attach the Xorg.0.log?

Comment 2 Chris Wilson 2013-07-30 15:10:44 UTC

Note that sna->scrn being NULL there is an indication of memory corruption, as X cannot start unless it sets that. An alternate possibility it that the BlockHandler is running after we shutdown... Again, I think impossible.

Comment 3 Danny 2013-07-30 20:18:13 UTC

I did recompile with full debug but unfortunately, I do not manage to crash it with the same procedure. Everything is a lot slower however, so maybe timing is important.

I suspected memory corruption as well as triggering the crash seems to rely there being consideral memory load on the system.

Would valgrind help (you'd need to tell me how to run that with X though)?

Reinstalling the non-debug version again and I can reproduce quite easily again:

gdb) bt full
#0  has_offload_slaves (sna=0x7f0eb6568000) at sna_accel.c:14700
        screen = <optimized out>
        dirty = <optimized out>
#1  0x00007f0eb78796ee in stop_flush (scanout=0x214a660, sna=0x7f0eb6568000)
    at sna_accel.c:14763
No locals.
#2  sna_accel_flush (sna=0x7f0eb6568000) at sna_accel.c:14985
        priv = 0x214a660
        busy = false
#3  sna_accel_block_handler (sna=0x7f0eb6568000, tv=0x7fff2aed2c58)
    at sna_accel.c:15425
No locals.
#4  0x000000000043d9c4 in BlockHandler (
    pTimeout=pTimeout@entry=0x7fff2aed2c58, 
    pReadmask=pReadmask@entry=0x81c340 <LastSelectMask>) at dixutils.c:387
        i = 0
        j = <optimized out>
#5  0x0000000000469e84 in WaitForSomething (
    pClientsReady=pClientsReady@entry=0x23a7a30) at WaitFor.c:210
        i = <optimized out>
        waittime = {tv_sec = 153, tv_usec = 65000}
        wt = 0x7fff2aed2c60
        timeout = <optimized out>
---Type <return> to continue, or q <return> to quit---
        clientsReadable = {fds_bits = {0 <repeats 16 times>}}
        clientsWritable = {fds_bits = {1, 1, 4294967295, 5415558, 34730504, 
            140733913574944, 34712528, 0, 34712528, 34730504, 0, 206158430224, 
            140733913574960, 140733913574752, 16, 264619602646}}
        selecterr = <optimized out>
        nready = 0
        devicesReadable = {fds_bits = {55, 1, 140733913574912, 48, 43118368, 
            4689382, 48, 43118368, 50831424, 4651885, 1, 48, 0, 0, 0, 
            46284816}}
        now = <optimized out>
        someReady = 0
#6  0x0000000000439581 in Dispatch () at dispatch.c:357
        clientReady = 0x23a7a30
        result = <optimized out>
        client = <optimized out>
        nready = <optimized out>
        icheck = 0x8163f0 <checkForInput>
        start_tick = <optimized out>
#7  0x00000000004282da in main (argc=9, argv=0x7fff2aed3068, 
    envp=<optimized out>) at main.c:298
        i = <optimized out>
        alwaysCheckForInput = {0, 1}

Comment 4 Danny 2013-07-30 20:55:09 UTC

just compiled 2.21.13 as it is supposed to fix some mem corruption, but it still crashes...

Comment 5 Chris Wilson 2013-07-30 21:12:35 UTC

Can you try CFLAGS="-O0 -g3" ./configure <blah>?

Comment 6 Chris Wilson 2013-07-30 21:16:29 UTC

To use valgrind, do ./configure --enable-debug, then I find it easier to launch X by hand, so something like:

$ sudo valgrind --trace-children /usr/bin/Xorg -ac -noreset 2>&1 | tee /tmp/xorg.txt

switch back to a second VT, or login in remotely, then
$ DISPLAY=:0 gnome-session

switch back to X

Running under valgrind, you will notice a slowdown, but not quite as much as perhaps you would imagine.

Comment 7 Danny 2013-07-30 21:51:37 UTC

with O0 -g3 it was not so easy to trigger, but:

Program received signal SIGSEGV, Segmentation fault.
0x00007f1472b8b88c in has_offload_slaves (sna=0x7f1471868000)
    at sna_accel.c:14747
14747	ScreenPtr screen = sna->scrn->pScreen;
(gdb) bt full
#0  0x00007f1472b8b88c in has_offload_slaves (sna=0x7f1471868000)
    at sna_accel.c:14747
        screen = 0x2085620
        dirty = 0x7fff0733c890
#1  0x00007f1472b8ba3b in stop_flush (sna=0x7f1471868000, scanout=0x207aa60)
    at sna_accel.c:14810
No locals.
#2  0x00007f1472b8c1bf in sna_accel_flush (sna=0x7f1471868000)
    at sna_accel.c:15032
        priv = 0x207aa60
        busy = false
#3  0x00007f1472b8cf29 in sna_accel_block_handler (sna=0x7f1471868000, 
    tv=0x7fff0733c998) at sna_accel.c:15472
No locals.
#4  0x00007f1472ba67d8 in sna_block_handler (arg=0x2054420, 
    timeout=0x7fff0733c998, read_mask=0x81c340 <LastSelectMask>)
    at sna_driver.c:557
        sna = 0x7f1471868000
        tv = 0x7fff0733c998
#5  0x000000000043d9c4 in BlockHandler (
    pTimeout=pTimeout@entry=0x7fff0733c998, 
    pReadmask=pReadmask@entry=0x81c340 <LastSelectMask>) at dixutils.c:387
        i = 0
---Type <return> to continue, or q <return> to quit---
        j = <optimized out>
#6  0x0000000000469e84 in WaitForSomething (
    pClientsReady=pClientsReady@entry=0x22e29d0) at WaitFor.c:210
        i = <optimized out>
        waittime = {tv_sec = 296, tv_usec = 432000}
        wt = 0x7fff0733c9a0
        timeout = <optimized out>
        clientsReadable = {fds_bits = {0 <repeats 16 times>}}
        clientsWritable = {fds_bits = {1, 1, 0, 34103520, 0, 0, 33899552, 
            5413855, 33899552, 46103504, 77595664, 206158430224, 
            140733314222960, 140733314222752, 46041312, 264619602646}}
        selecterr = <optimized out>
        nready = 0
        devicesReadable = {fds_bits = {57, 1, 140733314222912, 32, 46103344, 
            4689382, 32, 46103344, 51602000, 4651885, 0, 32, 0, 0, 0, 
            46103440}}
        now = <optimized out>
        someReady = 0
#7  0x0000000000439581 in Dispatch () at dispatch.c:357
        clientReady = 0x22e29d0
        result = <optimized out>
        client = <optimized out>
        nready = <optimized out>
---Type <return> to continue, or q <return> to quit---
        icheck = 0x8163f0 <checkForInput>
        start_tick = <optimized out>
#8  0x00000000004282da in main (argc=9, argv=0x7fff0733cda8, 
    envp=<optimized out>) at main.c:298
        i = <optimized out>
        alwaysCheckForInput = {0, 1}
(gdb) 

Valgrind will have to wait until tomorrow at least...

Comment 8 Chris Wilson 2013-07-30 22:08:28 UTC

Next time you see a crash, please p *sna and p *sna->scrn

Comment 9 Danny 2013-08-01 23:01:29 UTC

Created attachment 83493 [details]
gdb output

Work is keeping my too busy, so it took some time. Anyway, finally did manage to get the gdb output at least.

Comment 10 Chris Wilson 2013-08-02 07:12:44 UTC

Ok, this is starting to make sense. Buffer overflow in the relocation array. If you run with --enable-debug this should trigger an assertion. So if you could recompile and run under gdb, that would be invaluable. Meanwhile I'll look for paths where I've made an incorrect check.

Comment 11 Chris Wilson 2013-08-02 12:21:31 UTC

I think I understand it:

commit 5287660aafe45859c07874c22dca99c1ff5e555a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 2 13:18:12 2013 +0100

    sna: Reserve relocation entries for the deferred VBO
    
    Whilst we reserved exec entry slots for the deferred VBO, there were no
    relocation spaces reserved. So if we submitted a render command followed
    by a multitude of BLT copies, we could then overrun the relocation array
    when adding the deferred vbo to the batch.
    
    Reported-by: Danny <moondrake@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=67504
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Please do check that this indeed fixes the problem, thanks!

Comment 12 Danny 2013-08-02 20:17:08 UTC

You're quicker than I had time to recompile with debug. As far as i can tell, this has fixed the issue. Should it change, i will let you know. Thanks!

d.

Comment 13 Tadas 2013-08-05 14:35:26 UTC

Created attachment 83666 [details]
File name glitch

If i have a file or folder, and its name is more than 1 line, if i select part of the name in 2 or more lines, then the rest of the text in lines with selected text is not black, but white.

Comment 14 Chris Wilson 2013-08-05 15:40:49 UTC

glitch? Behaves the same for sna/uxa/fb, so I presume the bug is in the rendering commands i.e. higher up the stack.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.