Bug 70944 - 2.99.905 sigbus frequently when running "x11perf -copyplane500" with SNA
Summary: 2.99.905 sigbus frequently when running "x11perf -copyplane500" with SNA
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: All All
: medium critical
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-28 07:34 UTC by intelgraphics7
Modified: 2013-10-31 08:19 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Fix for SNA segfault (2.37 KB, text/plain)
2013-10-28 07:34 UTC, intelgraphics7
no flags Details
Related xorg.conf (2.87 KB, text/plain)
2013-10-28 09:35 UTC, intelgraphics7
no flags Details
Extensive debug log output (2.53 MB, text/plain)
2013-10-28 12:43 UTC, intelgraphics7
no flags Details
Extensive debug log output (945.61 KB, application/octet-stream)
2013-10-29 10:59 UTC, intelgraphics7
no flags Details

Description intelgraphics7 2013-10-28 07:34:15 UTC
Created attachment 88206 [details]
Fix for SNA segfault

Running "x11perf -copyplane500" crashes the Xserver in the intel driver code with a segfault in sna_accel.c:7454 (sna_copy_bitmap_blt()).

The attached patch fixes that for me.
It makes sure that line length variable bw is rounded up correctly.
Otherwise the line length is too short and it allocates not enough graphics memory for the given operation.

The patch applies this fix for all calculations of bw to make sure other functions besides the sna_copy_bitmap_blt() don't cause segfaults as well.

Please comment the patch and included it in future releases if it is correct from your point of view.
Comment 1 Chris Wilson 2013-10-28 07:55:39 UTC
The input (bx1, bx2) are always multiples of 8, so (bx2-bx1)/8 == (bx2-bx1+7)/8. Besides which that should not cause a crash but misrendering...
Comment 2 intelgraphics7 2013-10-28 08:10:43 UTC
But it seems that there are chances that (bx2-bx1) is not a multiple of 8.
Maybe it is related to the "ZaphodHeads" option. My xorg.conf has defined two separate screens and uses the "ZaphodHeads" option in the driver sections.
Without my patch the driver crashes after a few cicles of running the copyplane500 test, with my patch it does not.
Maybe bx2 or bx1 are not rounded correctly somewhere else?
Comment 3 Chris Wilson 2013-10-28 08:25:45 UTC
They all seem to be rounded correctly prior to the divide by 8. Please do attach your Xorg.0.log and if you can, the bt from gdb.
Comment 4 intelgraphics7 2013-10-28 09:33:06 UTC
Here is the GDB backtrace:

Program received signal SIGBUS, Bus error.
sna_copy_bitmap_blt (_bitmap=0x80564a00, drawable=0x80575c60, gc=0x80577f80, region=0xbffff784, sx=74, sy=73, bitplane=1, closure=0xbffff808)
    at sna_accel.c:7454
7454    sna_accel.c: No such file or directory.
(gdb) bt
#0  sna_copy_bitmap_blt (_bitmap=0x80564a00, drawable=0x80575c60, gc=0x80577f80, region=0xbffff784, sx=74, sy=73, bitplane=1, closure=0xbffff808)
    at sna_accel.c:7454
#1  0x0067c5e3 in sna_do_copy (src=0x80564a00, dst=0x80575c60, gc=0x80577f80, sx=91, sy=92, width=500, height=500, dx=17, dy=19, 
    copy=0x676630 <sna_copy_bitmap_blt>, bitPlane=1, closure=0xbffff808) at sna_accel.c:6166
#2  0x00691758 in sna_copy_plane (src=0x80564a00, dst=0x80575c60, gc=0x80577f80, src_x=91, src_y=92, w=500, h=500, dst_x=14, dst_y=16, bit=1)
    at sna_accel.c:7773
#3  0x801115cd in ?? ()
#4  0x800e54ac in ?? ()
#5  0x800379ed in ?? ()
#6  0x800253ea in ?? ()
#7  0x003204d3 in __libc_start_main () from /lib/i386-linux-gnu/libc.so.6
#8  0x80025729 in _start ()
Comment 5 intelgraphics7 2013-10-28 09:35:56 UTC
Created attachment 88222 [details]
Related xorg.conf
Comment 6 intelgraphics7 2013-10-28 09:39:52 UTC
To reproduce the issue I just use "./x11perf -copyplane500".
The issue does not always happen on the first run of x11perf but on the second or third I usually get the issue. Maybe that helps.
Comment 7 Chris Wilson 2013-10-28 09:42:14 UTC
So the likely cause there is bstride == 0, which should only happen if box->x1 == box->x2.

Can you please compile with ./configure --enable-debug=full and see it still reproduces the error? If so, please compress the Xorg.0.log and attach.
Comment 8 intelgraphics7 2013-10-28 09:47:44 UTC
bstride seems not to be 0 as you can see in the info below. But dst is out of bounds.
Unfortunately bx1 and bx2 were opmtimized out by the compiler:-(
Nevertheless I'll compile as you suggested and let you know the results.

(gdb) info locals
i = 64
upload = 0x8058ff20
ptr = 0x8a7efe00
bx1 = <optimized out>
bh = 44
bstride = 64
src_stride = <optimized out>
dst = 0x8a7f7000 <Address 0x8a7f7000 out of bounds>
bx2 = <optimized out>
src = 0x80573d60 "\002"
b = 0xb6f25100
pixmap = 0x80573d5f
sna = 0xb6f24000
arg = 0xbffff808
bitmap = 0x80569aa0
br00 = 3147776
br13 = 63702912
dx = 0
dy = 0
box = 0xbffff784
n = 1
Comment 9 intelgraphics7 2013-10-28 09:48:42 UTC
Note that the "info locals" are from a different run than the backtrace above. So addresses may differ.
Comment 10 intelgraphics7 2013-10-28 12:43:15 UTC
Created attachment 88228 [details]
Extensive debug log output

Skipped lots of content in the middle of the log file due to the upload size limit (see the "SNIP" marker in the file).
Comment 11 Chris Wilson 2013-10-28 12:53:00 UTC
Missing vital information between 6409s and 6411s. Can you please upload say the last 1000 lines?
Comment 12 intelgraphics7 2013-10-28 13:19:06 UTC
I did not cut away anything there. That is how it was in the log file.
But you're right, it seems there is something missing. I'll try to do the test once again to see if I get more output there this time.
Comment 13 intelgraphics7 2013-10-29 10:59:39 UTC
Created attachment 88283 [details]
Extensive debug log output

This now should include the crucial info.
Comment 14 Chris Wilson 2013-10-29 14:08:06 UTC
The upload still looks ok. I've placed assertions around the pointer access to verify that they are within bounds:

commit 0e6aca90c7b0b9edd5873034bcf0f3d8b2a9f065
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Oct 29 13:50:51 2013 +0000

    sna: asserts bitmap uploads are correct
    
    Place guards around the pointer accesses to verify that they are within
    the bitmap.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

but my current suspicion is that this is a spurious SIGBUS thrown by the kernel.
Comment 15 Chris Wilson 2013-10-29 15:43:58 UTC
One issue that I spotted that could lead to a resource issue in the kernel was that during the run the active buffers where not being retired - only at the end.

This should be improved by:

commit d86b36dc41f2e6744a4dea9286634cdf7989fa71
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Oct 29 15:12:17 2013 +0000

    sna: Check for retired upload buffers after checking for an idle ring
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 16 Chris Wilson 2013-10-29 21:06:13 UTC
Do you mind running with assertions enabled (--enable-debug) and seeing if that falls over in the out-of-bounds checks around sna_copy_plane_blt?
Comment 17 intelgraphics7 2013-10-30 08:47:03 UTC
I tried the assertions but never hit one.

But I've just noticed something else. Your suspicion about retired upload buffers might meet the mark.
My test case seems to make the X server allocating memory over and over and not releasing all of them. So the amount of free memory is constantly going down when I start the "x11perf -copyplane500" until the point where it's really low and then the SIGBUS happens. So I'm going to try your "Check for retired upload buffers after checking for an idle ring" fix and see if that helps my issue.
Comment 18 intelgraphics7 2013-10-30 09:11:46 UTC
Your "Check for retired upload buffers after checking for an idle ring" fix looks very promising. It runs already for several loops and the amount of free memory stays quite constant.

I've also double checked with my "fix" that I've posted initially. It did consume memory over and over as well and did crash the X server indeed, what else. But I have no idea why I was able to run the X server over the last weekend with my "!fix".
After thinking it through, I totally agree with you that (bx2-bx1)/8 == (bx2-bx1+7)/8. So my "fix" does not change anything as you've proposed at first.
Please excuse any inconveniences I may have caused with that.

I'll let run my test for some hours to make sure it works but as said already, it looks really promising.

Btw, is there a schedule for releasing 2.99.906 or even 3.00.000 ?
Comment 19 Chris Wilson 2013-10-30 09:18:33 UTC
Cool - I'm glad we've found the underlying problem. I'll do 2.99.906 this w/e as there are already substantial changes, and 3.0 will land as soon as I have a snapshot that can last a couple of weeks with a serious bug report/fix!
Comment 20 intelgraphics7 2013-10-30 09:45:31 UTC
Cool indeed. And thanks a lot for your help:-)
I'm going to do some further stress tests overnight and I would close the bug tomorrow if I'm allowed to do so.
Thanks again. Have a nice day.
Comment 21 intelgraphics7 2013-10-31 08:19:01 UTC
Ok, the test ran all night long without any issues so far. No memory leak anymore as far as I can see.
I'll close the bug and look forward to the upcoming release of 2.99.906.
Thanks a lot for your great help.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.