Bug 82683

Summary:	Segfault in glBufferSubData
Product:	Mesa	Reporter:	Ben Foppa <benjamin.foppa>
Component:	Drivers/DRI/i965	Assignee:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Status:	RESOLVED WONTFIX	QA Contact:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity:	normal
Priority:	medium	CC:	eero.t.tamminen
Version:	10.2
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	code example consistently demonstrating the segfault patch "fixing" the issue by removing an optimization block workaround

Description Ben Foppa 2014-08-15 22:24:12 UTC

Created attachment 104697 [details]
code example consistently demonstrating the segfault

(code attached) I'm getting a segfault in glBufferSubData that I can't seem to narrow down any further. I can glGetBufferSubData from the memory location, but I can't write to it.

I'm on x64 Arch Linux, using

00:02.0 VGA compatible controller: Intel Corporation Crystal Well Integrated Graphics Controller (rev 08)

Here's the log:

initializing..
[New Thread 0x7ffff4796700 (LWP 20204)]
[New Thread 0x7fffeb8d1700 (LWP 20205)]
[Thread 0x7fffeb8d1700 (LWP 20205) exited]
GL_VENDOR: Intel Open Source Technology Center
GL_RENDERER: Mesa DRI Intel(R) Haswell Mobile 

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff754c859 in __memcpy_sse2_unaligned () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff754c859 in __memcpy_sse2_unaligned () from /usr/lib/libc.so.6
#1  0x00007ffff2ea154d in ?? () from /usr/lib/xorg/modules/dri/i965_dri.so
#2  0x0000000000400ee4 in main (nargs=1, args=0x7fffffffe8d8) at opengl-segfault.c:148

Comment 1 Ben Foppa 2014-08-17 17:12:21 UTC

Created attachment 104766 [details] [review]
patch "fixing" the issue by removing an optimization block

Comment 2 Ben Foppa 2014-08-17 17:14:19 UTC

intel_obj->buffer->virtual is NULL when the memcpy takes place

Comment 3 Iago Toral 2014-09-11 14:32:07 UTC

This is a bug in the attached example. You have a huge loop where each iteration allocates a new buffer that is never freed... it simply runs out of memory/buffers.

Adding these deletes at the bottom of the while loop fixes the crash:

glDeleteVertexArrays(1, &vertex_array);
glDeleteBuffers(1, &vertex_buffer);

Comment 4 Ben Foppa 2014-09-12 01:47:16 UTC

The intent *was* to allocate many buffers simultaneously without freeing. While this certainly isn't a good use of VBOs, it does demonstrate some underlying incorrectness in the code. This behavior should NOT happen, and doesn't happen if I switch graphics drivers.

This is almost certainly not an OOM issue. I can cut the number of buffers in half and vastly increase the capacity of each without getting a segfault. It also starts slowing down once the memory consumption goes past a certain point (which is expected, since it starts to just use my RAM).

I'm specifically checking for OpenGL errors, including OOM. The error check code in my sample is wrong, but it's irrelevant anyway because no OpenGL errors are generated. Moreover, one would expect an out-of-memory error to occur when the buffer is actually allocated, and if not then, then certainly when it is read! The segfault only occurs once we're in glBufferSubData.

Like I said, the "patch" I uploaded fixes this issue - the patch only removes a block of code that, based on the comment surrounding it, I assume is only there for performance purposes, not correctness. But apparently, it reduces correctness, because it's causing very ungraceful errors.

Again, the use case isn't good, but it's demonstrative of underlying issues and false programmer assumptions, and I believe it to be important because I've had other SIGSEGVs and SIGBUSs show up in glBufferSubData for no real reason. I'm very inclined to suspect the same tricky optimization block, but that's just my bias.

Comment 5 Tapani Pälli 2014-09-12 06:22:02 UTC

Created attachment 106164 [details] [review]
workaround

The error is that intel_bufferobj_subdata does not take in to account that drm_intel_gem_bo_map_unsynchronized can fail. Attaching a workaround/fix to the problem. With this patch applied, test fails appropriately:

--- 8< ----
initializing..
GL_VENDOR: Intel Open Source Technology Center
GL_RENDERER: Mesa DRI Intel(R) Ivybridge Mobile 
Mesa: User error: GL_OUT_OF_MEMORY in glGenVertexArrays
GL_OUT_OF_MEMORY gen/binding

Comment 6 Kenneth Graunke 2014-09-12 06:59:51 UTC

There's no need to delete the code you highlighted. It's simply failing to check whether the mapping succeeded, which is a ~2 line fix, which I've posted to the mailing list for review:

http://lists.freedesktop.org/archives/mesa-dev/2014-September/067665.html

That said, this is a dangerous game to play - there are a variety of other places where we rely on the ability to map buffers for writing (including GPU command submission). If you keep looking, you're likely to find more similar breakage, and not all of them are as easy to fix.

Frankly, I'm a bit loathe to fix these, as it adds overhead (even if not much) to well behaved applications, to try and provide a better error report to applications which are clearly broken. Most applications don't even try to handle GL_OUT_OF_MEMORY, because in order to notice, they'd have to call glGetError() after virtually every API call...which is a widely known way to make your application horribly slow. Without doing that, you don't know what API calls actually failed, so how can you recover from it? And even if you know exactly what failed...you still have to be able to stop what you're doing, delete a bunch of things, and recover sensibly...which is a pretty heroic task.

On top of that, the most likely reason the application ran out of memory in the first place is due to a resource leak like this. If an application can't properly delete buffers under normal circumstances, I have a hard time believing that it properly cleans them up in error handling code triggered when it receives GL_OUT_OF_MEMORY at a random point in time.

I haven't seen a single non-trivial application get this right. Even if we properly report errors, they either fail to notice, or crash when trying to handle it. At which point, I wonder why we bother...

Comment 7 Ben Foppa 2014-09-13 03:43:44 UTC

Sorry, can someone help me understand how this did turn out to be an OOM issue?

At 128000 buffers x 2^7 bytes per buffer, I get segfaults.
At 64000 buffers x 2^14 bytes per buffer, it completes fine.
At 64000 buffers x 2^15 bytes per buffer, the OS kills the program for using too much RAM. It says "Killed". It doesn't segfault.

Comment 8 Tapani Pälli 2014-09-15 05:43:59 UTC

I know this is not proper answer to the question but ENOMEM is what we get from the kernel driver so that's how we should treat it. We run out of the memory simultaneously mappable by GTT, I don't understand why it would not happen in the case 2.

I've been reading through these to understand more:

http://blog.ffwll.ch/2012/10/i915gem-crashcourse.html
https://bwidawsk.net/blog/index.php/2014/06/the-global-gtt-part-1/

maybe intel-gpu-tools has some tool to catch this.

Comment 9 Iago Toral 2014-09-15 07:18:29 UTC

(In reply to comment #6)
(...)
> Frankly, I'm a bit loathe to fix these, as it adds overhead (even if not
> much) to well behaved applications, to try and provide a better error report
> to applications which are clearly broken.  Most applications don't even try
> to handle GL_OUT_OF_MEMORY, because in order to notice, they'd have to call
> glGetError() after virtually every API call...which is a widely known way to
> make your application horribly slow.  Without doing that, you don't know
> what API calls actually failed, so how can you recover from it?  And even if
> you know exactly what failed...you still have to be able to stop what you're
> doing, delete a bunch of things, and recover sensibly...which is a pretty
> heroic task.

FWIW, I totally agree with this.

(In reply to comment #7)
> Sorry, can someone help me understand how this did turn out to be an OOM
> issue?
> 
> At 128000 buffers x 2^7 bytes per buffer, I get segfaults.
> At 64000 buffers x 2^14 bytes per buffer, it completes fine.
> At 64000 buffers x 2^15 bytes per buffer, the OS kills the program for using
> too much RAM. It says "Killed". It doesn't segfault.

From the results, it seems like the kernel drm module has some limit on the number of buffers it can allocate. Indeed, if I reduce the size of the data to a single float per buffer it still segfaults at 128000 buffers.

At least on my SNB, the limit is close to 2^16 (65536 buffers). It starts to segfault a  bit before that (around 65300) but I guess that is probably because other apps in my desktop and the desktop itself are also allocating buffers from the driver.

Comment 10 Matt Turner 2016-11-03 22:14:45 UTC

I think we're going to mark as WONTFIX.

Comment 11 Ben Foppa 2016-11-03 23:16:51 UTC

SGTM, I was being way too pedantic. If this issue doesn't show up except in ridiculous circumstances, it's not much of an issue!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.