Created attachment 104697 [details]
code example consistently demonstrating the segfault
(code attached) I'm getting a segfault in glBufferSubData that I can't seem to narrow down any further. I can glGetBufferSubData from the memory location, but I can't write to it.
I'm on x64 Arch Linux, using
00:02.0 VGA compatible controller: Intel Corporation Crystal Well Integrated Graphics Controller (rev 08)
Here's the log:
[New Thread 0x7ffff4796700 (LWP 20204)]
[New Thread 0x7fffeb8d1700 (LWP 20205)]
[Thread 0x7fffeb8d1700 (LWP 20205) exited]
GL_VENDOR: Intel Open Source Technology Center
GL_RENDERER: Mesa DRI Intel(R) Haswell Mobile
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff754c859 in __memcpy_sse2_unaligned () from /usr/lib/libc.so.6
#0 0x00007ffff754c859 in __memcpy_sse2_unaligned () from /usr/lib/libc.so.6
#1 0x00007ffff2ea154d in ?? () from /usr/lib/xorg/modules/dri/i965_dri.so
#2 0x0000000000400ee4 in main (nargs=1, args=0x7fffffffe8d8) at opengl-segfault.c:148
Created attachment 104766 [details] [review]
patch "fixing" the issue by removing an optimization block
intel_obj->buffer->virtual is NULL when the memcpy takes place
This is a bug in the attached example. You have a huge loop where each iteration allocates a new buffer that is never freed... it simply runs out of memory/buffers.
Adding these deletes at the bottom of the while loop fixes the crash:
The intent *was* to allocate many buffers simultaneously without freeing. While this certainly isn't a good use of VBOs, it does demonstrate some underlying incorrectness in the code. This behavior should NOT happen, and doesn't happen if I switch graphics drivers.
This is almost certainly not an OOM issue. I can cut the number of buffers in half and vastly increase the capacity of each without getting a segfault. It also starts slowing down once the memory consumption goes past a certain point (which is expected, since it starts to just use my RAM).
I'm specifically checking for OpenGL errors, including OOM. The error check code in my sample is wrong, but it's irrelevant anyway because no OpenGL errors are generated. Moreover, one would expect an out-of-memory error to occur when the buffer is actually allocated, and if not then, then certainly when it is read! The segfault only occurs once we're in glBufferSubData.
Like I said, the "patch" I uploaded fixes this issue - the patch only removes a block of code that, based on the comment surrounding it, I assume is only there for performance purposes, not correctness. But apparently, it reduces correctness, because it's causing very ungraceful errors.
Again, the use case isn't good, but it's demonstrative of underlying issues and false programmer assumptions, and I believe it to be important because I've had other SIGSEGVs and SIGBUSs show up in glBufferSubData for no real reason. I'm very inclined to suspect the same tricky optimization block, but that's just my bias.
Created attachment 106164 [details] [review]
The error is that intel_bufferobj_subdata does not take in to account that drm_intel_gem_bo_map_unsynchronized can fail. Attaching a workaround/fix to the problem. With this patch applied, test fails appropriately:
--- 8< ----
GL_VENDOR: Intel Open Source Technology Center
GL_RENDERER: Mesa DRI Intel(R) Ivybridge Mobile
Mesa: User error: GL_OUT_OF_MEMORY in glGenVertexArrays
There's no need to delete the code you highlighted. It's simply failing to check whether the mapping succeeded, which is a ~2 line fix, which I've posted to the mailing list for review:
That said, this is a dangerous game to play - there are a variety of other places where we rely on the ability to map buffers for writing (including GPU command submission). If you keep looking, you're likely to find more similar breakage, and not all of them are as easy to fix.
Frankly, I'm a bit loathe to fix these, as it adds overhead (even if not much) to well behaved applications, to try and provide a better error report to applications which are clearly broken. Most applications don't even try to handle GL_OUT_OF_MEMORY, because in order to notice, they'd have to call glGetError() after virtually every API call...which is a widely known way to make your application horribly slow. Without doing that, you don't know what API calls actually failed, so how can you recover from it? And even if you know exactly what failed...you still have to be able to stop what you're doing, delete a bunch of things, and recover sensibly...which is a pretty heroic task.
On top of that, the most likely reason the application ran out of memory in the first place is due to a resource leak like this. If an application can't properly delete buffers under normal circumstances, I have a hard time believing that it properly cleans them up in error handling code triggered when it receives GL_OUT_OF_MEMORY at a random point in time.
I haven't seen a single non-trivial application get this right. Even if we properly report errors, they either fail to notice, or crash when trying to handle it. At which point, I wonder why we bother...
Sorry, can someone help me understand how this did turn out to be an OOM issue?
At 128000 buffers x 2^7 bytes per buffer, I get segfaults.
At 64000 buffers x 2^14 bytes per buffer, it completes fine.
At 64000 buffers x 2^15 bytes per buffer, the OS kills the program for using too much RAM. It says "Killed". It doesn't segfault.
I know this is not proper answer to the question but ENOMEM is what we get from the kernel driver so that's how we should treat it. We run out of the memory simultaneously mappable by GTT, I don't understand why it would not happen in the case 2.
I've been reading through these to understand more:
maybe intel-gpu-tools has some tool to catch this.
(In reply to comment #6)
> Frankly, I'm a bit loathe to fix these, as it adds overhead (even if not
> much) to well behaved applications, to try and provide a better error report
> to applications which are clearly broken. Most applications don't even try
> to handle GL_OUT_OF_MEMORY, because in order to notice, they'd have to call
> glGetError() after virtually every API call...which is a widely known way to
> make your application horribly slow. Without doing that, you don't know
> what API calls actually failed, so how can you recover from it? And even if
> you know exactly what failed...you still have to be able to stop what you're
> doing, delete a bunch of things, and recover sensibly...which is a pretty
> heroic task.
FWIW, I totally agree with this.
(In reply to comment #7)
> Sorry, can someone help me understand how this did turn out to be an OOM
> At 128000 buffers x 2^7 bytes per buffer, I get segfaults.
> At 64000 buffers x 2^14 bytes per buffer, it completes fine.
> At 64000 buffers x 2^15 bytes per buffer, the OS kills the program for using
> too much RAM. It says "Killed". It doesn't segfault.
From the results, it seems like the kernel drm module has some limit on the number of buffers it can allocate. Indeed, if I reduce the size of the data to a single float per buffer it still segfaults at 128000 buffers.
At least on my SNB, the limit is close to 2^16 (65536 buffers). It starts to segfault a bit before that (around 65300) but I guess that is probably because other apps in my desktop and the desktop itself are also allocating buffers from the driver.
I think we're going to mark as WONTFIX.
SGTM, I was being way too pedantic. If this issue doesn't show up except in ridiculous circumstances, it's not much of an issue!