I discovered something weird today which is somewhat pathological use case but nevertheless ends in catastrophe. I have a key binding defined in my window manager to "fullscreen" a window, resizing the window so that the client area fills the entire screen. The same key also restores the window to its original shape. The problem occurs when I run glxgears (or anything else) and then _hold down_ this button. This causes the window to switch back and forth between large and small sizes extremely rapidly. After some time of doing this (anywhere from ~5 seconds to a minute or so), the GPU hangs, requiring a reboot to recover. Incidentally, if I stop before the GPU hangs, closing X does not restore the console correctly. Kernel log contains the usual spam: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung render error detected, EIR: 0x00000000 [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 160037 at 160033) ... ad infinitum Occasionally there is visible corruption on the screen before and/or after the hang. If I ssh into the machine and take a look at memory usage, I can see the page cache continuously increasing in size until all available memory is consumed, at which point the memory is suddenly released and the process repeats (until it eventually hangs). I'm using: server 1.8.0, mesa 7.8.1 (also happens with git master), xf86-video-intel-2.11 on a ThinkPad T500 with a GM45, linux 2.6.34-rc6 (also happens with 2.6.33). I'm not sure if it's useful, but I've also attached the contents of i915_error_state from debugfs after a hang.
Created attachment 35380 [details] Demo program which rapidly resizes an opengl window. So here's a demo program which doesn't involve any window manager interaction. GDK and GtkGLExt are required. With the appropriate packages installed, gcc -o glresize glresize.c `pkg-config --cflags --libs gdk-2.0 gtkglext-1.0` should build the program. It seems that GPU hangs as in the initial report are relatively rare, but I have been able to trigger one. Nevertheless, this program is capable of wreaking untold destruction upon a system. The most severe problems appear to only happen with double buffering. The program will use double buffering unless you pass --single to it. Depending on how I run it, I can cause various kinds of horrible effects: * prevent X from restoring the console correctly this happens in every case except when the server segfaults or (trivially) the system locks up. * invoke an OOM killer rampage. happens with direct rendering and vm overcommit enabled. * hard-lock the kernel. happens with direct rendering, and overcommit disabled, sometimes takes a few tries... * cause the X server to segfault. happens instantly with indirect rendering. The program works perfectly (meaning it does not negatively affect the system) on an r600 card (direct and indirect rendering), or with the software rasterizer.
Created attachment 35381 [details] netconsole log from kernel panic/lockup Here's the netconsole output from the kernel lockup mentioned above. This is with latest Linus' git, but similar behaviour occurs with 2.6.33. The output was truncated, presumably due to the lockup. The system ceases to respond to sysrq. Sometimes the program segfaults instead of locking up, so it can take multiple tries to actually trigger the panic. Also, the system occasionally gets into a state where the program segfaults immediately, at which point the X server needs to be restarted. Nevertheless, the lockup will occur eventually.
Oops, I meant to mention in the last comment -- disable vm overcommit to trigger the lockup: echo 2 > /proc/sys/vm/overcommit_memory otherwise you get the "OOM killer rampage" instead of a panic.
Created attachment 35382 [details] Demo program which rapidly resizes an opengl window. I added a new option to the demo program, --offset, which adjusts the window position by one pixel in each direction. Apparently the window position is important: the kernel does not panic if you use this option (it's still possible to segfault the server or cause the oom killer rampage). I also noticed that running xrandr to change the screen size after or during execution of the test case will cause the server to immediately die with an assertion failure: X: intel_bufmgr_gem.c:900: drm_intel_gem_bo_unreference_locked_timed: Assertion `((&bo_gem->refcount)->atomic) > 0' failed. This assertion failure also does not occur if the --offset option is given.
Thanks a lot for the test cases, we'll take a look.
Created attachment 35554 [details] Resize, plain X Hmm, because most of my headless boxes don't have gtk+ available (just compiling mesa takes up most of their local harddrive space! ;-) this is pure X variant of glresize. Nick, does this still exercise the bug you are seeing? On my systems it rapidly allocates lots of memory that is reclaimed when the cache is reaped, but I haven't seen any anamalous behaviour (even with the original glresize).
glresize + xrandr == death. GPU hang: batchbuffer at 0x02c68000: 0x02c68000: 0x09000000: MI_LOAD_SCAN_LINES_INCL 0x02c68004: 0x00000258: dword 1 0x02c68008: 0x09000000: MI_LOAD_SCAN_LINES_INCL 0x02c6800c: 0x00000258: dword 1 0x02c68010: 0x01820000: MI_WAIT_FOR_EVENT 0x02c68014: HEAD 0x54f08806: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 1, dst tile 1) 0x02c68018: 0x03cc0380: format 8888, dst pitch 896, clipping disabled 0x02c6801c: 0x00000000: dst (0,0) 0x02c68020: 0x02580320: dst (800,600) 0x02c68024: 0x098d0000: dst offset 0x098d0000 0x02c68028: 0x00000000: src (0,0) 0x02c6802c: 0x00000400: src pitch 1024 0x02c68030: 0x094d0000: src offset 0x094d0000 0x02c68034: 0x05000000: MI_BATCH_BUFFER_END The blit at least looks valid, so what might be up with the WAIT?
Cool, I can try out the plain X version later tonight. It might not be important, but I noticed that your version does not release the GLX context in the loop, whereas the GDK version does.
On Mon, May 10, 2010 at 01:25:52PM -0700, bugzilla-daemon@freedesktop.org wrote: > https://bugs.freedesktop.org/show_bug.cgi?id=27922 > > --- Comment #7 from Chris Wilson <chris@chris-wilson.co.uk> 2010-05-10 13:25:51 PDT --- > glresize + xrandr == death. > > GPU hang: > batchbuffer at 0x02c68000: > 0x02c68000: 0x09000000: MI_LOAD_SCAN_LINES_INCL > 0x02c68004: 0x00000258: dword 1 > 0x02c68008: 0x09000000: MI_LOAD_SCAN_LINES_INCL > 0x02c6800c: 0x00000258: dword 1 > 0x02c68010: 0x01820000: MI_WAIT_FOR_EVENT > 0x02c68014: HEAD 0x54f08806: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src > tile 1, dst tile 1) > 0x02c68018: 0x03cc0380: format 8888, dst pitch 896, clipping disabled > 0x02c6801c: 0x00000000: dst (0,0) > 0x02c68020: 0x02580320: dst (800,600) > 0x02c68024: 0x098d0000: dst offset 0x098d0000 > 0x02c68028: 0x00000000: src (0,0) > 0x02c6802c: 0x00000400: src pitch 1024 > 0x02c68030: 0x094d0000: src offset 0x094d0000 > 0x02c68034: 0x05000000: MI_BATCH_BUFFER_END > > The blit at least looks valid, so what might be up with the WAIT? MI_WAIT_FOR_EVENT on a disabled pipe is supposed to hang your chip. At least it worked that way when I've developed the overlay. Dunno what you're doing with xrandr but I suspect abusing it (in conjuction with tons of MI_WAIT_FOR_EVENT commands in the bb due to dri2 swap requests) could result in strange things to happen. I've always been rather uneasy with this whole "userspace can submit bb commands that depend upon current kms state" thing. Which is why I've put the core overlay code into the kernel. Just my 2 cents.
Yeah yuck. So we submit a batch with a wait in it, then mess with the output config, then the batch runs. Seems like a bad race. Maybe we do need a kernel way of communicating this instead.
Not so simple. Moving the glXMakeCurrent(dpy, ctx, win);...glXMakeCurrent(dpy, 0, 0); into the loop causes the hang to trigger much faster and without xrandr interferring. The hang is nearly identical: batchbuffer at 0x028f5000: 0x028f5000: 0x09000000: MI_LOAD_SCAN_LINES_INCL 0x028f5004: 0x00000300: dword 1 0x028f5008: 0x09000000: MI_LOAD_SCAN_LINES_INCL 0x028f500c: 0x00000300: dword 1 0x028f5010: 0x01820000: MI_WAIT_FOR_EVENT 0x028f5014: HEAD 0x54f08806: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 1, dst tile 1) 0x028f5018: 0x03cc0400: format 8888, dst pitch 1024, clipping disabled 0x028f501c: 0x00000000: dst (0,0) 0x028f5020: 0x03000400: dst (1024,768) 0x028f5024: 0x0202e000: dst offset 0x0202e000 0x028f5028: 0x00000000: src (0,0) 0x028f502c: 0x00000400: src pitch 1024 0x028f5030: 0x024f5000: src offset 0x024f5000 0x028f5034: 0x05000000: MI_BATCH_BUFFER_END The dst is the X front buffer, but we should be under complete control of the dri2 buffers here.
Created attachment 35555 [details] [review] glresize
> --- Comment #10 from Jesse Barnes <jbarnes@virtuousgeek.org> 2010-05-10 14:30:44 PDT --- > Yeah yuck. So we submit a batch with a wait in it, then mess with the output > config, then the batch runs. Seems like a bad race. Maybe we do need a kernel > way of communicating this instead. Well, I was actually under the impression that a mix of a) wrestling all this stuff through single-threaded X and b) quiescent gem execution before modesetting in the kernel should prevent such races (with the current code). Looks like I was wrong. Compositioning window managers should get rid if this problem, anyway ;)
I don't know if this is still useful information, but I finally got around to testing the non-GDK version of my test case (I didn't bother with the original that doesn't call glXMakeCurrent every iteration): every failure mode listed in earlier comments is reproducible with this test case on my machine.
*** Bug 27601 has been marked as a duplicate of this bug. ***
(In reply to comment #14) > I don't know if this is still useful information, but I finally got around to > testing the non-GDK version of my test case (I didn't bother with the original > that doesn't call glXMakeCurrent every iteration): every failure mode listed in > earlier comments is reproducible with this test case on my machine. Thanks Nick for confirming the test case hits the same bugs. So far on g45 it causes the WAIT hang and on i945 I hit the assertion failure in dri_bo_unreference. What fun!
Regarding duplicatness of the bug 27601 - it looks like it is this bug. However it is on i915 not i965. I had such messages on console. It also look's like https://bugzilla.kernel.org/show_bug.cgi?id=15463 is the same problem - and possibly https://bugzilla.kernel.org/show_bug.cgi?id=15737 (however I couldn't even ping the computer and it depends on kernel version).
The reference counting error with xrandr + activity is generic, so I expect that the assertion could be hit on any Intel chipset. The GPU hang looks related - the theory is that userspace has issued a WAIT on a non-existent buffer (presumably due to the reference counting error), however this symptom will vary between chipsets as they use different mechanisms + code.
Created attachment 35567 [details] [review] Handle reference counting across page flipping. This seems to survive the beating on my i945 box.
g45 still hangs.
(In reply to comment #20) > g45 still hangs. That hang appears to be due to it being a headless system, and waiting on the disabled pipe. On my gm45 laptop I hit a completely different assertion.
(In reply to comment #19) > Created an attachment (id=35567) [details] > Handle reference counting across page flipping. > > This seems to survive the beating on my i945 box. It crashes my xorg-server (gdm enters loop of X restarts). I'll post full details soon as possible.
Fixed the assertion, survives on my gm45. Maciej, xf86-video-intel.git requires tip of libdrm.git as well (in case you updated one without the other...)
(In reply to comment #23) > Fixed the assertion, survives on my gm45. > > Maciej, xf86-video-intel.git requires tip of libdrm.git as well (in case you > updated one without the other...) Ok. I managed to start X (I rebuild server against libdrm from git just in case) but if I start openttd screen is black except small orange rectangular at the center with mouse that does not move. If I kill openttd from console all is ok. Compozite WM works ok - I haven't run into problems so far.
Probable duplicate: bug 27285. Possibly duplicate: bug 24884?
Created attachment 35569 [details] [review] Handle reference counting across page flipping. Talking the code through with Jesse, I discovered the separate event handling for FLIP complete notification and so we do need to call ExchangeBuffers when sucessfully scheduling the page flip as well.
Pushed: commit 9f54107f866a25cf670f81f7c52b8c108728c6a5 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue May 11 14:55:16 2010 +0100 dri2: Handle reference counting across page flipping 1. Instead of swapping bos, swap the entire private structure. 2. If we update the pixmap bo for the Screen, make sure we update the reference inside intel->front_buffer so that xrandr still functions. Fixes: Bug 27922 - i965: Rapidly resizing OpenGL window causes GPU to hang. https://bugs.freedesktop.org/show_bug.cgi?id=27922 My testing seems to indicate this is stable and the leak has gone, I think that was a separate issue fixed a while ago.
Going to reopen this bug because "glresize" still causes horrible things to happen. If you'd rather I file different bugs instead, I can do that. I updated libdrm/xf86-video-intel to latest git. The good news is that the following issues appear to be fixed: * it is possible to use xrandr with glresize without killing the server * the "OOM killer rampage" no longer occurs, so I see the same behaviours with and without overcommit. The bad news is that I can still: * Segfault the server, by setting LIBGL_ALWAYS_INDIRECT=1 and running glresize (I can post the backtrace if necessary). * Panic the kernel (as described earlier) by running glresize enough times. This now occurs regardless of the vm_overcommit setting.
(In reply to comment #28) > Going to reopen this bug because "glresize" still causes horrible things to > happen. If you'd rather I file different bugs instead, I can do that. Please file separate bugs, since the original bug has been fixed, and there're already many comments accumulated here. > I updated libdrm/xf86-video-intel to latest git. > > The good news is that the following issues appear to be fixed: > * it is possible to use xrandr with glresize without killing the server > * the "OOM killer rampage" no longer occurs, so I see the same behaviours > with and without overcommit. > > The bad news is that I can still: > * Segfault the server, by setting LIBGL_ALWAYS_INDIRECT=1 and running > glresize (I can post the backtrace if necessary). Maybe bug#27842? > * Panic the kernel (as described earlier) by running glresize enough times. > This now occurs regardless of the vm_overcommit setting. Please file a new bug with log messages.
(In reply to comment #29) > > The bad news is that I can still: > > * Segfault the server, by setting LIBGL_ALWAYS_INDIRECT=1 and running > > glresize (I can post the backtrace if necessary). > > Maybe bug#27842? That bug doesn't describe a server segfault, so probably not. OK, I have filed both remaining issues as bug 28079 and bug 28080.
What with OpenTTD 'orange box'? Separate bug as well? PS. The patch seemd to fix several 'bugs' I've been experiencing.
(In reply to comment #27) > Pushed: > > commit 9f54107f866a25cf670f81f7c52b8c108728c6a5 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue May 11 14:55:16 2010 +0100 > This commit causes a severe regression with the compositing manager in kwin (KDE SC 4.4.3). Any composited redraw leads to severe flickering in the redrawn area and sometimes remnants of previous frames remain visible.
I filed bug 28097 for the page flipping regression.
This should cover the page flip regression introduced in the earlier path: commit 030d56279bf14d9ddd42d8fdbeaa66ef3f557b4d Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri May 14 16:53:40 2010 +0100 drm: don't overwrite the old intel->front_buffer It's now handled in the common ExchangeBuffers() path.
(In reply to comment #34) > This should cover the page flip regression introduced in the earlier path: > > commit 030d56279bf14d9ddd42d8fdbeaa66ef3f557b4d > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Fri May 14 16:53:40 2010 +0100 Fixed. Marking bug as resolved again.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.