Bug 27922 - i965: Rapidly resizing OpenGL window causes GPU to hang.
Summary: i965: Rapidly resizing OpenGL window causes GPU to hang.
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Jesse Barnes
QA Contact:
URL:
Whiteboard:
Keywords:
: 27601 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-04-30 18:09 UTC by Nick Bowler
Modified: 2017-07-24 23:08 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Demo program which rapidly resizes an opengl window. (2.41 KB, text/plain)
2010-05-02 16:26 UTC, Nick Bowler
no flags Details
netconsole log from kernel panic/lockup (2.71 KB, text/plain)
2010-05-02 16:31 UTC, Nick Bowler
no flags Details
Demo program which rapidly resizes an opengl window. (2.52 KB, text/plain)
2010-05-02 18:36 UTC, Nick Bowler
no flags Details
Resize, plain X (2.36 KB, text/plain)
2010-05-10 13:10 UTC, Chris Wilson
no flags Details
glresize (2.39 KB, patch)
2010-05-10 14:35 UTC, Chris Wilson
no flags Details | Splinter Review
Handle reference counting across page flipping. (5.25 KB, patch)
2010-05-11 07:04 UTC, Chris Wilson
no flags Details | Splinter Review
Handle reference counting across page flipping. (6.39 KB, patch)
2010-05-11 11:13 UTC, Chris Wilson
no flags Details | Splinter Review

Description Nick Bowler 2010-04-30 18:09:41 UTC
I discovered something weird today which is somewhat pathological use case but
nevertheless ends in catastrophe.  I have a key binding defined in my window
manager to "fullscreen" a window, resizing the window so that the client area
fills the entire screen.  The same key also restores the window to its original
shape.

The problem occurs when I run glxgears (or anything else) and then _hold down_
this button.  This causes the window to switch back and forth between large and
small sizes extremely rapidly.  After some time of doing this (anywhere from ~5
seconds to a minute or so), the GPU hangs, requiring a reboot to recover.
Incidentally, if I stop before the GPU hangs, closing X does not restore the
console correctly.  Kernel log contains the usual spam:

  [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
  render error detected, EIR: 0x00000000
  [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 
  160037 at 160033)
  ... ad infinitum

Occasionally there is visible corruption on the screen before and/or after the
hang.

If I ssh into the machine and take a look at memory usage, I can see the page
cache continuously increasing in size until all available memory is consumed,
at which point the memory is suddenly released and the process repeats (until
it eventually hangs).

I'm using: server 1.8.0, mesa 7.8.1 (also happens with git master),
xf86-video-intel-2.11 on a ThinkPad T500 with a GM45, linux 2.6.34-rc6 (also
happens with 2.6.33).  I'm not sure if it's useful, but I've also attached the
contents of i915_error_state from debugfs after a hang.
Comment 1 Nick Bowler 2010-05-02 16:26:07 UTC
Created attachment 35380 [details]
Demo program which rapidly resizes an opengl window.

So here's a demo program which doesn't involve any window manager interaction.
GDK and GtkGLExt are required.  With the appropriate packages installed,

 gcc -o glresize glresize.c `pkg-config --cflags --libs gdk-2.0 gtkglext-1.0`

should build the program.  It seems that GPU hangs as in the initial report are
relatively rare, but I have been able to trigger one.  Nevertheless, this
program is capable of wreaking untold destruction upon a system.

The most severe problems appear to only happen with double buffering.  The
program will use double buffering unless you pass --single to it.  Depending
on how I run it, I can cause various kinds of horrible effects:

  * prevent X from restoring the console correctly
      this happens in every case except when the server segfaults or
      (trivially) the system locks up.

  * invoke an OOM killer rampage.
      happens with direct rendering and vm overcommit enabled.

  * hard-lock the kernel.
      happens with direct rendering, and overcommit disabled, sometimes takes
      a few tries...

  * cause the X server to segfault.
      happens instantly with indirect rendering.

The program works perfectly (meaning it does not negatively affect the system)
on an r600 card (direct and indirect rendering), or with the software
rasterizer.
Comment 2 Nick Bowler 2010-05-02 16:31:04 UTC
Created attachment 35381 [details]
netconsole log from kernel panic/lockup

Here's the netconsole output from the kernel lockup mentioned above.  This is with latest Linus' git, but similar behaviour occurs with 2.6.33.  The output was truncated, presumably due to the lockup.  The system ceases to respond to sysrq.

Sometimes the program segfaults instead of locking up, so it can take multiple tries to actually trigger the panic.  Also, the system occasionally gets into a state where the program segfaults immediately, at which point the X server needs to be restarted.  Nevertheless, the lockup will occur eventually.
Comment 3 Nick Bowler 2010-05-02 16:33:46 UTC
Oops, I meant to mention in the last comment -- disable vm overcommit to trigger the lockup:

  echo 2 > /proc/sys/vm/overcommit_memory

otherwise you get the "OOM killer rampage" instead of a panic.
Comment 4 Nick Bowler 2010-05-02 18:36:37 UTC
Created attachment 35382 [details]
Demo program which rapidly resizes an opengl window.

I added a new option to the demo program, --offset, which adjusts the window
position by one pixel in each direction.  Apparently the window position is
important: the kernel does not panic if you use this option (it's still
possible to segfault the server or cause the oom killer rampage).

I also noticed that running xrandr to change the screen size after or during
execution of the test case will cause the server to immediately die with an
assertion failure:

  X: intel_bufmgr_gem.c:900: drm_intel_gem_bo_unreference_locked_timed: Assertion `((&bo_gem->refcount)->atomic) > 0' failed.

This assertion failure also does not occur if the --offset option is given.
Comment 5 Jesse Barnes 2010-05-10 11:44:49 UTC
Thanks a lot for the test cases, we'll take a look.
Comment 6 Chris Wilson 2010-05-10 13:10:17 UTC
Created attachment 35554 [details]
Resize, plain X

Hmm, because most of my headless boxes don't have gtk+ available (just compiling mesa takes up most of their local harddrive space! ;-) this is pure X variant of glresize.

Nick, does this still exercise the bug you are seeing?

On my systems it rapidly allocates lots of memory that is reclaimed when the cache is reaped, but I haven't seen any anamalous behaviour (even with the original glresize).
Comment 7 Chris Wilson 2010-05-10 13:25:51 UTC
glresize + xrandr == death.

GPU hang:
batchbuffer at 0x02c68000:
0x02c68000:      0x09000000: MI_LOAD_SCAN_LINES_INCL
0x02c68004:      0x00000258:    dword 1
0x02c68008:      0x09000000: MI_LOAD_SCAN_LINES_INCL
0x02c6800c:      0x00000258:    dword 1
0x02c68010:      0x01820000: MI_WAIT_FOR_EVENT
0x02c68014: HEAD 0x54f08806: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 1, dst tile 1)
0x02c68018:      0x03cc0380:    format 8888, dst pitch 896, clipping disabled
0x02c6801c:      0x00000000:    dst (0,0)
0x02c68020:      0x02580320:    dst (800,600)
0x02c68024:      0x098d0000:    dst offset 0x098d0000
0x02c68028:      0x00000000:    src (0,0)
0x02c6802c:      0x00000400:    src pitch 1024
0x02c68030:      0x094d0000:    src offset 0x094d0000
0x02c68034:      0x05000000: MI_BATCH_BUFFER_END

The blit at least looks valid, so what might be up with the WAIT?
Comment 8 Nick Bowler 2010-05-10 13:30:39 UTC
Cool, I can try out the plain X version later tonight.  It might not be important, but I noticed that your version does not release the GLX context in the loop, whereas the GDK version does.
Comment 9 Daniel Vetter 2010-05-10 13:52:44 UTC
On Mon, May 10, 2010 at 01:25:52PM -0700, bugzilla-daemon@freedesktop.org wrote:
> https://bugs.freedesktop.org/show_bug.cgi?id=27922
> 
> --- Comment #7 from Chris Wilson <chris@chris-wilson.co.uk> 2010-05-10 13:25:51 PDT ---
> glresize + xrandr == death.
> 
> GPU hang:
> batchbuffer at 0x02c68000:
> 0x02c68000:      0x09000000: MI_LOAD_SCAN_LINES_INCL
> 0x02c68004:      0x00000258:    dword 1
> 0x02c68008:      0x09000000: MI_LOAD_SCAN_LINES_INCL
> 0x02c6800c:      0x00000258:    dword 1
> 0x02c68010:      0x01820000: MI_WAIT_FOR_EVENT
> 0x02c68014: HEAD 0x54f08806: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src
> tile 1, dst tile 1)
> 0x02c68018:      0x03cc0380:    format 8888, dst pitch 896, clipping disabled
> 0x02c6801c:      0x00000000:    dst (0,0)
> 0x02c68020:      0x02580320:    dst (800,600)
> 0x02c68024:      0x098d0000:    dst offset 0x098d0000
> 0x02c68028:      0x00000000:    src (0,0)
> 0x02c6802c:      0x00000400:    src pitch 1024
> 0x02c68030:      0x094d0000:    src offset 0x094d0000
> 0x02c68034:      0x05000000: MI_BATCH_BUFFER_END
> 
> The blit at least looks valid, so what might be up with the WAIT?

MI_WAIT_FOR_EVENT on a disabled pipe is supposed to hang your chip. At
least it worked that way when I've developed the overlay. Dunno what
you're doing with xrandr but I suspect abusing it (in conjuction with tons
of MI_WAIT_FOR_EVENT commands in the bb due to dri2 swap requests) could
result in strange things to happen.

I've always been rather uneasy with this whole "userspace can submit
bb commands that depend upon current kms state" thing. Which is why I've
put the core overlay code into the kernel.

Just my 2 cents.
Comment 10 Jesse Barnes 2010-05-10 14:30:44 UTC
Yeah yuck.  So we submit a batch with a wait in it, then mess with the output config, then the batch runs.  Seems like a bad race.  Maybe we do need a kernel way of communicating this instead.
Comment 11 Chris Wilson 2010-05-10 14:34:17 UTC
Not so simple. Moving the glXMakeCurrent(dpy, ctx, win);...glXMakeCurrent(dpy, 0, 0); into the loop causes the hang to trigger much faster and without xrandr interferring.

The hang is nearly identical:
batchbuffer at 0x028f5000:
0x028f5000:      0x09000000: MI_LOAD_SCAN_LINES_INCL
0x028f5004:      0x00000300:    dword 1
0x028f5008:      0x09000000: MI_LOAD_SCAN_LINES_INCL
0x028f500c:      0x00000300:    dword 1
0x028f5010:      0x01820000: MI_WAIT_FOR_EVENT
0x028f5014: HEAD 0x54f08806: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 1, dst tile 1)
0x028f5018:      0x03cc0400:    format 8888, dst pitch 1024, clipping disabled
0x028f501c:      0x00000000:    dst (0,0)
0x028f5020:      0x03000400:    dst (1024,768)
0x028f5024:      0x0202e000:    dst offset 0x0202e000
0x028f5028:      0x00000000:    src (0,0)
0x028f502c:      0x00000400:    src pitch 1024
0x028f5030:      0x024f5000:    src offset 0x024f5000
0x028f5034:      0x05000000: MI_BATCH_BUFFER_END

The dst is the X front buffer, but we should be under complete control of the dri2 buffers here.
Comment 12 Chris Wilson 2010-05-10 14:35:00 UTC
Created attachment 35555 [details] [review]
glresize
Comment 13 Daniel Vetter 2010-05-10 14:38:30 UTC
> --- Comment #10 from Jesse Barnes <jbarnes@virtuousgeek.org> 2010-05-10 14:30:44 PDT ---
> Yeah yuck.  So we submit a batch with a wait in it, then mess with the output
> config, then the batch runs.  Seems like a bad race.  Maybe we do need a kernel
> way of communicating this instead.

Well, I was actually under the impression that a mix of
a) wrestling all this stuff through single-threaded X and
b) quiescent gem execution before modesetting in the kernel
should prevent such races (with the current code). Looks like I was wrong.

Compositioning window managers should get rid if this problem, anyway ;)
Comment 14 Nick Bowler 2010-05-10 19:21:15 UTC
I don't know if this is still useful information, but I finally got around to testing the non-GDK version of my test case (I didn't bother with the original that doesn't call glXMakeCurrent every iteration): every failure mode listed in earlier comments is reproducible with this test case on my machine.
Comment 15 Chris Wilson 2010-05-11 02:08:47 UTC
*** Bug 27601 has been marked as a duplicate of this bug. ***
Comment 16 Chris Wilson 2010-05-11 02:32:37 UTC
(In reply to comment #14)
> I don't know if this is still useful information, but I finally got around to
> testing the non-GDK version of my test case (I didn't bother with the original
> that doesn't call glXMakeCurrent every iteration): every failure mode listed in
> earlier comments is reproducible with this test case on my machine.

Thanks Nick for confirming the test case hits the same bugs. So far on g45 it causes the WAIT hang and on i945 I hit the assertion failure in dri_bo_unreference. What fun!
Comment 17 Maciej Piechotka 2010-05-11 02:41:51 UTC
Regarding duplicatness of the bug 27601 - it looks like it is this bug. However it is on i915 not i965. I had such messages on console. It also look's like https://bugzilla.kernel.org/show_bug.cgi?id=15463 is the same problem - and possibly https://bugzilla.kernel.org/show_bug.cgi?id=15737 (however I couldn't even ping the computer and it depends on kernel version).
Comment 18 Chris Wilson 2010-05-11 03:00:23 UTC
The reference counting error with xrandr + activity is generic, so I expect that the assertion could be hit on any Intel chipset. The GPU hang looks related - the theory is that userspace has issued a WAIT on a non-existent buffer (presumably due to the reference counting error), however this symptom will vary between chipsets as they use different mechanisms + code.
Comment 19 Chris Wilson 2010-05-11 07:04:09 UTC
Created attachment 35567 [details] [review]
Handle reference counting across page flipping.

This seems to survive the beating on my i945 box.
Comment 20 Chris Wilson 2010-05-11 07:09:03 UTC
g45 still hangs.
Comment 21 Chris Wilson 2010-05-11 07:36:10 UTC
(In reply to comment #20)
> g45 still hangs.

That hang appears to be due to it being a headless system, and waiting on the disabled pipe.

On my gm45 laptop I hit a completely different assertion.
Comment 22 Maciej Piechotka 2010-05-11 07:53:01 UTC
(In reply to comment #19)
> Created an attachment (id=35567) [details]
> Handle reference counting across page flipping.
> 
> This seems to survive the beating on my i945 box.

It crashes my xorg-server (gdm enters loop of X restarts). I'll post full details soon as possible.
Comment 23 Chris Wilson 2010-05-11 07:56:58 UTC
Fixed the assertion, survives on my gm45.

Maciej, xf86-video-intel.git requires tip of libdrm.git as well (in case you updated one without the other...)
Comment 24 Maciej Piechotka 2010-05-11 09:14:32 UTC
(In reply to comment #23)
> Fixed the assertion, survives on my gm45.
> 
> Maciej, xf86-video-intel.git requires tip of libdrm.git as well (in case you
> updated one without the other...)

Ok. I managed to start X (I rebuild server against libdrm from git just in case) but if I start openttd screen is black except small orange rectangular at the center with mouse that does not move. If I kill openttd from console all is ok.

Compozite WM works ok - I haven't run into problems so far.
Comment 25 Chris Wilson 2010-05-11 10:30:51 UTC
Probable duplicate: bug 27285.
Possibly duplicate: bug 24884?
Comment 26 Chris Wilson 2010-05-11 11:13:10 UTC
Created attachment 35569 [details] [review]
Handle reference counting across page flipping.

Talking the code through with Jesse, I discovered the separate event handling for FLIP complete notification and so we do need to call ExchangeBuffers when sucessfully scheduling the page flip as well.
Comment 27 Chris Wilson 2010-05-12 13:41:24 UTC
Pushed:

commit 9f54107f866a25cf670f81f7c52b8c108728c6a5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue May 11 14:55:16 2010 +0100

    dri2: Handle reference counting across page flipping
    
    1. Instead of swapping bos, swap the entire private structure.
    
    2. If we update the pixmap bo for the Screen, make sure we update the
    reference inside intel->front_buffer so that xrandr still functions.
    
    Fixes:
    
      Bug 27922 - i965: Rapidly resizing OpenGL window causes GPU to hang.
      https://bugs.freedesktop.org/show_bug.cgi?id=27922
    
My testing seems to indicate this is stable and the leak has gone, I think that was a separate issue fixed a while ago.
Comment 28 Nick Bowler 2010-05-12 15:38:44 UTC
Going to reopen this bug because "glresize" still causes horrible things to happen.  If you'd rather I file different bugs instead, I can do that.

I updated libdrm/xf86-video-intel to latest git.

The good news is that the following issues appear to be fixed:
  * it is possible to use xrandr with glresize without killing the server
  * the "OOM killer rampage" no longer occurs, so I see the same behaviours with and without overcommit.

The bad news is that I can still:
  * Segfault the server, by setting LIBGL_ALWAYS_INDIRECT=1 and running glresize (I can post the backtrace if necessary).
  * Panic the kernel (as described earlier) by running glresize enough times.  This now occurs regardless of the vm_overcommit setting.
Comment 29 Gordon Jin 2010-05-12 19:22:14 UTC
(In reply to comment #28)
> Going to reopen this bug because "glresize" still causes horrible things to
> happen.  If you'd rather I file different bugs instead, I can do that.

Please file separate bugs, since the original bug has been fixed, and there're already many comments accumulated here.
 
> I updated libdrm/xf86-video-intel to latest git.
> 
> The good news is that the following issues appear to be fixed:
>   * it is possible to use xrandr with glresize without killing the server
>   * the "OOM killer rampage" no longer occurs, so I see the same behaviours
> with and without overcommit.
> 
> The bad news is that I can still:
>   * Segfault the server, by setting LIBGL_ALWAYS_INDIRECT=1 and running
> glresize (I can post the backtrace if necessary).

Maybe bug#27842?

>   * Panic the kernel (as described earlier) by running glresize enough times. 
> This now occurs regardless of the vm_overcommit setting.

Please file a new bug with log messages.
Comment 30 Nick Bowler 2010-05-12 20:31:19 UTC
(In reply to comment #29)
> > The bad news is that I can still:
> >   * Segfault the server, by setting LIBGL_ALWAYS_INDIRECT=1 and running
> > glresize (I can post the backtrace if necessary).
> 
> Maybe bug#27842?

That bug doesn't describe a server segfault, so probably not.

OK, I have filed both remaining issues as bug 28079 and bug 28080.
Comment 31 Maciej Piechotka 2010-05-12 22:45:52 UTC
What with OpenTTD 'orange box'? Separate bug as well?

PS. The patch seemd to fix several 'bugs' I've been experiencing.
Comment 32 Magnus Kessler 2010-05-12 23:44:05 UTC
(In reply to comment #27)
> Pushed:
> 
> commit 9f54107f866a25cf670f81f7c52b8c108728c6a5
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue May 11 14:55:16 2010 +0100
> 

This commit causes a severe regression with the compositing manager in kwin (KDE SC 4.4.3). Any composited redraw leads to severe flickering in the redrawn area and sometimes remnants of previous frames remain visible.
Comment 33 Brian Rogers 2010-05-13 22:56:08 UTC
I filed bug 28097 for the page flipping regression.
Comment 34 Chris Wilson 2010-05-14 09:32:45 UTC
This should cover the page flip regression introduced in the earlier path:

commit 030d56279bf14d9ddd42d8fdbeaa66ef3f557b4d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 14 16:53:40 2010 +0100

    drm: don't overwrite the old intel->front_buffer
    
    It's now handled in the common ExchangeBuffers() path.
Comment 35 Magnus Kessler 2010-05-15 00:02:59 UTC
(In reply to comment #34)
> This should cover the page flip regression introduced in the earlier path:
> 
> commit 030d56279bf14d9ddd42d8fdbeaa66ef3f557b4d
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri May 14 16:53:40 2010 +0100

Fixed. Marking bug as resolved again.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.