Bug 29007

Summary: GPU hang on video playback with overlay
Product: xorg Reporter: Marcel Beister <kingofthehill3>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: daniel, tmezzadra
Version: 7.5 (2009.10)   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg.log after a gpu hang
none
Xorg.log after a gpu hang
none
i915_error_state from debugfs after gpu hang
none
output of intel_gpu_dump after gpu hang
none
output of intel_reg_dumper after gpu hang
none
Capture overlay status for GPU hangs
none
i915_error_state with overlay registers patch after gpu hang
none
i915_error_state with overlay registers and stride patch after gpu hang
none
i915_error_state with patched driver (see comment 22) none

Description Marcel Beister 2010-07-11 09:01:17 UTC
Created attachment 36947 [details]
dmesg.log after a gpu hang

I'm trying to play videos on my Thinkpad X30 with 830M chipset and XV option enabled. Sometimes this works without any problems, but often it fails with a gpu hang message. To reproduce this behavior, I just have to start mplayer -vo xv multiple times, or I can minimize and maximize the mplayer window multiple times.

I'm using Debian with the intel-xorg package 2.12.0 from experimental. I had the same problem with 2.9 and 2.11 of the intel driver. The libdrm version is 2.4.21.
Comment 1 Marcel Beister 2010-07-11 09:02:16 UTC
Created attachment 36948 [details]
Xorg.log after a gpu hang
Comment 2 Marcel Beister 2010-07-11 09:04:45 UTC
Created attachment 36949 [details]
i915_error_state from debugfs after gpu hang
Comment 3 Marcel Beister 2010-07-11 09:07:09 UTC
This problem seems to be related to the KMS options, since 2.9 without KMS seems to work fine with XV video playback.
Comment 4 Chris Wilson 2010-07-11 10:25:21 UTC
0x01810000 MI_WAIT_FOR_EVENT strikes again on the ringbuffer.
Comment 5 Marcel Beister 2010-07-11 10:29:51 UTC
If there is anything I can do to in order to fix this bug, please let me know. I'm using kernel 2.6.34.1 btw.
Comment 6 Chris Wilson 2010-07-12 02:44:26 UTC
Marcel, what is the timing of the hang? I think I can see a potential race condition between the overlay being turned off and torn-down whilst it is still referenced by the ring and so the GPU might wander off into invalid memory. Does the GPU hang during normal playback, or after an event [such as closing the video]?
Comment 7 Marcel Beister 2010-07-12 02:56:02 UTC
The gpu hang CAN happen either when I start the video or if I do something like minimize/maximize the mplayer window. Also moving another window on top of the mplayer window or switching between full screen and windowed mode CAN trigger the error. It NEVER happend on closing the mplayer window or during normal playback without any user interaction.
Comment 8 Chris Wilson 2010-07-12 05:09:10 UTC
*** Bug 26818 has been marked as a duplicate of this bug. ***
Comment 9 Chris Wilson 2010-07-12 15:07:43 UTC
Can you grab an intel_reg_dump after an overlay hang? I'll put that also onto my i915_error_state improvement TODO list, but for the time being you'll need to run it by hang (and hope the registers have not been clobbered...)
Comment 10 Marcel Beister 2010-07-13 02:23:51 UTC
Created attachment 36988 [details]
output of intel_gpu_dump after gpu hang
Comment 11 Marcel Beister 2010-07-13 02:26:32 UTC
Hopefully the output of intel_gpu_dump is the thing you asked for. The guide for a intel_reg_dump seems to be out of date (http://intellinuxgraphics.org/intel_reg_dumper.html) since there is no intel_reg_dumper tool in the current intel-gpu-tools (1.0.2).
Comment 12 Chris Wilson 2010-07-13 02:51:16 UTC
In this case I do need intel_reg_dumper as I need to check the overlay registers to work out what the gpu is waiting on.
Comment 13 Marcel Beister 2010-07-13 03:03:51 UTC
Created attachment 36990 [details]
output of intel_reg_dumper after gpu hang
Comment 14 Chris Wilson 2010-07-13 03:11:00 UTC
/me swears.

Sorry about that, it appears not even intel_reg_dumper includes the overlay registers. Today is not going well.
Comment 15 Chris Wilson 2010-07-14 04:05:48 UTC
Created attachment 37034 [details] [review]
Capture overlay status for GPU hangs

Can you please apply this patch (applies cleanly to 2..35-rc5 at least) and see if we capture more information in /sys/kernel/debug/dri/0/i915_error_state following a GPU hang?
Comment 16 Marcel Beister 2010-07-14 12:57:38 UTC
Created attachment 37053 [details]
i915_error_state with overlay registers patch after gpu hang
Comment 17 Marcel Beister 2010-07-14 13:01:16 UTC
I could apply the patch to the 2.6.34.1 kernel without problems and attached the new i915_error_state output from debugfs.
Comment 18 Chris Wilson 2010-07-14 13:25:02 UTC
Hmm, Marcel that error state indicates that we used a Y stride of 640 bytes. The docs mention an erratum for i830 and i845 that strides should be a multiple of 256 bytes. This is fixed in xf86-video-intel.git. However, I do not think that this is the only bug at play here...

commit 3a7c25ff8ddd45c9d9eca5cc2228552847ca9e7d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 12 19:47:46 2010 +0100

    video: Apply overlay stride errata for i830 and i845
    
    Due to an erratum on these chipsets, the overlay stride must be a
    multiple of 256 bytes.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 19 Marcel Beister 2010-07-15 02:13:33 UTC
Created attachment 37064 [details]
i915_error_state with overlay registers and stride patch after gpu hang

I patched the 2.12.0 driver from debian with the changes from "Apply overlay stride errata for i830 and i845" and rebuild the driver (the patch was applied to i830_video.c instead of intel_video.c; I hope this was the right place!). Unfortunately the gpu still hangs, although this time i need 5 times to trigger the error. Attached is the i915_error_state log after the crash with the patched driver.
Comment 20 Chris Wilson 2010-07-15 03:13:00 UTC
Probably just a coincidence... ;-)

The hangs seem to occur at the start of an OVERLAY sequence. Maybe we are missing some wait or writing registers in the wrong sequence.
Comment 21 Chris Wilson 2010-07-16 02:51:04 UTC
Not a fix, but a substantial performance improvement:

commit 24bdfe0d5eb4e890e9c63bbb4617efaa0768ab7f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jul 15 13:54:04 2010 +0100

    video: Reuse the old buffers.
    
    After passing the new buffer to the kernel, the old buffer is unpinned
    and becomes available for re-use. So keep hold of the old buffer and
    swap after a PutImage. This greatly reduces the amount of CPU time
    consumed by the kernel on behalf of the video overlay -- by only
    allocating two buffers for an entire sequence, we avoid clflushing and
    page allocation on every frame.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

And you never know, it may help! :) So far, I haven't been able to reproduce this hang on i9xx.
Comment 22 Marcel Beister 2010-07-16 06:24:10 UTC
I patched 2.12.0-1 from Debian with
- video: Free the buffers immediately after turning off.
- video: Reuse the old buffers.
- video: Apply overlay stride errata for i830 and i845

Unfortunately this did not improve my problem. Should I attach a new i915_error_state.log from the patched driver?

If I find some time on this weekend I'll try to build a driver directly from the git repo to simplify testing.
Comment 23 Marcel Beister 2010-07-16 06:29:01 UTC
I currently have another old notebook on my desk (Dell Latitude C400) which is also i830 based. I will try to reproduce the error on that machine to rule out video-bios-or-whatever problems.

Btw. the Thinkpad X30 also seems to have some problems with the monitor detection, but I will put this in a separate bug report ;-)
Comment 24 Chris Wilson 2010-07-16 06:32:22 UTC
Yes, let's have a look at the recent error state and see if anything changed.

I've also reviewed the kernel overlay code, just a couple of niggles but I haven't found anything that is in glaring conflict with the docs. If you do have the time, I'd appreciate it if you could try:

http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=fair-eviction
Comment 25 Marcel Beister 2010-07-16 06:41:09 UTC
Created attachment 37125 [details]
i915_error_state with patched driver (see comment 22)
Comment 26 Chris Wilson 2010-07-16 07:05:00 UTC
The error is the same, so at least the new intel_video.c code doesn't regress! :)

Looking at the error, I've realised that the wait there is superfluous. The overlay will switch-on on the next vblank, and waiting for that makes no difference and is not required by the API. So we can proceed as normal and stall on the next upload if the buffer is still busy. This still means that we will detect gpu hangs as normal and not block indefinitely.

So, I am quite optimistic that the tweaked overlay driver will avoid this... Of course, the hang may just be detected later...
Comment 27 Chris Wilson 2010-07-16 07:52:16 UTC
Look what I found in the old ums overlay code:

/*
 * On I830, if pipe A is off when the overlayis enabled, it will fail to
 * turn on and blank the entire screen or lock up the ring. Light up pipe
 * A in this case to provide a clock for the overlay hardware
 */

Hmm.
Comment 28 Marcel Beister 2010-07-16 08:51:15 UTC
That sounds promising ;-)
Comment 29 Chris Wilson 2010-07-16 09:17:15 UTC
I've put my first attempt at applying the ums workaround up in the fair-eviction branch:

commit 5a9dda875800a4702f45e9365bdb692c3a5c18ea
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 16 17:13:01 2010 +0100

    drm/i915/overlay: Workaround i830 overlay activation bug.
    
    On i830, there exists a bug where an overlay on pipe B requires the mode
    clock on pipe A in order to activate. So workaround this by activating
    pipe A when trying to enable the overlay on pipe B.
    
    References:
    
      [Bug 29007] GPU hang on video playback with overlay
      https://bugs.freedesktop.org/show_bug.cgi?id=29007
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

That branch requires quite a bit of work to get it ready to upstream, but I'd appreciate it if you do get a chance to test it. Thanks.
Comment 30 Marcel Beister 2010-07-18 05:54:06 UTC
I compiled and tested the following combination:
- master branch of git://anongit.freedesktop.org/xorg/driver/xf86-video-intel
- fair-eviction branch of git://anongit.freedesktop.org/~ickle/linux-2.6

With this I noticed:
- when the kernel has finished and the init-scripts are started (and the resolution has already changed), the display turns off for about 3 seconds, but comes back
- when stating X, the driver no longer seems to detect a non-existing monitor on the external vga-connector
- the overlay bug is still there, but this time the system freezes completely :-(

Any suggestions on further debugging with my system? I'll test the same configuration on the Dell laptop asap, maybe things are different there.
Comment 31 Chris Wilson 2010-07-18 06:00:44 UTC
Ouch. Can you reproduce any of these symptoms on a vanilla 2.6.35-rc5? I think somewhere in that branch is a bug which hangs the system when X closes, but that miraculously disappears when I try to bisect.

If any are reproducible in 2.6.35-rc5, then we need to bisect quickly in order to make sure the regression doesn't get into 2.6.35. Thanks.
Comment 32 Marcel Beister 2010-07-19 02:58:14 UTC
Vanilla 2.6.35-rc5 behaves similar as the fair-eviction branch BUT overlay playback does NOT result in a complete system freeze. I already checked if the "	drm/i915/overlay: Workaround i830 overlay activation bug" fix causes this, but that's not the case. I'll try to bisect the commit which responsible for the freeze on my system.
Comment 33 Chris Wilson 2010-07-19 03:14:36 UTC
At this moment in time, I more concerned about the changes in behaviour for 2.6.35-rc5, in particular the 3 second delay and the loss of external monitor detection.

Finding the bug in the overlay code is a bonus. :)
Comment 34 Marcel Beister 2010-07-19 04:22:58 UTC
Concerning the external monitor: 
This is actually something good, because with 2.6.34 the driver often detects an external monitor, which does NOT exist.
Comment 35 Chris Wilson 2010-07-19 04:27:32 UTC
On Mon, 19 Jul 2010 04:22:59 -0700 (PDT), Marcel Beister <kingofthehill3@gmx.de>
> Concerning the external monitor: 
> This is actually something good, because with 2.6.34 the driver often detects
> an external monitor, which does NOT exist.

Thanks for the clarification. One less thing for me to worry about. :)

(I wonder whether replying via email actually works now... Would make my
life much easier, with a touch more integration.)
Comment 36 Marcel Beister 2010-07-20 01:27:36 UTC
Yesterday I went back to commit 12de1f6d6957f3240a6e29fb16d6417a951136b2, but the system still freezes on overlay playback. How far am I supposed to go back in the fair-eviction branch or in other words, what's the difference between that branch and vanilla 2.6.35-rc5?

I also tested overlay on the Dell laptop btw. and the behavior is exactly identical (some display on-off flickers, freeze on overlay playback, ...).
Comment 37 Chris Wilson 2010-07-20 01:50:45 UTC
> --- Comment #36 from Marcel Beister <kingofthehill3@gmx.de> 2010-07-20 01:27:36 PDT ---
> Yesterday I went back to commit 12de1f6d6957f3240a6e29fb16d6417a951136b2, but
> the system still freezes on overlay playback. How far am I supposed to go back
> in the fair-eviction branch or in other words, what's the difference between
> that branch and vanilla 2.6.35-rc5?

The trick is to let git figure it out for you:
# git bisect start
# git bisect good v2.6.35-rc5
# git bisect bad 12de1f6d6957f3240a6e29fb16d6417a951136b2

Every time you get a hang, git bisect bad, and recompile. If it passes,
git bisect good and recompile.

So it sounds like the regression is very early in the sequence. Useful to
know, thanks.
Comment 38 Chris Wilson 2010-07-24 05:08:05 UTC
I've found one nasty regression in the fair-eviction branch and have updated it. Might be worth retesting.
Comment 39 Marcel Beister 2010-07-24 07:54:09 UTC
I compiled the current version of the fair-eviction branch this morning, but the result was not very nice :-(. Instead of some display-on-off switching during the bootup, the image was moved about 1 cm to the right with parts of the right border visible on the left border. When X starts, the screen stayed black and the system freezes completely.

Unfortunately I didn't had the time for bisecting the "lockup-on-overlay-failure"-error during the week... I hope I'll get this done in the next days.
Comment 40 Chris Wilson 2010-08-20 03:18:10 UTC
The overlay branch [same location] is now rebased on top of 2.6.36-rc1 and boots on my i845. Due to various other breakage on that system [some nasty errata we don't workaround yet] that is as far as I've been able to test.

Marcel, if you find some time can you please try booting http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=overlay and seeing how it fares?
Comment 41 Marcel Beister 2010-08-31 01:30:51 UTC
I compiled the rebased overlay branch and tested it with my Thinkpad X30. The result is very bad: As soon as the kernel tries to switch the mode, the complete system freezes (reboot with sysreq-key is not working) and the monitor stays off.

I also tried to bisect the flickering problem which was introduced from 2.6.34 to 2.6.35, but each revision shows other problems or freezes. This makes it nearly impossible to find the commit that introduces the flickering. Sorry.
Comment 42 Marcel Beister 2010-09-23 16:48:28 UTC
Good news from the i830 front... I've switched to 2.6.36-rc5 and reverted back to the new intel drivers in debian (xserver-xorg-video-intel 2:2.12.0+shadow-1), and the results are very promising:

- the flickering and corruptions during bootup are gone
- no freezes during xserver start or shutdown
- no freezes during console <=> xserver switching
- video output with overlay is much more stable

but:
- switching to console has led to corruptions from time to time (not freezing)
- in rare cases video output with overlay can still lead to a system freeze
- sometimes the complete picture is shifted to the right by approx. 20-30 pixels (actually wrapped, the missing right part of the picture is visible on the left)

But stil... nice work!!!

I'm relieved that 2.6.36 with the current drivers from debian might be usable again (2.6.35 was definitly not)
Comment 43 Chris Wilson 2011-01-08 04:34:02 UTC
(In reply to comment #42)
> Good news from the i830 front... I've switched to 2.6.36-rc5 and reverted back
> to the new intel drivers in debian (xserver-xorg-video-intel
> 2:2.12.0+shadow-1), and the results are very promising:
> 
> - the flickering and corruptions during bootup are gone
> - no freezes during xserver start or shutdown
> - no freezes during console <=> xserver switching
> - video output with overlay is much more stable
> 
> but:
> - switching to console has led to corruptions from time to time (not freezing)
> - in rare cases video output with overlay can still lead to a system freeze
> - sometimes the complete picture is shifted to the right by approx. 20-30
> pixels (actually wrapped, the missing right part of the picture is visible on
> the left)

I think these two are:

commit 897493504addc5609f04a2c4f73c37ab972c29b2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Sep 12 18:25:19 2010 +0100

    drm/i915: Ensure that the crtcinfo is populated during mode_fixup()
    
    This should fix the mysterious mode setting failures reported during
    boot up and after resume, generally for i8xx class machines.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=16478
    Reported-and-tested-by: Xavier Chantry <chantry.xavier@gmail.com>
    Buzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29413
    Tested-by: Daniel Vetter <daniel@ffwll.ch>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org

I think that this setup should be stable now (shadow + overlay + kms on i830). Please do file fresh bugs for any issues you encounter. Thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.