Created attachment 36947 [details] dmesg.log after a gpu hang I'm trying to play videos on my Thinkpad X30 with 830M chipset and XV option enabled. Sometimes this works without any problems, but often it fails with a gpu hang message. To reproduce this behavior, I just have to start mplayer -vo xv multiple times, or I can minimize and maximize the mplayer window multiple times. I'm using Debian with the intel-xorg package 2.12.0 from experimental. I had the same problem with 2.9 and 2.11 of the intel driver. The libdrm version is 2.4.21.
Created attachment 36948 [details] Xorg.log after a gpu hang
Created attachment 36949 [details] i915_error_state from debugfs after gpu hang
This problem seems to be related to the KMS options, since 2.9 without KMS seems to work fine with XV video playback.
0x01810000 MI_WAIT_FOR_EVENT strikes again on the ringbuffer.
If there is anything I can do to in order to fix this bug, please let me know. I'm using kernel 2.6.34.1 btw.
Marcel, what is the timing of the hang? I think I can see a potential race condition between the overlay being turned off and torn-down whilst it is still referenced by the ring and so the GPU might wander off into invalid memory. Does the GPU hang during normal playback, or after an event [such as closing the video]?
The gpu hang CAN happen either when I start the video or if I do something like minimize/maximize the mplayer window. Also moving another window on top of the mplayer window or switching between full screen and windowed mode CAN trigger the error. It NEVER happend on closing the mplayer window or during normal playback without any user interaction.
*** Bug 26818 has been marked as a duplicate of this bug. ***
Can you grab an intel_reg_dump after an overlay hang? I'll put that also onto my i915_error_state improvement TODO list, but for the time being you'll need to run it by hang (and hope the registers have not been clobbered...)
Created attachment 36988 [details] output of intel_gpu_dump after gpu hang
Hopefully the output of intel_gpu_dump is the thing you asked for. The guide for a intel_reg_dump seems to be out of date (http://intellinuxgraphics.org/intel_reg_dumper.html) since there is no intel_reg_dumper tool in the current intel-gpu-tools (1.0.2).
In this case I do need intel_reg_dumper as I need to check the overlay registers to work out what the gpu is waiting on.
Created attachment 36990 [details] output of intel_reg_dumper after gpu hang
/me swears. Sorry about that, it appears not even intel_reg_dumper includes the overlay registers. Today is not going well.
Created attachment 37034 [details] [review] Capture overlay status for GPU hangs Can you please apply this patch (applies cleanly to 2..35-rc5 at least) and see if we capture more information in /sys/kernel/debug/dri/0/i915_error_state following a GPU hang?
Created attachment 37053 [details] i915_error_state with overlay registers patch after gpu hang
I could apply the patch to the 2.6.34.1 kernel without problems and attached the new i915_error_state output from debugfs.
Hmm, Marcel that error state indicates that we used a Y stride of 640 bytes. The docs mention an erratum for i830 and i845 that strides should be a multiple of 256 bytes. This is fixed in xf86-video-intel.git. However, I do not think that this is the only bug at play here... commit 3a7c25ff8ddd45c9d9eca5cc2228552847ca9e7d Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jul 12 19:47:46 2010 +0100 video: Apply overlay stride errata for i830 and i845 Due to an erratum on these chipsets, the overlay stride must be a multiple of 256 bytes. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Created attachment 37064 [details] i915_error_state with overlay registers and stride patch after gpu hang I patched the 2.12.0 driver from debian with the changes from "Apply overlay stride errata for i830 and i845" and rebuild the driver (the patch was applied to i830_video.c instead of intel_video.c; I hope this was the right place!). Unfortunately the gpu still hangs, although this time i need 5 times to trigger the error. Attached is the i915_error_state log after the crash with the patched driver.
Probably just a coincidence... ;-) The hangs seem to occur at the start of an OVERLAY sequence. Maybe we are missing some wait or writing registers in the wrong sequence.
Not a fix, but a substantial performance improvement: commit 24bdfe0d5eb4e890e9c63bbb4617efaa0768ab7f Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jul 15 13:54:04 2010 +0100 video: Reuse the old buffers. After passing the new buffer to the kernel, the old buffer is unpinned and becomes available for re-use. So keep hold of the old buffer and swap after a PutImage. This greatly reduces the amount of CPU time consumed by the kernel on behalf of the video overlay -- by only allocating two buffers for an entire sequence, we avoid clflushing and page allocation on every frame. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> And you never know, it may help! :) So far, I haven't been able to reproduce this hang on i9xx.
I patched 2.12.0-1 from Debian with - video: Free the buffers immediately after turning off. - video: Reuse the old buffers. - video: Apply overlay stride errata for i830 and i845 Unfortunately this did not improve my problem. Should I attach a new i915_error_state.log from the patched driver? If I find some time on this weekend I'll try to build a driver directly from the git repo to simplify testing.
I currently have another old notebook on my desk (Dell Latitude C400) which is also i830 based. I will try to reproduce the error on that machine to rule out video-bios-or-whatever problems. Btw. the Thinkpad X30 also seems to have some problems with the monitor detection, but I will put this in a separate bug report ;-)
Yes, let's have a look at the recent error state and see if anything changed. I've also reviewed the kernel overlay code, just a couple of niggles but I haven't found anything that is in glaring conflict with the docs. If you do have the time, I'd appreciate it if you could try: http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=fair-eviction
Created attachment 37125 [details] i915_error_state with patched driver (see comment 22)
The error is the same, so at least the new intel_video.c code doesn't regress! :) Looking at the error, I've realised that the wait there is superfluous. The overlay will switch-on on the next vblank, and waiting for that makes no difference and is not required by the API. So we can proceed as normal and stall on the next upload if the buffer is still busy. This still means that we will detect gpu hangs as normal and not block indefinitely. So, I am quite optimistic that the tweaked overlay driver will avoid this... Of course, the hang may just be detected later...
Look what I found in the old ums overlay code: /* * On I830, if pipe A is off when the overlayis enabled, it will fail to * turn on and blank the entire screen or lock up the ring. Light up pipe * A in this case to provide a clock for the overlay hardware */ Hmm.
That sounds promising ;-)
I've put my first attempt at applying the ums workaround up in the fair-eviction branch: commit 5a9dda875800a4702f45e9365bdb692c3a5c18ea Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jul 16 17:13:01 2010 +0100 drm/i915/overlay: Workaround i830 overlay activation bug. On i830, there exists a bug where an overlay on pipe B requires the mode clock on pipe A in order to activate. So workaround this by activating pipe A when trying to enable the overlay on pipe B. References: [Bug 29007] GPU hang on video playback with overlay https://bugs.freedesktop.org/show_bug.cgi?id=29007 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> That branch requires quite a bit of work to get it ready to upstream, but I'd appreciate it if you do get a chance to test it. Thanks.
I compiled and tested the following combination: - master branch of git://anongit.freedesktop.org/xorg/driver/xf86-video-intel - fair-eviction branch of git://anongit.freedesktop.org/~ickle/linux-2.6 With this I noticed: - when the kernel has finished and the init-scripts are started (and the resolution has already changed), the display turns off for about 3 seconds, but comes back - when stating X, the driver no longer seems to detect a non-existing monitor on the external vga-connector - the overlay bug is still there, but this time the system freezes completely :-( Any suggestions on further debugging with my system? I'll test the same configuration on the Dell laptop asap, maybe things are different there.
Ouch. Can you reproduce any of these symptoms on a vanilla 2.6.35-rc5? I think somewhere in that branch is a bug which hangs the system when X closes, but that miraculously disappears when I try to bisect. If any are reproducible in 2.6.35-rc5, then we need to bisect quickly in order to make sure the regression doesn't get into 2.6.35. Thanks.
Vanilla 2.6.35-rc5 behaves similar as the fair-eviction branch BUT overlay playback does NOT result in a complete system freeze. I already checked if the " drm/i915/overlay: Workaround i830 overlay activation bug" fix causes this, but that's not the case. I'll try to bisect the commit which responsible for the freeze on my system.
At this moment in time, I more concerned about the changes in behaviour for 2.6.35-rc5, in particular the 3 second delay and the loss of external monitor detection. Finding the bug in the overlay code is a bonus. :)
Concerning the external monitor: This is actually something good, because with 2.6.34 the driver often detects an external monitor, which does NOT exist.
On Mon, 19 Jul 2010 04:22:59 -0700 (PDT), Marcel Beister <kingofthehill3@gmx.de> > Concerning the external monitor: > This is actually something good, because with 2.6.34 the driver often detects > an external monitor, which does NOT exist. Thanks for the clarification. One less thing for me to worry about. :) (I wonder whether replying via email actually works now... Would make my life much easier, with a touch more integration.)
Yesterday I went back to commit 12de1f6d6957f3240a6e29fb16d6417a951136b2, but the system still freezes on overlay playback. How far am I supposed to go back in the fair-eviction branch or in other words, what's the difference between that branch and vanilla 2.6.35-rc5? I also tested overlay on the Dell laptop btw. and the behavior is exactly identical (some display on-off flickers, freeze on overlay playback, ...).
> --- Comment #36 from Marcel Beister <kingofthehill3@gmx.de> 2010-07-20 01:27:36 PDT --- > Yesterday I went back to commit 12de1f6d6957f3240a6e29fb16d6417a951136b2, but > the system still freezes on overlay playback. How far am I supposed to go back > in the fair-eviction branch or in other words, what's the difference between > that branch and vanilla 2.6.35-rc5? The trick is to let git figure it out for you: # git bisect start # git bisect good v2.6.35-rc5 # git bisect bad 12de1f6d6957f3240a6e29fb16d6417a951136b2 Every time you get a hang, git bisect bad, and recompile. If it passes, git bisect good and recompile. So it sounds like the regression is very early in the sequence. Useful to know, thanks.
I've found one nasty regression in the fair-eviction branch and have updated it. Might be worth retesting.
I compiled the current version of the fair-eviction branch this morning, but the result was not very nice :-(. Instead of some display-on-off switching during the bootup, the image was moved about 1 cm to the right with parts of the right border visible on the left border. When X starts, the screen stayed black and the system freezes completely. Unfortunately I didn't had the time for bisecting the "lockup-on-overlay-failure"-error during the week... I hope I'll get this done in the next days.
The overlay branch [same location] is now rebased on top of 2.6.36-rc1 and boots on my i845. Due to various other breakage on that system [some nasty errata we don't workaround yet] that is as far as I've been able to test. Marcel, if you find some time can you please try booting http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=overlay and seeing how it fares?
I compiled the rebased overlay branch and tested it with my Thinkpad X30. The result is very bad: As soon as the kernel tries to switch the mode, the complete system freezes (reboot with sysreq-key is not working) and the monitor stays off. I also tried to bisect the flickering problem which was introduced from 2.6.34 to 2.6.35, but each revision shows other problems or freezes. This makes it nearly impossible to find the commit that introduces the flickering. Sorry.
Good news from the i830 front... I've switched to 2.6.36-rc5 and reverted back to the new intel drivers in debian (xserver-xorg-video-intel 2:2.12.0+shadow-1), and the results are very promising: - the flickering and corruptions during bootup are gone - no freezes during xserver start or shutdown - no freezes during console <=> xserver switching - video output with overlay is much more stable but: - switching to console has led to corruptions from time to time (not freezing) - in rare cases video output with overlay can still lead to a system freeze - sometimes the complete picture is shifted to the right by approx. 20-30 pixels (actually wrapped, the missing right part of the picture is visible on the left) But stil... nice work!!! I'm relieved that 2.6.36 with the current drivers from debian might be usable again (2.6.35 was definitly not)
(In reply to comment #42) > Good news from the i830 front... I've switched to 2.6.36-rc5 and reverted back > to the new intel drivers in debian (xserver-xorg-video-intel > 2:2.12.0+shadow-1), and the results are very promising: > > - the flickering and corruptions during bootup are gone > - no freezes during xserver start or shutdown > - no freezes during console <=> xserver switching > - video output with overlay is much more stable > > but: > - switching to console has led to corruptions from time to time (not freezing) > - in rare cases video output with overlay can still lead to a system freeze > - sometimes the complete picture is shifted to the right by approx. 20-30 > pixels (actually wrapped, the missing right part of the picture is visible on > the left) I think these two are: commit 897493504addc5609f04a2c4f73c37ab972c29b2 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Sep 12 18:25:19 2010 +0100 drm/i915: Ensure that the crtcinfo is populated during mode_fixup() This should fix the mysterious mode setting failures reported during boot up and after resume, generally for i8xx class machines. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=16478 Reported-and-tested-by: Xavier Chantry <chantry.xavier@gmail.com> Buzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29413 Tested-by: Daniel Vetter <daniel@ffwll.ch> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: stable@kernel.org I think that this setup should be stable now (shadow + overlay + kms on i830). Please do file fresh bugs for any issues you encounter. Thanks!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.