Bug 71432 - [ILK] GPU Hang with h.264 video decoding
Summary: [ILK] GPU Hang with h.264 video decoding
Status: CLOSED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-09 16:43 UTC by Mihail Kasadjikov
Modified: 2017-07-24 22:56 UTC (History)
3 users (show)

See Also:
i915 platform: ILK
i915 features: GPU hang


Attachments
Information collected by intel_gpu_abrt tool. (659.12 KB, application/x-gzip)
2013-11-09 16:46 UTC, Mihail Kasadjikov
no flags Details
fresh collected error state (721.92 KB, application/x-gzip)
2013-12-18 22:21 UTC, Mihail Kasadjikov
no flags Details

Description Mihail Kasadjikov 2013-11-09 16:43:07 UTC
I have a Lenovo Thinkpad X201 tablet laptop with Intel Core i7 L-620 CPU.
I use Debian jessie (testing) 64-bit with KDE 4.10.5.

The problem appears when I try to use hardware video decoding via vaapi (I use vlc player) after some minutes of playback. This error appears on kernel 3.11.7 and not on 3.11.6. I don't understand why because no changes in GPU driver between these versions (if I read correctly the kernel's changelog)...
When GPU has hung I connect to laptop via ssh and collect errors.

Please let me know if you need more information.

$ uname -a
Linux h13 3.11.7-zen+ #1 ZEN SMP Wed Nov 6 16:42:14 FET 2013 x86_64 GNU/Linux

The version of packages:
i965-va-driver:amd64            1.2.1-2.1
libdrm-intel1:amd64             2.4.46-3
libdrm-nouveau2:amd64           2.4.46-3
libdrm-radeon1:amd64            2.4.46-3
libdrm2:amd64                   2.4.46-3
libva-drm1:amd64                1.2.1-2.1
libva-glx1:amd64                1.2.1-2.1
libva-x11-1:amd64               1.2.1-2.1
libva1:amd64                    1.2.1-2.1
xserver-xorg-core               2:1.14.3-4
xserver-xorg-video-intel        2:2.99.905+git1383858000.b46d0d3

The syslog messages:
Nov  9 18:23:37 localhost kernel: [ 2623.487215] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
Nov  9 18:23:37 localhost kernel: [ 2623.487225] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
Nov  9 18:23:43 localhost kernel: [ 2629.459913] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
Nov  9 18:23:49 localhost kernel: [ 2635.468525] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
Nov  9 18:23:55 localhost kernel: [ 2641.441165] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
Nov  9 18:24:01 localhost kernel: [ 2647.461769] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
Nov  9 18:24:07 localhost kernel: [ 2653.446329] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring

The intel_gpu_top output when GPU has hung (in ssh session):
    render busy: 100%: ████████████████████       render space: 898/131072
 bitstream busy:   1%: ▎                       bitstream space: 0/131072

           task  percent busy
             CS: 100%: ████████████████████    vert fetch: 0 (0/sec)
            URB: 100%: ████████████████████    prim fetch: 0 (0/sec)
            VFE: 100%: ████████████████████ VS invocations: 0 (0/sec)
            BCS:   1%: ▎                    GS invocations: 0 (0/sec)
             AI:   1%: ▎                         GS prims: 0 (0/sec)
             AC:   1%: ▎                    CL invocations: 0 (0/sec)
             AM:   1%: ▎                         CL prims: 0 (0/sec)
                                            PS invocations: 0 (0/sec)
                                            PS depth pass: 0 (0/sec)
Comment 1 Mihail Kasadjikov 2013-11-09 16:46:33 UTC
Created attachment 88940 [details]
Information collected by intel_gpu_abrt tool.
Comment 2 Mihail Kasadjikov 2013-11-09 18:13:34 UTC
Hmmm. The bug appears on 3.11.6 too...

Possible steps to reproduce:
1. Download a video "Planet" (file "h264_720p_hp_5.1_6mbps_ac3_unstyled_subs_planet.mkv") from this page http://www.auby.no/files/video_tests/
2. Check these settings in VLC:
 [postproc] # Video post processing filter
 postproc-q=0
 [avcodec] # FFmpeg audio/video decoder
 ffmpeg-hw=1
 [main] # main program
 fullscreen=0
 skip-frames=0
 vout=xcb_x11
3. Open video file and start playback. Maybe you should switch to fullscreen after some secods.
$. GPU hung.

I use this script to automatically collect errors (it run by cron every 2 minutes):
--- cut ---
# cat get_i915_error_state.sh
#!/bin/sh
TS=$(date +%Y-%m-%d_%H-%M-%S)
cd /opt/intel_bugs

[ -f /tmp/intel_gpu_abrt.lock ] && exit 0

[ -f /sys/kernel/debug/dri/0/i915_error_state ] && \
        grep -q "no error state collected" /sys/kernel/debug/dri/0/i915_error_state

if [ 0 -ne $? ]; then
        echo "i915 error detected at $TS"
        intel_gpu_abrt
        [ -f intel_gpu_abrt.tar ] && \
                mv intel_gpu_abrt.tar intel_gpu_abrt_${TS}.tar && \
                gzip intel_gpu_abrt_${TS}.tar
        echo "Trying to reset GPU..."
        echo 1 > /sys/kernel/debug/dri/0/i915_wedged
        touch /tmp/intel_gpu_abrt.lock
fi
--- cut ---

System environment:
-- chipset: Intel Corporation 5 Series/3400 Series Chipset
-- system architecture: x86_64
-- xf86-video-intel: 2.99.905 git b46d0d3
-- xserver: 1.14.3
-- mesa: 10.0.0-devel
-- libdrm: 2.4.46
-- kernel: 3.11.6-zen+
-- Linux distribution: Debian/jessie
-- Machine or mobo model: ThinkPad X201t
-- Display connector: LVDS
Comment 3 haihao 2013-11-11 02:30:54 UTC
Can you reproduce this issue with other video files ?
Comment 4 Mihail Kasadjikov 2013-11-11 09:24:05 UTC
Yes. I have a lot of videos in .flv and .mp4 format downloaded from
youtube. These videos have a different resolution like 720p, 480p and
less.
Does this bug is similar to bug 59050? The symptoms are similar.
Comment 5 Mihail Kasadjikov 2013-11-11 13:11:35 UTC
So.
What I tried:
switch to previous kernel 3.10.12 - nothing good.
change kwin render backend from OpenGL to Xrender - nothing good.
change mesa from 10.0-dev to 9.2 - nothing good.
switch off desktop environment (disable kdm at startup) and run pure X with xterm - nothing good.
Comment 6 Mihail Kasadjikov 2013-11-11 13:26:35 UTC
One more observation:
If I use Xvideo output in vlc then I get bug immediately.
If I use OpenGL output in vlc then I get bug after switch to fullscreen mode.
Comment 7 haihao 2013-11-12 01:07:58 UTC
> Does this bug is similar to bug 59050? The symptoms are similar.

For 59050,  the issues is only reproduced with some videos with specific resolutions,  so you mean you can play back some videos without GPU haung , right ?
Comment 8 Mihail Kasadjikov 2013-11-12 22:44:16 UTC
No.
I just proposed that the problem maybe in buffer size.
Sorry if it was a mistake.
Comment 9 haihao 2013-12-03 05:34:41 UTC
Some ILK related fixes included in the following tarball, could you give a try ?
http://www.freedesktop.org/software/vaapi/testing/libva-intel-driver/libva-intel-driver-1.2.2.pre1.tar.bz2
Comment 10 Mihail Kasadjikov 2013-12-03 23:37:04 UTC
It works.
And I've built this driver from git - it works too.
Thanks.
Comment 11 Mihail Kasadjikov 2013-12-18 22:18:57 UTC
Unfortunately the problem appears again.
I am not sure that cause of bug only in libva component. I suspect the bug appears when we use OpenGL and libva simultaneously. For example the probability of bug is increasing when I run some simple 3D game in some window and vlc in another one.
Comment 12 Mihail Kasadjikov 2013-12-18 22:21:21 UTC
Created attachment 90950 [details]
fresh collected error state
Comment 13 zhixinx.liu 2014-07-10 03:29:22 UTC
hi Mihail Kasadjikov 
can you still find this issue?
with ILK I run some 3D game in some window and mplayer in another one. cannot find   this problem.
Comment 14 Mihail Kasadjikov 2014-07-10 08:31:18 UTC
Hi.

I noticed the problem appears when I use "i915.i915_enable_rc6=1" in kernel cmdline. Because of this I can't use the power saving for Intel's GPU.
Please see this bug: https://bugzilla.kernel.org/show_bug.cgi?id=77691#c1

Generally I don't play 3D games on my laptop but modern desktop environments like KDE or Unity use OpenGL acceleration.

I usually catch a "GPU hang" when I watch youtube using flash player with video acceleration and with RC6 power saving. Also the VLC catch this error.

Now I'm not sure that this error in libva. Maybe it is in the kernel module...
Comment 15 Mihail Kasadjikov 2014-07-10 08:33:17 UTC
Can you please escalate this bug to kernel's developers?
Comment 16 zhixinx.liu 2014-07-10 09:12:00 UTC
if you want to escalate this bug to kernel's developers, you can change the Product from libva to DRI.
Comment 17 Jani Nikula 2014-09-12 13:26:53 UTC
(In reply to comment #14)
> I noticed the problem appears when I use "i915.i915_enable_rc6=1" in kernel
> cmdline. Because of this I can't use the power saving for Intel's GPU.
> Please see this bug: https://bugzilla.kernel.org/show_bug.cgi?id=77691#c1

Let's track your bug here.

First, do not change the enable_rc6 module parameter from its platform specific defaults, or all bets are off. Please see if you can reproduce the bug without (though AFAICT what you set there should be the same as the default).

Please also try a more recent kernel.
Comment 18 Mihail Kasadjikov 2014-09-13 15:47:45 UTC
So. I try to test kernel 3.15.10.
I can't use 3.16 because of some issues related to reiserfs.

I found that by default RC6 disabled for IronLake (gen 5).
In file «drivers/gpu/drm/i915/intel_pm.c»:
int intel_enable_rc6(const struct drm_device *dev)
{
…
        /* Disable RC6 on Ironlake */
        if (INTEL_INFO(dev)->gen == 5)
                return 0;
…
}

When I try to force enable_rc6 to 1 it still goes to default (disabled) value:
$ cat /proc/cmdline 
root=/dev/mapper/h13ssd-root ro ipv6.disable=1 elevator=deadline no_console_suspend=1 pcie_aspm=powersave video=inteldrmfb:1280x800R-8 quiet intel_iommu=igfx_off resume=/dev/mapper/h13ssd-swap zswap.enabled=1 drm.debug=0x04 i915.enable_rc6=1

$ dmesg | egrep -i "rc6"
[    1.913863] [drm] Enabling RC6 states: RC6 off, RC6p off, RC6pp off

And now I've got one error but without overall system hung.
$ dmesg | egrep -i "hangcheck"
[  350.800043] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... bsd ring idle
Comment 19 Mihail Kasadjikov 2014-09-18 19:21:21 UTC
I found and removed the «semaphores=0» from «/etc/modprobe.d/options.conf». It was added as a workaround according to bug #54226.

Right now I have no any parameters for i915 kernel module
$ cat /proc/cmdline 
root=/dev/mapper/h13ssd-root ro ipv6.disable=1 elevator=deadline no_console_suspend=1 pcie_aspm=powersave video=inteldrmfb:1280x800R-8 quiet intel_iommu=igfx_off resume=/dev/mapper/h13ssd-swap zswap.enabled=1

But after some videos on youtube using flashplayer I've got the «Hangcheck timer elapsed» without overall system hung:
$ dmesg | grep Hang
[ 4121.662268] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... bsd ring idle

It is interesting but after this message in dmesg and little pause in video playback, the other videos on YT are playing with HW decoder normally and without errors for long time...

And this, after message in dmesg:
# cat /sys/kernel/debug/dri/0/i915_error_state
no error state collected
Comment 20 Rodrigo Vivi 2014-10-15 20:14:15 UTC
Could you please try latest drm-intel-nightly from cgit.freedesktop.org/drm-intel?
Comment 21 Mihail Kasadjikov 2014-10-16 19:35:53 UTC
I can't use the fresh kernel because of I use Reiser4 filesystem at my /home.

$ git clone --depth=1 --branch="drm-intel-nightly" git://anongit.freedesktop.org/drm-intel drm-intel-nightly
Cloning into 'drm-intel-nightly'...
remote: Counting objects: 50283, done.
remote: Compressing objects: 100% (47765/47765), done.
remote: Total 50283 (delta 4384), reused 12857 (delta 1911)
Receiving objects: 100% (50283/50283), 134.94 MiB | 692.00 KiB/s, done.
Resolving deltas: 100% (4384/4384), done.
Checking connectivity... done.
Checking out files: 100% (47560/47560), done.
$ cd drm-intel-nightly/
$ zcat ~/dev/kernel/Reiser4/reiser4-for-3.16.2.patch.gz | patch -p 1 --dry-run | grep -B 1 ^Hunk
checking file fs/fs-writeback.c
Hunk #2 succeeded at 575 (offset 1 line).
Hunk #3 succeeded at 618 (offset 1 line).
Hunk #4 succeeded at 651 (offset 1 line).
Hunk #5 succeeded at 675 (offset 1 line).
Hunk #6 succeeded at 683 (offset 1 line).
Hunk #7 succeeded at 1012 (offset 1 line).
Hunk #8 succeeded at 1048 (offset 1 line).
--
checking file include/linux/fs.h
Hunk #4 succeeded at 1578 (offset 26 lines).
Hunk #5 succeeded at 2250 (offset 26 lines).
Hunk #6 succeeded at 2380 with fuzz 2 (offset 27 lines).
Hunk #7 succeeded at 2388 (offset 27 lines).
--
checking file include/linux/sched.h
Hunk #1 succeeded at 1881 (offset -11 lines).
checking file include/linux/writeback.h
Hunk #2 succeeded at 93 with fuzz 2.
checking file mm/filemap.c
Hunk #1 succeeded at 1441 (offset 6 lines).
checking file mm/page-writeback.c
Hunk #1 succeeded at 2196 (offset -3 lines).
checking file mm/vmscan.c
Hunk #1 FAILED at 2490.
Hunk #2 succeeded at 2538 with fuzz 2 (offset 8 lines).

Sorry. I'm not a software developer and don't know how to backport the new drm module into kernel 3.16.
Comment 22 Mihail Kasadjikov 2014-10-16 20:43:17 UTC
On kernel 3.16.6 the behavior is like described in comment 19.

So. Now it works almost ideal except this one error in dmesg and one little pause after some minutes while playng video.
Comment 23 Rodrigo Vivi 2015-01-15 18:49:33 UTC
What about newer kernel? it is strange to have the erro check and no error state collected. Could you please try to reproduce and grab the error state from latest stage where you have it working propperly but seeing warns?

Also, it would be good if you can try newer kernel.
If you use Ubuntu you can try nightly deb from ppa:http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/current/
Comment 24 Mihail Kasadjikov 2015-01-21 09:43:32 UTC
Hello.

$ uname -a
Linux h13 3.17.7 #1 SMP Sun Jan 18 01:51:42 MSK 2015 x86_64 GNU/Linux

Test with vlc 1.

config:
[avcodec] # FFmpeg audio/video decoder
avcodec-hw=vaapi_x11
[postproc] # Video post processing filter
postproc-q=0
[core] # core program
skip-frames=0
quiet-synchro=1
deinterlace=-1
vout=xcb_xv

stderr:
$ vlc Kizomba\ Isabelle\ and\ Felicien\ Asty\ -\ Curti\ ma\ mi.mp4 
VLC media player 2.2.0-rc2 Weatherwax (revision 2.2.0-rc1-118-g22fda39)
[0000000000a67118] core libvlc: Запуск vlc с интерфейсом по умолчанию. Используйте 'cvlc' для запуска vlc без интерфейса.
libva info: VA-API version 0.36.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_36
libva info: va_openDriver() returns 0
[00007f7becc328d8] avcodec decoder: Using Intel i965 driver for Intel(R) Ironlake Mobile - 1.4.1 for hardware decoding.
libva info: VA-API version 0.36.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_36
libva info: va_openDriver() returns 0
[00007f7becc328d8] avcodec decoder: Using Intel i965 driver for Intel(R) Ironlake Mobile - 1.4.1 for hardware decoding.

Vlc hung after about 10-20 seconds and ignore TERM signal. It was killed by "kill -9". No errors in dmesg.

Test with vlc 2.

config:
[avcodec] # FFmpeg audio/video decoder
avcodec-hw=vaapi_drm
[postproc] # Video post processing filter
postproc-q=0
[core] # core program
skip-frames=0
quiet-synchro=1
deinterlace=-1
vout=xcb_xv

stderr:
$ vlc Kizomba\ Isabelle\ and\ Felicien\ Asty\ -\ Curti\ ma\ mi.mp4 
VLC media player 2.2.0-rc2 Weatherwax (revision 2.2.0-rc1-118-g22fda39)
[00000000015b5118] core libvlc: Запуск vlc с интерфейсом по умолчанию. Используйте 'cvlc' для запуска vlc без интерфейса.
libva info: VA-API version 0.36.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_36
libva info: va_openDriver() returns 0
[00007f0110c32938] avcodec decoder: Using Intel i965 driver for Intel(R) Ironlake Mobile - 1.4.1 for hardware decoding.
libva info: VA-API version 0.36.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_36
libva info: va_openDriver() returns 0
[00007f0110c32938] avcodec decoder: Using Intel i965 driver for Intel(R) Ironlake Mobile - 1.4.1 for hardware decoding.

Video file was played normally.

Flash video works normally.

So. Right now it seems OK. The problem with "avcodec-hw=vaapi_x11" may be in X11 driver or vlc but I think the DRM driver works normally.
Comment 25 Mihail Kasadjikov 2015-02-01 19:06:15 UTC
Kernel 3.18.4.

Another test with vlc.

config:
[avcodec] # FFmpeg audio/video decoder
avcodec-hw=vaapi_drm
[postproc] # Video post processing filter
postproc-q=0
[core] # core program
skip-frames=0
quiet-synchro=1
deinterlace=-1
vout=xcb_xv

After many stops and rewinds by some seconds the vlc has been hung like in previous post in "test 1". No errors in dmesg.
But in htop I saw a 99% IOwait on one CPU core.
Vlc was killed by "-9" and no side-effects I can observe.
Comment 26 Mihail Kasadjikov 2015-02-01 20:02:06 UTC
I have switched HW decoder in vlc to VDPAU via libvdpau-va-gl1 library:

[avcodec] # FFmpeg audio/video decoder
avcodec-hw=vdpau_avcodec

And I have some additional errors:
libva info: VA-API version 0.36.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_36
libva info: va_openDriver() returns 0
[00007fc750f78278] avcodec decoder: Using OpenGL/VAAPI/libswscale backend for VDPAU for hardware decoding.
libva info: VA-API version 0.36.0
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_36
libva info: va_openDriver() returns 0
[00007fc750f78278] avcodec decoder: Using OpenGL/VAAPI/libswscale backend for VDPAU for hardware decoding.
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpDecoderRender_h264): no surfaces left in buffer
[VS] error (vdpVideoSurfaceGetBitsYCbCr): not implemented conversion VA FOURCC � -> VDP_YCBCR_FORMAT_YV12
[00007fc721499778] vdpau_chroma filter error: video surface export failure: VDP_STATUS_INVALID_Y_CB_CR_FORMAT

$ echo -n "FOURCC � ->" | hd
00000000  46 4f 55 52 43 43 20 ef  bf bd 20 2d 3e           |FOURCC ... ->|

After about 2 minutes vlc has been hung. No errors in dmesg.
In htop I saw a 99% IOwait on one CPU core.
Vlc was killed by "-9" and no side-effects on overall system I can observe.

This behaviour is not on all video files.

So. It looks like some video file (and youtube stream) has some frames that HW decoder can't understand and it is a fatal for HW decoder. Or decoder generate some error code that libva can't parse? At this stage I can't understand which component is broken.
Comment 27 Mihail Kasadjikov 2015-02-01 21:39:28 UTC
The mpv player works fine and without freezes.

So. I think the freezes is not in hardware but in libva/X11/mesa/vlc.
Comment 28 Jani Nikula 2015-10-23 09:37:47 UTC
(In reply to Mihail Kasadjikov from comment #27)
> The mpv player works fine and without freezes.
> 
> So. I think the freezes is not in hardware but in libva/X11/mesa/vlc.

Thanks, closing. Please reopen or file a new bug if the problem reappears.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.