Bug 69330 - [SNB] vaapi h.264 encoding causes drm stuck on render ring
Summary: [SNB] vaapi h.264 encoding causes drm stuck on render ring
Status: RESOLVED INVALID
Alias: None
Product: libva
Classification: Unclassified
Component: intel (show other bugs)
Version: unspecified
Hardware: Other All
: high major
Assignee: Pengfei
QA Contact: Sean V Kelley
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-13 17:47 UTC by Brian Lovin
Modified: 2016-01-06 02:23 UTC (History)
7 users (show)

See Also:
i915 platform:
i915 features:


Attachments
gpu crash dump (570.67 KB, application/x-compressed-tar)
2014-05-14 17:10 UTC, U. Artie Eoff
Details
Dmesg of GPU hang when doing mediasdk encoding (815.92 KB, text/plain)
2015-05-25 08:33 UTC, trolldev
Details

Description Brian Lovin 2013-09-13 17:47:26 UTC
System Environment:
--------------------------
Distro: Fedora 17
Arch: x86_64
wayland (master) heads/master-0-g9a5ed78
drm (master) heads/master-0-gb6da447
mesa (master) heads/master-0-g2487324
libva (master) heads/master-0-g3003999
intel-driver (master) heads/master-0-gd22d367
weston (master) heads/master-0-g62942ad


Detailed Description:
-----------------------------
When recording using LibVA capture on Sandybridge-class Intel machines (i915 driver), Weston will lock up and become unresponsive. This is also accompanied by visible corruption on the display.

This is supposedly related to the RC6 settings. Will post a comment to this bug when I've tested the LibVA recording with i915's RC6 disabled (append boot parameter  i915.i915_enable_rc6=0) 

Steps to Reproduce:
----------------------------
1. start weston under DRM mode
2. Press `mod-shift-space q`
3. Observe freeze of Weston
Comment 2 U. Artie Eoff 2013-09-13 18:46:35 UTC
For the record, I tested on "Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz" IvyBridge, Fedora 19 x86_64 with RC6 enabled and do not encounter this issue.

"dmesg | grep RC6" --> [drm] Enabling RC6 states: RC6 on, RC6p on, RC6pp off

and powertop shows it in use too.
Comment 3 Brian Lovin 2013-09-13 19:18:06 UTC
First, this was tested on Fedora 19, NOT 17.

I also tested this on another machine (same commit IDs as the previous machine):
ThinkPad T420 running on Ubuntu 12.04.

This issue is STILL OBSERVED when the RC6 states are disabled (Verified via dmesg), on both machines.
Comment 4 Sean V Kelley 2013-10-01 06:07:41 UTC
Why is the experimental - proof of concept - Weston screen capture problem/bug being assigned to libva?  Can Ander look at it first and triage?  Or is this not using the proof of concept screen capture program from Ander?
Comment 5 Sean V Kelley 2013-10-01 06:13:02 UTC
Ander, please triage this issue and give some feedback first.  Per our discussion in Oregon, this is a proof of concept that should be moving to potentially libvacodec or gstreamer based.
Comment 6 U. Artie Eoff 2013-10-01 14:43:06 UTC
(In reply to comment #4)
> Why is the experimental - proof of concept - Weston screen capture
> problem/bug being assigned to libva?

Well, that's the first I've heard about it being a experimental POC.  Nonetheless, it's a bug and can be further triaged as you see fit.  LibVA h.264 capture is about to be released as a feature in Weston.  IRCC, using the feature causes a GPU hang too, only on Sandybridge, and likely rules out Weston itself.  So libva or the libva intel-driver seems like the next logical component to investigate/assign until evidence shows otherwise.  Who knows, maybe it's a kernel-level bug like some of the decode hangs we've encountered (Brian did rule out RC6)??

> Can Ander look at it first and triage?
> Or is this not using the proof of concept screen capture program from Ander?

Yeh, seems like a logical [co-]owner since Ander enabled the feature in Weston.  Not sure if there is another "vehicle" (apart from Weston) to test the libva h.264 encoding pipeline.  If there is, then that might be a good place to look at the bug.
Comment 7 U. Artie Eoff 2014-04-15 19:59:41 UTC
Any progress on this issue?
Comment 8 Ander Conselvan de Oliveira 2014-05-09 13:01:07 UTC
It is possible that I wasn't able to reproduce this because I had too much disk space. The screen capture code didn't handle an out of disk space condition properly and ended up in a loop that caused the compositor to block. I sent a patch to the list that fix this:

http://lists.freedesktop.org/archives/wayland-devel/2014-May/014761.html


---

On a side note, I noticed that change in the kernel that fixed (or worked around) the problem in SandyBridge was reverted. So I am seeing an RC6 related freeze on my machine. Seems the only kernel that works properly is 3.12 while 3.13 and 3.14 have the "fix" reverted.

This was the "fix":

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=351aa5666d02062b52329bcfe4bcf9d1f882fba9

And this is the commit that reverted it:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=29c78f609e661e663a239a37923adb1d61f6386c
Comment 9 U. Artie Eoff 2014-05-14 17:10:03 UTC
I tested this again on the following s/w stack on an NDIS166 w/ Intel(R) Celeron(R) CPU B810 @ 1.60GHz:

kernel 3.14.3-200.fc20.x86_64
wayland (HEAD) remotes/origin/HEAD-0-g8511544
drm (HEAD) libdrm-2.4.52-0-g46d451c
mesa (HEAD) heads/10.1-0-g9e1eb6f
libva (HEAD) libva-1.2.1-0-g88ed1eb
intel-driver (HEAD) 1.2.2-0-g121e70d
cairo (HEAD) heads/1.12-0-g59e2a93
libinput (HEAD) heads/master-0-g97af5c3
weston (HEAD) remotes/origin/HEAD-0-gbe803ad

Weston still hangs, but I'm now getting some different log messages (kernel: [drm] stuck on render ring) and a GPU crash dump... perhaps they'll be useful.
Comment 10 U. Artie Eoff 2014-05-14 17:10:40 UTC
Created attachment 99037 [details]
gpu crash dump
Comment 11 Ander Conselvan de Oliveira 2014-05-15 07:47:16 UTC
"Kernel: 3.14.3-200.fc20.x86_64"

Do you have RC6 enabled. If you do you would hit a lock up due to what I mentioned on the side note on my previous comment.

Brian mentioned that the issue still happened with RC6 disabled on IvyBridge. Can you confirm that is still the case?
Comment 12 U. Artie Eoff 2014-05-15 14:09:40 UTC
(In reply to comment #11)
> "Kernel: 3.14.3-200.fc20.x86_64"
> 
> Do you have RC6 enabled. If you do you would hit a lock up due to what I
> mentioned on the side note on my previous comment.
> 

Yes, RC6 was enabled when I tested on Sandybridge.  I'll confirm with RC6 disabled.

The first kernel patch you mentioned did fix this issue, IIRC.  And I see why it had to be reverted.  I wonder if there's something else that can be changed to fix this issue without introducing other problems (like in the revert commit).

> Brian mentioned that the issue still happened with RC6 disabled on
> IvyBridge. Can you confirm that is still the case?

Eh?  I don't see where he mentioned that.  I only see a comment (#c3) where he tested on "Sandybridge" with RC6 "disabled" and still saw it.  Again, I'll reconfirm that.

I tested on Ivybridge (#c2) with RC6 "enabled" and no issue.
Comment 13 U. Artie Eoff 2014-05-15 16:14:45 UTC
I confirmed that disabling RC6 does not solve this issue on Sandybridge.

Thus, regardless of RC6 enabled or disabled, this issue persists.
Comment 14 Chris Wilson 2014-05-15 20:42:31 UTC
In Uartie's dump, it looks like the GPU kept on running after the batch hung. Nevertheless it is the libva batch that made the GPU stop processing commands (even basic ones such as LRI and SDW).
Comment 15 Anu Reddy 2014-08-26 18:40:54 UTC
This issue is still causing weston freeze on Sandybridge with below software stack.

Software Stack
wayland (HEAD) remotes/origin/HEAD-0-g6d0f298 
drm (HEAD) heads/master-16-gd686160 
mesa (HEAD) remotes/origin/10.2-0-gd82ca4e 
libva (HEAD) libva-1.3.1-0-g053f70f 
intel-driver (HEAD) 1.3.1-0-ga720bc8 
cairo (HEAD) heads/1.12-0-g59e2a93 
libinput (HEAD) heads/master-163-gbb10ec8 
weston (HEAD) remotes/origin/master-0-g652c794 

Kernel: 3.14.4-200.fc20.x86_64
Comment 16 trolldev 2015-05-25 08:32:46 UTC
Hi all,

I am having a very similar issue.
My CPU is a haswell 4790K running under a kernel 3.9.14.
I use the latests Intel Media SDK SDK2015ProductionEvaluation16.4.1 and SDK2015ProductionEvaluation16.4.2.

When I try to do an accelerated encoding with ffmpeg or the provided h264encode app, the GPU gets stucked.

For the record, these apps use libva, here is the output of the libva init:

libva info: VA-API version 0.35.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0


I get the following dmesg:

[   85.952746] [drm] stuck on bsd ring
[   85.953293] [drm] GPU HANG: ecode 7:1:0xa8ffdfbe, in ffmpeg [7536], reason: Ring hung, action: reset
[   85.953294] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   85.953294] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   85.953295] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   85.953295] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   85.953296] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   85.960746] drm/i915: Resetting chip after gpu hang

Please find attached the error reports.

I tried to disabled RC6 but it did not help.

Thanks
Comment 17 trolldev 2015-05-25 08:33:30 UTC
Created attachment 116020 [details]
Dmesg of GPU hang when doing mediasdk encoding
Comment 18 haihao 2015-11-23 14:40:13 UTC
Recently we found some cases work well with GTT but hang with PPGTT on SNB, could you disable PPGTT or try a new kernel on SNB if you still experience this issue?
Comment 19 ykzhao 2016-01-06 02:23:10 UTC
Since there is no response for more than six months, this bug will be closed.

If the issue still exists and can be reproduced, please reopen it or file a new bug. Of course it will be better that the mentioned option of "i915.enable_ppgtt=0" can be tried firstly to see whether the hang issue is gone.

Thanks


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.