Distro: Fedora 17
wayland (master) heads/master-0-g9a5ed78
drm (master) heads/master-0-gb6da447
mesa (master) heads/master-0-g2487324
libva (master) heads/master-0-g3003999
intel-driver (master) heads/master-0-gd22d367
weston (master) heads/master-0-g62942ad
When recording using LibVA capture on Sandybridge-class Intel machines (i915 driver), Weston will lock up and become unresponsive. This is also accompanied by visible corruption on the display.
This is supposedly related to the RC6 settings. Will post a comment to this bug when I've tested the LibVA recording with i915's RC6 disabled (append boot parameter i915.i915_enable_rc6=0)
Steps to Reproduce:
1. start weston under DRM mode
2. Press `mod-shift-space q`
3. Observe freeze of Weston
For the record, I tested on "Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz" IvyBridge, Fedora 19 x86_64 with RC6 enabled and do not encounter this issue.
"dmesg | grep RC6" --> [drm] Enabling RC6 states: RC6 on, RC6p on, RC6pp off
and powertop shows it in use too.
First, this was tested on Fedora 19, NOT 17.
I also tested this on another machine (same commit IDs as the previous machine):
ThinkPad T420 running on Ubuntu 12.04.
This issue is STILL OBSERVED when the RC6 states are disabled (Verified via dmesg), on both machines.
Why is the experimental - proof of concept - Weston screen capture problem/bug being assigned to libva? Can Ander look at it first and triage? Or is this not using the proof of concept screen capture program from Ander?
Ander, please triage this issue and give some feedback first. Per our discussion in Oregon, this is a proof of concept that should be moving to potentially libvacodec or gstreamer based.
(In reply to comment #4)
> Why is the experimental - proof of concept - Weston screen capture
> problem/bug being assigned to libva?
Well, that's the first I've heard about it being a experimental POC. Nonetheless, it's a bug and can be further triaged as you see fit. LibVA h.264 capture is about to be released as a feature in Weston. IRCC, using the feature causes a GPU hang too, only on Sandybridge, and likely rules out Weston itself. So libva or the libva intel-driver seems like the next logical component to investigate/assign until evidence shows otherwise. Who knows, maybe it's a kernel-level bug like some of the decode hangs we've encountered (Brian did rule out RC6)??
> Can Ander look at it first and triage?
> Or is this not using the proof of concept screen capture program from Ander?
Yeh, seems like a logical [co-]owner since Ander enabled the feature in Weston. Not sure if there is another "vehicle" (apart from Weston) to test the libva h.264 encoding pipeline. If there is, then that might be a good place to look at the bug.
Any progress on this issue?
It is possible that I wasn't able to reproduce this because I had too much disk space. The screen capture code didn't handle an out of disk space condition properly and ended up in a loop that caused the compositor to block. I sent a patch to the list that fix this:
On a side note, I noticed that change in the kernel that fixed (or worked around) the problem in SandyBridge was reverted. So I am seeing an RC6 related freeze on my machine. Seems the only kernel that works properly is 3.12 while 3.13 and 3.14 have the "fix" reverted.
This was the "fix":
And this is the commit that reverted it:
I tested this again on the following s/w stack on an NDIS166 w/ Intel(R) Celeron(R) CPU B810 @ 1.60GHz:
wayland (HEAD) remotes/origin/HEAD-0-g8511544
drm (HEAD) libdrm-2.4.52-0-g46d451c
mesa (HEAD) heads/10.1-0-g9e1eb6f
libva (HEAD) libva-1.2.1-0-g88ed1eb
intel-driver (HEAD) 1.2.2-0-g121e70d
cairo (HEAD) heads/1.12-0-g59e2a93
libinput (HEAD) heads/master-0-g97af5c3
weston (HEAD) remotes/origin/HEAD-0-gbe803ad
Weston still hangs, but I'm now getting some different log messages (kernel: [drm] stuck on render ring) and a GPU crash dump... perhaps they'll be useful.
Created attachment 99037 [details]
gpu crash dump
Do you have RC6 enabled. If you do you would hit a lock up due to what I mentioned on the side note on my previous comment.
Brian mentioned that the issue still happened with RC6 disabled on IvyBridge. Can you confirm that is still the case?
(In reply to comment #11)
> "Kernel: 3.14.3-200.fc20.x86_64"
> Do you have RC6 enabled. If you do you would hit a lock up due to what I
> mentioned on the side note on my previous comment.
Yes, RC6 was enabled when I tested on Sandybridge. I'll confirm with RC6 disabled.
The first kernel patch you mentioned did fix this issue, IIRC. And I see why it had to be reverted. I wonder if there's something else that can be changed to fix this issue without introducing other problems (like in the revert commit).
> Brian mentioned that the issue still happened with RC6 disabled on
> IvyBridge. Can you confirm that is still the case?
Eh? I don't see where he mentioned that. I only see a comment (#c3) where he tested on "Sandybridge" with RC6 "disabled" and still saw it. Again, I'll reconfirm that.
I tested on Ivybridge (#c2) with RC6 "enabled" and no issue.
I confirmed that disabling RC6 does not solve this issue on Sandybridge.
Thus, regardless of RC6 enabled or disabled, this issue persists.
In Uartie's dump, it looks like the GPU kept on running after the batch hung. Nevertheless it is the libva batch that made the GPU stop processing commands (even basic ones such as LRI and SDW).
This issue is still causing weston freeze on Sandybridge with below software stack.
wayland (HEAD) remotes/origin/HEAD-0-g6d0f298
drm (HEAD) heads/master-16-gd686160
mesa (HEAD) remotes/origin/10.2-0-gd82ca4e
libva (HEAD) libva-1.3.1-0-g053f70f
intel-driver (HEAD) 1.3.1-0-ga720bc8
cairo (HEAD) heads/1.12-0-g59e2a93
libinput (HEAD) heads/master-163-gbb10ec8
weston (HEAD) remotes/origin/master-0-g652c794
I am having a very similar issue.
My CPU is a haswell 4790K running under a kernel 3.9.14.
I use the latests Intel Media SDK SDK2015ProductionEvaluation16.4.1 and SDK2015ProductionEvaluation16.4.2.
When I try to do an accelerated encoding with ffmpeg or the provided h264encode app, the GPU gets stucked.
For the record, these apps use libva, here is the output of the libva init:
libva info: VA-API version 0.35.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
I get the following dmesg:
[ 85.952746] [drm] stuck on bsd ring
[ 85.953293] [drm] GPU HANG: ecode 7:1:0xa8ffdfbe, in ffmpeg , reason: Ring hung, action: reset
[ 85.953294] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 85.953294] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 85.953295] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 85.953295] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 85.953296] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 85.960746] drm/i915: Resetting chip after gpu hang
Please find attached the error reports.
I tried to disabled RC6 but it did not help.
Created attachment 116020 [details]
Dmesg of GPU hang when doing mediasdk encoding
Recently we found some cases work well with GTT but hang with PPGTT on SNB, could you disable PPGTT or try a new kernel on SNB if you still experience this issue?
Since there is no response for more than six months, this bug will be closed.
If the issue still exists and can be reproduced, please reopen it or file a new bug. Of course it will be better that the mentioned option of "i915.enable_ppgtt=0" can be tried firstly to see whether the hang issue is gone.