Bug 83207

Summary: [BDW rc6] GPU hang playing a video
Product: xorg Reporter: Timo Aaltonen <tjaalton>
Component: Driver/intelAssignee: Rodrigo Vivi <rodrigo.vivi>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: gary.c.wang, intel-gfx-bugs, xiong.y.zhang, yk
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
patch 1/2
none
patch 2/2
none
GPU HANG: ecode 0:0xf5dffffe
none
Enable using BCS for pageflips in gen7/7+/8
none
rc6 disabled in BDW d-step CPU with kernel 3.18-rc1/Xubuntu 14.10 beta
none
The latest xf86-video-intel built for commit d08a5f555a0c47ae23c0f9a890b512cb23e74feb
none
xserver-xorg-video-intel_2.99.999-0ubuntu1.1_amd64.deb
none
Use Vmask for 3DSTATE_PS
none
"Use Vmask for 3DSTATE_PS" patch applied xserver-xorg-video deb none

Description Timo Aaltonen 2014-08-28 18:46:52 UTC
Playing the Ubuntu welcome video causes GPU hangs on BDW. It's launched with

gst-launch-0.10 playbin uri=file:///UbuntuBoot.ogv video-sink=xvimagesink

and then boom. Tried disabling rc6 which helps with some steppings but not the latest ones (as on Wilson Beach).

error states (and the video) at http://koti.kapsi.fi/~tjaalton/bdw


Tested on ubuntu 14.10 with:

kernel v3.17-rc2
libdrm 2.4.56
mesa 10.2.6
xdrv 2.99.914

same with an earlier stack on ubuntu 14.04..
Comment 1 Timo Aaltonen 2014-08-28 19:03:04 UTC
using 'ximagesink' instead works fine
Comment 2 Chris Wilson 2014-08-29 08:06:35 UTC
That's a planar YUV video (as opposed to packed YUV), do you have other videos that work? Just hoping that the failure is in the planar video path...
Comment 3 Chris Wilson 2014-08-29 09:45:55 UTC
Timo, can you please try with

commit 2086965e5c0781e0a3996de89e4dda03c5d42610
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 29 10:37:09 2014 +0100

    gen8: Refresh video render programs

?
Comment 4 Timo Aaltonen 2014-09-01 13:43:25 UTC
So that patch didn't change things as discussed on irc, any further ideas? :)
Comment 5 Chris Wilson 2014-09-05 07:50:02 UTC
Timo mentioned that if he disables rc6 from the BIOS, all is fine.
Comment 6 Chris Wilson 2014-09-05 10:11:52 UTC
We await results from testing with recent -nightly and the per-context reg w/a.
Comment 7 Timo Aaltonen 2014-09-05 11:41:31 UTC
nightly build from last night should have that commit and it still has the bug
Comment 8 Yang Kun (YK) 2014-09-05 15:43:10 UTC
this issue is blocking us from shipping. bumping the importance to high+critical . please let me know if this is not appropriate.

thanks
-YK
Comment 9 Timo Aaltonen 2014-09-23 16:21:22 UTC
Here's a recent error state with drm-intel-nightly build from Sep 17th:

http://koti.kapsi.fi/~tjaalton/bdw/i915_error_state_b36_intel
Comment 10 Rodrigo Vivi 2014-09-23 20:28:54 UTC
Created attachment 106758 [details] [review]
patch 1/2
Comment 11 Rodrigo Vivi 2014-09-23 20:29:27 UTC
Created attachment 106759 [details] [review]
patch 2/2

WaCsStallBeforeStateCacheInvalidate
Comment 12 Rodrigo Vivi 2014-09-23 20:30:39 UTC
Could you please test -nightly with 2 patches attached?

Also the original equivalent of them on your kernel?
Comment 13 Timo Aaltonen 2014-09-24 09:25:05 UTC
no luck with them on -nightly, error state looks identical to the old
Comment 14 Rodrigo Vivi 2014-09-24 21:28:14 UTC
That is odd. Maybe I was looking your error state though...

But regardless the -nightly result with those patches, your kernel really need those original patches.
Comment 15 Gary Wang 2014-10-01 09:39:54 UTC
Created attachment 107170 [details]
GPU HANG: ecode 0:0xf5dffffe

Run "UbuntuBoot.ogv" playback on Ubuntu 14.04 (3.13.0-36-generic #63+hwe3-Ubuntu) and the gpu hang in kernel log
Comment 16 Gary Wang 2014-10-02 05:09:09 UTC
Created attachment 107215 [details]
Enable using BCS for pageflips in gen7/7+/8

It verified this issue in HP Stag BW C2 Sku device/BIOS B.38 for kernel 3.13 (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tag/?id=v3.13) with "drm-i915-always-enable-BCS-to-gen7-later" patch applied. 

It forces to using BCS pageflips for Ivybridge and later. 

The GPU hang with ecode:0xf5dffffe went away in kernel 3.13+patch, and UbuntuBoot.ogv playback worked well (via firstboot-video provided by Canonical). 
test #1, test cycle with 1345 times overnight, pass
test #2, test cycle with 100 times, pass
test #3, test cycle with 123 times, pass
test #4, test cycle with 132 times, pass
test #4, test cycle with 50 times, pass

The random UI freeze (like # 77104, https://bugs.freedesktop.org/show_bug.cgi?id=77104) didn't be happened again until now.

Gary
Comment 17 Gary Wang 2014-10-02 05:12:30 UTC
For fix patch (https://bugs.freedesktop.org/attachment.cgi?id=100213) from issue #77104 (https://www.libreoffice.org/bugzilla/show_bug.cgi?id=77104), it still got some fail-rate to be UI freeze/GPU hang.
Comment 18 Gary Wang 2014-10-02 05:13:21 UTC
For #16/17, it's based on BDW platform.
Comment 19 Gavin Hindman 2014-10-16 14:08:55 UTC
Does disabling RC6 really eliminate the error?  The original comments indicated it helped some early steppings, but not later steppings.
Comment 20 Rodrigo Vivi 2014-10-16 16:02:39 UTC
I have to admit I'm lost here. This patch looks correct because it forces a behaviour that is already the one used upstream. And also the one used on Canonical backport for BDW. So I have no idea what kernel in question here.

Does Canonical applied the kernel I had pointed out? to include this W/A: WaCsStallBeforeStateCacheInvalidate ?

#77104 doesn't make sense here. If you are facing a similar issue this is another bug. Please reproduce it with -nightly and open a new bug.
Comment 21 Rodrigo Vivi 2014-10-22 00:39:04 UTC
Hi Timo,

I got a clean ubuntu 14.04-1 here and got the versions you had mentioned on the first report from launchpad and tried to reproduce the bug locally here and I couldn't.

With your 3.13.0-36 it hangs on boot.

with 3.17 everything works fine, including the video.

Is there anything I'm missing? Any other change on your environment you didn't mentioned?
Comment 22 Timo Aaltonen 2014-10-22 06:40:03 UTC
You need to install the matching linux-image-extra package too, which has all of drm/*.. that'd explain the boot hang.

I think the problem that OEM1&2 are seeing (and not OEM3) is due to the fact that their first-stage installer uses a slightly older kernel (-34) which then might(?) leave the hw in some state that after a reboot to the latest kernel it'll fail with this issue. The gpu hang can't be reproduced after the second reboot.. I'll try to synthesize that on my hw.

And I'll double-check if this is the diff the images have.
Comment 23 Gary Wang 2014-10-22 09:52:47 UTC
It can be reproduced in XUbuntu 14.10 beta-1 (http://cdimage.ubuntu.com/xubuntu/releases/14.10/beta-1/xubuntu-14.10-beta1-desktop-amd64.iso) with its resolution more than 1920x1080 (in WSB SDS).
Comment 24 Timo Aaltonen 2014-10-22 16:37:41 UTC
if 14.10beta fails it could be because it's 3.16 based kernel doesn't have all the workarounds..
Comment 25 Gary Wang 2014-10-23 01:35:57 UTC
I upgrade the kernel from 3.16 to 3.18rc1 in Xubuntu 14.10 beta-1/-2, still suffered from the same fail GPU hang error code “0x85dffffb” on BDW platform (WSB SDS)
Comment 26 Gary Wang 2014-10-24 05:52:28 UTC
Created attachment 108334 [details]
rc6 disabled in BDW d-step CPU with kernel 3.18-rc1/Xubuntu 14.10 beta

For comment #25 (drm-intel-nightly-10/22),
If disabling rc6 by i915.enable_rc6=0 in drm-intel-nightly-10/22, it only suffered GPU hang at the first time, and worked well at following test cycles. It appears to be related to GPU rc6.

intel@intel-Broadwell-Client-platform:~$ ./play.sh 
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
New clock: GstPulseSinkClock
WARNING: from element /GstPlayBin:playbin0/GstBin:vbin/GstBin:bin0/GstXvImageSink:xvimagesink0: A lot of buffers are being dropped.
Additional debug info:
gstbasesink.c(2875): gst_base_sink_is_too_late (): /GstPlayBin:playbin0/GstBin:vbin/GstBin:bin0/GstXvImageSink:xvimagesink0:
There may be a timestamping problem, or this computer is too slow.
WARNING: from element /GstPlayBin:playbin0/GstBin:vbin/GstBin:bin0/GstXvImageSink:xvimagesink0: A lot of buffers are being dropped.
Additional debug info:
gstbasesink.c(2875): gst_base_sink_is_too_late (): /GstPlayBin:playbin0/GstBin:vbin/GstBin:bin0/GstXvImageSink:xvimagesink0:
There may be a timestamping problem, or this computer is too slow.
Got EOS from element "playbin0".
Execution ended after 33048083158 ns.
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...
intel@intel-Broadwell-Client-platform:~$ ./play.sh 
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
New clock: GstPulseSinkClock
Got EOS from element "playbin0".
Execution ended after 33047536912 ns.
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...
Comment 27 Timo Aaltonen 2014-10-24 07:38:52 UTC
The theory that I had was wrong, it doesn't matter if the first-stage installer kernel is old or not, still happens with a newer kernel.

And in fact I can reproduce this on 14.10 with WB and newer kernel.. at least sometimes. Same ecode 0x85dffffb.
Comment 28 Rodrigo Vivi 2014-10-24 21:21:37 UTC
Hi Timo and Garry,

This seems duplicate of: https://bugs.freedesktop.org/show_bug.cgi?id=85389

Can you please verify the xf86-video-intel' sna fix listed there.
Comment 29 Timo Aaltonen 2014-10-24 21:31:58 UTC
ddx on 14.10 is 2.99.914, so it doesn't have that regression
Comment 30 Rodrigo Vivi 2014-10-27 17:19:58 UTC
I tried again to reproduce here and everything run fine.

Now I got Xubuntu 14.10. But latest one already contains Mesa 10.3. So you probably wants to give a try.

But also other differences are on Silicon stepping and on BIOS. I would recommend to test your images on latest available silicon/bios.
Comment 31 Yang Kun (YK) 2014-10-28 03:30:58 UTC
(In reply to Rodrigo Vivi from comment #30)
> I tried again to reproduce here and everything run fine.
> 
> Now I got Xubuntu 14.10. But latest one already contains Mesa 10.3. So you
> probably wants to give a try.
> 
> But also other differences are on Silicon stepping and on BIOS. I would
> recommend to test your images on latest available silicon/bios.

Hi Rodrigo, 

can you share what Silicon stepping and BIOS/vBIOS version you're using ?

thank you
-YK
Comment 32 Gary Wang 2014-10-28 06:21:41 UTC
Created attachment 108553 [details]
The latest xf86-video-intel built for commit d08a5f555a0c47ae23c0f9a890b512cb23e74feb

Hi Rodrigo,
I use the latest snapshot (including your patch http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=4df0052a21efd744c4b8cb2409139ded6e45f5c8) of xf86-video-video to verify this issue in Xubuntu 14.10 beta-1 (because my built host is xserver-xorg-core v1.15),

commit d08a5f555a0c47ae23c0f9a890b512cb23e74feb
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 24 09:53:29 2014 +0100

    sna/trapezoids: Prevent overflow of edge gradient in mono rasteriser

    References: https://bugs.freedesktop.org/show_bug.cgi?id=70461#c76
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

This issue is still able to be reproduced with the same GPU error code in CPU d-step BDW machine.
Comment 33 Gary Wang 2014-10-28 06:24:32 UTC
Created attachment 108554 [details]
xserver-xorg-video-intel_2.99.999-0ubuntu1.1_amd64.deb

For comment #32, The latest code of xf86-video-intel built for commit 

commit d08a5f555a0c47ae23c0f9a890b512cb23e74feb
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 24 09:53:29 2014 +0100

    sna/trapezoids: Prevent overflow of edge gradient in mono rasteriser

    References: https://bugs.freedesktop.org/show_bug.cgi?id=70461#c76
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 34 Gary Wang 2014-10-28 06:42:58 UTC
It got the same GPU error code in Xubuntu 14.10 formal release in the same BDW devices. (BIOS: BDW-E2R1.86C.0095.R08.1410190256, 10/19/2014, d-step CPU) in 3200x1800, and pass the test in 1920x1080.

I will try to get newer BDW CPU/DEVICE for its verification (I don't have them until now).
Comment 35 Gary Wang 2014-10-28 06:51:42 UTC
The version of MESA in Xubuntu 14.10 formal release is v10.3.0, xdrv is 2.99.914, libdrm is 2.4.56
Comment 36 Timo Aaltonen 2014-10-29 09:24:59 UTC
One way to trigger this is to bump the scale factor on Unity to 1.5, then I can reproduce it on the BDW ULX box too. It should have the latest stepping (4), while my Wilson Beach is still on beta.

You can find the scale factor from display settings. It's set by default on the OEM machines in question. After disabling it the gpu hang is not seen.
Comment 37 XiongZhang 2014-10-30 09:38:48 UTC
On Wilson Beach,I can reproduce this issue as Timo suggest to set scale > 1 and  resolution > 1920x1080
Comment 38 XiongZhang 2014-10-31 08:02:28 UTC
If  I add i915.enable_rc6=0 boot option on Wilson Beach, the first time run gst-launch, the gpu will hang. Once the gpu finish reset resulting from gpu hang, running gst-launch has no problem.
Comment 39 Rodrigo Vivi 2014-10-31 19:47:21 UTC
Could you please try reverting this patch and see if you can still reproduce the issue:

git show 0d68b25e9ceb344fe2f93373b1c0311d33814265
commit 0d68b25e9ceb344fe2f93373b1c0311d33814265
Author: Tom O'Rourke <Tom.O'Rourke@intel.com>
Date:   Wed Apr 9 11:44:06 2014 -0700

    drm/i915/bdw: Use timeout mode for RC6 on bdw
Comment 40 XiongZhang 2014-11-03 02:11:37 UTC
(In reply to Rodrigo Vivi from comment #39)
> Could you please try reverting this patch and see if you can still reproduce
> the issue:
> 
> git show 0d68b25e9ceb344fe2f93373b1c0311d33814265
> commit 0d68b25e9ceb344fe2f93373b1c0311d33814265
> Author: Tom O'Rourke <Tom.O'Rourke@intel.com>
> Date:   Wed Apr 9 11:44:06 2014 -0700
> 
>     drm/i915/bdw: Use timeout mode for RC6 on BDW

After reverting this commit, this issue still exist
Comment 41 Rodrigo Vivi 2014-11-05 23:49:40 UTC
Created attachment 108997 [details] [review]
Use Vmask for 3DSTATE_PS

Please confirm attached xf86-video-intel patch fixes the issue for you.
Comment 42 Gary Wang 2014-11-06 02:23:17 UTC
Created attachment 109004 [details]
"Use Vmask for 3DSTATE_PS" patch applied xserver-xorg-video deb

I verified it by "Use Vmask for 3DSTATE_PS" patch with following the latesat xf86-video-intel snapshot (without that patch, it fails test and gets GPU hang)

commit ba408bf21c4b65f19c7b581e4c88c92805184334
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 4 13:39:52 2014 +0000

    sna: Correct units for videoRam

It appears to work well on WSB SDS now (rc6_enabled). Thanks Rodrigo!

For Timo, can you help to verify it on your customized system environment?
Comment 43 XiongZhang 2014-11-06 02:24:51 UTC
(In reply to Rodrigo Vivi from comment #41)
> Created attachment 108997 [details] [review] [review]
> Use Vmask for 3DSTATE_PS
> 
> Please confirm attached xf86-video-intel patch fixes the issue for you.

This patch fix this issue reproduced on Willson Beach

thanks
Comment 44 Timo Aaltonen 2014-11-06 08:08:13 UTC
the patch applied on 2.99.910 works fine, but on current master (the driver you provided) it causes corrupted video output on the window
Comment 45 Gary Wang 2014-11-06 09:29:54 UTC
Hi Timo, 
I only built it based on 2.99.216+ for its experiment on WSB SDS/Xubuntu 14.10 beta-1 (original one is 2.99.214).
Comment 46 Chris Wilson 2014-11-06 10:44:44 UTC
commit 97fe3c1c860978c7a649cba93a55fa497010ccc1
Author: Rodrigo Vivi <rodrigo.vivi@intel.com>
Date:   Wed Nov 5 15:48:14 2014 -0800

    sna: Use VMask in 3DSTATE_PS
    
    Using dispatch mask cause hangs waiting PS Done on some cases like bug #83207,
    with larger screen or when scaling it.
    
    Also mesa uses VMask instead of Dmask for 3DSTATE_PS because in some cases
    they were getting incorrect derivatives for subspans.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=83207
    Cc: Timo Aaltonen <tjaalton@ubuntu.com>
    Cc: Gary Wang <gary.c.wang@intel.com>
    Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Tested-by: Timo Aaltonen <tjaalton@ubuntu.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.