Bug 110394

Summary:

Skylake GPU HANG while gstreamer H264 vaapi encoding from MJPEG vaapi decode on drm-tip

Product:

DRI

Reporter:

Andy Nicholas <andy.nicholas>

Component:

DRM/Intel

Assignee:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Status:

RESOLVED MOVED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

critical

Priority:

medium

CC:

intel-gfx-bugs

Version:

DRI git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

Triaged, ReadyForDev

i915 platform:

SKL

i915 features:

GPU hang

Attachments:

Description	Flags
GPU crash info plus dmesg log	none
Crash #2 of the second test configuration. This one happened in 4 iterations on the same hardware.	none
3rd GPU hang on same hardware using same test loop. Within 2 iterations.	none
4th GPU hang on same hardware	none
5th GPU hang on same hardware, within 1 iteration	none
GPU hang #6 after replacing power supply and cable to same compute stick.	none
GPU hang #7 - on different compute stick running exact same test	none
7th GPU hang, this time with always disabling coarse-grain power	none
8th GPU hang, but this time without a GPU crashlog. Was using coarse power gating disabled.	none
GPU hang #9 after coarse power gating disabled, machine .36	none
MJPEG - hand-carry video camera outside my house	none
GPU hang #9	none
GPU Hang #10 after enabling coarse grain power gating	none
GPU hang #11 - after enabling coarse power gating	none
GPU hang 12	none
GPU hang 13 - after media encoder power gating ONLY enabled	none
mjpeg-outside-640x480.mkv.aa	none
mjpeg-outside-640x480.mkv.ab	none
mjpeg-outside-640x480.mkv.ac	none
mjpeg-outside-640x480.mkv.ad	none
mjpeg-outside-640x480.mkv.ae	none
mjpeg-outside-640x480.mkv.af	none
mjpeg-outside-640x480.mkv.ag	none
mjpeg-outside-640x480.mkv.ah	none
mjpeg-outside-640x480.mkv.aa	none
x86_64_defconfig	none
GPU crash #14 - with drm-tip from today	none
GPU crash #15 - with drm-tip from today	none
GPU hang #16 when "silly test" is enabled	none
attachment-18046-0.html	none

Description Andy Nicholas 2019-04-10 17:41:17 UTC

Created attachment 143919 [details]
GPU crash info plus dmesg log

Similar to Bug #110297 which I filed.

Skylake GPU hang when encoding video stream to H.264 using VAAPI. The stream is decoded from a VAAPI MJPEG stream from a file. We run test loops where we transcode this stream over and over, thousands of times. This GPU hang happened on iteration 1026.

Running on Intel Compute Stick STK2mV64CC. We have locked the minimum and maximum clock speeds of the GPU to 500 Mhz to attempt to avoid... this issue.

We are running this test because one of our products needs to have this exact configuration: read an MJPEG stream from a V4L camera and transcode into H264. This configuration needs to be super stable. Crashing once in 1026 iterations is not considered "stable".

Using Ubuntu 18.04 plus DRM-TIP kernel from about 3 weeks ago which corresponds with 5.1-rc1.

Using GStreamer 1.14.1:

shield@tobeprovisioned1804:~$ gst-launch-1.0 --version
gst-launch-1.0 version 1.14.1
GStreamer 1.14.1
https://launchpad.net/distros/ubuntu/+source/gstreamer1.0

Full GPU hang log and dmesg enclosed. This is related to a similar bug which I previously filed.

Especially concerning is that the machine is usable (but the GPU seems dead) after this crash. We would like to figure out a way of determining that the GPU has died and to kernel panic so that we can, eventually, reboot. Modifying the kernel is A-OK to avoid this issue, so if Intel doesn't have a mechanism then I will try to add something myself.

Leaving the machine in this "half dead" state is bad. We can't use the gstreamer process termination as the "reboot the machine" trigger as we may have other, less severe, bugs where we simply want to restart the gstreamer process.

Comment 1 Chris Wilson 2019-04-10 17:47:56 UTC

(In reply to Andy Nicholas from comment #0) 
> Especially concerning is that the machine is usable (but the GPU seems dead)
> after this crash. We would like to figure out a way of determining that the
> GPU has died and to kernel panic so that we can, eventually, reboot.
> Modifying the kernel is A-OK to avoid this issue, so if Intel doesn't have a
> mechanism then I will try to add something myself.

Watch /proc/sys/kernel/tainted. Currently we set TAINT_WARN, but you can change that to TAINT_DIE if you fancy something less likely to be set by others.

diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
index 68875ba43b8d..11795677bf6e 100644
--- a/drivers/gpu/drm/i915/i915_reset.c
+++ b/drivers/gpu/drm/i915/i915_reset.c
@@ -1088,7 +1088,7 @@ void i915_reset(struct drm_i915_private *i915,
         * rather than continue on into oblivion. For everyone else,
         * the system should still plod along, but they have been warned!
         */
-       add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+       add_taint(TAINT_DIE, LOCKDEP_STILL_OK);
 error:
        __i915_gem_set_wedged(i915);
        goto finish;

Comment 2 Andy Nicholas 2019-04-10 18:05:13 UTC

Also --

No additional strenuous load is running on the compute stick, only this transcode. We have not modified the standard cstates and the freq scaling governor is set to "powersave".

shield@tobeprovisioned1804:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
powersave
 
Please me know if there is some different configuration which is less likely to exhibit this issue. We previously have put stress-ng loads on the CPUs to better simulate what we typically would utilize, and that configuration seemed especially stable... but depending on periodic CPU activity to prevent GPU hangs is just playing russian roulette with the GPU video encoder.

Comment 3 Andy Nicholas 2019-04-10 19:18:00 UTC

My DRM-TIP kernel is from:

commit 00cb3798a5d008c3f824fe7c89c663dba66155c3 (HEAD -> drm-tip, origin/drm-tip, origin/HEAD)
Author: Rodrigo Vivi <rodrigo.vivi@intel.com>
Date:   Fri Mar 22 12:52:43 2019 -0700


These config switches were ADDED to DRM-TIP so I could boot from eMMC and configure for lower kernel latency and see serial output when the GPU goes bonkers:

CONFIG_USB_SERIAL=y
CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_SERIAL_FTDI_SIO=y
CONFIG_USB_PL2303=y
CONFIG_FRAME_POINTER=y
CONFIG_LATENCYTOP=y
CONFIG_MMC=y
CONFIG_MMC_BLOCK=y
CONFIG_MMC_BLOCK_MINORS=8
CONFIG_MMC_SDHCI=y
CONFIG_MMC_SDHCI_PCI=y
CONFIG_MMC_RICOH_MMC=y
CONFIG_MMC_SDHCI_ACPI=y
CONFIG_DEBUG_INFO=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KEXEC_FILE=y
CONFIG_ARCH_HAS_KEXEC_PURGATORY=y
CONFIG_KEXEC_JUMP=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
CONFIG_DRM_I915_DEBUG=y
CONFIG_DRM_I915_DEBUG_RUNTIME_PM=y
CONFIG_USB_RTL8152=y
CONFIG_USB_NET_DRIVERS=y

transcoding script using gstreamer is:


#!/usr/bin/env bash

set -ex

tcount=0
while true; do
	echo "Transcode: iteration $tcount" | tee tcount.txt
 
	# remove old output
	rm -f /tmp/transcode-output.mp4

        time gst-launch-1.0 filesrc location= andy-movies/mjpeg-outside-640x480.mkv ! matroskademux ! vaapijpegdec ! vaapih264enc ! qtmux ! filesink location=/tmp/gst-output.mp4

	tcount=$((tcount+1))	
done

Comment 4 Andy Nicholas 2019-04-10 19:21:57 UTC

Created attachment 143920 [details]
Crash #2 of the second test configuration. This one happened in 4 iterations on the same hardware.

This is another crash on the same test configuration. I restarted my test case and moments later, the same machine had crashed again after 4 iterations.

dmesg and GPU crashlog enclosed.

Comment 5 Andy Nicholas 2019-04-10 19:48:24 UTC

Created attachment 143921 [details]
3rd GPU hang on same hardware using same test loop. Within 2 iterations.

This time my transcoder script seemed to have terminated... but actually didn't -- it was just asleep, waiting. The transcoder script RESUMED RUNNING after 14 minutes and completed the transcoding task. No further information was emitted into the dmesg log.

This is bad because our usage case can't tolerate a "hang" where the transcoder does not emit anything for 14 minutes... or even a few milliseconds.

Comment 6 Andy Nicholas 2019-04-10 19:52:52 UTC

Created attachment 143922 [details]
4th GPU hang on same hardware

This was a full GPU crash which happened a few minutes after the initial "hang" of 14 minutes. The GPU crashlog is enclosed.

Comment 7 Andy Nicholas 2019-04-10 19:54:52 UTC

(In reply to Andy Nicholas from comment #5)
> Created attachment 143921 [details]
> 3rd GPU hang on same hardware using same test loop. Within 2 iterations.
> 
> This time my transcoder script seemed to have terminated... but actually
> didn't -- it was just asleep, waiting. The transcoder script RESUMED RUNNING
> after 14 minutes and completed the transcoding task. No further information
> was emitted into the dmesg log.
> 
> This is bad because our usage case can't tolerate a "hang" where the
> transcoder does not emit anything for 14 minutes... or even a few
> milliseconds.

Btw, while the transcoder was "hung" I was able to login and use the kernel as normal using SSH so most of the kernel subsystems seemed to be performing normally. The GPU was definitely not functioning properly.

Btw, This is using Ubuntu server so we do not have Xserver running.

Comment 8 Andy Nicholas 2019-04-10 20:04:38 UTC

Created attachment 143923 [details]
5th GPU hang on same hardware, within 1 iteration

GPU crash within 1 iteration after rebooting into clean system.

I'm going to stop testing now. If you guys need this compute stick in order to reproduce the crash, we can FedEx you this system. It's a compute stick and a USB network dongle -- not big.

Comment 9 Andy Nicholas 2019-04-10 20:28:08 UTC

We have 5 compute sticks running this exact same test simultaneously. These compute sticks should have identical configurations running identical test cases, but none of the other compute sticks are crashing.

To try to narrow this down, we will see if we can create a CloneZilla image of the compute stick which crashes. This image will be used on other compute sticks to see if we are able to narrow down the GPU crashing problem.

Comment 10 Chris Wilson 2019-04-10 20:32:14 UTC

If you can capture an input stream that reproduces the hang, that would be fantastic. Even if it is just one mjpeg frame in a loop, that plus the reproduction command will be invaluable.

Comment 11 Andy Nicholas 2019-04-10 21:12:21 UTC

Created attachment 143924 [details]
GPU hang #6 after replacing power supply and cable to same compute stick.

GPU hang #6 after replacing the power supply and cable to the same compute stick. GPU hung within 1 iteration of this test case.

Comment 12 Andy Nicholas 2019-04-10 21:15:56 UTC

(In reply to Andy Nicholas from comment #11)
> Created attachment 143924 [details]
> GPU hang #6 after replacing power supply and cable to same compute stick.
> 
> GPU hang #6 after replacing the power supply and cable to the same compute
> stick. GPU hung within 1 iteration of this test case.

Differing power supply was plugged into a different power strip from the original power supply feeding this compute stick, so we are reasonably sure that the power supply and power feeding systems are working properly. We have lots and lots of compute stick power supplies and feed cables, so this test was simple.

Comment 13 Andy Nicholas 2019-04-10 21:33:26 UTC

Created attachment 143925 [details]
GPU hang #7 - on different compute stick running exact same test

This is from a different compute stick running the exact same test with exact same binary file for movie. Dies around iteration #58. Some stacks captured this time, not sure if the extra info is helpful.

Btw, our output directory is using /tmp which is using tmpfs, so writing to the local flash filesystem is not happening as part of this test.

Comment 14 Chris Wilson 2019-04-10 22:02:36 UTC

Try

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 1a6f36e08a60..d2e075e54e89 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -7202,7 +7202,7 @@ static void gen9_enable_rc6(struct drm_i915_private *dev_priv)
         * 3b: Enable Coarse Power Gating only when RC6 is enabled.
         * WaRsDisableCoarsePowerGating:skl,cnl - Render/Media PG need to be disabled with RC6.
         */
-       if (NEEDS_WaRsDisableCoarsePowerGating(dev_priv))
+       if (NEEDS_WaRsDisableCoarsePowerGating(dev_priv) || 1)
                I915_WRITE(GEN9_PG_ENABLE, 0);
        else
                I915_WRITE(GEN9_PG_ENABLE,

Comment 15 Andy Nicholas 2019-04-10 22:16:17 UTC

(In reply to Chris Wilson from comment #14)
> Try
> 
> diff --git a/drivers/gpu/drm/i915/intel_pm.c
> b/drivers/gpu/drm/i915/intel_pm.c
> index 1a6f36e08a60..d2e075e54e89 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -7202,7 +7202,7 @@ static void gen9_enable_rc6(struct drm_i915_private
> *dev_priv)
>          * 3b: Enable Coarse Power Gating only when RC6 is enabled.
>          * WaRsDisableCoarsePowerGating:skl,cnl - Render/Media PG need to be
> disabled with RC6.
>          */
> -       if (NEEDS_WaRsDisableCoarsePowerGating(dev_priv))
> +       if (NEEDS_WaRsDisableCoarsePowerGating(dev_priv) || 1)
>                 I915_WRITE(GEN9_PG_ENABLE, 0);
>         else
>                 I915_WRITE(GEN9_PG_ENABLE,

Sure, I can try this.

According to the comment, doesn't your change just force the same path of disabling coarse power gating? 

If so, is the power consumption now gigantic while encoding or decoding media?

Comment 16 Chris Wilson 2019-04-10 22:24:51 UTC

Busy power consumption will be unaffected, semi-active power consumption will be unaffected (that uses rc6 for short sleeps while active). Idle (GPU) power consumption will be affected (off the top of my head, it prevents saving the last 100mW). The rapl interface provides power consumption information (see https://gitlab.freedesktop.org/drm/igt-gpu-tools/blob/master/lib/igt_gpu_power.c for an example)

All 7 GPU error states indicate that it failed before reloading the same context after a short idling -- the IPEHR (last command parsed) is the last command in the retiring context. RING_TAIL (every time) is garbage -- whether that is just the forcewake failing... probably.

Comment 17 Andy Nicholas 2019-04-12 01:08:06 UTC

Forcing the coarse power stuff off did not work. Crashlog enclosed. Took about 95 iterations (95 minutes) to reproduce. 2 other compute sticks have shown the same issue with the previous kernel.

Comment 18 Andy Nicholas 2019-04-12 01:09:02 UTC

Created attachment 143944 [details]
7th GPU hang, this time with always disabling coarse-grain power

Comment 19 Andy Nicholas 2019-04-12 01:45:42 UTC

Created attachment 143945 [details]
8th GPU hang, but this time without a GPU crashlog. Was using coarse power gating disabled.

8th GPU hang, but without a GPU crashlog.

Comment 20 Chris Wilson 2019-04-12 06:09:40 UTC

So to test the earlier observation that it is rc6-related,

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 8e826a6ab62e..acce2574228b 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -8671,7 +8671,7 @@ static void intel_enable_rc6(struct drm_i915_private *dev_priv)
        else if (INTEL_GEN(dev_priv) >= 11)
                gen11_enable_rc6(dev_priv);
        else if (INTEL_GEN(dev_priv) >= 9)
-               gen9_enable_rc6(dev_priv);
+               gen9_disable_rc6(dev_priv);
        else if (IS_BROADWELL(dev_priv))
                gen8_enable_rc6(dev_priv);
        else if (INTEL_GEN(dev_priv) >= 6)

Comment 21 Andy Nicholas 2019-04-12 17:37:52 UTC

Created attachment 143954 [details]
GPU hang #9 after coarse power gating disabled, machine .36

Ok, I will try the fc6 disable suggestion shortly. Will take a few hours to run and cause trouble.

This is a hang from the 2nd compute stick which hung after 257 transcode iterations using the same source MJPEG movie.

We have 2 compute sticks which seem to reliably show these problems. One is in San Diego at my work. The other is here in San Jose, Bay Area with me. We made a CloneZilla image of the compute stick to see if it would hang and it did after about 650 iterations (e.g., 650 minutes).

I am told that CloneZilla can only really restore to device of similar size, so if you had an m5 compute stick (STK2mv64cc) we could put the 5GB image on a server for you. There's nothing confidential from shield. It's a stock 18.04 image with my home-made MJPEG movie. If necessary we can Fedex a compute stick which demonstrates the problem.

Comment 22 Chris Wilson 2019-04-12 18:32:55 UTC

(In reply to Andy Nicholas from comment #21)
> I am told that CloneZilla can only really restore to device of similar size,
> so if you had an m5 compute stick (STK2mv64cc) we could put the 5GB image on
> a server for you.

I've put one on order.

Even just a simple disk image should be sufficient for a chroot, but whatever works :)

Comment 23 Andy Nicholas 2019-04-15 02:44:59 UTC

Ok, I ran the kernel with the patch to disable RC6 on 8x compute sticks running the same kernel and same user-mode libraries. As per the previous tests, the minimum and maximum frequency was locked at 500 Mhz.

We observed zero crashes or warnings emitted into the dmesg log (or anywhere else). None of the compute sticks locked-up or otherwise misbehaved.

This was our expectation, and after running the transcode loop more than 28,000 times (3500x8) this continues to the the case.

Comment 24 Andy Nicholas 2019-04-15 02:54:26 UTC

(In reply to Andy Nicholas from comment #23)
> Ok, I ran the kernel with the patch to disable RC6 on 8x compute sticks
> running the same kernel and same user-mode libraries. As per the previous
> tests, the minimum and maximum frequency was locked at 500 Mhz.
> 
> We observed zero crashes or warnings emitted into the dmesg log (or anywhere
> else). None of the compute sticks locked-up or otherwise misbehaved.
> 
> This was our expectation, and after running the transcode loop more than
> 28,000 times (3500x8) this continues to the the case.

Any previous test crashed around 1050 times at most, and often much sooner, so I believe having all the compute sticks survive to 3500 iterations is statistically significant.

Comment 25 Chris Wilson 2019-04-15 08:20:35 UTC

You can try something silly like

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 4e0a351bfbca..e5feb0f5a5fe 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2115,10 +2115,16 @@ static int gen9_emit_bb_start(struct i915_request *rq,
 {
        u32 *cs;
 
-       cs = intel_ring_begin(rq, 6);
+       cs = intel_ring_begin(rq, 12);
        if (IS_ERR(cs))
                return PTR_ERR(cs);
 
+       *cs++ = MI_LOAD_REGISTER_IMM(1);
+       *cs++ = i915_mmio_reg_offset(GEN6_RC_CONTROL);
+       *cs++ = GEN6_RC_CTL_HW_ENABLE |
+               GEN6_RC_CTL_RC6_ENABLE |
+               GEN6_RC_CTL_EI_MODE(1);
+
        *cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
 
        *cs++ = MI_BATCH_BUFFER_START_GEN8 |
@@ -2129,6 +2135,10 @@ static int gen9_emit_bb_start(struct i915_request *rq,
        *cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
        *cs++ = MI_NOOP;
 
+       *cs++ = MI_LOAD_REGISTER_IMM(1);
+       *cs++ = i915_mmio_reg_offset(GEN6_RC_CONTROL);
+       *cs++ = 0;
+
        intel_ring_advance(rq, cs);
 
        return 0;

to see if it is just an rc6 event on idling that is the culprit, while keeping rc6 active for the encoder.

Comment 26 Chris Wilson 2019-04-15 08:21:54 UTC

The other step would be to disable rc6 only and keep the powergate in gen9_enable_rc6() -- to confirm which of the two is the risk.

Comment 27 Chris Wilson 2019-04-15 08:23:30 UTC

I do now have a STK2mv64cc and it survived a w/e with non-media workloads. If you have a recipe for me to run mjpeg through it, that would be useful.

Comment 28 Andy Nicholas 2019-04-15 19:53:37 UTC

(In reply to Chris Wilson from comment #26)
> The other step would be to disable rc6 only and keep the powergate in
> gen9_enable_rc6() -- to confirm which of the two is the risk.

To be clear - I believe this is the state of Shield's testing, all of which use the same drm-tip kernel from 3 weeks ago, all of which are transcoding from MJPEG --> H264:

(1) Only disable RC6, no other changes == no GPU issues

(2) Only disable coarse power-gating [in gen9_enable_rc6()], no other changes == GPU hangs.

(3) No changes to drm-tip kernel == GPU hangs.


I'm guessing that there's something that RC6 allows which can cause the transcoder GPU hangs to occur, but whatever that is, it's not related to coarse power gating.

I will proceed with placing both changes (RC6 disable + Coarse power gating disable) into a kernel. I'm expecting the GPU not to hang/crash.

Comment 29 Chris Wilson 2019-04-15 19:57:52 UTC

The combination that I don't think has been tested is with

GEN6_RC_CONTROL == 0
GEN6_PG_ENABLE == RENDER_PG_ENABLE | MEDIA_PG_ENABLE

as so far both have been disabled when disabling rc6.

Comment 30 Andy Nicholas 2019-04-15 20:06:57 UTC

Created attachment 143979 [details]
MJPEG - hand-carry video camera outside my house

Enclosed is 1 file for the recipe. This file needs to be transcoded from its format of MJPEG into H.264 on the compute stick using vaapi using gstreamer on Ubuntu 18.04 Server.

Complete recipe coming after this large upload.

Comment 31 Andy Nicholas 2019-04-15 20:48:57 UTC

(In reply to Andy Nicholas from comment #30)
> Created attachment 143979 [details]
> MJPEG - hand-carry video camera outside my house
> 
> Enclosed is 1 file for the recipe. This file needs to be transcoded from its
> format of MJPEG into H.264 on the compute stick using vaapi using gstreamer
> on Ubuntu 18.04 Server.
> 
> Complete recipe coming after this large upload.


Complete recipe from "No OS":


(1) Install Ubuntu 18.04.2 Server onto compute stick. I used a USB stick to do this. Select volume manager in settings and SSH enable. Choose to make 52GB of the compute stick into the large boot partition.


(2) Install these to get current gstreamer and ffmpeg test facilities:

    sudo apt install -y libgstreamer1.0-0 gstreamer1.0-plugins-base gstreamer1.0-plugins-good
    sudo apt install -y gstreamer1.0-plugins-bad gstreamer1.0-plugins-ugly gstreamer1.0-libav
    sudo apt install -y gstreamer1.0-doc gstreamer1.0-tools gstreamer1.0-x gstreamer1.0-alsa
    sudo apt install -y gstreamer1.0-gl gstreamer1.0-gtk3 gstreamer1.0-qt5 gstreamer1.0-pulseaudio

    sudo apt install -y ffmpeg
    sudo apt install -y gstreamer1.0-vaapi
    sudo apt install -y vainfo
    sudo apt install -y xserver-xorg-hwe-18.04
    sudo apt-get update
    sudo apt-get upgrade



(3) Add the line below to /etc/fstab to point /tmp to tmpfs so we are routinely saving to ram instead of flash:

tmpfs /tmp tmpfs defaults,noatime,nodiratime 0 0



(4) Save the following script to ~/setup-gpu.sh


#!/usr/bin/env bash

echo "This script must be run from sudo su prompt #"

set -ex

echo 500 | tee /sys/class/drm/card0/gt_min_freq_mhz
echo 500 | tee /sys/class/drm/card0/gt_max_freq_mhz
echo 500 | tee /sys/class/drm/card0/gt_boost_freq_mhz

cat /sys/class/drm/card0/gt_min_freq_mhz
cat /sys/class/drm/card0/gt_max_freq_mhz
cat /sys/class/drm/card0/gt_boost_freq_mhz

echo "Intel GPU Frequencies Locked"

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor

echo "Intel CPUs set to performance governor"



(5) Save the following script to ~/mjpeg-to-h264.sh:

#!/usr/bin/env bash

set -ex

tcount=0
while true; do
	echo "Transcode: iteration $tcount" | tee tcount.txt
 
	# remove old output
	rm -f /tmp/transcode-output.mp4

	# transcode big-buck-bunny.mp4 using gstreamer
        time gst-launch-1.0 filesrc location= andy-movies/mjpeg-outside-640x480.mkv ! matroskademux ! vaapijpegdec ! vaapih264enc ! qtmux ! filesink location=/tmp/gst-output.mp4

	tcount=$((tcount+1))	
done



(6) Save the MJPEG movie from this bug reply #30 to:

    ~/andy-movies/mjpeg-outside-640x480.mkv



(7) Switch to root: 

    $ sudo su
    # ./setup-gpu.sh
    <ctrl-D> to exit from root
    $./mjpeg-to-h264.sh


Once the compute stick is configured, the last Step #7 is what I do over and over after rebooting.

NOTE: It's highly possible that this configuration will not reproduce the issue, even when updated with the drm-tip kernel. We have only been successful when using the image which was running on compute stick .38 and .36.

andy

Comment 32 Andy Nicholas 2019-04-16 04:05:19 UTC

This is the diff for the powergating test I'm running now. No other changes are made to the kernel, just this one patch:

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index eaf0793ebf60..29260ba32529 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -8609,8 +8609,12 @@ static void intel_enable_rc6(struct drm_i915_private *dev_priv)
                cherryview_enable_rc6(dev_priv);
        else if (IS_VALLEYVIEW(dev_priv))
                valleyview_enable_rc6(dev_priv);
-       else if (INTEL_GEN(dev_priv) >= 9)
-               gen9_enable_rc6(dev_priv);
+       else if (INTEL_GEN(dev_priv) >= 9) {
+               gen9_disable_rc6(dev_priv);
+
+               I915_WRITE(GEN9_PG_ENABLE,
+                           GEN9_RENDER_PG_ENABLE | GEN9_MEDIA_PG_ENABLE);
+       }
        else if (IS_BROADWELL(dev_priv))
                gen8_enable_rc6(dev_priv);
        else if (INTEL_GEN(dev_priv) >= 6)


Btw -

I did read through the gen9_enable_rc6() code and the "hysteresis" stuff jumped out at me. does anything bad happen if those calculations are not accurate?

Comment 33 Andy Nicholas 2019-04-16 06:44:48 UTC

Created attachment 143984 [details]
GPU hang #9

Enclosed is a GPU hang after running with power gating enabled, but the gen9_disable_rc6() code preceding that. Was not really expecting that, but does it narrow down the issue?

Comment 34 Andy Nicholas 2019-04-16 06:47:36 UTC

Created attachment 143985 [details]
GPU Hang #10 after enabling coarse grain power gating

Crash on 2nd compute stick after enabling coarse grain power gating. No GPU crashlog recovered.

Comment 35 Andy Nicholas 2019-04-16 06:48:38 UTC

(In reply to Andy Nicholas from comment #33)
> Created attachment 143984 [details]
> GPU hang #9
> 
> Enclosed is a GPU hang after running with power gating enabled, but the
> gen9_disable_rc6() code preceding that. Was not really expecting that, but
> does it narrow down the issue?

This was after around 150 iterations, so this did not take long to cause trouble. Then other crash happened on a compute stick which, as far as I know, had never crashed previously.

Comment 36 Andy Nicholas 2019-04-16 06:59:09 UTC

Created attachment 143986 [details]
GPU hang #11 - after enabling coarse power gating

One more hang after 67 iterations after enabling coarse power gating, but disabling RC6 using gen9_disable_rc6().

Comment 37 Andy Nicholas 2019-04-16 07:02:05 UTC

I did not try the "silly" suggestion above, but I can do that in the morning. I'm in San Jose CA, Bay Area.

Comment 38 Chris Wilson 2019-04-16 07:46:37 UTC

(In reply to Andy Nicholas from comment #32)
> This is the diff for the powergating test I'm running now. No other changes
> are made to the kernel, just this one patch:
> 
> diff --git a/drivers/gpu/drm/i915/intel_pm.c
> b/drivers/gpu/drm/i915/intel_pm.c
> index eaf0793ebf60..29260ba32529 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -8609,8 +8609,12 @@ static void intel_enable_rc6(struct drm_i915_private
> *dev_priv)
>                 cherryview_enable_rc6(dev_priv);
>         else if (IS_VALLEYVIEW(dev_priv))
>                 valleyview_enable_rc6(dev_priv);
> -       else if (INTEL_GEN(dev_priv) >= 9)
> -               gen9_enable_rc6(dev_priv);
> +       else if (INTEL_GEN(dev_priv) >= 9) {
> +               gen9_disable_rc6(dev_priv);
> +
> +               I915_WRITE(GEN9_PG_ENABLE,
> +                           GEN9_RENDER_PG_ENABLE | GEN9_MEDIA_PG_ENABLE);
> +       }
>         else if (IS_BROADWELL(dev_priv))
>                 gen8_enable_rc6(dev_priv);
>         else if (INTEL_GEN(dev_priv) >= 6)
> 
> 
> Btw -
> 
> I did read through the gen9_enable_rc6() code and the "hysteresis" stuff
> jumped out at me. does anything bad happen if those calculations are not
> accurate?

It will just use whatever the default values were. The important point here is that with RC6_CONTROL=0 and PG_ENABLE!=0, we see exactly the same hang as before. That is at least a nudge back towards the powergating as being the culprit.

Comment 39 Andy Nicholas 2019-04-16 16:32:34 UTC

(In reply to Chris Wilson from comment #25)
> You can try something silly like
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> b/drivers/gpu/drm/i915/intel_lrc.c
> index 4e0a351bfbca..e5feb0f5a5fe 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2115,10 +2115,16 @@ static int gen9_emit_bb_start(struct i915_request
> *rq,
>  {
>         u32 *cs;
>  
> -       cs = intel_ring_begin(rq, 6);
> +       cs = intel_ring_begin(rq, 12);
>         if (IS_ERR(cs))
>                 return PTR_ERR(cs);
>  
> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> +       *cs++ = i915_mmio_reg_offset(GEN6_RC_CONTROL);
> +       *cs++ = GEN6_RC_CTL_HW_ENABLE |
> +               GEN6_RC_CTL_RC6_ENABLE |
> +               GEN6_RC_CTL_EI_MODE(1);
> +
>         *cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
>  
>         *cs++ = MI_BATCH_BUFFER_START_GEN8 |
> @@ -2129,6 +2135,10 @@ static int gen9_emit_bb_start(struct i915_request *rq,
>         *cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
>         *cs++ = MI_NOOP;
>  
> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> +       *cs++ = i915_mmio_reg_offset(GEN6_RC_CONTROL);
> +       *cs++ = 0;
> +
>         intel_ring_advance(rq, cs);
>  
>         return 0;
> 
> to see if it is just an rc6 event on idling that is the culprit, while
> keeping rc6 active for the encoder.

gen9_emit_bb_start() does not exist in my drm-tip sources from 4 weeks ago (March 22). There is a gen8_emit_bb_start(), but no "gen9" or anything which seems similar.

andy@kbuild:/build/kernel/drm-tip/drm-tip/drivers/gpu/drm/i915$ git log
commit 00cb3798a5d008c3f824fe7c89c663dba66155c3 (HEAD -> drm-tip, origin/drm-tip, origin/HEAD)
Author: Rodrigo Vivi <rodrigo.vivi@intel.com>
Date:   Fri Mar 22 12:52:43 2019 -0700

    drm-tip: 2019y-03m-22d-19h-51m-23s UTC integration manifest

Comment 40 Andy Nicholas 2019-04-16 23:28:33 UTC

Created attachment 144003 [details]
GPU hang 12

Crash while checking if the media power gating enable, alone, causes trouble. Answer: yes, but I should run the other configuration also.

index eaf0793ebf60..800ed263c626 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -8609,8 +8609,12 @@ static void intel_enable_rc6(struct drm_i915_private *dev_priv)
                cherryview_enable_rc6(dev_priv);
        else if (IS_VALLEYVIEW(dev_priv))
                valleyview_enable_rc6(dev_priv);
-       else if (INTEL_GEN(dev_priv) >= 9)
-               gen9_enable_rc6(dev_priv);
+       else if (INTEL_GEN(dev_priv) >= 9) {
+               gen9_disable_rc6(dev_priv);
+
+               I915_WRITE(GEN9_PG_ENABLE,
+                          GEN9_MEDIA_PG_ENABLE);
+       }
        else if (IS_BROADWELL(dev_priv))
                gen8_enable_rc6(dev_priv);
        else if (INTEL_GEN(dev_priv) >= 6)

Comment 41 Andy Nicholas 2019-04-17 05:51:18 UTC

Created attachment 144005 [details]
GPU hang 13 - after media encoder power gating ONLY enabled

At least 3 compute sticks died when run in this configuration and I uploaded #12 previously. 

+               gen9_disable_rc6(dev_priv);
+
+               I915_WRITE(GEN9_PG_ENABLE,
+                          GEN9_MEDIA_PG_ENABLE);

This particular crash had a huge amount of information dumped, so I figured I should upload it also.

Interestingly, the compute stick which usually has its GPU hang for all these tests is still running strong after 839 iterations.

I am about to test the opposite power gate and see if the render power gate ON while media power gate is OFF has any effect.

Comment 42 Andy Nicholas 2019-04-17 07:08:25 UTC

I'm going to keep my old kernel around and create a newer kernel based on existing drm-tip from this morning. You seem to be actively pulling in merges from upstream sources and hopefully everything will be OK.

I will start a test probably later in the morning (my time) of this kernel to understand whether the problem still occurs, then I can try the "silly test".

Comment 43 Chris Wilson 2019-04-17 10:21:03 UTC

After some cursing at ubuntu installers and gstreamer-vaapi packaging, it looks like I have the pipeline working^Wsetup. Do you mind trimming your mjpeg sample as bugs.fd.o doesn't like the current attachment -- and I want to try and stick to your sample to keep differences to a minimum.

Comment 44 Andy Nicholas 2019-04-17 16:15:48 UTC

(In reply to Andy Nicholas from comment #41)
> Created attachment 144005 [details]
> GPU hang 13 - after media encoder power gating ONLY enabled
> 
> At least 3 compute sticks died when run in this configuration and I uploaded
> #12 previously. 
> 
> +               gen9_disable_rc6(dev_priv);
> +
> +               I915_WRITE(GEN9_PG_ENABLE,
> +                          GEN9_MEDIA_PG_ENABLE);
> 
> This particular crash had a huge amount of information dumped, so I figured
> I should upload it also.
> 
> Interestingly, the compute stick which usually has its GPU hang for all
> these tests is still running strong after 839 iterations.
> 
> I am about to test the opposite power gate and see if the render power gate
> ON while media power gate is OFF has any effect.

Yes, 4 of 8 compute sticks crashed by this morning, so it appears that having either media or render power gating enabled ends up with GPU hanging.

Comment 45 Andy Nicholas 2019-04-17 17:04:22 UTC

Created attachment 144013 [details]
mjpeg-outside-640x480.mkv.aa

Using split tool, part 1 of 8 (aa of ah)

Comment 46 Andy Nicholas 2019-04-17 17:07:25 UTC

Created attachment 144014 [details]
mjpeg-outside-640x480.mkv.ab

Part 2 of 8

Comment 47 Andy Nicholas 2019-04-17 17:09:46 UTC

Created attachment 144015 [details]
mjpeg-outside-640x480.mkv.ac

Part 3 of 8

Comment 48 Andy Nicholas 2019-04-17 17:11:54 UTC

Created attachment 144016 [details]
mjpeg-outside-640x480.mkv.ad

Part 4 of 8

Comment 49 Andy Nicholas 2019-04-17 17:14:02 UTC

Created attachment 144017 [details]
mjpeg-outside-640x480.mkv.ae

Part 5 of 8

Comment 50 Andy Nicholas 2019-04-17 17:16:06 UTC

Created attachment 144018 [details]
mjpeg-outside-640x480.mkv.af

Part 6 of 8

Comment 51 Andy Nicholas 2019-04-17 17:18:23 UTC

Created attachment 144019 [details]
mjpeg-outside-640x480.mkv.ag

Part 7 of 8

Comment 52 Andy Nicholas 2019-04-17 17:19:26 UTC

Created attachment 144020 [details]
mjpeg-outside-640x480.mkv.ah

Part 8 of 8

Comment 53 Andy Nicholas 2019-04-17 17:49:40 UTC

Created attachment 144021 [details]
mjpeg-outside-640x480.mkv.aa

Part 1 of 8 (properly identified as a binary file)

Comment 54 Andy Nicholas 2019-04-17 18:41:56 UTC

Created attachment 144025 [details]
x86_64_defconfig

Building with latest drm-tip made no difference. Possibly made problem show up faster because I've had 2 of 8 compute sticks crash within 5 minutes of starting my test(!).

My x86_64_defconfig file for arch/x86/configs/x86_64_defconfig is enclosed since this will cause the kernel to be built for PREEMPT which might matter.

My drm-tip log starts with this:

commit 739f9bd5ae97972c9eebf3fe3574a59f286719ff (HEAD -> drm-tip, origin/drm-tip, origin/HEAD)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Apr 17 07:26:57 2019 +0100

    drm-tip: 2019y-04m-17d-06h-25m-43s UTC integration manifest

Comment 55 Andy Nicholas 2019-04-18 01:06:40 UTC

Created attachment 144027 [details]
GPU crash #14 - with drm-tip from today

Enclosed. This crashlog seems to have more detailed information than previous ones. Maybe it's because i updated my kernel to be current.

Comment 56 Andy Nicholas 2019-04-18 01:11:54 UTC

Created attachment 144028 [details]
GPU crash #15 - with drm-tip from today

Lots of interesting data in this one. Crashed very soon after I started the transcoding test.

Comment 57 Andy Nicholas 2019-04-18 04:45:18 UTC

Created attachment 144029 [details]
GPU hang #16 when "silly test" is enabled

1 of 2 GPU crashes (of 8 running compute sticks) which happened when "silly test" code from above is included in a build of drm-tip from yesterday.

Comment 58 Andy Nicholas 2019-04-18 05:44:13 UTC

Is the issue related to one of the GPU blocks just not waking up "soon enough" after being powered-down? Or not waking up ever?

Any suggestions for good tests to run to help narrow down the problem?

Comment 59 Andy Nicholas 2019-04-19 05:06:43 UTC

Ideally, I would be able to setup a video pipeline so that the jpeg frames could be decoded using VAAPI into main memory, modified, then encoded using VAAPI to H264. That is what the previous gstreamer script accomplishes.

At this point I am looking for some kind of fallback position where I could take the jpeg frames and invoke the GPU less to reduce the chance of crashing. So I'm willing to take the CPU reduction to have the CPU decode the JPEG frames, but the GPU would need to encode the H264 stream.

I did this with the script below and, unfortunately, the crashes still occur. Software jpeg decode is being paired with hardware H264 encode. I've only been running for less than 1 hour and 2 compute sticks have already crashed.

Not sure how I'm going to proceed because it looks like the only way to work-around this issue is to disable RC6. Maybe I can try to reduce the size of the movie and see if a more minimal size can reproduce the problem.


#!/usr/bin/env bash

set -ex

tcount=0
while true; do
	echo "Transcode: iteration $tcount" | tee tcount.txt
 
	# remove old output
	rm -f /tmp/transcode-output.mp4

	# transcode using gstreamer
	time gst-launch-1.0 filesrc location=mjpeg-outside-640x480.mkv ! \
		matroskademux ! \
		queue ! \
		jpegdec ! \
		videoconvert ! \
		videoscale ! \
		"video/x-raw,width=640,height=480,framerate=(fraction)30/1" ! \
		queue ! \
		vaapih264enc ! \
		mp4mux ! \
		filesink location=/tmp/gst-output.mp4

	tcount=$((tcount+1))	
done

Comment 60 Chris Wilson 2019-05-28 08:20:26 UTC

My current run is up to 12 days (25428 iterations) without incident on my STK2mV64CC. That's has been the longest I've managed to leave it alone over the last few weeks and not run other tests! I think this compute stick isn't showing the symptoms. However, given the hypothesis that this is from poking at the HW at the same time as it is powergating (or transitioning into a deep rc6), it should have been only a matter of time for it to strike, given the same hw and sw. Of course, I'd like to assume the reason I haven't seen it is because it is unreproducible on drm-tip!

Comment 61 Andy Nicholas 2019-05-28 16:05:52 UTC

Created attachment 144364 [details]
attachment-18046-0.html

Thanks for looking at this. Which version of drm-tip are you using?

On Tue, May 28, 2019 at 1:20 AM <bugzilla-daemon@freedesktop.org> wrote:

> *Comment # 60 <https://bugs.freedesktop.org/show_bug.cgi?id=110394#c60> on
> bug 110394 <https://bugs.freedesktop.org/show_bug.cgi?id=110394> from Chris
> Wilson <chris@chris-wilson.co.uk> *
>
> My current run is up to 12 days (25428 iterations) without incident on my
> STK2mV64CC. That's has been the longest I've managed to leave it alone over the
> last few weeks and not run other tests! I think this compute stick isn't
> showing the symptoms. However, given the hypothesis that this is from poking at
> the HW at the same time as it is powergating (or transitioning into a deep
> rc6), it should have been only a matter of time for it to strike, given the
> same hw and sw. Of course, I'd like to assume the reason I haven't seen it is
> because it is unreproducible on drm-tip!
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 62 Francesco Balestrieri 2019-06-03 18:14:48 UTC

*** Bug 110297 has been marked as a duplicate of this bug. ***

Comment 63 Lakshmi 2019-06-17 09:43:54 UTC

(In reply to Andy Nicholas from comment #61)
> Created attachment 144364 [details]
> attachment-18046-0.html
> 
> Thanks for looking at this. Which version of drm-tip are you using?
> 
> On Tue, May 28, 2019 at 1:20 AM <bugzilla-daemon@freedesktop.org> wrote:
> 
> > *Comment # 60 <https://bugs.freedesktop.org/show_bug.cgi?id=110394#c60> on
> > bug 110394 <https://bugs.freedesktop.org/show_bug.cgi?id=110394> from Chris
> > Wilson <chris@chris-wilson.co.uk> *
> >
> > My current run is up to 12 days (25428 iterations) without incident on my
> > STK2mV64CC. That's has been the longest I've managed to leave it alone over the
> > last few weeks and not run other tests! I think this compute stick isn't
> > showing the symptoms. However, given the hypothesis that this is from poking at
> > the HW at the same time as it is powergating (or transitioning into a deep
> > rc6), it should have been only a matter of time for it to strike, given the
> > same hw and sw. Of course, I'd like to assume the reason I haven't seen it is
> > because it is unreproducible on drm-tip!
> >
> > ------------------------------
> > You are receiving this mail because:
> >
> >    - You reported the bug.
> >
> >

You can always verify with the latest drmtip (https://cgit.freedesktop.org/drm-tip) and if the issue occurs please report back.

Comment 64 Chris Wilson 2019-07-14 14:08:18 UTC

So I was playing around with disabling GEN9_PG_ENABLE while an engine was active, and in doing so reproduced exactly the same symptoms with forcewake dying, and with different patterns killing mmio entirely.

(Reminder for my future self.)

Comment 65 Martin Peres 2019-11-29 19:03:33 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/264.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.