Created attachment 143919 [details] GPU crash info plus dmesg log Similar to Bug #110297 which I filed. Skylake GPU hang when encoding video stream to H.264 using VAAPI. The stream is decoded from a VAAPI MJPEG stream from a file. We run test loops where we transcode this stream over and over, thousands of times. This GPU hang happened on iteration 1026. Running on Intel Compute Stick STK2mV64CC. We have locked the minimum and maximum clock speeds of the GPU to 500 Mhz to attempt to avoid... this issue. We are running this test because one of our products needs to have this exact configuration: read an MJPEG stream from a V4L camera and transcode into H264. This configuration needs to be super stable. Crashing once in 1026 iterations is not considered "stable". Using Ubuntu 18.04 plus DRM-TIP kernel from about 3 weeks ago which corresponds with 5.1-rc1. Using GStreamer 1.14.1: shield@tobeprovisioned1804:~$ gst-launch-1.0 --version gst-launch-1.0 version 1.14.1 GStreamer 1.14.1 https://launchpad.net/distros/ubuntu/+source/gstreamer1.0 Full GPU hang log and dmesg enclosed. This is related to a similar bug which I previously filed. Especially concerning is that the machine is usable (but the GPU seems dead) after this crash. We would like to figure out a way of determining that the GPU has died and to kernel panic so that we can, eventually, reboot. Modifying the kernel is A-OK to avoid this issue, so if Intel doesn't have a mechanism then I will try to add something myself. Leaving the machine in this "half dead" state is bad. We can't use the gstreamer process termination as the "reboot the machine" trigger as we may have other, less severe, bugs where we simply want to restart the gstreamer process.
(In reply to Andy Nicholas from comment #0) > Especially concerning is that the machine is usable (but the GPU seems dead) > after this crash. We would like to figure out a way of determining that the > GPU has died and to kernel panic so that we can, eventually, reboot. > Modifying the kernel is A-OK to avoid this issue, so if Intel doesn't have a > mechanism then I will try to add something myself. Watch /proc/sys/kernel/tainted. Currently we set TAINT_WARN, but you can change that to TAINT_DIE if you fancy something less likely to be set by others. diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c index 68875ba43b8d..11795677bf6e 100644 --- a/drivers/gpu/drm/i915/i915_reset.c +++ b/drivers/gpu/drm/i915/i915_reset.c @@ -1088,7 +1088,7 @@ void i915_reset(struct drm_i915_private *i915, * rather than continue on into oblivion. For everyone else, * the system should still plod along, but they have been warned! */ - add_taint(TAINT_WARN, LOCKDEP_STILL_OK); + add_taint(TAINT_DIE, LOCKDEP_STILL_OK); error: __i915_gem_set_wedged(i915); goto finish;
Also -- No additional strenuous load is running on the compute stick, only this transcode. We have not modified the standard cstates and the freq scaling governor is set to "powersave". shield@tobeprovisioned1804:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor powersave Please me know if there is some different configuration which is less likely to exhibit this issue. We previously have put stress-ng loads on the CPUs to better simulate what we typically would utilize, and that configuration seemed especially stable... but depending on periodic CPU activity to prevent GPU hangs is just playing russian roulette with the GPU video encoder.
My DRM-TIP kernel is from: commit 00cb3798a5d008c3f824fe7c89c663dba66155c3 (HEAD -> drm-tip, origin/drm-tip, origin/HEAD) Author: Rodrigo Vivi <rodrigo.vivi@intel.com> Date: Fri Mar 22 12:52:43 2019 -0700 These config switches were ADDED to DRM-TIP so I could boot from eMMC and configure for lower kernel latency and see serial output when the GPU goes bonkers: CONFIG_USB_SERIAL=y CONFIG_USB_SERIAL_CONSOLE=y CONFIG_USB_SERIAL_FTDI_SIO=y CONFIG_USB_PL2303=y CONFIG_FRAME_POINTER=y CONFIG_LATENCYTOP=y CONFIG_MMC=y CONFIG_MMC_BLOCK=y CONFIG_MMC_BLOCK_MINORS=8 CONFIG_MMC_SDHCI=y CONFIG_MMC_SDHCI_PCI=y CONFIG_MMC_RICOH_MMC=y CONFIG_MMC_SDHCI_ACPI=y CONFIG_DEBUG_INFO=y CONFIG_PREEMPT=y CONFIG_PREEMPT_COUNT=y CONFIG_KALLSYMS_ALL=y CONFIG_KEXEC_FILE=y CONFIG_ARCH_HAS_KEXEC_PURGATORY=y CONFIG_KEXEC_JUMP=y CONFIG_CPU_FREQ_STAT=y CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y CONFIG_DRM_I915_DEBUG=y CONFIG_DRM_I915_DEBUG_RUNTIME_PM=y CONFIG_USB_RTL8152=y CONFIG_USB_NET_DRIVERS=y transcoding script using gstreamer is: #!/usr/bin/env bash set -ex tcount=0 while true; do echo "Transcode: iteration $tcount" | tee tcount.txt # remove old output rm -f /tmp/transcode-output.mp4 time gst-launch-1.0 filesrc location= andy-movies/mjpeg-outside-640x480.mkv ! matroskademux ! vaapijpegdec ! vaapih264enc ! qtmux ! filesink location=/tmp/gst-output.mp4 tcount=$((tcount+1)) done
Created attachment 143920 [details] Crash #2 of the second test configuration. This one happened in 4 iterations on the same hardware. This is another crash on the same test configuration. I restarted my test case and moments later, the same machine had crashed again after 4 iterations. dmesg and GPU crashlog enclosed.
Created attachment 143921 [details] 3rd GPU hang on same hardware using same test loop. Within 2 iterations. This time my transcoder script seemed to have terminated... but actually didn't -- it was just asleep, waiting. The transcoder script RESUMED RUNNING after 14 minutes and completed the transcoding task. No further information was emitted into the dmesg log. This is bad because our usage case can't tolerate a "hang" where the transcoder does not emit anything for 14 minutes... or even a few milliseconds.
Created attachment 143922 [details] 4th GPU hang on same hardware This was a full GPU crash which happened a few minutes after the initial "hang" of 14 minutes. The GPU crashlog is enclosed.
(In reply to Andy Nicholas from comment #5) > Created attachment 143921 [details] > 3rd GPU hang on same hardware using same test loop. Within 2 iterations. > > This time my transcoder script seemed to have terminated... but actually > didn't -- it was just asleep, waiting. The transcoder script RESUMED RUNNING > after 14 minutes and completed the transcoding task. No further information > was emitted into the dmesg log. > > This is bad because our usage case can't tolerate a "hang" where the > transcoder does not emit anything for 14 minutes... or even a few > milliseconds. Btw, while the transcoder was "hung" I was able to login and use the kernel as normal using SSH so most of the kernel subsystems seemed to be performing normally. The GPU was definitely not functioning properly. Btw, This is using Ubuntu server so we do not have Xserver running.
Created attachment 143923 [details] 5th GPU hang on same hardware, within 1 iteration GPU crash within 1 iteration after rebooting into clean system. I'm going to stop testing now. If you guys need this compute stick in order to reproduce the crash, we can FedEx you this system. It's a compute stick and a USB network dongle -- not big.
We have 5 compute sticks running this exact same test simultaneously. These compute sticks should have identical configurations running identical test cases, but none of the other compute sticks are crashing. To try to narrow this down, we will see if we can create a CloneZilla image of the compute stick which crashes. This image will be used on other compute sticks to see if we are able to narrow down the GPU crashing problem.
If you can capture an input stream that reproduces the hang, that would be fantastic. Even if it is just one mjpeg frame in a loop, that plus the reproduction command will be invaluable.
Created attachment 143924 [details] GPU hang #6 after replacing power supply and cable to same compute stick. GPU hang #6 after replacing the power supply and cable to the same compute stick. GPU hung within 1 iteration of this test case.
(In reply to Andy Nicholas from comment #11) > Created attachment 143924 [details] > GPU hang #6 after replacing power supply and cable to same compute stick. > > GPU hang #6 after replacing the power supply and cable to the same compute > stick. GPU hung within 1 iteration of this test case. Differing power supply was plugged into a different power strip from the original power supply feeding this compute stick, so we are reasonably sure that the power supply and power feeding systems are working properly. We have lots and lots of compute stick power supplies and feed cables, so this test was simple.
Created attachment 143925 [details] GPU hang #7 - on different compute stick running exact same test This is from a different compute stick running the exact same test with exact same binary file for movie. Dies around iteration #58. Some stacks captured this time, not sure if the extra info is helpful. Btw, our output directory is using /tmp which is using tmpfs, so writing to the local flash filesystem is not happening as part of this test.
Try diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c index 1a6f36e08a60..d2e075e54e89 100644 --- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -7202,7 +7202,7 @@ static void gen9_enable_rc6(struct drm_i915_private *dev_priv) * 3b: Enable Coarse Power Gating only when RC6 is enabled. * WaRsDisableCoarsePowerGating:skl,cnl - Render/Media PG need to be disabled with RC6. */ - if (NEEDS_WaRsDisableCoarsePowerGating(dev_priv)) + if (NEEDS_WaRsDisableCoarsePowerGating(dev_priv) || 1) I915_WRITE(GEN9_PG_ENABLE, 0); else I915_WRITE(GEN9_PG_ENABLE,
(In reply to Chris Wilson from comment #14) > Try > > diff --git a/drivers/gpu/drm/i915/intel_pm.c > b/drivers/gpu/drm/i915/intel_pm.c > index 1a6f36e08a60..d2e075e54e89 100644 > --- a/drivers/gpu/drm/i915/intel_pm.c > +++ b/drivers/gpu/drm/i915/intel_pm.c > @@ -7202,7 +7202,7 @@ static void gen9_enable_rc6(struct drm_i915_private > *dev_priv) > * 3b: Enable Coarse Power Gating only when RC6 is enabled. > * WaRsDisableCoarsePowerGating:skl,cnl - Render/Media PG need to be > disabled with RC6. > */ > - if (NEEDS_WaRsDisableCoarsePowerGating(dev_priv)) > + if (NEEDS_WaRsDisableCoarsePowerGating(dev_priv) || 1) > I915_WRITE(GEN9_PG_ENABLE, 0); > else > I915_WRITE(GEN9_PG_ENABLE, Sure, I can try this. According to the comment, doesn't your change just force the same path of disabling coarse power gating? If so, is the power consumption now gigantic while encoding or decoding media?
Busy power consumption will be unaffected, semi-active power consumption will be unaffected (that uses rc6 for short sleeps while active). Idle (GPU) power consumption will be affected (off the top of my head, it prevents saving the last 100mW). The rapl interface provides power consumption information (see https://gitlab.freedesktop.org/drm/igt-gpu-tools/blob/master/lib/igt_gpu_power.c for an example) All 7 GPU error states indicate that it failed before reloading the same context after a short idling -- the IPEHR (last command parsed) is the last command in the retiring context. RING_TAIL (every time) is garbage -- whether that is just the forcewake failing... probably.
Forcing the coarse power stuff off did not work. Crashlog enclosed. Took about 95 iterations (95 minutes) to reproduce. 2 other compute sticks have shown the same issue with the previous kernel.
Created attachment 143944 [details] 7th GPU hang, this time with always disabling coarse-grain power
Created attachment 143945 [details] 8th GPU hang, but this time without a GPU crashlog. Was using coarse power gating disabled. 8th GPU hang, but without a GPU crashlog.
So to test the earlier observation that it is rc6-related, diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c index 8e826a6ab62e..acce2574228b 100644 --- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -8671,7 +8671,7 @@ static void intel_enable_rc6(struct drm_i915_private *dev_priv) else if (INTEL_GEN(dev_priv) >= 11) gen11_enable_rc6(dev_priv); else if (INTEL_GEN(dev_priv) >= 9) - gen9_enable_rc6(dev_priv); + gen9_disable_rc6(dev_priv); else if (IS_BROADWELL(dev_priv)) gen8_enable_rc6(dev_priv); else if (INTEL_GEN(dev_priv) >= 6)
Created attachment 143954 [details] GPU hang #9 after coarse power gating disabled, machine .36 Ok, I will try the fc6 disable suggestion shortly. Will take a few hours to run and cause trouble. This is a hang from the 2nd compute stick which hung after 257 transcode iterations using the same source MJPEG movie. We have 2 compute sticks which seem to reliably show these problems. One is in San Diego at my work. The other is here in San Jose, Bay Area with me. We made a CloneZilla image of the compute stick to see if it would hang and it did after about 650 iterations (e.g., 650 minutes). I am told that CloneZilla can only really restore to device of similar size, so if you had an m5 compute stick (STK2mv64cc) we could put the 5GB image on a server for you. There's nothing confidential from shield. It's a stock 18.04 image with my home-made MJPEG movie. If necessary we can Fedex a compute stick which demonstrates the problem.
(In reply to Andy Nicholas from comment #21) > I am told that CloneZilla can only really restore to device of similar size, > so if you had an m5 compute stick (STK2mv64cc) we could put the 5GB image on > a server for you. I've put one on order. Even just a simple disk image should be sufficient for a chroot, but whatever works :)
Ok, I ran the kernel with the patch to disable RC6 on 8x compute sticks running the same kernel and same user-mode libraries. As per the previous tests, the minimum and maximum frequency was locked at 500 Mhz. We observed zero crashes or warnings emitted into the dmesg log (or anywhere else). None of the compute sticks locked-up or otherwise misbehaved. This was our expectation, and after running the transcode loop more than 28,000 times (3500x8) this continues to the the case.
(In reply to Andy Nicholas from comment #23) > Ok, I ran the kernel with the patch to disable RC6 on 8x compute sticks > running the same kernel and same user-mode libraries. As per the previous > tests, the minimum and maximum frequency was locked at 500 Mhz. > > We observed zero crashes or warnings emitted into the dmesg log (or anywhere > else). None of the compute sticks locked-up or otherwise misbehaved. > > This was our expectation, and after running the transcode loop more than > 28,000 times (3500x8) this continues to the the case. Any previous test crashed around 1050 times at most, and often much sooner, so I believe having all the compute sticks survive to 3500 iterations is statistically significant.
You can try something silly like diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 4e0a351bfbca..e5feb0f5a5fe 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -2115,10 +2115,16 @@ static int gen9_emit_bb_start(struct i915_request *rq, { u32 *cs; - cs = intel_ring_begin(rq, 6); + cs = intel_ring_begin(rq, 12); if (IS_ERR(cs)) return PTR_ERR(cs); + *cs++ = MI_LOAD_REGISTER_IMM(1); + *cs++ = i915_mmio_reg_offset(GEN6_RC_CONTROL); + *cs++ = GEN6_RC_CTL_HW_ENABLE | + GEN6_RC_CTL_RC6_ENABLE | + GEN6_RC_CTL_EI_MODE(1); + *cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE; *cs++ = MI_BATCH_BUFFER_START_GEN8 | @@ -2129,6 +2135,10 @@ static int gen9_emit_bb_start(struct i915_request *rq, *cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE; *cs++ = MI_NOOP; + *cs++ = MI_LOAD_REGISTER_IMM(1); + *cs++ = i915_mmio_reg_offset(GEN6_RC_CONTROL); + *cs++ = 0; + intel_ring_advance(rq, cs); return 0; to see if it is just an rc6 event on idling that is the culprit, while keeping rc6 active for the encoder.
The other step would be to disable rc6 only and keep the powergate in gen9_enable_rc6() -- to confirm which of the two is the risk.
I do now have a STK2mv64cc and it survived a w/e with non-media workloads. If you have a recipe for me to run mjpeg through it, that would be useful.
(In reply to Chris Wilson from comment #26) > The other step would be to disable rc6 only and keep the powergate in > gen9_enable_rc6() -- to confirm which of the two is the risk. To be clear - I believe this is the state of Shield's testing, all of which use the same drm-tip kernel from 3 weeks ago, all of which are transcoding from MJPEG --> H264: (1) Only disable RC6, no other changes == no GPU issues (2) Only disable coarse power-gating [in gen9_enable_rc6()], no other changes == GPU hangs. (3) No changes to drm-tip kernel == GPU hangs. I'm guessing that there's something that RC6 allows which can cause the transcoder GPU hangs to occur, but whatever that is, it's not related to coarse power gating. I will proceed with placing both changes (RC6 disable + Coarse power gating disable) into a kernel. I'm expecting the GPU not to hang/crash.
The combination that I don't think has been tested is with GEN6_RC_CONTROL == 0 GEN6_PG_ENABLE == RENDER_PG_ENABLE | MEDIA_PG_ENABLE as so far both have been disabled when disabling rc6.
Created attachment 143979 [details] MJPEG - hand-carry video camera outside my house Enclosed is 1 file for the recipe. This file needs to be transcoded from its format of MJPEG into H.264 on the compute stick using vaapi using gstreamer on Ubuntu 18.04 Server. Complete recipe coming after this large upload.
(In reply to Andy Nicholas from comment #30) > Created attachment 143979 [details] > MJPEG - hand-carry video camera outside my house > > Enclosed is 1 file for the recipe. This file needs to be transcoded from its > format of MJPEG into H.264 on the compute stick using vaapi using gstreamer > on Ubuntu 18.04 Server. > > Complete recipe coming after this large upload. Complete recipe from "No OS": (1) Install Ubuntu 18.04.2 Server onto compute stick. I used a USB stick to do this. Select volume manager in settings and SSH enable. Choose to make 52GB of the compute stick into the large boot partition. (2) Install these to get current gstreamer and ffmpeg test facilities: sudo apt install -y libgstreamer1.0-0 gstreamer1.0-plugins-base gstreamer1.0-plugins-good sudo apt install -y gstreamer1.0-plugins-bad gstreamer1.0-plugins-ugly gstreamer1.0-libav sudo apt install -y gstreamer1.0-doc gstreamer1.0-tools gstreamer1.0-x gstreamer1.0-alsa sudo apt install -y gstreamer1.0-gl gstreamer1.0-gtk3 gstreamer1.0-qt5 gstreamer1.0-pulseaudio sudo apt install -y ffmpeg sudo apt install -y gstreamer1.0-vaapi sudo apt install -y vainfo sudo apt install -y xserver-xorg-hwe-18.04 sudo apt-get update sudo apt-get upgrade (3) Add the line below to /etc/fstab to point /tmp to tmpfs so we are routinely saving to ram instead of flash: tmpfs /tmp tmpfs defaults,noatime,nodiratime 0 0 (4) Save the following script to ~/setup-gpu.sh #!/usr/bin/env bash echo "This script must be run from sudo su prompt #" set -ex echo 500 | tee /sys/class/drm/card0/gt_min_freq_mhz echo 500 | tee /sys/class/drm/card0/gt_max_freq_mhz echo 500 | tee /sys/class/drm/card0/gt_boost_freq_mhz cat /sys/class/drm/card0/gt_min_freq_mhz cat /sys/class/drm/card0/gt_max_freq_mhz cat /sys/class/drm/card0/gt_boost_freq_mhz echo "Intel GPU Frequencies Locked" echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor echo "Intel CPUs set to performance governor" (5) Save the following script to ~/mjpeg-to-h264.sh: #!/usr/bin/env bash set -ex tcount=0 while true; do echo "Transcode: iteration $tcount" | tee tcount.txt # remove old output rm -f /tmp/transcode-output.mp4 # transcode big-buck-bunny.mp4 using gstreamer time gst-launch-1.0 filesrc location= andy-movies/mjpeg-outside-640x480.mkv ! matroskademux ! vaapijpegdec ! vaapih264enc ! qtmux ! filesink location=/tmp/gst-output.mp4 tcount=$((tcount+1)) done (6) Save the MJPEG movie from this bug reply #30 to: ~/andy-movies/mjpeg-outside-640x480.mkv (7) Switch to root: $ sudo su # ./setup-gpu.sh <ctrl-D> to exit from root $./mjpeg-to-h264.sh Once the compute stick is configured, the last Step #7 is what I do over and over after rebooting. NOTE: It's highly possible that this configuration will not reproduce the issue, even when updated with the drm-tip kernel. We have only been successful when using the image which was running on compute stick .38 and .36. andy
This is the diff for the powergating test I'm running now. No other changes are made to the kernel, just this one patch: diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c index eaf0793ebf60..29260ba32529 100644 --- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -8609,8 +8609,12 @@ static void intel_enable_rc6(struct drm_i915_private *dev_priv) cherryview_enable_rc6(dev_priv); else if (IS_VALLEYVIEW(dev_priv)) valleyview_enable_rc6(dev_priv); - else if (INTEL_GEN(dev_priv) >= 9) - gen9_enable_rc6(dev_priv); + else if (INTEL_GEN(dev_priv) >= 9) { + gen9_disable_rc6(dev_priv); + + I915_WRITE(GEN9_PG_ENABLE, + GEN9_RENDER_PG_ENABLE | GEN9_MEDIA_PG_ENABLE); + } else if (IS_BROADWELL(dev_priv)) gen8_enable_rc6(dev_priv); else if (INTEL_GEN(dev_priv) >= 6) Btw - I did read through the gen9_enable_rc6() code and the "hysteresis" stuff jumped out at me. does anything bad happen if those calculations are not accurate?
Created attachment 143984 [details] GPU hang #9 Enclosed is a GPU hang after running with power gating enabled, but the gen9_disable_rc6() code preceding that. Was not really expecting that, but does it narrow down the issue?
Created attachment 143985 [details] GPU Hang #10 after enabling coarse grain power gating Crash on 2nd compute stick after enabling coarse grain power gating. No GPU crashlog recovered.
(In reply to Andy Nicholas from comment #33) > Created attachment 143984 [details] > GPU hang #9 > > Enclosed is a GPU hang after running with power gating enabled, but the > gen9_disable_rc6() code preceding that. Was not really expecting that, but > does it narrow down the issue? This was after around 150 iterations, so this did not take long to cause trouble. Then other crash happened on a compute stick which, as far as I know, had never crashed previously.
Created attachment 143986 [details] GPU hang #11 - after enabling coarse power gating One more hang after 67 iterations after enabling coarse power gating, but disabling RC6 using gen9_disable_rc6().
I did not try the "silly" suggestion above, but I can do that in the morning. I'm in San Jose CA, Bay Area.
(In reply to Andy Nicholas from comment #32) > This is the diff for the powergating test I'm running now. No other changes > are made to the kernel, just this one patch: > > diff --git a/drivers/gpu/drm/i915/intel_pm.c > b/drivers/gpu/drm/i915/intel_pm.c > index eaf0793ebf60..29260ba32529 100644 > --- a/drivers/gpu/drm/i915/intel_pm.c > +++ b/drivers/gpu/drm/i915/intel_pm.c > @@ -8609,8 +8609,12 @@ static void intel_enable_rc6(struct drm_i915_private > *dev_priv) > cherryview_enable_rc6(dev_priv); > else if (IS_VALLEYVIEW(dev_priv)) > valleyview_enable_rc6(dev_priv); > - else if (INTEL_GEN(dev_priv) >= 9) > - gen9_enable_rc6(dev_priv); > + else if (INTEL_GEN(dev_priv) >= 9) { > + gen9_disable_rc6(dev_priv); > + > + I915_WRITE(GEN9_PG_ENABLE, > + GEN9_RENDER_PG_ENABLE | GEN9_MEDIA_PG_ENABLE); > + } > else if (IS_BROADWELL(dev_priv)) > gen8_enable_rc6(dev_priv); > else if (INTEL_GEN(dev_priv) >= 6) > > > Btw - > > I did read through the gen9_enable_rc6() code and the "hysteresis" stuff > jumped out at me. does anything bad happen if those calculations are not > accurate? It will just use whatever the default values were. The important point here is that with RC6_CONTROL=0 and PG_ENABLE!=0, we see exactly the same hang as before. That is at least a nudge back towards the powergating as being the culprit.
(In reply to Chris Wilson from comment #25) > You can try something silly like > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c > b/drivers/gpu/drm/i915/intel_lrc.c > index 4e0a351bfbca..e5feb0f5a5fe 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -2115,10 +2115,16 @@ static int gen9_emit_bb_start(struct i915_request > *rq, > { > u32 *cs; > > - cs = intel_ring_begin(rq, 6); > + cs = intel_ring_begin(rq, 12); > if (IS_ERR(cs)) > return PTR_ERR(cs); > > + *cs++ = MI_LOAD_REGISTER_IMM(1); > + *cs++ = i915_mmio_reg_offset(GEN6_RC_CONTROL); > + *cs++ = GEN6_RC_CTL_HW_ENABLE | > + GEN6_RC_CTL_RC6_ENABLE | > + GEN6_RC_CTL_EI_MODE(1); > + > *cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE; > > *cs++ = MI_BATCH_BUFFER_START_GEN8 | > @@ -2129,6 +2135,10 @@ static int gen9_emit_bb_start(struct i915_request *rq, > *cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE; > *cs++ = MI_NOOP; > > + *cs++ = MI_LOAD_REGISTER_IMM(1); > + *cs++ = i915_mmio_reg_offset(GEN6_RC_CONTROL); > + *cs++ = 0; > + > intel_ring_advance(rq, cs); > > return 0; > > to see if it is just an rc6 event on idling that is the culprit, while > keeping rc6 active for the encoder. gen9_emit_bb_start() does not exist in my drm-tip sources from 4 weeks ago (March 22). There is a gen8_emit_bb_start(), but no "gen9" or anything which seems similar. andy@kbuild:/build/kernel/drm-tip/drm-tip/drivers/gpu/drm/i915$ git log commit 00cb3798a5d008c3f824fe7c89c663dba66155c3 (HEAD -> drm-tip, origin/drm-tip, origin/HEAD) Author: Rodrigo Vivi <rodrigo.vivi@intel.com> Date: Fri Mar 22 12:52:43 2019 -0700 drm-tip: 2019y-03m-22d-19h-51m-23s UTC integration manifest
Created attachment 144003 [details] GPU hang 12 Crash while checking if the media power gating enable, alone, causes trouble. Answer: yes, but I should run the other configuration also. index eaf0793ebf60..800ed263c626 100644 --- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -8609,8 +8609,12 @@ static void intel_enable_rc6(struct drm_i915_private *dev_priv) cherryview_enable_rc6(dev_priv); else if (IS_VALLEYVIEW(dev_priv)) valleyview_enable_rc6(dev_priv); - else if (INTEL_GEN(dev_priv) >= 9) - gen9_enable_rc6(dev_priv); + else if (INTEL_GEN(dev_priv) >= 9) { + gen9_disable_rc6(dev_priv); + + I915_WRITE(GEN9_PG_ENABLE, + GEN9_MEDIA_PG_ENABLE); + } else if (IS_BROADWELL(dev_priv)) gen8_enable_rc6(dev_priv); else if (INTEL_GEN(dev_priv) >= 6)
Created attachment 144005 [details] GPU hang 13 - after media encoder power gating ONLY enabled At least 3 compute sticks died when run in this configuration and I uploaded #12 previously. + gen9_disable_rc6(dev_priv); + + I915_WRITE(GEN9_PG_ENABLE, + GEN9_MEDIA_PG_ENABLE); This particular crash had a huge amount of information dumped, so I figured I should upload it also. Interestingly, the compute stick which usually has its GPU hang for all these tests is still running strong after 839 iterations. I am about to test the opposite power gate and see if the render power gate ON while media power gate is OFF has any effect.
I'm going to keep my old kernel around and create a newer kernel based on existing drm-tip from this morning. You seem to be actively pulling in merges from upstream sources and hopefully everything will be OK. I will start a test probably later in the morning (my time) of this kernel to understand whether the problem still occurs, then I can try the "silly test".
After some cursing at ubuntu installers and gstreamer-vaapi packaging, it looks like I have the pipeline working^Wsetup. Do you mind trimming your mjpeg sample as bugs.fd.o doesn't like the current attachment -- and I want to try and stick to your sample to keep differences to a minimum.
(In reply to Andy Nicholas from comment #41) > Created attachment 144005 [details] > GPU hang 13 - after media encoder power gating ONLY enabled > > At least 3 compute sticks died when run in this configuration and I uploaded > #12 previously. > > + gen9_disable_rc6(dev_priv); > + > + I915_WRITE(GEN9_PG_ENABLE, > + GEN9_MEDIA_PG_ENABLE); > > This particular crash had a huge amount of information dumped, so I figured > I should upload it also. > > Interestingly, the compute stick which usually has its GPU hang for all > these tests is still running strong after 839 iterations. > > I am about to test the opposite power gate and see if the render power gate > ON while media power gate is OFF has any effect. Yes, 4 of 8 compute sticks crashed by this morning, so it appears that having either media or render power gating enabled ends up with GPU hanging.
Created attachment 144013 [details] mjpeg-outside-640x480.mkv.aa Using split tool, part 1 of 8 (aa of ah)
Created attachment 144014 [details] mjpeg-outside-640x480.mkv.ab Part 2 of 8
Created attachment 144015 [details] mjpeg-outside-640x480.mkv.ac Part 3 of 8
Created attachment 144016 [details] mjpeg-outside-640x480.mkv.ad Part 4 of 8
Created attachment 144017 [details] mjpeg-outside-640x480.mkv.ae Part 5 of 8
Created attachment 144018 [details] mjpeg-outside-640x480.mkv.af Part 6 of 8
Created attachment 144019 [details] mjpeg-outside-640x480.mkv.ag Part 7 of 8
Created attachment 144020 [details] mjpeg-outside-640x480.mkv.ah Part 8 of 8
Created attachment 144021 [details] mjpeg-outside-640x480.mkv.aa Part 1 of 8 (properly identified as a binary file)
Created attachment 144025 [details] x86_64_defconfig Building with latest drm-tip made no difference. Possibly made problem show up faster because I've had 2 of 8 compute sticks crash within 5 minutes of starting my test(!). My x86_64_defconfig file for arch/x86/configs/x86_64_defconfig is enclosed since this will cause the kernel to be built for PREEMPT which might matter. My drm-tip log starts with this: commit 739f9bd5ae97972c9eebf3fe3574a59f286719ff (HEAD -> drm-tip, origin/drm-tip, origin/HEAD) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Apr 17 07:26:57 2019 +0100 drm-tip: 2019y-04m-17d-06h-25m-43s UTC integration manifest
Created attachment 144027 [details] GPU crash #14 - with drm-tip from today Enclosed. This crashlog seems to have more detailed information than previous ones. Maybe it's because i updated my kernel to be current.
Created attachment 144028 [details] GPU crash #15 - with drm-tip from today Lots of interesting data in this one. Crashed very soon after I started the transcoding test.
Created attachment 144029 [details] GPU hang #16 when "silly test" is enabled 1 of 2 GPU crashes (of 8 running compute sticks) which happened when "silly test" code from above is included in a build of drm-tip from yesterday.
Is the issue related to one of the GPU blocks just not waking up "soon enough" after being powered-down? Or not waking up ever? Any suggestions for good tests to run to help narrow down the problem?
Ideally, I would be able to setup a video pipeline so that the jpeg frames could be decoded using VAAPI into main memory, modified, then encoded using VAAPI to H264. That is what the previous gstreamer script accomplishes. At this point I am looking for some kind of fallback position where I could take the jpeg frames and invoke the GPU less to reduce the chance of crashing. So I'm willing to take the CPU reduction to have the CPU decode the JPEG frames, but the GPU would need to encode the H264 stream. I did this with the script below and, unfortunately, the crashes still occur. Software jpeg decode is being paired with hardware H264 encode. I've only been running for less than 1 hour and 2 compute sticks have already crashed. Not sure how I'm going to proceed because it looks like the only way to work-around this issue is to disable RC6. Maybe I can try to reduce the size of the movie and see if a more minimal size can reproduce the problem. #!/usr/bin/env bash set -ex tcount=0 while true; do echo "Transcode: iteration $tcount" | tee tcount.txt # remove old output rm -f /tmp/transcode-output.mp4 # transcode using gstreamer time gst-launch-1.0 filesrc location=mjpeg-outside-640x480.mkv ! \ matroskademux ! \ queue ! \ jpegdec ! \ videoconvert ! \ videoscale ! \ "video/x-raw,width=640,height=480,framerate=(fraction)30/1" ! \ queue ! \ vaapih264enc ! \ mp4mux ! \ filesink location=/tmp/gst-output.mp4 tcount=$((tcount+1)) done
My current run is up to 12 days (25428 iterations) without incident on my STK2mV64CC. That's has been the longest I've managed to leave it alone over the last few weeks and not run other tests! I think this compute stick isn't showing the symptoms. However, given the hypothesis that this is from poking at the HW at the same time as it is powergating (or transitioning into a deep rc6), it should have been only a matter of time for it to strike, given the same hw and sw. Of course, I'd like to assume the reason I haven't seen it is because it is unreproducible on drm-tip!
Created attachment 144364 [details] attachment-18046-0.html Thanks for looking at this. Which version of drm-tip are you using? On Tue, May 28, 2019 at 1:20 AM <bugzilla-daemon@freedesktop.org> wrote: > *Comment # 60 <https://bugs.freedesktop.org/show_bug.cgi?id=110394#c60> on > bug 110394 <https://bugs.freedesktop.org/show_bug.cgi?id=110394> from Chris > Wilson <chris@chris-wilson.co.uk> * > > My current run is up to 12 days (25428 iterations) without incident on my > STK2mV64CC. That's has been the longest I've managed to leave it alone over the > last few weeks and not run other tests! I think this compute stick isn't > showing the symptoms. However, given the hypothesis that this is from poking at > the HW at the same time as it is powergating (or transitioning into a deep > rc6), it should have been only a matter of time for it to strike, given the > same hw and sw. Of course, I'd like to assume the reason I haven't seen it is > because it is unreproducible on drm-tip! > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
*** Bug 110297 has been marked as a duplicate of this bug. ***
(In reply to Andy Nicholas from comment #61) > Created attachment 144364 [details] > attachment-18046-0.html > > Thanks for looking at this. Which version of drm-tip are you using? > > On Tue, May 28, 2019 at 1:20 AM <bugzilla-daemon@freedesktop.org> wrote: > > > *Comment # 60 <https://bugs.freedesktop.org/show_bug.cgi?id=110394#c60> on > > bug 110394 <https://bugs.freedesktop.org/show_bug.cgi?id=110394> from Chris > > Wilson <chris@chris-wilson.co.uk> * > > > > My current run is up to 12 days (25428 iterations) without incident on my > > STK2mV64CC. That's has been the longest I've managed to leave it alone over the > > last few weeks and not run other tests! I think this compute stick isn't > > showing the symptoms. However, given the hypothesis that this is from poking at > > the HW at the same time as it is powergating (or transitioning into a deep > > rc6), it should have been only a matter of time for it to strike, given the > > same hw and sw. Of course, I'd like to assume the reason I haven't seen it is > > because it is unreproducible on drm-tip! > > > > ------------------------------ > > You are receiving this mail because: > > > > - You reported the bug. > > > > You can always verify with the latest drmtip (https://cgit.freedesktop.org/drm-tip) and if the issue occurs please report back.
So I was playing around with disabling GEN9_PG_ENABLE while an engine was active, and in doing so reproduced exactly the same symptoms with forcewake dying, and with different patterns killing mmio entirely. (Reminder for my future self.)
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/264.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.