Bug 107945

Summary: System crashes seconds after a GPU hang with kernel newer than 4.18
Product: DRI Reporter: leozinho29_eu
Component: DRM/IntelAssignee: Tvrtko Ursulin <tvrtko.ursulin>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: highest CC: intel-gfx-bugs, tvrtko.ursulin
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard: Triaged, ReadyForDev
i915 platform: SKL i915 features: GPU hang
Attachments:
Description Flags
Dmesg and GPU hangs
none
netconsole
none
Relevant files when using intel_iommu=igfx_off
none
dmesg and card0/error
none
The error when building the kernel with the commit reverted
none
dmesg from crash with proposed Ubuntu HWE kernel
none
Bring back GT workaround application on the engine reset path none

Description leozinho29_eu 2018-09-15 17:47:26 UTC
Initially I reported https://bugs.freedesktop.org/show_bug.cgi?id=107941 but I'm reporting the system crash here to make it separated.

The attached file at https://bugs.freedesktop.org/attachment.cgi?id=141570 has the relevant files from the GPU hang that triggered the crash but, unfortunately, has no data from the crash itself.

With kernel versions at least 4.18 (happens with -rc too), after a GPU hang happens the system crashes. Apparently there is no way to recover and I have to keep the power button pressed to poweroff the computer.

4.17.19 is OK, while a GPU hang may happen, the system do not crash after that, allowing me to use the computer normally after the hang.

I'll try to set netconsole to see if I can get any useful information from the crashed system with this new drm-tip, but in my previous attempts there was no output in the receiving machine.

Processor: Intel Core i3-6100U;
Video: Intel HD Graphics 520;
Architecture: amd64;
Mesa: 18.3.0-devel (git-914bd3014f);
Kernel version: drm-tip (feeccde66999c5e87be3550f2159e5d7eeb61c67)
Distribution: Xubuntu 18.04.1 amd64.

Notes:

I'm sorry for reporting so many bugs at once, it's the first time I'm using Vulkan seriously on my computer and could keep the applications open for enough time to detect problems.

There is a multitude of regressions affecting my computer when using kernel versions at least 4.18 (this is the one related to i915), so I have to use 4.17 for now.
Comment 1 Chris Wilson 2018-09-15 19:57:36 UTC
Please be specific in which kernel you tested, and please check upstream from https://cgit.freedesktop.org/drm/drm-tip
Comment 2 leozinho29_eu 2018-09-15 23:38:21 UTC
Created attachment 141579 [details]
Dmesg and GPU hangs

The first kernel that has this problem is 4.18-rc1. The last kernel with this problem is drm-tip (4.19-rc3, 75bb460b367a614d10b0fba220143bee42657d7e).

Dmesg from 4.18-rc1 is corrupted, but there are parts still readable.

I set netconsole and, as I expected, there was no output. One thing I noticed is that I connected an optical USB mouse and, after 10 minutes, its LED still did not turn on.

Last good kernel: 4.17.19
First bad kernel: 4.18-rc1
Latest bad kernel: drm-tip (4.19-rc3, 75bb460b367a614d10b0fba220143bee42657d7e).
Comment 3 Chris Wilson 2018-09-16 10:55:23 UTC
The dmesg from drmtip goes on for 300+s after the gpu hang, with no indication of subsequent GPU malaise. As you have netconsole setup one thing you can try is config DRM_I915_DEBUG_GEM to see if any sanity checks fails; make sure you also have reboot on panic set.

As to a possible cause, could you try with intel_iommu=igfx_off
Comment 4 leozinho29_eu 2018-09-16 15:30:44 UTC
Created attachment 141583 [details]
netconsole

I was writing an answer when the system froze. The notebook is frozen right now, I am writing this from my desktop. The desktop is the receiving the netconsole messages.

I set the debug options in the kernel build (I still did not test with the boot option intel_iommu=igfx_off) and set the reboot on panic. The notebook needed more than 10 minutes to freeze this time, but the reboot on panic did not work, as it's frozen for a few minutes already (I set in sysctl.conf kernel.panic = 5.

I think I found a pattern of the system crash. 

1) A GPU hang must happen (for example, the hangs caused by Vulkan);
2) A task with significant video load is closed (watching a video and then closing the media player window, for example);
3) When the focus of the window is changed, the system crashes.

It's important to say that with 4.17.19, after the hang I can open and close demanding tasks with no issues.

I have attached the netconsole, which seems to not have relevant data again. I will try the boot option later, at least it seems I found some pattern now.
Comment 5 leozinho29_eu 2018-09-16 16:56:07 UTC
Created attachment 141585 [details]
Relevant files when using intel_iommu=igfx_off

Setting intel_iommu=igfx_off did not work, it crashed in the same way.

It does look like that, whatever this issue is, it is not causing a kernel panic, as the system is not rebooting after 5 seconds, even with kernel.panic = 5 and the panic=5 set.

The attached file has dmesg (corrupted but readable), the GPU hang card0/error and a recording of the sound being played then the crash happened.

Is there another way to make the system poweroff after this? I'm not liking to do these tests because the HDD stops suddenly, producing a strong click noise.
Comment 6 Chris Wilson 2018-09-17 10:47:55 UTC
Sadly if the system is not panicking, nor is responding to local or remote input, the only way to reset is via the big button.
Comment 7 Lakshmi 2018-09-20 15:04:56 UTC
Reporter, with the current pattern were you able to reproduce the issue every time?
Comment 8 leozinho29_eu 2018-09-20 15:17:13 UTC
Considering the steps:

1) A GPU hang must happen (for example, the hangs caused by Vulkan);
2) A task with significant video load is closed (watching a video and then closing the media player window, for example);
3) When the focus of the window is changed, the system crashes.

To trigger the crash, a GPU hang MUST happen, which is the step 1.

From there no, it's not 100% chance that doing steps 2 and 3 will trigger the crash. So I can do that step 2 and 3 a few times with no crash, but then in another try it may crash.

If estimating a percentage, from my experiences I would say 20% of the times I do the steps 2 and 3 the crash happens. I would say that, after step 1, the system will for sure crash later, the question is "how later?".

I will see if I can get a 32 GB pen-drive to install Xubuntu on it to test, as then the HDD will be powered off and I will be able to test with no risks of damaging it.
Comment 9 leozinho29_eu 2018-10-05 01:24:51 UTC
As I was helped on Launchpad on how to address a regression I was facing and all Mesa issues I had were solved (thank you), I am being able to use kernel version newer than 4.17.19.

I installed 4.19-rc6 today and I noticed some important details which I believe they may be helpful to share.

Using the system normally (office, web browser, mail client), even without triggering a GPU hang or really heavy loads, makes the system have problems as this system crash but for a very brief time.

Basically, the step:

2) A task with significant video load is closed (watching a video and then closing the media player window, for example)

Makes the system have a very brief crash-like situation sometimes, where the sound loop once and then the system continues normally.

I've noticed events like that today 7 times already. The last one happened when I opened GNOME Software, a few minutes ago.

It's important to say that there is absolutely nothing that can be seen when that happens, as there is no screen corruption or anything wrong in the screen. The only way possible to notice those tiny freezes is if there is sound playing, because the sound loop.

Which leads to a interesting thing: whatever causes the system crash is happening often but without a GPU hang there is no crash. It may be possible to get information about this tiny freeze that, only after a GPU hang, causes a system crash.

Which debug options should be used to try to get useful information, if any?
Comment 10 Lakshmi 2018-10-05 07:24:29 UTC
> Which debug options should be used to try to get useful information, if any?

dmesg log from boot with kernel parameters drm.debug=0x1e log_buf_len=4M.
Comment 11 leozinho29_eu 2018-10-05 17:53:32 UTC
I have booted the system with that settings and have obtained the dmesg. As the file is too big even compressed, I uploaded it to Megasync, its link is:

https://mega.nz/#!EooyWYqb!cbNA6qea85_uxKj_lvimZ0-Awnrv6i-FDTBcy1TfpwQ

I noticed the events in three moments, at approximately: 9350 seconds, 9473 seconds and 9714 seconds.

I tried reading the dmesg but I can't point to anything more specific than the time when that happened.
Comment 12 Lakshmi 2018-10-23 10:58:55 UTC
> https://mega.nz/#!EooyWYqb!cbNA6qea85_uxKj_lvimZ0-Awnrv6i-FDTBcy1TfpwQ
I can not find GPU hang from this file. Can you attach GPU crash dump /sys/class/drm/card$N/error in this case.
Comment 13 leozinho29_eu 2018-10-23 13:52:25 UTC
Created attachment 142149 [details]
dmesg and card0/error

That log had no GPU hang because it really hadn't a GPU hang. The times I specified in the comment were when I noticed the sound hiccups, similar to this crash, but with no crash. Sorry if I was not clear.

I did the tests now with drm-tip HEAD 9510f8e44127260f92b5b6c3127aafa22b15f741:

The attached compressed file has corrupted but readable dmesg and the GPU hang log, sysrq emergency sync saved the logs this time.
Comment 14 leozinho29_eu 2018-10-29 00:17:48 UTC
I have bisected between 4.17 and 4.18-rc1 (nearly 6000 commits) and the first bad commit is:

commit 59b449d5c82af03acdfc3f9a343c9d085ab5568f (refs/bisect/bad)
Author: Oscar Mateo <oscar.mateo@intel.com>
Date:   Tue Apr 10 09:12:47 2018 -0700

    drm/i915: Split out functions for different kinds of workarounds
    
    There are different kind of workarounds (those that modify registers that
    live in the context image, those that modify global registers, those that
    whitelist registers, etc...) and they have different requirements in terms
    of where they are applied and how. Also, by splitting them apart, it should
    be easier to decide where a new workaround should go.
    
    v2:
      - Add multiple MISSING_CASE
      - Rebased
    
    v3:
      - Rename mmio_workarounds to gt_workarounds (Chris, Mika)
      - Create empty placeholders for BDW and CHV GT WAs
      - Rebased
    
    v4: Rebased
    
    v5:
     - Rebased
     - FORCE_TO_NONPRIV register exists since BDW, so make a path
       for it to achieve universality, even if empty (Chris)
    
    Signed-off-by: Oscar Mateo <oscar.mateo@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    [ickle: appease checkpatch]
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/1523376767-18480-2-git-send-email-oscar.mateo@intel.com



The bisect log:

git bisect start
# good: [29dcea88779c856c7dc92040a0c01233263101d4] Linux 4.17
git bisect good 29dcea88779c856c7dc92040a0c01233263101d4
# bad: [ce397d215ccd07b8ae3f71db689aedb85d56ab40] Linux 4.18-rc1
git bisect bad ce397d215ccd07b8ae3f71db689aedb85d56ab40
# bad: [1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 1c8c5a9d38f607c0b6fd12c91cbe1a4418762a21
# bad: [135c5504a600ff9b06e321694fbcac78a9530cd4] Merge tag 'drm-next-2018-06-06-1' of git://anongit.freedesktop.org/drm/drm
git bisect bad 135c5504a600ff9b06e321694fbcac78a9530cd4
# good: [5231804cf9e584f3e7e763a0d6d2fffe011c1bce] Merge tag 'leds_for_4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
git bisect good 5231804cf9e584f3e7e763a0d6d2fffe011c1bce
# bad: [315852b422972e6ebb1dfddaadada09e46a2681a] drm: rcar-du: Fix build failure
git bisect bad 315852b422972e6ebb1dfddaadada09e46a2681a
# bad: [e71a82d8c1fa28ab048227df929e4f07d98f1656] Revert "drm/i915/cnl: Use mmio access to context status buffer"
git bisect bad e71a82d8c1fa28ab048227df929e4f07d98f1656
# bad: [672e314b21dc614894e69bb56a2b55cc7d256810] drm/i915/kbl: Add KBL GT2 sku
git bisect bad 672e314b21dc614894e69bb56a2b55cc7d256810
# good: [72f775fa284886893bec4a189ed38ac30e2535aa] drm/i915: use name from intel_shared_dpll.info
git bisect good 72f775fa284886893bec4a189ed38ac30e2535aa
# good: [bba0869b18e44ff2f713c98575ddad8c7c5e9b10] drm/i915: Treat i915_reset_engine() as guilty until proven innocent
git bisect good bba0869b18e44ff2f713c98575ddad8c7c5e9b10
# good: [7d3c425fefb91da7e984a43ba27dff6cdd53758a] drm/i915: Move a bunch of workaround-related code to its own file
git bisect good 7d3c425fefb91da7e984a43ba27dff6cdd53758a
# bad: [e307126a2c8e792a4b426ee3ab827d1285544e12] drm/i915/dsi: improve dphy param limits logging
git bisect bad e307126a2c8e792a4b426ee3ab827d1285544e12
# bad: [f4ecfbfc32ed0cb502374164638d14c4fb03e916] drm/i915: Check whitelist registers across resets
git bisect bad f4ecfbfc32ed0cb502374164638d14c4fb03e916
# bad: [e53a1058395435b8801591361b2be18adda869ff] drm/i915/bios: reduce the scope of some local variables in parse_ddi_port()
git bisect bad e53a1058395435b8801591361b2be18adda869ff
# bad: [f212bf9abe5de9f938fecea7df07046e74052dde] drm/i915/bios: filter out invalid DDC pins from VBT child devices
git bisect bad f212bf9abe5de9f938fecea7df07046e74052dde
# bad: [59b449d5c82af03acdfc3f9a343c9d085ab5568f] drm/i915: Split out functions for different kinds of workarounds
git bisect bad 59b449d5c82af03acdfc3f9a343c9d085ab5568f
# first bad commit: [59b449d5c82af03acdfc3f9a343c9d085ab5568f] drm/i915: Split out functions for different kinds of workarounds
Comment 15 Mika Kuoppala 2018-11-08 12:53:02 UTC
Thanks for bisect. There isn't anything obvious in the commit
tho. Have you tried to produce the hang with and without the commit?

Also noticed your dumps that you have iommu on, so could you also
try with intel_iommu=igfx_off. Thanks
Comment 16 leozinho29_eu 2018-11-08 15:39:52 UTC
Created attachment 142413 [details]
The error when building the kernel with the commit reverted

Reverting that commit makes the kernel impossible to build, probably I would need to revert multiple commits to make the kernel possible to build again.

Every kernel from the bisect was tested: I would reproduce a GPU hang, https://bugs.freedesktop.org/show_bug.cgi?id=108531 was reliable to cause a GPU hang, then I would use the computer normally, it could crash after a few seconds or after a few minutes. If it did not crash after 1 hour, I would start an iGVT-g guest to put a significant load in the computer.

As the GPU hang was a different bug, which seems fixed (thank you), the criteria was:

GPU hang and crash later: bad;
GPU hang and don't crash: good

It seems:

[7d3c425fefb91da7e984a43ba27dff6cdd53758a] drm/i915: Move a bunch of workaround-related code to its own file

Is the last commit before the bad commit.

In comment 5 I have test results with intel_iommu=igfx_off, and using this option the system crashes too.
Comment 17 Tvrtko Ursulin 2018-11-23 13:18:41 UTC
Could you try removing all i915 options from the command line, and also iommu ones? Just have  intel_iommu=igfx_off.

Then if still hard hang, add i915.reset=1 and see if that helps.
Comment 18 leozinho29_eu 2018-11-23 21:27:49 UTC
Created attachment 142602 [details]
dmesg from crash with proposed Ubuntu HWE kernel

The option i915.reset=1 seems to work: with it, there is no system crash after the GPU hang, at least for now (more than 20 minutes). Setting intel_iommu=igfx_off and removing the other i915 options had not effect, only i915.reset=1 worked.

The attached file has dmesg from a system crash with 4.18.0-42, the HWE kernel for Ubuntu Bionic Beaver.

At least for now, i915.reset=1 seems to solve the system crash.
Comment 19 Tvrtko Ursulin 2018-11-26 14:18:28 UTC
Created attachment 142614 [details] [review]
Bring back GT workaround application on the engine reset path

That i915.reset=1 helps suggests your bisect result might be valid, since the indicated commit did remove some workaround application from the engine reset path (which i915.reset=1 disables).

Would you be able to test this patch on top of a known bad (for you) drm-tip kernel? It is exploratory patch at this stage only, which puts back a class of workaround application on the engine reset path. If so, this patch should be tested with the default i915.reset=2 modparam.
Comment 20 leozinho29_eu 2018-11-26 18:23:52 UTC
I have tested the kernel commit 59b449d5c82af03acdfc3f9a343c9d085ab5568f, which is the first bad commit. Did not set i915.reset=1 on kernel command line and the command `sudo cat reset` output is 2. Built one kernel without the patch and other with the patch.

The kernel without the patch crashed after some time after a GPU hang.

The kernel with the patch is still working after the GPU hang after 2 hours and multiple GPU loads applied to it: games, videos and iGVT-g virtual machines.

For now (who knows if it will crash even later) the patch seems to solve the problem.
Comment 21 Tvrtko Ursulin 2018-11-28 10:21:18 UTC
Thanks for testing this.

I am working on a proper fix and until then i915.reset=1 is the safest workaround to use.
Comment 22 Tvrtko Ursulin 2018-12-03 17:54:53 UTC
Would you be able to test the first two patches from this series: https://patchwork.freedesktop.org/series/53313/ ? Those two applied together and in sequence should fix the issue you are seeing.
Comment 23 leozinho29_eu 2018-12-03 23:07:01 UTC
After 4 hours after a GPU hang, drm-tip HEAD 3eebbff19df5f58ddc16a132567dd45717f0753a with the first two patches applied is still working normally, while drm-tip HEAD 3eebbff19df5f58ddc16a132567dd45717f0753a without the patches is crashing.

These two patches seem to solve the issue to me.

The kernel command line of the current boot that is working well after 4 hours after the GPU hang:

BOOT_IMAGE=/boot/vmlinuz-4.20.0-rc5-1-2-ursulin+ root=UUID=6b4ae5c0-c78c-49a6-a1ba-029192618a7a ro quiet ro kvm.ignore_msrs=1 kvm.halt_poll_ns=0 kvm.halt_poll_ns_grow=0 intel_iommu=on iommu=pt i915.enable_gvt=1 i915.fastboot=1 resume=UUID=d8cfc834-ef4f-4a15-9c4e-403c9cbd0685 mtrr_gran_size=2M mtrr_chunk_size=64M cgroup_enable=memory swapaccount=1 zswap.enabled=1 log_buf_len=64M usbhid.quirks=0x0079:0x0006:0x100000 config_scsi_mq_default=y scsi_mod.use_blk_mq=1
Comment 24 Tvrtko Ursulin 2018-12-04 13:08:57 UTC
Excellent, thank you very much for testing this throughout! The fixes are now in drm-tip and should find their way to stable kernels in due time:

commit 25d140faaa25f728159eb8c304eae53d88a7f14e
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Date:   Mon Dec 3 13:33:19 2018 +0000

    drm/i915: Record GT workarounds in a list

commit 4a15c75c42460252a63d30f03b4766a52945fb47
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Date:   Mon Dec 3 13:33:41 2018 +0000

    drm/i915: Introduce per-engine workarounds
Comment 25 leozinho29_eu 2019-01-21 00:26:07 UTC
I don't know if this is the right place to ask, but both 4.19 and Ubuntu 4.18 kernels (4.19.16 and 4.18.0-13, respectively) still have this bug.

I reported this against the Ubuntu Bionic HWE kernel https://bugs.launchpad.net/ubuntu/+source/linux-meta-hwe-edge/+bug/1804898 but it has received no attention yet, which is unsurprising as the fix was not backported to 4.19.

As the patches needed to fix this bug are pretty large but the commit message has written that they were made to be as easy as possible to backport, will they be backported to 4.19? If not, i915.reset=1 seems to be an OK workaround, but taints the kernel.

Thank you.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.