109949 – [skl] Hard lockup (Freeze) after reset failure: GPU HANG: ecode 9:0:0x85dffffb

Bug 109949 - [skl] Hard lockup (Freeze) after reset failure: GPU HANG: ecode 9:0:0x85dffffb

Summary: [skl] Hard lockup (Freeze) after reset failure: GPU HANG: ecode 9:0:0x85dffffb

Status:	CLOSED NOTOURBUG

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged
Keywords:

Depends on:
Blocks:

Reported:	2019-03-10 02:15 UTC by Paul
Modified:	2019-06-12 05:58 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	SKL
i915 features:	GPU hang

Attachments
kernel dmesg before freeze (83.33 KB, text/plain) 2019-03-10 02:15 UTC, Paul	no flags	Details
kernel log with drm.debug=0x1e log_buf_len=1M (through netconsole) / 32M log (2.44 MB, application/x-gzip) 2019-03-10 02:49 UTC, Paul	no flags	Details
drm-tip fac89f79a boot dmesg with drm.debug=0x1e log_buf_len=4M (3.26 MB, text/x-log) 2019-05-07 07:17 UTC, Arcadiy Ivanov	no flags	Details
2019-05-08 card0 error dump (55.50 KB, text/plain) 2019-05-08 08:26 UTC, Arcadiy Ivanov	no flags	Details
2019-05-10 drm-tip GPU HANG: ecode 9:0:0x00000000, hang on vcs0 error dump (57.51 KB, text/plain) 2019-05-11 16:11 UTC, Arcadiy Ivanov	no flags	Details
2019-05-11 5.0.13 GPU Hang vcs0 (56.57 KB, text/plain) 2019-05-11 17:42 UTC, Arcadiy Ivanov	no flags	Details
View All

Description Paul 2019-03-10 02:15:38 UTC

Created attachment 143604 [details]
kernel dmesg before freeze

I am not sure how to 100% reproduce this, but in my case, by opening ~30 chromium tabs with youtube videos running should cause a hard lockup, aka system/xorg freezes, music hangs too with repeating audio chunks, sysrq/ping/network/netconsole/ssh doesn't work.

The only dmesg I could recover contained something like this:

[  174.026397] i915 0000:00:02.0: GPU HANG: ecode 9:1:0xfffffffe, in  [0], hang on rcs0
[  174.026403] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  174.026406] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  174.026407] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  174.026409] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  174.026411] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  174.027425] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  174.028043] i915 0000:00:02.0: Resetting chip for hang on rcs0

Well, I am not exactly sure how I should continue debugging this because it's basically a hard-lockup.

Note: I am not sure when it started but I've been having this issue for quite some time (This also occurred with 4.18.x, and perhaps even before that, although I can't quite remember when it started).

System info:

CPU: Intel i7-6700K (Not overclocked)
Kernel: Linux - 5.0.0-arch1-2-ARCH-02093-g2988dab1bd13 #1 SMP PREEMPT Sat Mar 9 22:56:40 CET 2019 x86_64 GNU/Linux (drm-tip/	24962f1aef49db97e09c7942157a2dbc973b546b)
Distro: Arch Linux
Display Connector: eDP
chromium: 72.0.3626.121 (with hw accel enabled + intel-hybrid-driver)

Comment 1 Paul 2019-03-10 02:49:01 UTC

Created attachment 143605 [details]
kernel log with drm.debug=0x1e log_buf_len=1M (through netconsole) / 32M log

Comment 2 Paul 2019-03-10 02:54:58 UTC

Weird observation: Just when the hang happens, my netconsole receive is not reachable over the network anymore (ssh). This seemed very weird but it seems like it's being flooded by network packets coming from the main computer which is unresponsive. By disconnecting the main computer from the switch, the receiver side is reachable again.

I found this in the kernel log (netconsole receiver side):

[203009.876199] NETDEV WATCHDOG: enp2s0 (r8169): transmit queue 0 timed out
[203009.876248] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x21a/0x220
[203009.876252] Modules linked in: veth ip6t_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_filter ip6_tables ipt_MASQUERADE xt_CHECKSUM xt_comment xt_tcpudp iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_filter bridge stp llc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_rapl intel_telemetry_pltdrv intel_punit_ipc intel_telemetry_core intel_pmc_ipc x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i915 nls_iso8859_1 nls_cp437 crct10dif_pclmul vfat crc32_pclmul fat ghash_clmulni_intel kvmgt vfio_mdev mdev vfio_iommu_type1 vfio snd_soc_skl kvm snd_soc_hdac_hda snd_hda_ext_core snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_compress irqbypass ac97_bus i2c_algo_bit snd_pcm_dmaengine drm_kms_helper snd_hda_intel snd_hda_codec snd_hda_core drm ppdev snd_hwdep aesni_intel snd_pcm snd_timer aes_x86_64 crypto_simd snd intel_gtt cryptd glue_helper
[203009.876338]  intel_cstate intel_rapl_perf r8169 agpgart wdat_wdt mei_me processor_thermal_device pcspkr tpm_crb soundcore realtek intel_soc_dts_iosf syscopyarea libphy sysfillrect mei sysimgblt i2c_i801 fb_sys_fops evdev tpm_tis parport_pc mac_hid parport tpm_tis_core int3400_thermal tpm acpi_thermal_rel int3406_thermal pcc_cpufreq rng_core dptf_power int3403_thermal int340x_thermal_zone ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sd_mod ahci libahci libata crc32c_intel xhci_pci xhci_hcd scsi_mod
[203009.876398] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.0.0-arch1-1-ARCH #1
[203009.876401] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4105B-ITX, BIOS P1.30 05/04/2018
[203009.876407] RIP: 0010:dev_watchdog+0x21a/0x220
[203009.876411] Code: 49 63 4c 24 e0 eb 8c 4c 89 ef c6 05 d2 5f bf 00 01 e8 0a 72 fc ff 89 d9 4c 89 ee 48 c7 c7 38 62 73 98 48 89 c2 e8 20 ec 96 ff <0f> 0b eb be 66 90 0f 1f 44 00 00 48 c7 47 08 00 00 00 00 48 c7 07
[203009.876414] RSP: 0018:ffff9e2c37f03e78 EFLAGS: 00010286
[203009.876418] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[203009.876421] RDX: 0000000000000103 RSI: 00000000000000f6 RDI: 00000000ffffffff
[203009.876424] RBP: ffff9e2c3597245c R08: 0000000000000001 R09: 0000000000000366
[203009.876426] R10: 0000000000000004 R11: 0000000000000000 R12: ffff9e2c35972480
[203009.876429] R13: ffff9e2c35972000 R14: 0000000000000001 R15: ffff9e2c349ee680
[203009.876433] FS:  0000000000000000(0000) GS:ffff9e2c37f00000(0000) knlGS:0000000000000000
[203009.876436] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[203009.876439] CR2: 00007f81eee7cd0f CR3: 000000000920e000 CR4: 0000000000340ee0
[203009.876442] Call Trace:
[203009.876449]  <IRQ>
[203009.876458]  ? qdisc_put_unlocked+0x30/0x30
[203009.876462]  ? qdisc_put_unlocked+0x30/0x30
[203009.876471]  call_timer_fn+0x2b/0x160
[203009.876477]  ? qdisc_put_unlocked+0x30/0x30
[203009.876482]  expire_timers+0x99/0x110
[203009.876489]  run_timer_softirq+0x8a/0x160
[203009.876497]  ? sched_clock+0x5/0x10
[203009.876503]  ? sched_clock_cpu+0xe/0xd0
[203009.876511]  __do_softirq+0x112/0x356
[203009.876520]  irq_exit+0xd9/0xf0
[203009.876526]  smp_apic_timer_interrupt+0x87/0x180
[203009.876531]  apic_timer_interrupt+0xf/0x20
[203009.876535]  </IRQ>
[203009.876543] RIP: 0010:cpuidle_enter_state+0xbc/0x480
[203009.876547] Code: e8 19 08 a3 ff 80 7c 24 13 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 99 03 00 00 31 ff e8 0b 2b a9 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 c4 02 00 00 49 63 cc 4c 8b 3c 24 4c 2b 7c 24 08 48
[203009.876549] RSP: 0018:ffffb9dd40d03e98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[203009.876553] RAX: ffff9e2c37f00000 RBX: ffffffff988b8180 RCX: 000000000000001f
[203009.876556] RDX: 0000b8a2eb8d9a10 RSI: ffffffff986aa7d8 RDI: ffffffff986b2e68
[203009.876559] RBP: ffff9e2c37f2ad00 R08: 0000000000000000 R09: 0000000000021500
[203009.876561] R10: 000114870be2a1fa R11: ffff9e2c37f20be4 R12: 0000000000000007
[203009.876564] R13: ffffffff988b8438 R14: 0000000000000007 R15: 0000000000000000
[203009.876574]  ? cpuidle_enter_state+0x97/0x480
[203009.876582]  do_idle+0x217/0x250
[203009.876589]  cpu_startup_entry+0x19/0x20
[203009.876595]  start_secondary+0x1aa/0x200
[203009.876603]  secondary_startup_64+0xa4/0xb0
[203009.876611] ---[ end trace 5d229bf56124be76 ]---
[203009.971055] r8169 0000:02:00.0 enp2s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100).
[203030.027038] r8169 0000:02:00.0 enp2s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100).
[203050.080567] r8169 0000:02:00.0 enp2s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100).
[203069.934074] r8169 0000:02:00.0 enp2s0: rtl_txcfg_empty_cond == 0 (loop: 666, delay: 100).

I don't even know what this means, in general. How can the frozen system spam the other system?

Comment 3 Chris Wilson 2019-03-10 10:07:06 UTC

[  378.776909] i915 0000:00:02.0: GPU HANG: ecode 9:1:0xfffffffe, in  [0], hang on rcs0
[  378.776915] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  378.776917] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  378.776919] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  378.776921] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  378.776923] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  378.777957] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  378.779069] [drm:gen6_hw_domain_reset [i915]] Wait for 0x00000002 engines reset failed
[  378.779087] [drm:i915_reset_engine [i915]] Failed to reset rcs0, ret=-110
[  378.779140] [drm:i915_reset_device [i915]] resetting chip
[  378.779158] i915 0000:00:02.0: Resetting chip for hang on rcs0
[  378.780690] [drm:gen6_hw_domain_reset [i915]] Wait for 0x00000001 engines reset failed
[  378.782230] [drm:gen6_hw_domain_reset [i915]] Wait for 0x00000001 engines reset failed
[  378.783769] [drm:gen6_hw_domain_reset [i915]] Wait for 0x00000001 engines reset failed

Comment 4 Lakshmi 2019-03-11 12:16:28 UTC

Paul, can you please attach GPU crash dump which is saved at /sys/class/drm/card0/error

Comment 5 Paul 2019-03-11 21:12:01 UTC

(In reply to Lakshmi from comment #4)
> Paul, can you please attach GPU crash dump which is saved at
> /sys/class/drm/card0/error

I am not able to retrieve a gpu crash dump because it's hard freezing, nothing really works (no networking/ping/ssh/sysrq/...). After a hard reset the system is reusable again but /sys/class/drm/card0/error just contains: "No error state collected".

Comment 6 Lakshmi 2019-04-12 09:34:45 UTC

(In reply to Paul from comment #5)
> (In reply to Lakshmi from comment #4)
> > Paul, can you please attach GPU crash dump which is saved at
> > /sys/class/drm/card0/error
> 
> I am not able to retrieve a gpu crash dump because it's hard freezing,
> nothing really works (no networking/ping/ssh/sysrq/...). After a hard reset
> the system is reusable again but /sys/class/drm/card0/error just contains:
> "No error state collected".

Do you still having the issue with no error file? Have you tried to update the kernel to latest e.g drmtip?

Comment 7 Paul 2019-04-18 17:30:59 UTC

Sorry about the delay. I am currently unable to test drm-tip because I don't have access to the affected system right now. I will report back in a few days and see if I can still reproduce this.

Comment 8 Arcadiy Ivanov 2019-05-02 23:52:08 UTC

I'm having the same issue with Dell Precision Mobile 7510, Fedora 30, kernel 5.0.9. 

Details of kernel configuration are here: https://www.ivanov.biz/2019/howto-optimize-intel-graphics-performance-fedora-kde-linux-laptop/

The hard freeze prevents me from getting any crash information whatsoever, but it too occurs with many Chromium tabs, albeit only with one video playing. The failure is sporadic.

[    0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.0.9-301.fc30.x86_64 root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rootflags=discard rhgb rd.driver.blacklist=nouveau i915.enable_dc=2 i915.disable_power_well=0 i915.enable_fbc=1 i915.enable_guc=3 i915.enable_dpcd_backlight=1 l1tf=flush
[    0.639085] Kernel command line: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.0.9-301.fc30.x86_64 root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rootflags=discard rhgb rd.driver.blacklist=nouveau i915.enable_dc=2 i915.disable_power_well=0 i915.enable_fbc=1 i915.enable_guc=3 i915.enable_dpcd_backlight=1 l1tf=flush
[    3.736491] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    3.736778] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin (v1.27)
[    3.748478] [drm] HuC: Loaded firmware i915/skl_huc_ver01_07_1398.bin (version 1.7)
[    3.758925] [drm] GuC: Loaded firmware i915/skl_guc_ver9_33.bin (version 9.33)
[    3.770134] i915 0000:00:02.0: GuC firmware version 9.33
[    3.770138] i915 0000:00:02.0: GuC submission enabled
[    3.770139] i915 0000:00:02.0: HuC enabled
[    3.771606] [drm] Initialized i915 1.6.0 20181204 for 0000:00:02.0 on minor 0
[    3.815092] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[    5.483490] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])

Comment 9 Lakshmi 2019-05-03 05:52:19 UTC

(In reply to Arcadiy Ivanov from comment #8)
> I'm having the same issue with Dell Precision Mobile 7510, Fedora 30, kernel
> 5.0.9. 
> 
> Details of kernel configuration are here:
> https://www.ivanov.biz/2019/howto-optimize-intel-graphics-performance-fedora-
> kde-linux-laptop/
> 
> The hard freeze prevents me from getting any crash information whatsoever,
> but it too occurs with many Chromium tabs, albeit only with one video
> playing. The failure is sporadic.
> 
> [    0.000000] Command line:
> BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.0.9-301.fc30.x86_64
> root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap
> rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rootflags=discard rhgb
> rd.driver.blacklist=nouveau i915.enable_dc=2 i915.disable_power_well=0
> i915.enable_fbc=1 i915.enable_guc=3 i915.enable_dpcd_backlight=1 l1tf=flush
> [    0.639085] Kernel command line:
> BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.0.9-301.fc30.x86_64
> root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap
> rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rootflags=discard rhgb
> rd.driver.blacklist=nouveau i915.enable_dc=2 i915.disable_power_well=0
> i915.enable_fbc=1 i915.enable_guc=3 i915.enable_dpcd_backlight=1 l1tf=flush
> [    3.736491] i915 0000:00:02.0: vgaarb: changed VGA decodes:
> olddecodes=io+mem,decodes=none:owns=io+mem
> [    3.736778] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin
> (v1.27)
> [    3.748478] [drm] HuC: Loaded firmware i915/skl_huc_ver01_07_1398.bin
> (version 1.7)
> [    3.758925] [drm] GuC: Loaded firmware i915/skl_guc_ver9_33.bin (version
> 9.33)
> [    3.770134] i915 0000:00:02.0: GuC firmware version 9.33
> [    3.770138] i915 0000:00:02.0: GuC submission enabled
> [    3.770139] i915 0000:00:02.0: HuC enabled
> [    3.771606] [drm] Initialized i915 1.6.0 20181204 for 0000:00:02.0 on
> minor 0
> [    3.815092] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
> [    5.483490] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops
> i915_audio_component_bind_ops [i915])

Can you verify this issue on latest drm-tip:
https://cgit.freedesktop.org/drm-tip and send dmesg from boot with drm.debug=0x1e log_buf_len=4M?

Comment 10 Arcadiy Ivanov 2019-05-05 09:07:15 UTC

I could endeavor to build drm-tip and enable the below kernel option, the issue is there is no chance I'll be able to retrieve the logs if the system locks up. 

The system experiences full hard freeze - I won't be able to show you dmesg content because I won't be able to drop into TTY, keyboard is frozen, there is no disk activity, no response to suspend, no initialization of new devices upon being plugged, pulseaudio, depending on what was playing at the time, is stuck in a buffer loop repeating. The only way to fix this is a poweroff.

Are you still interested or it won't be enough?

Comment 11 Arcadiy Ivanov 2019-05-05 21:48:02 UTC

(In reply to Lakshmi from comment #9)
> (In reply to Arcadiy Ivanov from comment #8)
> ...
> Can you verify this issue on latest drm-tip:
> https://cgit.freedesktop.org/drm-tip and send dmesg from boot with
> drm.debug=0x1e log_buf_len=4M?

@lakshmi Do you by any chance have a 5.0.x-based drm-tip?

Comment 12 Jani Saarinen 2019-05-06 05:49:53 UTC

Hi,
Why you would need that if we are asking testing on our pre-upstream tree? 
If still needed you can get info from:  git://anongit.freedesktop.org/gfx-ci/linux
and search tag you need to build. eg. CI_DRM_570x gives you quite close?

Comment 13 Arcadiy Ivanov 2019-05-07 06:45:36 UTC

I wanted to repro with Fedora-based patches. But anyway, I reproduced with fac89f79a454771f without debug and will post debug once available.

Comment 14 Arcadiy Ivanov 2019-05-07 07:17:13 UTC

Created attachment 144185 [details]
drm-tip fac89f79a boot dmesg with drm.debug=0x1e log_buf_len=4M

Comment 15 Arcadiy Ivanov 2019-05-07 14:49:44 UTC

I've noticed that the rate of lockups with debug info is higher than without, but could be a random fluke.

Comment 16 Arcadiy Ivanov 2019-05-08 03:20:57 UTC

After running drm-tip 5.1.0 I can confirm that lockups are much more frequent with or without debug logging vs Fedora 5.0.11.

Comment 17 Lakshmi 2019-05-08 06:47:30 UTC

(In reply to Arcadiy Ivanov from comment #16)
> After running drm-tip 5.1.0 I can confirm that lockups are much more
> frequent with or without debug logging vs Fedora 5.0.11.

The original issue is related to GPU hang. But in the attached log I can't find any information about the hang.

Can you check if you have the crash dump file /sys/class/drm/card0/error?

Comment 18 Arcadiy Ivanov 2019-05-08 07:36:34 UTC

> The original issue is related to GPU hang. But in the attached log I can't find any information about the hang.

>> [...] and send dmesg from boot with drm.debug=0x1e log_buf_len=4M?

@lakshmi

Yes, it's not in the logs. 

Firstly the volume of log generation is such that 1 min after the boot dmesg logs overflow the buffer.

Secondly there is no way for me to retrieve them if they were somehow generated. Once the GPU lock happens everything stops. And I do mean EVERYTHING - the machine is hard-locked with absolutely no possibility of access until hard power off:
- X is completely frozen
- there is no disk activity
- keyboard is completely unresponsive 
- no ctrl-alt-delete or switching to tty or ctrl-alt-backspace
- plugging in external devices or unplugging them produces no response
- closing laptop doesn't work and the laptop does not suspend

The machine is completely frozen.

> Can you check if you have the crash dump file /sys/class/drm/card0/error?

The contents of `error` does not survive the power-off and I'm not able to retrieve the data. I'll try to see if I can collect it somehow with inotify on `/sys/class/drm/card0/error` to see if I can dump it.

The reason I suspect this is a GPU hang is that video playback freezes, the hardware mouse cursor is still active for a few seconds until the full freeze takes over the machine.

Is there any reliable way to make card crash dump survive the reboot?

Comment 19 Arcadiy Ivanov 2019-05-08 07:57:47 UTC

My current i915 module settings:

Module: i915
Parameter: alpha_support --> N
Parameter: disable_display --> N
Parameter: disable_power_well --> 0
Parameter: dmc_firmware_path --> (null)
Parameter: edp_vswing --> 0
Parameter: enable_dc --> 2
Parameter: enable_dpcd_backlight --> Y
Parameter: enable_dp_mst --> Y
Parameter: enable_fbc --> 1
Parameter: enable_guc --> 3
Parameter: enable_gvt --> N
Parameter: enable_hangcheck --> Y
Parameter: enable_ips --> 1
Parameter: enable_psr --> 0
Parameter: error_capture --> Y
Parameter: fastboot --> -1
Parameter: force_reset_modeset_test --> N
Parameter: guc_firmware_path --> (null)
Parameter: guc_log_level --> 1
Parameter: huc_firmware_path --> (null)
Parameter: invert_brightness --> 0
Parameter: load_detect_test --> N
Parameter: lvds_channel_mode --> 0
Parameter: mmio_debug --> 0
Parameter: modeset --> -1
Parameter: nuclear_pageflip --> N
Parameter: panel_use_ssc --> -1
Parameter: prefault_disable --> N
Parameter: reset --> 2
Parameter: vbt_firmware --> (null)
Parameter: vbt_sdvo_panel_type --> -1
Parameter: verbose_state_checks --> Y

Comment 20 Arcadiy Ivanov 2019-05-08 08:26:08 UTC

Created attachment 144194 [details]
2019-05-08 card0 error dump

Comment 21 Arcadiy Ivanov 2019-05-08 08:29:10 UTC

Got the GPU hang dump via the following script:

```
#!/bin/bash -eEu

while grep -q '^No error' /sys/class/drm/card0/error ; do 
    sleep 0.3 
done

cat  /sys/class/drm/card0/error > /card0_error.dump
sync

```

running it as 

`sudo nohup ./monitor_gpu_error.sh &`

Comment 22 Arcadiy Ivanov 2019-05-08 09:39:46 UTC

Just experienced a lockup on drm-tip as well (the captured one is on 5.0.11) but no actual dump was captured before machine completely froze.

Comment 23 Arcadiy Ivanov 2019-05-11 10:39:15 UTC

@lakshmi Do you require any additional info or you're satisfied with the dump?

Comment 24 Arcadiy Ivanov 2019-05-11 16:11:37 UTC

Created attachment 144234 [details]
2019-05-10 drm-tip GPU HANG: ecode 9:0:0x00000000, hang on vcs0 error dump

This happened on drm-tip. Had to drop monitoring script down to 100ms recheck to capture error.

Comment 25 Arcadiy Ivanov 2019-05-11 16:18:30 UTC

Looking at the 2019-05-08 and 2019-05-10 dumps we're talking about two different causes:

GPU HANG: ecode 9:0:0x85dffffb, in chromium-vaapi [4523], reason: hang on rcs0, action: reset
Kernel: 5.0.11-300.fc30.x86_64

vs

GPU HANG: ecode 9:0:0x00000000, hang on vcs0
Kernel: 5.1.0+ x86_64

Comment 26 Arcadiy Ivanov 2019-05-11 17:38:01 UTC

My machine just survived a GPU hang on F30 5.0.13 and remained operational. 

[ 4860.674981] [drm] GPU HANG: ecode 9:2:0xa8dfbffd, in chromium-vaapi [4647], reason: hang on vcs0, action: reset
[ 4860.674984] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 4860.674985] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 4860.674986] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 4860.674986] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 4860.674987] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 4860.675052] i915 0000:00:02.0: Resetting vcs0 for hang on vcs0
[ 4868.658707] i915 0000:00:02.0: Resetting vcs0 for hang on vcs0
[ 4876.658545] i915 0000:00:02.0: Resetting vcs0 for hang on vcs0
[ 4884.659343] i915 0000:00:02.0: Resetting vcs0 for hang on vcs0
[ 4892.658365] i915 0000:00:02.0: Resetting vcs0 for hang on vcs0

Comment 27 Arcadiy Ivanov 2019-05-11 17:42:32 UTC

Created attachment 144235 [details]
2019-05-11 5.0.13 GPU Hang vcs0

GPU HANG: ecode 9:2:0xa8dfbffd, in chromium-vaapi [4647], reason: hang on vcs0, action: reset

Comment 28 Arcadiy Ivanov 2019-05-19 04:13:22 UTC

Any news on this?

Comment 29 Lakshmi 2019-05-20 07:46:50 UTC

(In reply to Arcadiy Ivanov from comment #28)
> Any news on this?

Sorry for the delay. There is a high chance that this could be a Vaapi driver bug.
Can you please this bug under Vaapi driver https://github.com/intel/intel-vaapi-driver/issues/new

Closing this issue as NOTOURBUG.

Comment 30 Arcadiy Ivanov 2019-05-20 17:07:12 UTC

I'm terribly sorry, but are you suggesting that a full kernel-level lockup can be caused by a non-privileged userland API?

Comment 31 Arcadiy Ivanov 2019-05-20 17:14:01 UTC

Also, could you please document some reasoning and provide evidence from the attachments why this should not be addressed in i915 kernel mod?

Comment 32 Paul 2019-05-25 13:56:51 UTC

(In reply to Arcadiy Ivanov from comment #30)
> I'm terribly sorry, but are you suggesting that a full kernel-level lockup
> can be caused by a non-privileged userland API?

I second that. The bug still exists even with the latest drm-tip and for me I am unable to recover anything.

Comment 33 Chris Wilson 2019-05-27 08:12:00 UTC

Do what exactly? Userspace hangs the GPU, the kernel detects the hung GPU, resets the hung GPU and if userspace keeps on doing so, kills the userspace. We can't magically fix userspace.

Comment 34 Arcadiy Ivanov 2019-06-06 15:33:46 UTC

(In reply to Chris Wilson from comment #33)
> Do what exactly? Userspace hangs the GPU, the kernel detects the hung GPU,
> resets the hung GPU and if userspace keeps on doing so, kills the userspace.
> We can't magically fix userspace.

Chris, this is not what're discussing here. If it was the issue, nobody would make a peep!

The issue isn't that userspace hangs GPU and GPU then kills userspace. The issue is that userspace hangs the GPU AND ***kills the kernel***: GPU is hung, but it's not the userspace that dies it's the kernel that gets completely seized up - no IO, no interrupts. Everything dies.

Comment 35 Arcadiy Ivanov 2019-06-10 17:08:43 UTC

@Chris any thoughts on the matter?

Comment 36 Chris Wilson 2019-06-11 08:29:14 UTC

The kernel didn't die, the machine did. The answer to that is don't hang the GPU, and the GPU won't kill the machine.

Comment 37 Arcadiy Ivanov 2019-06-11 12:20:55 UTC

Сhris, I'm not sure I understand your position. Are you saying that it's a normal state of affairs that userland can create a condition where via a series of unprivileged library calls that go through i915 kernel driver, a machine can be killed? 

I dare say, that presents a wonderful exploit opportunity where an appropriately crafted **unprivileged** userland code would kill any Linux machine with an Intel GPU.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.