Bug 111919 - Intel card (Coffeelake) short freezes (hang) after upgrade to kernel 5.3.4
Summary: Intel card (Coffeelake) short freezes (hang) after upgrade to kernel 5.3.4
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other All
: high major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-07 14:33 UTC by Stanislav Ochotnicky
Modified: 2019-10-15 07:01 UTC (History)
3 users (show)

See Also:
i915 platform: CFL
i915 features: GPU hang


Attachments
/sys/class/drm/card0/error output (16.31 KB, text/plain)
2019-10-07 14:34 UTC, Stanislav Ochotnicky
no flags Details
/sys/class/drm/card0/error output with i915.dmc_firmware_path=/dev/null (16.26 KB, text/plain)
2019-10-13 15:52 UTC, Stanislav Ochotnicky
no flags Details
gpu crash dump (16.46 KB, text/plain)
2019-10-14 19:53 UTC, jakov.ivkovic
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stanislav Ochotnicky 2019-10-07 14:33:48 UTC
I updated my kernel to 5.3.4 today and had a few short (few second long) UI freezes. Freezes recovered and I found following in my dmesg:

[101184.000657] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
[101184.000659] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[101184.000659] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[101184.000660] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[101184.000660] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[101184.000660] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[101184.001664] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[101394.022014] usb 1-11.3: reset high-speed USB device number 6 using xhci_hcd
[101752.002696] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0


I can provide list of more system components/libraries, but as far as I can tell this is related to 5.3.4 kernel update. I've seen the freezes in a few cases - mostly web browser usage it seems. I am attaching the card error output from sysfs. If more info/testing etc is needed let me know.
Comment 1 Stanislav Ochotnicky 2019-10-07 14:34:34 UTC
Created attachment 145676 [details]
/sys/class/drm/card0/error output
Comment 2 Chris Wilson 2019-10-07 14:42:32 UTC
rcs0 command stream:
  IDLE?: no
  START: 0x00009000
  HEAD:  0x00400820 [0x00000000]
  TAIL:  0x00000820 [0x00000000, 0x00000000]
  CTL:   0x00003001
  MODE:  0x00000000
  HWS:   0xffffe000
  ACTHD: 0x00000000 00400820
  IPEIR: 0x00000000
  IPEHR: 0x7a000004
  INSTDONE: 0xffdfffff
  SC_INSTDONE: 0xffffffff
  SAMPLER_INSTDONE[0][0]: 0xffffffff
  SAMPLER_INSTDONE[0][1]: 0xffffffff
  SAMPLER_INSTDONE[0][2]: 0xffffffff
  ROW_INSTDONE[0][0]: 0xffffffff
  ROW_INSTDONE[0][1]: 0xffffffff
  ROW_INSTDONE[0][2]: 0xffffffff
  BBADDR: 0x0000fffe_ec2fca94
  BB_STATE: 0x00000020
  INSTPS: 0x00008840
  INSTPM: 0x00000000
  FADDR: 0x00000000 00009820
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  GFX_MODE: 0x00008000
  PDP0: 0x0000000b7e32f000
  PDP1: 0x0000000000000000
  PDP2: 0x0000000000000000
  PDP3: 0x0000000000000000
  ring->head: 0x00000000
  ring->tail: 0x00000000
  hangcheck timestamp: 0ms (4395844992; epoch)
  engine reset count: 0
  ELSP[0]:  pid 801, seqno       15:0011639e!, prio 2, emitted -960ms, start 00009000, head 00000780, tail 00000820
  ELSP[1]:  pid 0, seqno        5:0000178c, prio -4093, emitted -959ms, start 00001000, head 000008c0, tail 00000928
  Active context: [0] hw_id 0, prio 0, guilty 0 active 0

The GPU did not do a context switch at the end of ELSP[0].

Seems like you are able to reproduce this fairly easily with your usage, could you try setting i915.dmc_firmware_path=/dev/null on your kernel/grub commandline?
Comment 3 Chris Wilson 2019-10-07 14:47:04 UTC
For the record, what is your last known good kernel version (what version did you upgrade from)?
Comment 4 Stanislav Ochotnicky 2019-10-07 15:02:11 UTC
As far as I can tell - 5.3.2 was OK but I am not 100% sure. I definitely skipped 5.3.3 during my updates so that could go either way. 

It's possible I used my PC mostly remotely during the time 5.3.2 was used so I would not have noticed any GFX issues. Let's say - 5.3.x might be affected.

I now have a system booted with i915.dmc_firmware_path=/dev/null kernel commandline. I'll report back if I have more info (anything specific to look for?)

For now I'll just use it and see if I notice any weirdness...
Comment 5 Chris Wilson 2019-10-07 15:07:13 UTC
Disabling dmc will prevent reaching package c-state 8+, otherwise it should be no impact, so we are on the lookout to see if it hangs again.
Comment 6 Stanislav Ochotnicky 2019-10-07 16:32:28 UTC
FWIW, I've dug a bit more in the journal and around the same time I have these (presumably Chromium) logs:
ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
ERROR:shared_image_manager.cc(120)] SharedImageManager::ProduceGLTexture: Trying to produce a representation from a non-existent mailbox. 3E:FB:28:49:D0:7B:96:F0:6F:34:7A:9B:8C:07:C3:09
ERROR:gles2_cmd_decoder.cc(18508)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoCreateAndTexStorage2DSharedImageINTERNAL: invalid mailbox name
ERROR:gles2_cmd_decoder.cc(18529)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoBeginSharedImageAccessCHROMIUM: bound texture is not a shared image
ERROR:gles2_cmd_decoder.cc(18552)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoEndSharedImageAccessCHROMIUM: bound texture is not a shared image
ERROR:gles2_cmd_decoder.cc(18529)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoBeginSharedImageAccessCHROMIUM: bound texture is not a shared image


The GL ERRORS repeat for a few seconds until:
ERROR:logger.cc(46)] Too many GL errors, not reporting any more for this context. use --disable-gl-error-limit to see all errors

I'll continue running with i915.dmc_firmware_path=/dev/null and see if I can reproduce (so far I haven't been able)
Comment 7 Stanislav Ochotnicky 2019-10-13 15:52:12 UTC
Created attachment 145727 [details]
/sys/class/drm/card0/error output with i915.dmc_firmware_path=/dev/null

I have managed to reproduce again - even after upgrading kernel to 5.3.5 and adding i915.dmc_firmware_path=/dev/null kernel command line option.

Attaching new output of /sys/class/drm/card0/error

I haven't yet found an exact reproducer, but will try to dig. My current two leads/ideas are:
 * Related to suspend/resume (i.e. I don't remember seeing hang after fresh boot, only after suspend/resume cycle)
 * Related to IOMMU/2nd video card being assigned to a VM

Both of the above might be wild goose chases at this point though.
Comment 8 jakov.ivkovic 2019-10-14 19:53:34 UTC
Created attachment 145737 [details]
gpu crash dump

Same thing happening to me.

I can confirm that (at least in my case) it doesn't happen only after suspend/resume cycle. This crash happened shortly after reboot.
Comment 9 jakov.ivkovic 2019-10-14 19:59:01 UTC
Forgot to mention; it happens to me while running chromium as well.
Comment 10 Lakshmi 2019-10-15 07:01:54 UTC
(In reply to Chris Wilson from comment #5)
> Disabling dmc will prevent reaching package c-state 8+, otherwise it should
> be no impact, so we are on the lookout to see if it hangs again.

(In reply to Stanislav Ochotnicky from comment #7)
> Created attachment 145727 [details]
> /sys/class/drm/card0/error output with i915.dmc_firmware_path=/dev/null
> 
> I have managed to reproduce again - even after upgrading kernel to 5.3.5 and
> adding i915.dmc_firmware_path=/dev/null kernel command line option.
> 
> Attaching new output of /sys/class/drm/card0/error
> 
> I haven't yet found an exact reproducer, but will try to dig. My current two
> leads/ideas are:
>  * Related to suspend/resume (i.e. I don't remember seeing hang after fresh
> boot, only after suspend/resume cycle)
>  * Related to IOMMU/2nd video card being assigned to a VM
> 
> Both of the above might be wild goose chases at this point though.

CC'ing Chris.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.