I updated my kernel to 5.3.4 today and had a few short (few second long) UI freezes. Freezes recovered and I found following in my dmesg: [101184.000657] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0 [101184.000659] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [101184.000659] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [101184.000660] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [101184.000660] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [101184.000660] [drm] GPU crash dump saved to /sys/class/drm/card0/error [101184.001664] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 [101394.022014] usb 1-11.3: reset high-speed USB device number 6 using xhci_hcd [101752.002696] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 I can provide list of more system components/libraries, but as far as I can tell this is related to 5.3.4 kernel update. I've seen the freezes in a few cases - mostly web browser usage it seems. I am attaching the card error output from sysfs. If more info/testing etc is needed let me know.
Created attachment 145676 [details] /sys/class/drm/card0/error output
rcs0 command stream: IDLE?: no START: 0x00009000 HEAD: 0x00400820 [0x00000000] TAIL: 0x00000820 [0x00000000, 0x00000000] CTL: 0x00003001 MODE: 0x00000000 HWS: 0xffffe000 ACTHD: 0x00000000 00400820 IPEIR: 0x00000000 IPEHR: 0x7a000004 INSTDONE: 0xffdfffff SC_INSTDONE: 0xffffffff SAMPLER_INSTDONE[0][0]: 0xffffffff SAMPLER_INSTDONE[0][1]: 0xffffffff SAMPLER_INSTDONE[0][2]: 0xffffffff ROW_INSTDONE[0][0]: 0xffffffff ROW_INSTDONE[0][1]: 0xffffffff ROW_INSTDONE[0][2]: 0xffffffff BBADDR: 0x0000fffe_ec2fca94 BB_STATE: 0x00000020 INSTPS: 0x00008840 INSTPM: 0x00000000 FADDR: 0x00000000 00009820 RC PSMI: 0x00000010 FAULT_REG: 0x00000000 GFX_MODE: 0x00008000 PDP0: 0x0000000b7e32f000 PDP1: 0x0000000000000000 PDP2: 0x0000000000000000 PDP3: 0x0000000000000000 ring->head: 0x00000000 ring->tail: 0x00000000 hangcheck timestamp: 0ms (4395844992; epoch) engine reset count: 0 ELSP[0]: pid 801, seqno 15:0011639e!, prio 2, emitted -960ms, start 00009000, head 00000780, tail 00000820 ELSP[1]: pid 0, seqno 5:0000178c, prio -4093, emitted -959ms, start 00001000, head 000008c0, tail 00000928 Active context: [0] hw_id 0, prio 0, guilty 0 active 0 The GPU did not do a context switch at the end of ELSP[0]. Seems like you are able to reproduce this fairly easily with your usage, could you try setting i915.dmc_firmware_path=/dev/null on your kernel/grub commandline?
For the record, what is your last known good kernel version (what version did you upgrade from)?
As far as I can tell - 5.3.2 was OK but I am not 100% sure. I definitely skipped 5.3.3 during my updates so that could go either way. It's possible I used my PC mostly remotely during the time 5.3.2 was used so I would not have noticed any GFX issues. Let's say - 5.3.x might be affected. I now have a system booted with i915.dmc_firmware_path=/dev/null kernel commandline. I'll report back if I have more info (anything specific to look for?) For now I'll just use it and see if I notice any weirdness...
Disabling dmc will prevent reaching package c-state 8+, otherwise it should be no impact, so we are on the lookout to see if it hangs again.
FWIW, I've dug a bit more in the journal and around the same time I have these (presumably Chromium) logs: ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command ERROR:shared_image_manager.cc(120)] SharedImageManager::ProduceGLTexture: Trying to produce a representation from a non-existent mailbox. 3E:FB:28:49:D0:7B:96:F0:6F:34:7A:9B:8C:07:C3:09 ERROR:gles2_cmd_decoder.cc(18508)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoCreateAndTexStorage2DSharedImageINTERNAL: invalid mailbox name ERROR:gles2_cmd_decoder.cc(18529)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoBeginSharedImageAccessCHROMIUM: bound texture is not a shared image ERROR:gles2_cmd_decoder.cc(18552)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoEndSharedImageAccessCHROMIUM: bound texture is not a shared image ERROR:gles2_cmd_decoder.cc(18529)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : DoBeginSharedImageAccessCHROMIUM: bound texture is not a shared image The GL ERRORS repeat for a few seconds until: ERROR:logger.cc(46)] Too many GL errors, not reporting any more for this context. use --disable-gl-error-limit to see all errors I'll continue running with i915.dmc_firmware_path=/dev/null and see if I can reproduce (so far I haven't been able)
Created attachment 145727 [details] /sys/class/drm/card0/error output with i915.dmc_firmware_path=/dev/null I have managed to reproduce again - even after upgrading kernel to 5.3.5 and adding i915.dmc_firmware_path=/dev/null kernel command line option. Attaching new output of /sys/class/drm/card0/error I haven't yet found an exact reproducer, but will try to dig. My current two leads/ideas are: * Related to suspend/resume (i.e. I don't remember seeing hang after fresh boot, only after suspend/resume cycle) * Related to IOMMU/2nd video card being assigned to a VM Both of the above might be wild goose chases at this point though.
Created attachment 145737 [details] gpu crash dump Same thing happening to me. I can confirm that (at least in my case) it doesn't happen only after suspend/resume cycle. This crash happened shortly after reboot.
Forgot to mention; it happens to me while running chromium as well.
(In reply to Chris Wilson from comment #5) > Disabling dmc will prevent reaching package c-state 8+, otherwise it should > be no impact, so we are on the lookout to see if it hangs again. (In reply to Stanislav Ochotnicky from comment #7) > Created attachment 145727 [details] > /sys/class/drm/card0/error output with i915.dmc_firmware_path=/dev/null > > I have managed to reproduce again - even after upgrading kernel to 5.3.5 and > adding i915.dmc_firmware_path=/dev/null kernel command line option. > > Attaching new output of /sys/class/drm/card0/error > > I haven't yet found an exact reproducer, but will try to dig. My current two > leads/ideas are: > * Related to suspend/resume (i.e. I don't remember seeing hang after fresh > boot, only after suspend/resume cycle) > * Related to IOMMU/2nd video card being assigned to a VM > > Both of the above might be wild goose chases at this point though. CC'ing Chris.
I can still reproduce on 5.3.8. I can provide another dump of /sys/class/drm/card0/error if needed. I can also confirm this has nothing to do with suspend as I can reproduce after fresh restart. Overall it's not a big deal for me since it ends up just as a short UI freeze. But if I can provide any additional information let me know. I should be able to start git-bisect if that would help narrow things down.
I just started a VM which has a different graphics card assigned and I experienced another hang with i915. Perhaps this is not necesarily VM/IOMMU related but that might be one of the triggers? In any case - I have a separate crash dump for this event.
> I should be able to start git-bisect if that would help narrow things down. Yes, this definitely helps. Can you post the bad commit that caused the issue. Thanks!
Not sure whether I should open new report. I think I've been hit by the same bug with similar characteristics. I started to notice short (second, mostly less) freezes with 5.3 kernel (don't know the exact version, since I got it from Debian unstable it surely wasn't the initial 5.3.0). With 5.2 kernel everything was fine. I notice freezes only in Firefox and Thunderbird. I get the same dmesg output (mostly Resetting rcs0 for hang on rcs0). /sys/class/drm/card0/error got generated I think when I tried to close message window in Thunderbird. If you want I can upload my /sys/class/drm/card0/error. Not every freeze in Firefox/Thunderbird generates "Resetting..." message to Firefox, so I'm not 100% sure that these are related actually. I connected these things today, so I start monitoring. I haven't compiled running kernel from source in a years, so I can't currently do it. If you point me to good instructions, I can try to git-bisect this.
Created attachment 146045 [details] Similar crash in my system Anyway, I'll add my crash dump. I have monitored relations to Firefox/Thunderbind mini-freezes and these errors, but haven't found any correlation.
Created attachment 146046 [details] Similar crash in my system
(In reply to ilvez from comment #16) > Created attachment 146046 [details] > Similar crash in my system Can you try to reproduce this issue using drm-tip (https://cgit.freedesktop.org/drm-tip)
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/484.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.