Bug 105373

Summary: [kbl-h] GPU HANG: ecode 9:0:0xfedffffa, in Xorg [1345], reason: Hang on rcs0, action: reset
Product: DRI Reporter: Vasil Kolev <vasil>
Component: DRM/IntelAssignee: Mika Kuoppala <mika.kuoppala>
Status: RESOLVED MOVED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: bruno.cudini+freedesktop, chris, intel-gfx-bugs, jon.ewins, mika.kuoppala
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard: ReadyForDev
i915 platform: KBL i915 features: GPU hang
Attachments:
Description Flags
/sys/class/drm/card0/error
none
dmesg
none
/sys/class/drm/card0/error with 1.04
none
dmesg with 1.04
none
dmesg with drim.debug
none
/sys/class/drm/card0/error with drm-tip
none
dmesg with drm-tip
none
dmesg with drm-tip 4.17.0-rc2 (d04fd4f6d93cea918521059db8358ff9e7a4a03b)
none
/sys/class/drm/card0/error with drm-tip 4.17.0-rc2 (d04fd4f6d93cea918521059db8358ff9e7a4a03b)
none
dmesg with drm-tip 4.18.0+ (d53f119472fc7daa532e46ea77098e9e9db2ac10)
none
/sys/class/drm/card0/error with drm-tip 4.18.0+ (d53f119472fc7daa532e46ea77098e9e9db2ac10)
none
/sys/class/drm/card0/error with drm-tip 4.18.0+ (d53f119472fc7daa532e46ea77098e9e9db2ac10), second try
none
dmesg with drm-tip 4.18.0+ (d53f119472fc7daa532e46ea77098e9e9db2ac10), second try
none
results from intel-gpu-tools
none
results with disable_display
none
/sys/class/drm/card0/error with drm-tip 5.2.0-rc2 (aff09dc14e1d9f03f9e6c8c157d4abccf4ca2b14)
none
dmesg with with drm-tip 5.2.0-rc2 (aff09dc14e1d9f03f9e6c8c157d4abccf4ca2b14) none

Description Vasil Kolev 2018-03-06 22:44:46 UTC
Created attachment 137844 [details]
/sys/class/drm/card0/error

issue: GPU doesn't even start working

On every boot, as soon as I login and compiz tries to start, the above message shows up in dmesg.
Comment 1 Vasil Kolev 2018-03-06 22:45:39 UTC
Created attachment 137845 [details]
dmesg
Comment 2 Chris Wilson 2018-03-07 08:45:31 UTC
You need to update the dmc firmware:

commit 4f0aa1fa3e3849caee450ee5d14fcc289cf16703
Author: Anusha Srivatsa <anusha.srivatsa@intel.com>
Date:   Thu Nov 9 10:51:43 2017 -0800

    drm/i915/dmc: DMC 1.04 for Kabylake
    
    There is a new version of DMC available for KBL.
    
    The release notes mentions:
    1. Fix for the issue where DC_STATE was getting enabled even
    when disabled by driver causing data corruption.
    
    v2: Remove pull request from commit message (Rodrigo).
    
    Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Signed-off-by: Anusha Srivatsa <anusha.srivatsa@intel.com>
    Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Signed-off-by: Jani Nikula <jani.nikula@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/1510253503-12634-1-git-send-email-anusha.srivatsa@intel.com
Comment 3 Vasil Kolev 2018-03-07 16:19:22 UTC
Same happens with 1.04. Attaching dmesg, /sys/class/drm/card0/error.

Also, echo 1 > /sys/kernel/debug/dri/0/i915_wedged doesn't seem to have any effect, it's still unable to reset the GPU.
Comment 4 Vasil Kolev 2018-03-07 16:20:09 UTC
Created attachment 137865 [details]
/sys/class/drm/card0/error with 1.04
Comment 5 Vasil Kolev 2018-03-07 16:20:35 UTC
Created attachment 137866 [details]
dmesg with 1.04
Comment 6 Elizabeth 2018-03-07 17:33:49 UTC
Hi, could you attach dmesg with debug info, drm.debug=0xe parameter in grub. Thanks.
Comment 7 Vasil Kolev 2018-03-07 18:15:06 UTC
Created attachment 137869 [details]
dmesg with drim.debug

Here's the dmesg with debug.
Comment 8 Jani Saarinen 2018-03-29 07:10:51 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 9 Vasil Kolev 2018-03-29 07:42:59 UTC
Jani, yes, this is still valid - there doesn't seem to have been a new release of anything related to this, and the bug persists (my GPU is hung and I still use some very slow driver/access to my video card).

The stuff I'm currently running is 4.15.0 with the path for the 1.04 firmware. Is there any new work in the drm-tip, and how do I fetch that?
Comment 10 Jani Saarinen 2018-03-29 08:09:31 UTC
You can get drm-tip from: https://cgit.freedesktop.org/drm-tip.
Comment 11 Vasil Kolev 2018-03-29 13:43:20 UTC
Created attachment 138427 [details]
/sys/class/drm/card0/error with drm-tip
Comment 12 Vasil Kolev 2018-03-29 13:43:57 UTC
Created attachment 138428 [details]
dmesg with drm-tip
Comment 13 Vasil Kolev 2018-03-29 13:45:33 UTC
(In reply to Jani Saarinen from comment #10)
> You can get drm-tip from: https://cgit.freedesktop.org/drm-tip.

Attached are the dmesg (with debug enabled) and the error from the card with drm-tip. The issue persists.
Comment 14 Jani Saarinen 2018-04-24 07:00:58 UTC
Mika, Chris, any advice here?
Comment 15 Mika Kuoppala 2018-04-24 14:07:24 UTC
Looked at this and discussed with Chris on irc and here are the findings:

HW RING START is not from request that was queued to hardware. And the
gpu is dormant on a previous requests tail.

Please retest with fetching a up-to-date drm-tip from https://cgit.freedesktop.org/drm-tip and prevent driver from loading a dmc firmware by moving dmc firmware binaries out from /lib/firmware/i915.
Comment 16 Vasil Kolev 2018-04-24 15:54:30 UTC
Created attachment 139062 [details]
dmesg with drm-tip 4.17.0-rc2 (d04fd4f6d93cea918521059db8358ff9e7a4a03b)
Comment 17 Vasil Kolev 2018-04-24 15:55:06 UTC
Created attachment 139063 [details]
/sys/class/drm/card0/error with drm-tip 4.17.0-rc2 (d04fd4f6d93cea918521059db8358ff9e7a4a03b)
Comment 18 Vasil Kolev 2018-04-24 15:56:46 UTC
Retested with the latest drm-tip, the issue looks the same.

Is there anything else besides the dmesg and /sys/class/drm/card0/error I can help with? I can see to provide access to the laptop in question.
Comment 19 Vasil Kolev 2018-08-19 15:41:27 UTC
Just a note, I've been using the latest drm-tip (updating it around once a month) and the issue doesn't get any different. Any pointers where I should look?
Comment 20 Mika Kuoppala 2018-08-21 07:43:44 UTC
Please update your drm-tip again and test with following
kernel parameters.

drm.debug=0xe intel_iommu=off i915.enable_dc=0 i915.disable_power_well=0

Upload dmesg and error state. Thanks
Comment 21 Vasil Kolev 2018-08-21 08:00:09 UTC
Rebuilding at the moment. Should I have the DMC firmware loaded or not?
Comment 22 Mika Kuoppala 2018-08-21 08:04:54 UTC
Should not matter as the command line takes care of it.
Comment 23 Vasil Kolev 2018-08-21 09:05:58 UTC
Created attachment 141211 [details]
dmesg with drm-tip 4.18.0+ (d53f119472fc7daa532e46ea77098e9e9db2ac10)
Comment 24 Vasil Kolev 2018-08-21 09:06:37 UTC
Created attachment 141212 [details]
/sys/class/drm/card0/error with drm-tip 4.18.0+ (d53f119472fc7daa532e46ea77098e9e9db2ac10)
Comment 25 Vasil Kolev 2018-08-21 09:08:27 UTC
I've attached the new ones. Currently I see a difference - before glxinfo was just saying i/o error, now it has proper output.
Comment 26 Mika Kuoppala 2018-08-21 12:59:29 UTC
Not much has changed. The gpu has not even picked up the work item given for it by Xorg to execution.

Can you ssh to the box after the boot has failed and do a following:

# intel_reg read 0x2358
# intel_reg read 0x2358

it is a timestamp register so let's see it the gpu is completely dead at
this point.

Also please enable everything under "drm/i915 Debugging" kernel config
for more detailed traces.

You could also add the following to the previous kernel command line
options: intel_idle.max_cstate=1 intel_pstate=disable
Comment 27 Vasil Kolev 2018-08-21 14:25:07 UTC
Created attachment 141215 [details]
/sys/class/drm/card0/error with drm-tip 4.18.0+ (d53f119472fc7daa532e46ea77098e9e9db2ac10), second try
Comment 28 Vasil Kolev 2018-08-21 14:25:34 UTC
Created attachment 141216 [details]
dmesg with drm-tip 4.18.0+ (d53f119472fc7daa532e46ea77098e9e9db2ac10), second try
Comment 29 Vasil Kolev 2018-08-21 14:29:19 UTC
I've attached updated dmesg and error with all the debug options enabled in the kernel and the new parameters.

I've tried the register read, and the register changes between reads:

(0x00002358): 0x6b619e3d
(0x00002358): 0x6dddb91a
(0x00002358): 0x6f0d705a

Also, it seems I haven't described the situation properly, so here it goes:

After boot, when the display manager starts, it takes some time to start reacting to anything else than cursor moves, and I can't even switch to the text console. After a while it starts working, but anything GPU related either crashes or blocks (for example, I'm running chromium with --disable-gpu). Compiz was refusing to work, and glxinfo was saying "i/o error".

Now glxinfo shows a normal result, but in any case anything GPU-related blocks, otherwise I'm able to (somewhat slowly) use X.
Comment 30 Lakshmi 2018-09-10 13:50:21 UTC
Mika, any updates on this issue?
Comment 31 Mika Kuoppala 2018-10-15 14:01:47 UTC
Vasil, can you boot the machine without gui (adding '3' to kernel cmdline), and then run some gem tests from intel-gpu-tools?

For starters: gem_exec_basic, gem_exec_nop, gem_ctx_switch

And if they seem to work fine, then the full fastfeedback
suite with
'./scripts/run-tests.sh -F ./tests/intel-ci/fast-feedback.testlist'

Basically to see if tests will also encounter nonresponsive gpu right from the start.
Comment 32 Vasil Kolev 2018-10-18 06:31:49 UTC
Hi Mika,

I've been a bit swamped in the last few days, but I'll rebuild with the latest drm-tip tonight and will run these tests.
Comment 33 Vasil Kolev 2018-10-18 20:11:25 UTC
Created attachment 142085 [details]
results from intel-gpu-tools

The generated results are in results/, in res/ there's the output of the three tests and a dmesg file from since the second one (had to reboot after the third one, then ran the barrage).

I seem to be unable to turn off the frame buffer, can it be the reason for the issue?
Comment 34 Mika Kuoppala 2018-10-19 09:15:46 UTC
Seems that atleast some gpu processing have been achieved
at some point (gem_ctx_switch).

Care to do with kernel params 'drm.debug=0xe i915.disable_display=1'
Thanks
Comment 35 Vasil Kolev 2018-10-19 09:28:14 UTC
Does disable_display mean that I should do this over ssh?
Comment 36 Mika Kuoppala 2018-10-19 12:22:46 UTC
Yes, you need to do it through ssh.

There is bug currently in display side which will oops the kernel
if you run the whole fastfeedback testlist.

But if you could run the 3 specific gem tests provided through ssh, without display enabled
Comment 37 Vasil Kolev 2018-10-19 19:49:21 UTC
Created attachment 142107 [details]
results with disable_display
Comment 38 Francesco Balestrieri 2018-11-27 09:38:40 UTC
Mika, did the logs reveal anything?
Comment 39 Chris Wilson 2018-11-27 09:53:48 UTC
Only that everytime the GPU goes to sleep it doesn't wake up until a reset is issued. Still strongly suggesting the dmc.
Comment 40 Vasil Kolev 2018-12-23 20:09:42 UTC
FWIW, I did some more tests with drm-tim 3ac901085a9fae8699716ac44579dab1dec546c3 and the latest BIOS for this MB. The problem persists, but there's something strange, this time mpv -vo opengl blocks for 10-14 seconds and then starts showing the video. Running glxgears doesn't show the gears, just reports FPS.

How does this look? If it could be a hardware issue, I can look into some workarounds...
Comment 41 Lakshmi 2019-06-04 10:06:03 UTC
Vasil, sorry for the delay. Do you still have the issue?
If so, can you reproduce the issue with current drmtip? (https://cgit.freedesktop.org/drm-tip).

If problem persists with current drmtip, Can you attach the dmesg and crash dump file?
Comment 42 Vasil Kolev 2019-06-04 10:16:54 UTC
Hi Lakshimi,

The problem existed with drm-tip from about two months ago. I'll rebuild and will let you know, as I'll need to reenable DRI.
Comment 43 Vasil Kolev 2019-06-04 18:57:16 UTC
Created attachment 144449 [details]
/sys/class/drm/card0/error with drm-tip 5.2.0-rc2 (aff09dc14e1d9f03f9e6c8c157d4abccf4ca2b14)
Comment 44 Vasil Kolev 2019-06-04 18:57:51 UTC
Created attachment 144451 [details]
dmesg with with drm-tip 5.2.0-rc2 (aff09dc14e1d9f03f9e6c8c157d4abccf4ca2b14)
Comment 45 Vasil Kolev 2019-06-04 18:58:16 UTC
Lakshimi, the problem persists, I've attached dmesg and the error.
Comment 46 Lakshmi 2019-06-05 05:47:09 UTC
(In reply to Vasil Kolev from comment #45)
> Lakshimi, the problem persists, I've attached dmesg and the error.

Thanks Vasil. I will come back to you when soon.
Comment 47 Lakshmi 2019-07-13 18:41:03 UTC
@Mika, any updates here?
Comment 48 Lakshmi 2019-08-27 10:49:14 UTC
@Mika, since this is a high priority bug this issue needs an update at least once in a week. Any progress here?
Comment 49 Martin Peres 2019-11-29 17:41:19 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/80.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.