Bug 108093 - GPU HANG: ecode 8:1:0xd0bfc51f
Summary: GPU HANG: ecode 8:1:0xd0bfc51f
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86 (IA32) Linux (All)
: high normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-28 09:14 UTC by Gerhard Hintermayer
Modified: 2019-05-02 12:40 UTC (History)
1 user (show)

See Also:
i915 platform: BSW/CHT
i915 features: GPU hang


Attachments
GPU crash dump (17.27 KB, text/plain)
2018-09-28 09:14 UTC, Gerhard Hintermayer
no flags Details
GPU crash dump with latest master git sources (24.95 KB, text/plain)
2018-09-28 11:35 UTC, Gerhard Hintermayer
no flags Details
GPU crash dump with kernel 4-19.0-rc5 (34.51 KB, text/plain)
2018-10-01 11:53 UTC, Gerhard Hintermayer
no flags Details
dmesg output with extended drm debugging (165.70 KB, text/plain)
2018-10-01 12:02 UTC, Gerhard Hintermayer
no flags Details

Description Gerhard Hintermayer 2018-09-28 09:14:38 UTC
Created attachment 141772 [details]
GPU crash dump

X unusable on this machine. Most of the time after 2nd start of X the machine freezes. (have to power cycle) 
no Xorg.conf, but intel-specific settings in xorg.conf.d:



Section "Device"
    Identifier  "card0"
    Driver  "intel"
    Option "AccelMethod" "sna"
    Option "ExaNoComposite" "false"
    Option "CacheLines" "1024"
    Option "XvMC" "true"
    Option "PreferredMode" "1280x1024
EndSection
Comment 1 Gerhard Hintermayer 2018-09-28 09:16:35 UTC
output of lspci:
00:02.0 VGA compatible controller: Intel Corporation Atom/Celeron/Pentium Processor x5-E8000/J3xxx/N3xxx Integrated Graphics Controller (rev 21)
00:02.0 0300: 8086:22b1 (rev 21) (numeric)
Comment 2 Chris Wilson 2018-09-28 09:57:42 UTC
It walked off the end of the batch:

bcs0 command stream:
  IDLE?: no
  START: 0x00009000
  HEAD:  0x00000170 [0x00000150]
  TAIL:  0x000001a0 [0x00000188, 0x000001a8]
  CTL:   0x00003001
  MODE:  0x00000000
  HWS:   0x00004000
  ACTHD: 0x00000001 00000048
  IPEIR: 0x00000008
  IPEHR: 0x2f403ae8
  INSTDONE: 0xfffffff7
  batch: [0x00000000_00942000, 0x00000000_00946000]
  BBADDR: 0x00000001_00000045

The batch is just a single blit with a valid MI_BBE; suggesting a TLB error or some other incoherency.
Comment 3 Gerhard Hintermayer 2018-09-28 10:43:35 UTC
Just checking my various installations:

the same hardware used to work with xorg-server-1.16.4.-r5, xf86-video-intel-2.99.917_p20160203 and kernel 4.3.3-gentoo (all gentoo portage version numbering). I was just about upgrading my standard installation to latest versions and ran into this issue.
Comment 4 Gerhard Hintermayer 2018-09-28 11:34:25 UTC
compiled xf86-video-intel from git master , now GPU HANG: ecode 8:0:0x00072727
Comment 5 Gerhard Hintermayer 2018-09-28 11:35:31 UTC
Created attachment 141773 [details]
GPU crash dump with latest master git sources
Comment 6 Chris Wilson 2018-09-28 11:38:22 UTC
It's not likely a userspace issue, so preferably check with drm-tip [https://github.com/freedesktop/drm-tip]
Comment 7 Gerhard Hintermayer 2018-09-28 11:54:29 UTC
(In reply to Chris Wilson from comment #6)
> It's not likely a userspace issue, so preferably check with drm-tip
> [https://github.com/freedesktop/drm-tip]

Ok, I'll give it a try next week, will have check how to acomplish that within gentoo...
Comment 8 Lakshmi 2018-09-30 18:17:16 UTC
> Ok, I'll give it a try next week, will have check how to acomplish that
> within gentoo...

Thanks for giving it a try. If issue persists,  attach dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M.
Comment 9 Gerhard Hintermayer 2018-10-01 11:51:41 UTC
(In reply to Lakshmi from comment #8)
> > Ok, I'll give it a try next week, will have check how to acomplish that
> > within gentoo...
> 
> Thanks for giving it a try. If issue persists,  attach dmesg from boot with
> kernel parameters drm.debug=0x1e log_buf_len=4M.

Now GPU HANG: ecode 8:0:0x85dffffb. Will reboot with suggested kernel opts and upload dmesg output.
Comment 10 Gerhard Hintermayer 2018-10-01 11:53:01 UTC
Created attachment 141820 [details]
GPU crash dump with kernel 4-19.0-rc5
Comment 11 Gerhard Hintermayer 2018-10-01 12:02:43 UTC
Created attachment 141821 [details]
dmesg output with extended drm debugging

xdm is not autostart, manually started @ ~ 44 secs uptime
Comment 12 Gerhard Hintermayer 2018-10-01 13:20:57 UTC
btw. tried my old kernel 4.3.3-gentoo together with all the recent userspace apps (xorg-server, xf86-video-intel ... without even touching these), so basically just switch back to old kernel and no GPU hang occurs, X responds within normal time (whereas it is quite slow using recent kernel versions) and no artefacts are displayed on screen (was also the case sometimes with the GPU hangs.

regards
Comment 13 Gerhard Hintermayer 2018-10-02 06:52:00 UTC
Tried kernel 4.14.65 (the lastone  officially marked stable on gentoo) crashes the machine at first start of X. No entries in log. Complete hangup, have to reset machine :-(
Comment 14 Gerhard Hintermayer 2018-10-02 13:32:11 UTC
kernel 4.9.122-gentoo (in gentoo next stable after 4.14.65 going downwards) works. So some changes after this kernel must break DRI on this type of intel graphics controller. I'll think I'll stick to this kernel, unfortunately no specte/meltdown protection yet, but better that X completely broken ...
Comment 15 Lakshmi 2018-10-11 15:04:05 UTC
Gerhard, can attach cat /sys/class/drm/card$N/error from latest kernel after gpu hang? This will help in investigating the issue.
Comment 16 Chris Wilson 2018-10-22 09:55:13 UTC
That last error is most bizarre. It is complaining it hit an absent PTE for the logical context image.
Comment 17 Chris Wilson 2018-10-22 09:57:37 UTC
Sneaky suspicion:

diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 7f308e713fae..2df5b8a1c988 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -111,10 +111,7 @@ i915_get_ggtt_vma_pages(struct i915_vma *vma);
 
 static void gen6_ggtt_invalidate(struct drm_i915_private *dev_priv)
 {
-       /*
-        * Note that as an uncached mmio write, this will flush the
-        * WCB of the writes into the GGTT before it triggers the invalidate.
-        */
+       wmb();
        I915_WRITE(GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN);
 
        /*
Comment 18 Gerhard Hintermayer 2018-10-22 15:35:38 UTC
(In reply to Lakshmi from comment #15)
> Gerhard, can attach cat /sys/class/drm/card$N/error from latest kernel after
> gpu hang? This will help in investigating the issue.

Sorry, this is still on my list. Was busy with other tasks recently, but if machine hangs completely then no chance :-( ... sometimes only X will crash, sometimes complete machine will be frozen.
Comment 19 Francesco Balestrieri 2018-11-28 08:27:56 UTC
Gerard, could you try to reproduce the error using drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot. Thanks!
Comment 20 Gerhard Hintermayer 2018-11-28 08:42:44 UTC
(In reply to Francesco Balestrieri from comment #19)
> Gerard, could you try to reproduce the error using drm-tip
> (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e
> log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
> Thanks!

Is the repo identical to https://github.com/freedesktop/drm-tip, I'm sitting behind a firewall without git/ssh pass through ?
Comment 21 Francesco Balestrieri 2018-11-28 09:12:25 UTC
I'm not completely sure, but judging from the commit logs they are the same at the moment at least, so it should be fine to try with https://github.com/freedesktop/drm-tip
Comment 22 Francesco Balestrieri 2018-11-28 09:28:22 UTC
Apparently you should also be able to clone from https://anongit.freedesktop.org/git/drm-tip.git but I haven't tried it myself.
Comment 23 Lakshmi 2018-12-18 06:07:01 UTC
Reporter, have you tried to verify with drmtip?
Comment 24 Gerhard Hintermayer 2018-12-18 07:16:20 UTC
(In reply to Lakshmi from comment #23)
> Reporter, have you tried to verify with drmtip?

Sorry, not yet, was quite busy last weeks ....
Comment 25 Francesco Balestrieri 2019-02-06 08:06:41 UTC
Gerhard, sorry for the bother but did you have a chance to try?
Comment 26 Gerhard Hintermayer 2019-02-06 08:19:39 UTC
(In reply to Francesco Balestrieri from comment #25)
> Gerhard, sorry for the bother but did you have a chance to try?

Oops, sorry, not yet.
Comment 27 Lakshmi 2019-02-22 09:20:59 UTC
Gerhard, any updates?
Comment 28 Gerhard Hintermayer 2019-02-22 09:41:21 UTC
Sorry, still no updates. Meanwhile I put the machine into factory, so I'll have to clone an new one for testing.
Comment 29 Lakshmi 2019-03-25 11:21:36 UTC
Gerhard, Any update with latest drmtip? If this issue not seen lately, I can close this bug.
Comment 30 Lakshmi 2019-04-12 09:42:03 UTC
If you do not have the possibility to reproduce the issue, I can close this issue so that you reopen when the issue appears again. What do you think?
Comment 31 Lakshmi 2019-05-02 12:39:58 UTC
No feedback for more than 1 month. Closing this bug as WORKSFORME.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.