Bug 106858 - [bsw] GPU hang on first bcs user batch (TLB)
Summary: [bsw] GPU hang on first bcs user batch (TLB)
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other Linux (All)
: medium critical
Assignee: Abdiel Janulgue
QA Contact: Intel GFX Bugs mailing list
Whiteboard: Triaged, ReadyForDev
Depends on:
Reported: 2018-06-08 10:42 UTC by circle_chen
Modified: 2019-03-11 14:21 UTC (History)
2 users (show)

See Also:
i915 platform: BSW/CHT
i915 features: GPU hang

dmesg and /sys/class/drm/card0/error (18.85 KB, application/x-7z-compressed)
2018-06-08 10:42 UTC, circle_chen
no flags Details
dmesg, /sys/class/drm/card0/error, and build config (57.00 KB, application/x-gzip)
2018-06-08 12:50 UTC, circle_chen
no flags Details
Debug log after DRM's log enabled. (76.55 KB, application/x-gzip)
2018-06-12 09:53 UTC, circle_chen
no flags Details
Xorg.0.log (35.55 KB, text/plain)
2018-06-12 10:03 UTC, circle_chen
no flags Details
Add Xorg.0.log for disabled Intel driver. (37.42 KB, text/plain)
2018-10-30 10:35 UTC, circle_chen
no flags Details
attachment-21972-0.html (1.90 KB, text/html)
2018-11-23 11:23 UTC, circle_chen
no flags Details

Description circle_chen 2018-06-08 10:42:53 UTC
Created attachment 140080 [details]
dmesg and /sys/class/drm/card0/error

My GIGABYTE Q21B/Q21B UI is always broken. I found some GPU hangs error from dmesg. The gpu resets multiple times after hang.

model name: GIGABYTE Q21B/Q21B
[   35.849634] [drm] GPU HANG: ecode 8:1:0x21204b77, in X [437], reason: Hang on bcs0, action: reset
[   35.849639] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   35.849640] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   35.849642] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   35.849643] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   35.849645] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   35.849663] i915 0000:00:02.0: Resetting bcs0 after gpu hang
[   42.848209] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[   51.840245] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[   61.824125] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[   70.848282] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[   79.840180] i915 0000:00:02.0: Resetting rcs0 after gpu hang
Comment 1 Chris Wilson 2018-06-08 10:54:38 UTC
blt command stream:
  IDLE?: no
  START: 0x00009000
  HEAD:  0x00000068 [0x00000000]
    head = 0x00000068, wraps = 0
  TAIL:  0x00000130 [0x00000078, 0x00000098]
  CTL:   0x00003001
    len=16384, enabled
  MODE:  0x00000000
  HWS:   0x00004000
  ACTHD: 0x00000000 0008801c
    at ring: 0x00000000
  IPEIR: 0x00000008
  IPEHR: 0xdedfb480
  INSTDONE: 0xfffffff7
    busy: HS
  batch: [0x00000000_00000000, 0x00000000_00001000]
  BBADDR: 0x00000000_00088019
  BB_STATE: 0x00000020
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00000000 00088200
  RC PSMI: 0x00000010
  FAULT_REG: 0x000000c9
    Invalid PTE Fault
    Engine GFX
    Source ID 25
  SYNC_0: 0x00000000
  SYNC_1: 0x00000000
  SYNC_2: 0x00000000
  GFX_MODE: 0x00008000
  PDP0: 0x000000007bfd9000
  PDP1: 0x000000007bfe6000
  PDP2: 0x000000007bfe6000
  PDP3: 0x000000007bfe6000
  seqno: 0x00000002
  last_seqno: 0x00000005
  waiting: yes
  ring->head: 0x00000000
  ring->tail: 0x00000130
  hangcheck stall: yes
  hangcheck action: dead
  hangcheck action timestamp: 4294899752, 4725752 ms ago
  engine reset count: 0
  ELSP[0]:  pid 437, ban score 0, seqno        2:00000005, prio 1024, emitted 4726568ms ago, head 000000e8, tail 00000130
  Active context: X[437] user_handle 0 hw_id 2, prio 0, ban score 0 guilty 0 active 0

Another early batch (first user), IPEHR of garbage, a page fault. Houston we have problem.
Comment 2 Chris Wilson 2018-06-08 10:56:44 UTC
Please build a kernel from https://cgit.freedesktop.org/drm-tip and test. It will be important later on for testing patches, anyway.
Comment 3 Chris Wilson 2018-06-08 11:01:24 UTC
See also #106828
Comment 4 circle_chen 2018-06-08 12:50:04 UTC
Created attachment 140086 [details]
dmesg, /sys/class/drm/card0/error, and build config

Thanks for your quickly reply.
I have tried to build the kernel 4.17.0-rc7+ from https://cgit.freedesktop.org/drm-tip.
Unfortunately, this issue still persist cannot be resolved.

I also uploaded an attachment here. It included dmesg, error dump, and my build config. Hope it is helpful to resolve this issue.
Comment 5 Chris Wilson 2018-06-08 13:03:24 UTC
Hmm. Can you do a quick run with CONFIG_INTEL_IOMMU disabled (./scripts/config -d CONFIG_INTEL_IOMMU)?
Comment 6 Chris Wilson 2018-06-08 13:04:18 UTC
Oh, and just in case I forget later, this is no longer hanging on the first batch.
Comment 7 Chris Wilson 2018-06-08 13:05:20 UTC
(In reply to Chris Wilson from comment #6)
> Oh, and just in case I forget later, this is no longer hanging on the first
> batch.

Yes it is. Just rcs this time, and not a garbage IPEHR.
Comment 8 circle_chen 2018-06-11 07:55:01 UTC
Disabled Intel_IOMMU still not work.
Comment 9 James Ausmus 2018-06-11 16:56:11 UTC
Can you re-attach the dmesg log after adding the following kernel parameters:

drm.debug=0x1e log_buf_len=4M
Comment 10 James Ausmus 2018-06-11 17:00:07 UTC
Also, does this happen every time?
Comment 11 circle_chen 2018-06-12 09:53:47 UTC
Created attachment 140130 [details]
Debug log after DRM's log enabled.

I've uploaded the attachment for DRM logs enabled.
Comment 12 circle_chen 2018-06-12 09:55:25 UTC
(In reply to James Ausmus from comment #10)
> Also, does this happen every time?

Yes, the issue happened every time during startx running.
Comment 13 circle_chen 2018-06-12 10:03:14 UTC
Created attachment 140131 [details]

Uploaded Xorg.0.log.
Comment 14 James Ausmus 2018-06-13 00:41:42 UTC
Thanks for the additional details and logs!
Comment 15 circle_chen 2018-06-29 07:03:27 UTC
Do you have any updates on it?
Comment 16 Lakshmi 2018-09-10 13:40:45 UTC
Francesco, any updates on this issue?
Comment 17 Abdiel Janulgue 2018-10-19 08:22:12 UTC
(In reply to circle_chen from comment #15)
> HI,
> Do you have any updates on it?

Can you try this again but this time just stick to the modesetting driver? (e.g. remove xorg-x11-drv-intel (fedora) or xorg-x11-drv-intel (deb)). And attach the Xorg logs.
Comment 18 circle_chen 2018-10-30 10:35:11 UTC
Created attachment 142274 [details]
Add Xorg.0.log for disabled Intel driver.

Add the log for disabled Intel driver. (Screen is hang on blank cursor)
Comment 19 Abdiel Janulgue 2018-10-30 11:06:28 UTC
Hmm, looks like our options of narrowing the problem is getting dimmer now that the ddx is apparently innocent.

I'd just like to verify at this point this is not a hw problem of some sort in your system. Can you drop your run-level down to command-line and run IGT[1] testdisplay as root?

igt/tests$ ./testdisplay

Also you reported the hang on a blank cursor? You can try running tests/kms_cursor_crc as well

If one or both tests triggers a hang, please attach the /sys/class/drm/card0/error

[1] https://github.com/freedesktop/xorg-intel-gpu-tools
Comment 20 Francesco Balestrieri 2018-11-23 11:21:37 UTC
circle_chen, were you able to try Abdiel's test?
Comment 21 circle_chen 2018-11-23 11:23:05 UTC
Created attachment 142591 [details]

Yes, sorry for late reply.
 I will do the test next week.

<bugzilla-daemon@freedesktop.org>於 2018年11月23日 週五,下午7:21寫道:

> *Comment # 20 <https://bugs.freedesktop.org/show_bug.cgi?id=106858#c20> on
> bug 106858 <https://bugs.freedesktop.org/show_bug.cgi?id=106858> from
> Francesco Balestrieri <francesco.balestrieri@intel.com> *
> circle_chen, were you able to try Abdiel's test?
> ------------------------------
> You are receiving this mail because:
>    - You reported the bug.
> --
Yours sincerely,

陳至圓  敬上
Chih-Yuan Chen

Comment 22 Francesco Balestrieri 2018-12-04 07:34:28 UTC
Comment 23 circle_chen 2018-12-12 09:40:26 UTC
I have downloaded xorg-intel-gpu-tools-intel-gpu-tools-1.19 and compiled successful.

And then boot the kernel and install the tools/libs to file system to run ./testdisplay.
But I got "syntax error: unexpected end of file".

We are not sure what wrong in my environment.
Could you provided a boot image let us test more easily?
Comment 24 Abdiel Janulgue 2018-12-13 07:12:06 UTC
If you didn't manage to compile IGT, please use your own distro's package binaries
Comment 25 Lakshmi 2019-02-26 11:27:14 UTC
Circle Chen, any updates here? Were you able to run the tests as in Comment 19? Feedback is needed to proceed further with this bug.
Comment 26 Lakshmi 2019-03-11 14:20:48 UTC
No feedback for more than 2 months, closing this bug as WORKSFORME.
When you experience the same problem with drmtip, please attach dmesg log from boot with kernel parameters drm.debug=0x1e log_buf_len=4M.
Remember to attach error file and xorg log as well.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.