101522 – [bsw] lite-restore failed on vcs

Bug 101522 - [bsw] lite-restore failed on vcs

Summary: [bsw] lite-restore failed on vcs

Status:	CLOSED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium critical
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2017-06-20 16:00 UTC by Dmitry D
Modified:	2018-04-23 10:11 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	BSW/CHT
i915 features:	GPU hang

Attachments
/sys/class/drm/card0/error (53.51 KB, text/plain) 2017-06-20 16:00 UTC, Dmitry D	no flags	Details
dmesg (4.09 KB, text/plain) 2017-06-20 16:01 UTC, Dmitry D	no flags	Details
dmesg with debug (119.30 KB, text/plain) 2017-06-20 20:40 UTC, Dmitry D	no flags	Details
Full dmesg log with enabled debug. (219.91 KB, text/plain) 2017-07-22 14:25 UTC, Dmitry D	no flags	Details
Another crash dmesg with debug (370.42 KB, text/plain) 2017-07-23 15:11 UTC, Dmitry D	no flags	Details
/sys/class/drm/card0/error for another crash dmesg with debug (50.29 KB, text/plain) 2017-07-23 15:12 UTC, Dmitry D	no flags	Details
View All

Description Dmitry D 2017-06-20 16:00:35 UTC

Created attachment 132091 [details]
/sys/class/drm/card0/error

Hello,

I have embedded system on Intel Atom Z8350 that run Ubuntu 16.04 with X11 server + ffmpeg H264 transcoder + mpv h264 video stream player + ffvademo to show web camera H264 stream. Every hour it have an error:
[drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe C (start=3143204 end=3143205) time 916 us, min 1073, max 1079, scanline start 1054, end 1116
But then after about 24-48h it hangs on GPU:
[58796.141184] [drm] GPU HANG: ecode 8:2:0xbffffffe, in ffmpeg [1311], reason: Hang on vcs, action: reset
[58796.141188] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[58796.141190] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[58796.141191] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[58796.141192] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[58796.141194] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[58796.141380] drm/i915: Resetting chip after gpu hang
[58796.845062] [drm:gen8_reset_engines [i915]] *ERROR* vcs: reset request timeout
[58796.845208] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
[58798.396642] asynchronous wait on fence i915:[global]:ad2acf timed out
[70725.643729] retire_capture_urb: 3 callbacks suppressed

Kernel:
Linux vg3 4.12.0-041200rc5-lowlatency #201706112031 SMP PREEMPT Mon Jun 12 00:38:08 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Comment 1 Dmitry D 2017-06-20 16:01:45 UTC

Created attachment 132092 [details]
dmesg

Comment 2 Elizabeth 2017-06-20 20:32:10 UTC

Hello, 
Could you please boot with the parameter "drm.debug=0xe" on grub, and provide the full dmesg log?
Thank you.

Comment 3 Dmitry D 2017-06-20 20:40:17 UTC

Created attachment 132100 [details]
dmesg with debug

Comment 4 Chris Wilson 2017-06-20 21:50:56 UTC

Not an ffmpeg issue as it looks like the elsp load didn't take.

Comment 5 Chris Wilson 2017-06-20 21:53:37 UTC

Furthermore, it looks like the GPU went south entirely as the GPU reset failed as well:

[58796.845062] [drm:gen8_reset_engines [i915]] *ERROR* vcs: reset request timeout
[58796.845208] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

Erroring out at that point is probably overkill, we can just try the reset even if we fail to signal the hw and then bail if the reset itself fails.

Comment 6 Dmitry D 2017-06-20 22:06:59 UTC

>Could you please boot with the parameter "drm.debug=0xe" on grub, and provide the 
>full dmesg log?
Do you need log at time of error or just after boot (attachment dmesg with debug)?

Comment 7 Dmitry D 2017-06-21 19:36:27 UTC

Another hang with debug enabled:

[27185.912135] [drm:drm_mode_addfb2 [drm]] [FB:86]
[27192.487171] [drm:missed_breadcrumb [i915]] vcs missed breadcrumb at intel_breadcrumbs_hangcheck+0x61/0x80 [i915], irq posted? yes
[27196.540538] [drm] GPU HANG: ecode 8:2:0xbffffffe, in ffmpeg [1229], reason: Hang on vcs, action: reset
[27196.540543] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[27196.540544] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[27196.540545] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[27196.540546] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[27196.540548] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[27196.540731] [drm:i915_reset_and_wakeup [i915]] resetting chip
[27196.540761] drm/i915: Resetting chip after gpu hang
[27197.248318] [drm:gen8_reset_engines [i915]] *ERROR* vcs: reset request timeout
[27197.248529] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
[27198.151214] [drm:drm_mode_addfb2 [drm]] [FB:86]

System still on and if you need addition info after crash then let me know

Comment 8 Elizabeth 2017-06-22 15:48:54 UTC

(In reply to Dmitry D from comment #6)
> >Could you please boot with the parameter "drm.debug=0xe" on grub, and provide the 
> >full dmesg log?
> Do you need log at time of error or just after boot (attachment dmesg with
> debug)?
From boot till the bug reproduce, thank you.

Comment 9 Elizabeth 2017-06-22 15:51:06 UTC

Adding tag into "Whiteboard" field - ReadyForDev
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included

Comment 10 Ricardo 2017-07-21 16:35:20 UTC

Dmitry please add the logs required as specified by Elizabeth, if there is no response soon the bug will be closed due to lack of activity...

Comment 11 Dmitry D 2017-07-21 16:47:11 UTC

Hello, Ricardo.

I did it month ago.  "dmesg with debug" it is boot part of dmesg with debug enabled and then when system hangs I put end of dmesg in Comment #7 2017-06-21 19:36:27 UTC. I can't put whole dmesg since boot to crash because at time of crash it cut old boot messages.

Comment 12 Dmitry D 2017-07-22 14:25:48 UTC

Created attachment 132830 [details]
Full dmesg log with enabled debug.

This is full dmesg with enabled debug from boot to crash. Could you confirm that is what you need. Thanks.

Comment 13 Dmitry D 2017-07-23 15:11:56 UTC

Created attachment 132849 [details]
Another crash dmesg with debug

Comment 14 Dmitry D 2017-07-23 15:12:31 UTC

Created attachment 132850 [details]
/sys/class/drm/card0/error for another crash dmesg with debug

Comment 15 Dmitry D 2017-07-26 20:35:45 UTC

Hell, Ricardo and Elizabeth.

Do you need another logs or last logs is ok?

Comment 16 Elizabeth 2017-07-26 22:59:14 UTC

(In reply to Dmitry D from comment #15)
> Hello, Ricardo and Elizabeth.
> 
> Do you need another logs or last logs is ok?
Hello Dmitry,
Thank you for updating the logs. 
For now, that would be enough. If more info is needed it will be commented here. Thanks again.

Comment 17 Elizabeth 2017-10-23 14:44:45 UTC

From error state:

bsd command stream:

  HEAD:  0x52e01df0 [0x00001df0]
    head = 0x00001df0, wraps = 663
  ACTHD: 0x00000000 52e01df0
    at ring: 0x00000000
  ...
  IPEHR: 0x00000000
  ...
  seqno: 0x0090ea64
  last_seqno: 0x0090ea66
  waiting: yes
  ring->head: 0x00001dd0
  ring->tail: 0x00001f30
  hangcheck stall: yes
  hangcheck action: dead
  hangcheck action timestamp: 4314722752, 10059896 ms ago
  ELSP[0]:  pid 1307, ban score 0, seqno        5:0090ea65, emitted 10061652ms ago, head 00001df0, tail 00001e40
  ELSP[1]:  pid 1682, ban score 0, seqno        6:0090ea66, emitted 10061640ms ago, head 000015d0, tail 00001618
  Active context: ffmpeg[1307] user_handle 0 hw_id 5, ban score 0 guilty 0 active 0

batch (vcs (submitted by ffmpeg [1307], ctx 0 [5], score 0)) at 0x00000000_ff354000
0xff354000:      0x13000082: MI_FLUSH_DW post_sync_op='no write'  invalidate video state (BCS-only),
0xff354004:      0x00000000:    address
0xff354008:      0x00000000:    dword
0xff35400c:      0x00000000:    upper dword
0xff354010:      0x70000003: 3D UNKNOWN: 3d_965 opcode = 0x7000
0xff354014:      0x00020202: MI_NOOP
0xff354018:      0x00000000: MI_NOOP
0xff35401c:      0x00000000: MI_NOOP
0xff354020:      0x00000000: MI_NOOP
0xff354024:      0x70010004: 3D UNKNOWN: 3d_965 opcode = 0x7001
0xff354028:      0x00000000: MI_NOOP
Bad length (114) in MI_SEMAPHORE_MBOX, [3, 3]
0xff35402c:      0x0b3c4ff0: MI_SEMAPHORE_MBOX update semaphore, compare semaphore, use compare reg 0
0xff354030:      0x480027fb:    value
0xff354034:      0x000002e0:    address

Comment 18 Elizabeth 2018-01-25 23:05:03 UTC

Just adding these from dmesg:

[drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe C (start=1421471 end=1421472) time 370 us, min 1073, max 1079, scanline start 1056, end 1082

Hello Dmitry, any changes with 4.14 or 4.15 https://www.kernel.org?

Comment 19 Dmitry D 2018-01-25 23:06:44 UTC

>any changes with 4.14 or 4.15 https://www.kernel.org?
no any changes on 4.14

Comment 20 Jani Saarinen 2018-03-29 07:11:31 UTC

First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.

Comment 21 Jani Saarinen 2018-04-23 10:11:15 UTC

Closing, please re-open if still occurs.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.