101235 – [BAT][HSW] igt@gem_exec_flush@basic-batch-kernel-default-cmd hard hanged

Bug 101235 - [BAT][HSW] igt@gem_exec_flush@basic-batch-kernel-default-cmd hard hanged

Summary: [BAT][HSW] igt@gem_exec_flush@basic-batch-kernel-default-cmd hard hanged

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	highest critical
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-05-30 15:15 UTC by Martin Peres
Modified:	2017-06-30 21:07 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	HSW
i915 features:	GEM/Other

Attachments

Description Martin Peres 2017-05-30 15:15:15 UTC

The machine fi-hsw-4770r hard-hanged when running igt@gem_exec_flush@basic-batch-kernel-default-cmd on CI_DRM_2671. The only thing that looks a little suspicious in the logs is this:


[  104.040011] [IGT] gem_exec_fence: starting subtest await-hang-default
[  106.726854] [drm:missed_breadcrumb [i915]] vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no, current seqno=4f3d, last=4f3e
[  112.762486] [drm] GPU HANG: ecode 7:0:0xe757feff, in gem_exec_fence [1670], reason: Hang on rcs0, action: reset
[  112.762778] [drm:i915_reset_and_wakeup [i915]] resetting chip
[  112.762851] drm/i915: Resetting chip after gpu hang
[  112.763002] [drm:i915_gem_reset [i915]] context gem_exec_fence[1670]/0 marked guilty (score 10) banned? no
[  112.763035] [drm:i915_gem_reset [i915]] resetting rcs0 to restart from tail of request 0x5966c
[  112.763205] [drm:intel_print_rc6_info [i915]] Enabling RC6 states: RC6 on
[  112.763379] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 0
[  112.774296] [IGT] gem_exec_fence: exiting, ret=0

Full logs: https://intel-gfx-ci.01.org/CI/CI_DRM_2671/fi-hsw-4770r/igt@gem_exec_flush@basic-batch-kernel-default-cmd.html

Comment 1 Chris Wilson 2017-05-30 15:26:59 UTC

(In reply to Martin Peres from comment #0)
> The machine fi-hsw-4770r hard-hanged when running
> igt@gem_exec_flush@basic-batch-kernel-default-cmd on CI_DRM_2671. The only
> thing that looks a little suspicious in the logs is this:
> 
> 
> [  104.040011] [IGT] gem_exec_fence: starting subtest await-hang-default
> [  106.726854] [drm:missed_breadcrumb [i915]] vecs0 missed breadcrumb at
> intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no, current
> seqno=4f3d, last=4f3e
> [  112.762486] [drm] GPU HANG: ecode 7:0:0xe757feff, in gem_exec_fence
> [1670], reason: Hang on rcs0, action: reset
> [  112.762778] [drm:i915_reset_and_wakeup [i915]] resetting chip
> [  112.762851] drm/i915: Resetting chip after gpu hang
> [  112.763002] [drm:i915_gem_reset [i915]] context gem_exec_fence[1670]/0
> marked guilty (score 10) banned? no
> [  112.763035] [drm:i915_gem_reset [i915]] resetting rcs0 to restart from
> tail of request 0x5966c
> [  112.763205] [drm:intel_print_rc6_info [i915]] Enabling RC6 states: RC6 on
> [  112.763379] [drm:init_workarounds_ring [i915]] rcs0: Number of context
> specific w/a: 0
> [  112.774296] [IGT] gem_exec_fence: exiting, ret=0

That's not suspicious as it was a hang test. The test in question (gem_exec_flush) is not recorded as even starting. There's no information here regarding the disappearance of the machine.

Comment 2 Martin Peres 2017-05-30 15:34:25 UTC

(In reply to Chris Wilson from comment #1)
> (In reply to Martin Peres from comment #0)
> > The machine fi-hsw-4770r hard-hanged when running
> > igt@gem_exec_flush@basic-batch-kernel-default-cmd on CI_DRM_2671. The only
> > thing that looks a little suspicious in the logs is this:
> > 
> > 
> > [  104.040011] [IGT] gem_exec_fence: starting subtest await-hang-default
> > [  106.726854] [drm:missed_breadcrumb [i915]] vecs0 missed breadcrumb at
> > intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no, current
> > seqno=4f3d, last=4f3e
> > [  112.762486] [drm] GPU HANG: ecode 7:0:0xe757feff, in gem_exec_fence
> > [1670], reason: Hang on rcs0, action: reset
> > [  112.762778] [drm:i915_reset_and_wakeup [i915]] resetting chip
> > [  112.762851] drm/i915: Resetting chip after gpu hang
> > [  112.763002] [drm:i915_gem_reset [i915]] context gem_exec_fence[1670]/0
> > marked guilty (score 10) banned? no
> > [  112.763035] [drm:i915_gem_reset [i915]] resetting rcs0 to restart from
> > tail of request 0x5966c
> > [  112.763205] [drm:intel_print_rc6_info [i915]] Enabling RC6 states: RC6 on
> > [  112.763379] [drm:init_workarounds_ring [i915]] rcs0: Number of context
> > specific w/a: 0
> > [  112.774296] [IGT] gem_exec_fence: exiting, ret=0
> 
> That's not suspicious as it was a hang test. 

OK, I'll try to remember that this one also is one of these tests.

> The test in question
> (gem_exec_flush) is not recorded as even starting. There's no information
> here regarding the disappearance of the machine.

Piglit probably only syncs after the execution of a test, not at the beginning of a test.

Comment 3 Ricardo 2017-05-30 17:08:21 UTC

does this mean that this is not a bug? Martin or is a bug but for the IGT test?

Comment 4 Martin Peres 2017-05-30 17:19:58 UTC

(In reply to Ricardo from comment #3)
> does this mean that this is not a bug? Martin or is a bug but for the IGT
> test?

No, it is likely a bug in the kernel... but we don't have enough data yet. Hopefully, we can get more logs next time it happens.

Comment 5 Ricardo 2017-05-31 13:55:35 UTC

ok thanks... will leave it open and NEEDINFO

Comment 6 Jani Saarinen 2017-06-07 10:38:32 UTC

Statistics: Failure rate 1/25 run(s) (4%)

Comment 7 Jari Tahvanainen 2017-06-19 09:56:00 UTC

Seen once 2017-05-30. Statistics: Failure rate 1/67 run(s) (1%)

Comment 8 Jari Tahvanainen 2017-06-26 10:59:30 UTC

Still only one failure at 2017-05-30. Statistics: Failure rate 1/100 run(s) (1%) - seems to be going towards resolved+worksforme...

Comment 9 Martin Peres 2017-06-26 16:20:43 UTC

Those hangs are the worst, because we have no idea about what trigger them...

I understand your view Jari, but I think we should keep this bug until the end of the month and then close it. I can say that this has not been seen in the patchwork runs nor the IGT runs.

Comment 10 Elizabeth 2017-06-30 21:07:08 UTC

(In reply to Martin Peres from comment #9)
> Those hangs are the worst, because we have no idea about what trigger them...
> 
> I understand your view Jari, but I think we should keep this bug until the
> end of the month and then close it. I can say that this has not been seen in
> the patchwork runs nor the IGT runs.

Hello,
I'm proceeding to close this bug since the time mentioned before has passed. If the problem appear again, please share the information and change status to REOPEN. Thank you.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.