Bug 101235 - [BAT][HSW] igt@gem_exec_flush@basic-batch-kernel-default-cmd hard hanged
Summary: [BAT][HSW] igt@gem_exec_flush@basic-batch-kernel-default-cmd hard hanged
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: highest critical
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-05-30 15:15 UTC by Martin Peres
Modified: 2017-06-30 21:07 UTC (History)
1 user (show)

See Also:
i915 platform: HSW
i915 features: GEM/Other


Attachments

Description Martin Peres 2017-05-30 15:15:15 UTC
The machine fi-hsw-4770r hard-hanged when running igt@gem_exec_flush@basic-batch-kernel-default-cmd on CI_DRM_2671. The only thing that looks a little suspicious in the logs is this:


[  104.040011] [IGT] gem_exec_fence: starting subtest await-hang-default
[  106.726854] [drm:missed_breadcrumb [i915]] vecs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no, current seqno=4f3d, last=4f3e
[  112.762486] [drm] GPU HANG: ecode 7:0:0xe757feff, in gem_exec_fence [1670], reason: Hang on rcs0, action: reset
[  112.762778] [drm:i915_reset_and_wakeup [i915]] resetting chip
[  112.762851] drm/i915: Resetting chip after gpu hang
[  112.763002] [drm:i915_gem_reset [i915]] context gem_exec_fence[1670]/0 marked guilty (score 10) banned? no
[  112.763035] [drm:i915_gem_reset [i915]] resetting rcs0 to restart from tail of request 0x5966c
[  112.763205] [drm:intel_print_rc6_info [i915]] Enabling RC6 states: RC6 on
[  112.763379] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 0
[  112.774296] [IGT] gem_exec_fence: exiting, ret=0

Full logs: https://intel-gfx-ci.01.org/CI/CI_DRM_2671/fi-hsw-4770r/igt@gem_exec_flush@basic-batch-kernel-default-cmd.html
Comment 1 Chris Wilson 2017-05-30 15:26:59 UTC
(In reply to Martin Peres from comment #0)
> The machine fi-hsw-4770r hard-hanged when running
> igt@gem_exec_flush@basic-batch-kernel-default-cmd on CI_DRM_2671. The only
> thing that looks a little suspicious in the logs is this:
> 
> 
> [  104.040011] [IGT] gem_exec_fence: starting subtest await-hang-default
> [  106.726854] [drm:missed_breadcrumb [i915]] vecs0 missed breadcrumb at
> intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no, current
> seqno=4f3d, last=4f3e
> [  112.762486] [drm] GPU HANG: ecode 7:0:0xe757feff, in gem_exec_fence
> [1670], reason: Hang on rcs0, action: reset
> [  112.762778] [drm:i915_reset_and_wakeup [i915]] resetting chip
> [  112.762851] drm/i915: Resetting chip after gpu hang
> [  112.763002] [drm:i915_gem_reset [i915]] context gem_exec_fence[1670]/0
> marked guilty (score 10) banned? no
> [  112.763035] [drm:i915_gem_reset [i915]] resetting rcs0 to restart from
> tail of request 0x5966c
> [  112.763205] [drm:intel_print_rc6_info [i915]] Enabling RC6 states: RC6 on
> [  112.763379] [drm:init_workarounds_ring [i915]] rcs0: Number of context
> specific w/a: 0
> [  112.774296] [IGT] gem_exec_fence: exiting, ret=0

That's not suspicious as it was a hang test. The test in question (gem_exec_flush) is not recorded as even starting. There's no information here regarding the disappearance of the machine.
Comment 2 Martin Peres 2017-05-30 15:34:25 UTC
(In reply to Chris Wilson from comment #1)
> (In reply to Martin Peres from comment #0)
> > The machine fi-hsw-4770r hard-hanged when running
> > igt@gem_exec_flush@basic-batch-kernel-default-cmd on CI_DRM_2671. The only
> > thing that looks a little suspicious in the logs is this:
> > 
> > 
> > [  104.040011] [IGT] gem_exec_fence: starting subtest await-hang-default
> > [  106.726854] [drm:missed_breadcrumb [i915]] vecs0 missed breadcrumb at
> > intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no, current
> > seqno=4f3d, last=4f3e
> > [  112.762486] [drm] GPU HANG: ecode 7:0:0xe757feff, in gem_exec_fence
> > [1670], reason: Hang on rcs0, action: reset
> > [  112.762778] [drm:i915_reset_and_wakeup [i915]] resetting chip
> > [  112.762851] drm/i915: Resetting chip after gpu hang
> > [  112.763002] [drm:i915_gem_reset [i915]] context gem_exec_fence[1670]/0
> > marked guilty (score 10) banned? no
> > [  112.763035] [drm:i915_gem_reset [i915]] resetting rcs0 to restart from
> > tail of request 0x5966c
> > [  112.763205] [drm:intel_print_rc6_info [i915]] Enabling RC6 states: RC6 on
> > [  112.763379] [drm:init_workarounds_ring [i915]] rcs0: Number of context
> > specific w/a: 0
> > [  112.774296] [IGT] gem_exec_fence: exiting, ret=0
> 
> That's not suspicious as it was a hang test. 

OK, I'll try to remember that this one also is one of these tests.

> The test in question
> (gem_exec_flush) is not recorded as even starting. There's no information
> here regarding the disappearance of the machine.

Piglit probably only syncs after the execution of a test, not at the beginning of a test.
Comment 3 Ricardo 2017-05-30 17:08:21 UTC
does this mean that this is not a bug? Martin or is a bug but for the IGT test?
Comment 4 Martin Peres 2017-05-30 17:19:58 UTC
(In reply to Ricardo from comment #3)
> does this mean that this is not a bug? Martin or is a bug but for the IGT
> test?

No, it is likely a bug in the kernel... but we don't have enough data yet. Hopefully, we can get more logs next time it happens.
Comment 5 Ricardo 2017-05-31 13:55:35 UTC
ok thanks... will leave it open and NEEDINFO
Comment 6 Jani Saarinen 2017-06-07 10:38:32 UTC
Statistics: Failure rate 1/25 run(s) (4%)
Comment 7 Jari Tahvanainen 2017-06-19 09:56:00 UTC
Seen once 2017-05-30. Statistics: Failure rate 1/67 run(s) (1%)
Comment 8 Jari Tahvanainen 2017-06-26 10:59:30 UTC
Still only one failure at 2017-05-30. Statistics: Failure rate 1/100 run(s) (1%) - seems to be going towards resolved+worksforme...
Comment 9 Martin Peres 2017-06-26 16:20:43 UTC
Those hangs are the worst, because we have no idea about what trigger them...

I understand your view Jari, but I think we should keep this bug until the end of the month and then close it. I can say that this has not been seen in the patchwork runs nor the IGT runs.
Comment 10 Elizabeth 2017-06-30 21:07:08 UTC
(In reply to Martin Peres from comment #9)
> Those hangs are the worst, because we have no idea about what trigger them...
> 
> I understand your view Jari, but I think we should keep this bug until the
> end of the month and then close it. I can say that this has not been seen in
> the patchwork runs nor the IGT runs.

Hello,
I'm proceeding to close this bug since the time mentioned before has passed. If the problem appear again, please share the information and change status to REOPEN. Thank you.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.