65387 – [HSW] vebox semaphores ignored

Bug 65387 - [HSW] vebox semaphores ignored

Summary: [HSW] vebox semaphores ignored

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	high major
Assignee:	Ben Widawsky
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-06-05 03:22 UTC by lu hua
Modified:	2017-10-06 16:08 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg (124.60 KB, text/plain) 2013-06-05 03:22 UTC, lu hua	no flags	Details
better error handling at debugfs write (1.54 KB, patch) 2013-06-11 09:01 UTC, Mika Kuoppala	no flags	Details \| Splinter Review
dmesg (123.58 KB, text/plain) 2013-06-13 06:03 UTC, lu hua	no flags	Details
i915_error_state (2.76 MB, text/plain) 2013-06-13 06:04 UTC, lu hua	no flags	Details
dmesg with tlb invalidation fix patch (508.97 KB, text/plain) 2013-08-07 07:01 UTC, lu hua	no flags	Details
attachment-4239-0.html (2.01 KB, text/html) 2017-10-06 16:08 UTC, Lizhong	no flags	Details
View All

Description lu hua 2013-06-05 03:22:15 UTC

Created attachment 80319 [details]
dmesg

System Environment:
--------------------------
Arch:           x86_64
Platform:       Haswell
Kernel:     (drm-intel-next-queued)d7697eea3eec74c561d12887d892c53ac4380c00

Bug detailed description:
-------------------------
It randomly aborted on haswell with drm-intel-next-queued kernel.It happens 1 in 5 runs. It doesn't happens on drm-intel-fixes kernel. I run it 10 cycles on IVB, It works well.

output:
gem_seqno_wrap: gem_seqno_wrap.c:501: preset_run_once: Assertion `write_seqno(1) == 0' failed.
Aborted (core dumped)

Reproduce steps:
----------------
1. ./gem_seqno_wrap

Comment 1 Mika Kuoppala 2013-06-11 09:01:22 UTC

Created attachment 80668 [details] [review]
better error handling at debugfs write

Comment 2 lu hua 2013-06-13 06:03:55 UTC

Created attachment 80752 [details]
dmesg

Test with this patch. The aborted goes away, It causes GPU hung.
output:
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11

Comment 3 lu hua 2013-06-13 06:04:50 UTC

Created attachment 80753 [details]
i915_error_state

Comment 4 Chris Wilson 2013-06-13 07:29:31 UTC

Mika, that error state is incorrect.

Comment 5 Chris Wilson 2013-06-13 07:34:54 UTC

On second thoughts, it may just be genuine (mistook the ringbuffer for a batchbuffer), and the bug is that the request is not getting flushed.

Comment 6 Chris Wilson 2013-06-13 09:14:01 UTC

And it is the BSD ring that died, helps to read the dmesg. hangcheck 1, me 0.

Comment 7 Chris Wilson 2013-06-13 09:18:40 UTC

bsd command stream:
  HEAD: 0x000150b8
  TAIL: 0x00015108
  CTL: 0x0001f001
  ACTHD: 0xde49ece0
  IPEIR: 0x00000008
  IPEHR: 0x00000000
  INSTDONE: 0x3fffffff
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0xde49dd80
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  SYNC_0: 0x0000000a [last synced 0x00000000]
  SYNC_1: 0x00000009 [last synced 0x0000000b]
  seqno: 0x00000000
  waiting: yes
  ring->head: 0x00000000
  ring->tail: 0x00015108

Active [0]:

Where's the BSD batchbuffer gone?

Comment 8 Chris Wilson 2013-06-13 09:28:56 UTC

So, I think inter-ring serialisation is broken. The batch is passed from render -> bsd -> blt -> vecs. In order for that object to be removed from the active list, vecs must have completed its request early. Indeed we can tell that much by the RING registers )which is all we have :(

Comment 9 Mika Kuoppala 2013-06-14 11:18:19 UTC

(In reply to comment #8)
> So, I think inter-ring serialisation is broken. The batch is passed from
> render -> bsd -> blt -> vecs. In order for that object to be removed from
> the active list, vecs must have completed its request early. Indeed we can
> tell that much by the RING registers )which is all we have :(

gem_seqno_wrap doesn't submit anything to bsd nor vecs rings.

Apparently gem_seqno_wrap doesn't even success to submit any batches.
The first time it tries to alter the driver seqno, it get -EAGAIN
back, most likely from __wait_seqno due to differing reset counts.

I suspect that that the _previous_ test that was run has not completed
cleanly. It has still work in rings when the test ends.

So either previous test has injected something which hangs the bsd ring
and gem_seqno_wrap is a victim here as it will find hung gpu when it starts. Or calling gem_seqno_init while there are legimate work in the rings causes gpu hang.

Lu, could you find out what test is run before gem_seqno_wrap?
And perhaps alter the test order to see the effects.

Comment 10 Chris Wilson 2013-06-14 11:55:04 UTC

It's not the gem_seqno_wrap per-se that is at fault, it is triggered by the quiescence done in the setup.

Comment 11 Daniel Vetter 2013-08-06 20:21:21 UTC

Ben just volunteered.

Comment 12 Daniel Vetter 2013-08-06 20:23:45 UTC

Please retest with Chris' tlb invalidation fix:

https://patchwork.kernel.org/patch/2839512/

Comment 13 Ben Widawsky 2013-08-07 02:32:40 UTC

Haihao, do you have some time to look at this one (or someone on your team), assuming the TLB fix doesn't solve it?

Comment 14 lu hua 2013-08-07 06:59:52 UTC

(In reply to comment #12)
> Please retest with Chris' tlb invalidation fix:
> 
> https://patchwork.kernel.org/patch/2839512/

Test with this patch. System randomly hang.
output:
gem_seqno_wrap: gem_seqno_wrap.c:501: preset_run_once: Assertion `write_seqno(1) == 0' failed.
gem_write:476 failed, ret=-1, errno=5

Call trace:
[  192.720248]  [<ffffffffa00a7e86>] ? i915_gem_reset+0xbd/0x224 [i915]
[  192.721426]  [<ffffffff8122c1b7>] ? kobject_uevent_env+0x3ec/0x413
[  192.722594]  [<ffffffffa00948b0>] ? i915_reset+0x36/0x1a6 [i915]
[  192.723782]  [<ffffffffa009938e>] ? i915_error_work_func+0xba/0x148 [i915]
[  192.724954]  [<ffffffff81045fb6>] ? process_one_work+0x1c1/0x2c8
[  192.726126]  [<ffffffff810441a3>] ? pwq_activate_delayed_work+0x23/0x2e
[  192.727291]  [<ffffffff81046502>] ? worker_thread+0x1d0/0x299
[  192.728455]  [<ffffffff81046332>] ? rescuer_thread+0x24e/0x24e
[  192.729606]  [<ffffffff8104b149>] ? kthread+0xad/0xb5
[  192.730747]  [<ffffffff8104b09c>] ? kthread_freezable_should_stop+0x46/0x46
[  192.731911]  [<ffffffff8145842c>] ? ret_from_fork+0x7c/0xb0
[  192.733078]  [<ffffffff8104b09c>] ? kthread_freezable_should_stop+0x46/0x46
[  192.734223] Code: 48 8b 80 30 03 00 00 48 3b b0 20 22 00 00 48 8d 90 c8 1f 00 00 48 8d 41 a8 48 0f 44 f2 48 8d 97 a0 00 00 00 48 39 d1 75 15 0f 0b <48> 39 70 50 75 05 48 8b 40 30 c3 48 8b 40 58 48 83 e8 58 48 8d
[  192.736800] RIP  [<ffffffffa00a6962>] i915_gem_obj_offset+0x36/0x57 [i915]
[  192.738029]  RSP <ffff880241a79d00>
[  192.739208] CR2: fffffffffffffff8
[  192.740402] ---[ end trace 2840c64b8122deca ]---
[  192.741650] BUG: unable to handle kernel paging request at ffffffffffffffd8

Comment 15 lu hua 2013-08-07 07:01:36 UTC

Created attachment 83757 [details]
dmesg with tlb invalidation fix patch

Comment 16 Ben Widawsky 2013-08-12 23:56:11 UTC

Please try:
https://patchwork.kernel.org/patch/2843323/

Comment 17 lu hua 2013-08-13 02:48:43 UTC

(In reply to comment #16)
> Please try:
> https://patchwork.kernel.org/patch/2843323/

Fixed by this patch.

Comment 18 Paulo Zanoni 2013-08-14 14:03:17 UTC

(In reply to comment #17)
> (In reply to comment #16)
> > Please try:
> > https://patchwork.kernel.org/patch/2843323/
> 
> Fixed by this patch.

Patch merged, closing bug.

http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=d6fc62c1699f4331c6b05a4b82a7796f8d281af3

Thanks everybody for reporting, testing and fixing,
Paulo

Comment 19 lu hua 2013-08-19 01:35:13 UTC

Verified.Fixed.

Comment 20 Elizabeth 2017-10-06 14:46:11 UTC

Closing old verified.

Comment 21 Lizhong 2017-10-06 16:08:12 UTC

Created attachment 134708 [details]
attachment-4239-0.html

PRC National Holiday Oct.1~Oct.8. Email response maybe be very slow

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.