Bug 65387

Summary: [HSW] vebox semaphores ignored
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Ben Widawsky <ben>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: haihao.xiang, przanoni, yangweix.shui
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
better error handling at debugfs write
none
dmesg
none
i915_error_state
none
dmesg with tlb invalidation fix patch
none
attachment-4239-0.html none

Description lu hua 2013-06-05 03:22:15 UTC
Created attachment 80319 [details]
dmesg

System Environment:
--------------------------
Arch:           x86_64
Platform:       Haswell
Kernel:     (drm-intel-next-queued)d7697eea3eec74c561d12887d892c53ac4380c00

Bug detailed description:
-------------------------
It randomly aborted on haswell with drm-intel-next-queued kernel.It happens 1 in 5 runs. It doesn't happens on drm-intel-fixes kernel. I run it 10 cycles on IVB, It works well.

output:
gem_seqno_wrap: gem_seqno_wrap.c:501: preset_run_once: Assertion `write_seqno(1) == 0' failed.
Aborted (core dumped)

Reproduce steps:
----------------
1. ./gem_seqno_wrap
Comment 1 Mika Kuoppala 2013-06-11 09:01:22 UTC
Created attachment 80668 [details] [review]
better error handling at debugfs write
Comment 2 lu hua 2013-06-13 06:03:55 UTC
Created attachment 80752 [details]
dmesg

Test with this patch. The aborted goes away, It causes GPU hung.
output:
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
Comment 3 lu hua 2013-06-13 06:04:50 UTC
Created attachment 80753 [details]
i915_error_state
Comment 4 Chris Wilson 2013-06-13 07:29:31 UTC
Mika, that error state is incorrect.
Comment 5 Chris Wilson 2013-06-13 07:34:54 UTC
On second thoughts, it may just be genuine (mistook the ringbuffer for a batchbuffer), and the bug is that the request is not getting flushed.
Comment 6 Chris Wilson 2013-06-13 09:14:01 UTC
And it is the BSD ring that died, helps to read the dmesg. hangcheck 1, me 0.
Comment 7 Chris Wilson 2013-06-13 09:18:40 UTC
bsd command stream:
  HEAD: 0x000150b8
  TAIL: 0x00015108
  CTL: 0x0001f001
  ACTHD: 0xde49ece0
  IPEIR: 0x00000008
  IPEHR: 0x00000000
  INSTDONE: 0x3fffffff
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0xde49dd80
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  SYNC_0: 0x0000000a [last synced 0x00000000]
  SYNC_1: 0x00000009 [last synced 0x0000000b]
  seqno: 0x00000000
  waiting: yes
  ring->head: 0x00000000
  ring->tail: 0x00015108

Active [0]:

Where's the BSD batchbuffer gone?
Comment 8 Chris Wilson 2013-06-13 09:28:56 UTC
So, I think inter-ring serialisation is broken. The batch is passed from render -> bsd -> blt -> vecs. In order for that object to be removed from the active list, vecs must have completed its request early. Indeed we can tell that much by the RING registers )which is all we have :(
Comment 9 Mika Kuoppala 2013-06-14 11:18:19 UTC
(In reply to comment #8)
> So, I think inter-ring serialisation is broken. The batch is passed from
> render -> bsd -> blt -> vecs. In order for that object to be removed from
> the active list, vecs must have completed its request early. Indeed we can
> tell that much by the RING registers )which is all we have :(

gem_seqno_wrap doesn't submit anything to bsd nor vecs rings.

Apparently gem_seqno_wrap doesn't even success to submit any batches.
The first time it tries to alter the driver seqno, it get -EAGAIN
back, most likely from __wait_seqno due to differing reset counts.

I suspect that that the _previous_ test that was run has not completed
cleanly. It has still work in rings when the test ends.

So either previous test has injected something which hangs the bsd ring
and gem_seqno_wrap is a victim here as it will find hung gpu when it starts. Or calling gem_seqno_init while there are legimate work in the rings causes gpu hang.

Lu, could you find out what test is run before gem_seqno_wrap?
And perhaps alter the test order to see the effects.
Comment 10 Chris Wilson 2013-06-14 11:55:04 UTC
It's not the gem_seqno_wrap per-se that is at fault, it is triggered by the quiescence done in the setup.
Comment 11 Daniel Vetter 2013-08-06 20:21:21 UTC
Ben just volunteered.
Comment 12 Daniel Vetter 2013-08-06 20:23:45 UTC
Please retest with Chris' tlb invalidation fix:

https://patchwork.kernel.org/patch/2839512/
Comment 13 Ben Widawsky 2013-08-07 02:32:40 UTC
Haihao, do you have some time to look at this one (or someone on your team), assuming the TLB fix doesn't solve it?
Comment 14 lu hua 2013-08-07 06:59:52 UTC
(In reply to comment #12)
> Please retest with Chris' tlb invalidation fix:
> 
> https://patchwork.kernel.org/patch/2839512/

Test with this patch. System randomly hang.
output:
gem_seqno_wrap: gem_seqno_wrap.c:501: preset_run_once: Assertion `write_seqno(1) == 0' failed.
gem_write:476 failed, ret=-1, errno=5

Call trace:
[  192.720248]  [<ffffffffa00a7e86>] ? i915_gem_reset+0xbd/0x224 [i915]
[  192.721426]  [<ffffffff8122c1b7>] ? kobject_uevent_env+0x3ec/0x413
[  192.722594]  [<ffffffffa00948b0>] ? i915_reset+0x36/0x1a6 [i915]
[  192.723782]  [<ffffffffa009938e>] ? i915_error_work_func+0xba/0x148 [i915]
[  192.724954]  [<ffffffff81045fb6>] ? process_one_work+0x1c1/0x2c8
[  192.726126]  [<ffffffff810441a3>] ? pwq_activate_delayed_work+0x23/0x2e
[  192.727291]  [<ffffffff81046502>] ? worker_thread+0x1d0/0x299
[  192.728455]  [<ffffffff81046332>] ? rescuer_thread+0x24e/0x24e
[  192.729606]  [<ffffffff8104b149>] ? kthread+0xad/0xb5
[  192.730747]  [<ffffffff8104b09c>] ? kthread_freezable_should_stop+0x46/0x46
[  192.731911]  [<ffffffff8145842c>] ? ret_from_fork+0x7c/0xb0
[  192.733078]  [<ffffffff8104b09c>] ? kthread_freezable_should_stop+0x46/0x46
[  192.734223] Code: 48 8b 80 30 03 00 00 48 3b b0 20 22 00 00 48 8d 90 c8 1f 00 00 48 8d 41 a8 48 0f 44 f2 48 8d 97 a0 00 00 00 48 39 d1 75 15 0f 0b <48> 39 70 50 75 05 48 8b 40 30 c3 48 8b 40 58 48 83 e8 58 48 8d
[  192.736800] RIP  [<ffffffffa00a6962>] i915_gem_obj_offset+0x36/0x57 [i915]
[  192.738029]  RSP <ffff880241a79d00>
[  192.739208] CR2: fffffffffffffff8
[  192.740402] ---[ end trace 2840c64b8122deca ]---
[  192.741650] BUG: unable to handle kernel paging request at ffffffffffffffd8
Comment 15 lu hua 2013-08-07 07:01:36 UTC
Created attachment 83757 [details]
dmesg with tlb invalidation fix patch
Comment 16 Ben Widawsky 2013-08-12 23:56:11 UTC
Please try:
https://patchwork.kernel.org/patch/2843323/
Comment 17 lu hua 2013-08-13 02:48:43 UTC
(In reply to comment #16)
> Please try:
> https://patchwork.kernel.org/patch/2843323/

Fixed by this patch.
Comment 18 Paulo Zanoni 2013-08-14 14:03:17 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > Please try:
> > https://patchwork.kernel.org/patch/2843323/
> 
> Fixed by this patch.

Patch merged, closing bug.

http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=d6fc62c1699f4331c6b05a4b82a7796f8d281af3

Thanks everybody for reporting, testing and fixing,
Paulo
Comment 19 lu hua 2013-08-19 01:35:13 UTC
Verified.Fixed.
Comment 20 Elizabeth 2017-10-06 14:46:11 UTC
Closing old verified.
Comment 21 Lizhong 2017-10-06 16:08:12 UTC
Created attachment 134708 [details]
attachment-4239-0.html

PRC National Holiday Oct.1~Oct.8. Email response maybe be very slow

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.