Summary: | [HSW] vebox semaphores ignored | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | lu hua <huax.lu> | ||||||||||||||
Component: | DRM/Intel | Assignee: | Ben Widawsky <ben> | ||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||
Severity: | major | ||||||||||||||||
Priority: | high | CC: | haihao.xiang, przanoni, yangweix.shui | ||||||||||||||
Version: | unspecified | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | Linux (All) | ||||||||||||||||
Whiteboard: | |||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||
Attachments: |
|
Created attachment 80668 [details] [review] better error handling at debugfs write Created attachment 80752 [details]
dmesg
Test with this patch. The aborted goes away, It causes GPU hung.
output:
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
retrying write to dfs: 11
Created attachment 80753 [details]
i915_error_state
Mika, that error state is incorrect. On second thoughts, it may just be genuine (mistook the ringbuffer for a batchbuffer), and the bug is that the request is not getting flushed. And it is the BSD ring that died, helps to read the dmesg. hangcheck 1, me 0. bsd command stream: HEAD: 0x000150b8 TAIL: 0x00015108 CTL: 0x0001f001 ACTHD: 0xde49ece0 IPEIR: 0x00000008 IPEHR: 0x00000000 INSTDONE: 0x3fffffff INSTPS: 0x00000000 INSTPM: 0x00000000 FADDR: 0xde49dd80 RC PSMI: 0x00000010 FAULT_REG: 0x00000000 SYNC_0: 0x0000000a [last synced 0x00000000] SYNC_1: 0x00000009 [last synced 0x0000000b] seqno: 0x00000000 waiting: yes ring->head: 0x00000000 ring->tail: 0x00015108 Active [0]: Where's the BSD batchbuffer gone? So, I think inter-ring serialisation is broken. The batch is passed from render -> bsd -> blt -> vecs. In order for that object to be removed from the active list, vecs must have completed its request early. Indeed we can tell that much by the RING registers )which is all we have :( (In reply to comment #8) > So, I think inter-ring serialisation is broken. The batch is passed from > render -> bsd -> blt -> vecs. In order for that object to be removed from > the active list, vecs must have completed its request early. Indeed we can > tell that much by the RING registers )which is all we have :( gem_seqno_wrap doesn't submit anything to bsd nor vecs rings. Apparently gem_seqno_wrap doesn't even success to submit any batches. The first time it tries to alter the driver seqno, it get -EAGAIN back, most likely from __wait_seqno due to differing reset counts. I suspect that that the _previous_ test that was run has not completed cleanly. It has still work in rings when the test ends. So either previous test has injected something which hangs the bsd ring and gem_seqno_wrap is a victim here as it will find hung gpu when it starts. Or calling gem_seqno_init while there are legimate work in the rings causes gpu hang. Lu, could you find out what test is run before gem_seqno_wrap? And perhaps alter the test order to see the effects. It's not the gem_seqno_wrap per-se that is at fault, it is triggered by the quiescence done in the setup. Ben just volunteered. Please retest with Chris' tlb invalidation fix: https://patchwork.kernel.org/patch/2839512/ Haihao, do you have some time to look at this one (or someone on your team), assuming the TLB fix doesn't solve it? (In reply to comment #12) > Please retest with Chris' tlb invalidation fix: > > https://patchwork.kernel.org/patch/2839512/ Test with this patch. System randomly hang. output: gem_seqno_wrap: gem_seqno_wrap.c:501: preset_run_once: Assertion `write_seqno(1) == 0' failed. gem_write:476 failed, ret=-1, errno=5 Call trace: [ 192.720248] [<ffffffffa00a7e86>] ? i915_gem_reset+0xbd/0x224 [i915] [ 192.721426] [<ffffffff8122c1b7>] ? kobject_uevent_env+0x3ec/0x413 [ 192.722594] [<ffffffffa00948b0>] ? i915_reset+0x36/0x1a6 [i915] [ 192.723782] [<ffffffffa009938e>] ? i915_error_work_func+0xba/0x148 [i915] [ 192.724954] [<ffffffff81045fb6>] ? process_one_work+0x1c1/0x2c8 [ 192.726126] [<ffffffff810441a3>] ? pwq_activate_delayed_work+0x23/0x2e [ 192.727291] [<ffffffff81046502>] ? worker_thread+0x1d0/0x299 [ 192.728455] [<ffffffff81046332>] ? rescuer_thread+0x24e/0x24e [ 192.729606] [<ffffffff8104b149>] ? kthread+0xad/0xb5 [ 192.730747] [<ffffffff8104b09c>] ? kthread_freezable_should_stop+0x46/0x46 [ 192.731911] [<ffffffff8145842c>] ? ret_from_fork+0x7c/0xb0 [ 192.733078] [<ffffffff8104b09c>] ? kthread_freezable_should_stop+0x46/0x46 [ 192.734223] Code: 48 8b 80 30 03 00 00 48 3b b0 20 22 00 00 48 8d 90 c8 1f 00 00 48 8d 41 a8 48 0f 44 f2 48 8d 97 a0 00 00 00 48 39 d1 75 15 0f 0b <48> 39 70 50 75 05 48 8b 40 30 c3 48 8b 40 58 48 83 e8 58 48 8d [ 192.736800] RIP [<ffffffffa00a6962>] i915_gem_obj_offset+0x36/0x57 [i915] [ 192.738029] RSP <ffff880241a79d00> [ 192.739208] CR2: fffffffffffffff8 [ 192.740402] ---[ end trace 2840c64b8122deca ]--- [ 192.741650] BUG: unable to handle kernel paging request at ffffffffffffffd8 Created attachment 83757 [details]
dmesg with tlb invalidation fix patch
Please try: https://patchwork.kernel.org/patch/2843323/ (In reply to comment #16) > Please try: > https://patchwork.kernel.org/patch/2843323/ Fixed by this patch. (In reply to comment #17) > (In reply to comment #16) > > Please try: > > https://patchwork.kernel.org/patch/2843323/ > > Fixed by this patch. Patch merged, closing bug. http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=d6fc62c1699f4331c6b05a4b82a7796f8d281af3 Thanks everybody for reporting, testing and fixing, Paulo Verified.Fixed. Closing old verified. Created attachment 134708 [details]
attachment-4239-0.html
PRC National Holiday Oct.1~Oct.8. Email response maybe be very slow
|
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 80319 [details] dmesg System Environment: -------------------------- Arch: x86_64 Platform: Haswell Kernel: (drm-intel-next-queued)d7697eea3eec74c561d12887d892c53ac4380c00 Bug detailed description: ------------------------- It randomly aborted on haswell with drm-intel-next-queued kernel.It happens 1 in 5 runs. It doesn't happens on drm-intel-fixes kernel. I run it 10 cycles on IVB, It works well. output: gem_seqno_wrap: gem_seqno_wrap.c:501: preset_run_once: Assertion `write_seqno(1) == 0' failed. Aborted (core dumped) Reproduce steps: ---------------- 1. ./gem_seqno_wrap