Bug 73383

Summary:

[IVB/HSW ppgtt bisected] When run nightly piglit testing causes X no responsive

Product:

DRI

Reporter:

lu hua <huax.lu>

Component:

DRM/Intel

Assignee:

Daniel Vetter <daniel>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

blocker

Priority:

high

CC:

eero.t.tamminen, intel-gfx-bugs, mengmeng.meng

Version:

unspecified

Hardware:

All

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
dmesg(Ogles3conform)	none

Description lu hua 2014-01-08 06:19:30 UTC

System Environment:
--------------------------
Platform:       Ivybridge/Haswell
Libdrm:		(master)libdrm-2.4.50-5-ge8cbc579651ef55274763c67acb366dd4155e0ce
Mesa:		(master)a61ae2aa01c9b1976ba4107c1564e0c3f94aea4d
Xserver:	(master)xorg-server-1.15.0
Xf86_video_intel:(master)2.99.907-1-g4d8f78bc95f8dd36693f74365dbc3c442fbbf8a9
Cairo:		(master)2afc941d7ffad3bd37a915e82342640772d4df5b
Libva:		(staging)c02f36ba4eb0ab8da46a06419106eafceded6cb9
Libva_intel_driver:(staging)e2e4898bd60aed48b260a4bc380959d9ae2ec1d9
Kernel:	(drm-intel-nightly) 164a4cb4c1431a0689f85507868356fae24da638

Bug detailed description:
-----------------------------
When run nightly Piglit testing on Ivybridge and Haswell, It causes X no responsive. It's not replicable manually.
It happens on -nightly branch, works well on -fixes branch.

The latest known good commit: 02905525c3310ba673676f8041d0ff6be4da0e24
The latest known bad commit: 2c21a51be4b73fbde9d8dd330195fbb15dd4ed4b

dmesg:
[  275.573184] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  275.573226] IP: [<ffffffffa008c473>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
[  275.573272] PGD b410e067 PUD c7525067 PMD 0
[  275.573295] Oops: 0000 [#1] SMP
[  275.573313] Modules linked in: bnep bluetooth rfkill ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc ipv6 dm_mod dcdbas snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek serio_raw pcspkr snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc lpc_ich i2c_i801 mfd_core snd_timer snd soundcore acpi_cpufreq i915 video button drm_kms_helper drm
[  275.573563] CPU: 2 PID: 4309 Comm: arb_color_buffe Not tainted 3.13.0-rc4_drm-intel-nightly_2c21a5_20140106_debug+ #100
[  275.573607] Hardware name: Dell Inc. OptiPlex 9010/03JR84, BIOS A01 05/04/2012
[  275.573638] task: ffff8800c6c93fa0 ti: ffff8800b7568000 task.ti: ffff8800b7568000
[  275.573669] RIP: 0010:[<ffffffffa008c473>]  [<ffffffffa008c473>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
[  275.573718] RSP: 0018:ffff8800b7569d18  EFLAGS: 00010246
[  275.573740] RAX: 0000000000000000 RBX: ffff880003740000 RCX: ffff8800befe5000
[  275.573770] RDX: 0000000000000011 RSI: 0000000000000011 RDI: ffff8801156101c0
[  275.573800] RBP: ffff8800b7569d38 R08: 0000160000000000 R09: ffff880000000000
[  275.573830] R10: 0000000000000010 R11: ffff8801156101c0 R12: ffff8801156101c0
[  275.573860] R13: 0000000000000001 R14: ffff880118545000 R15: ffff880003740000
[  275.573890] FS:  00007f7e0cb30740(0000) GS:ffff88011e300000(0000) knlGS:0000000000000000
[  275.573923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  275.573948] CR2: 0000000000000008 CR3: 00000000b7443000 CR4: 00000000001407e0
[  275.573977] Stack:
[  275.573987]  ffff8801156101c0 ffff880003348d80 0000000000000000 ffff8801189e3268
[  275.574023]  ffff8800b7569d78 ffffffffa0082d37 ffff8800d4880898 ffff8800d4880898
[  275.574058]  ffff8800d48808f0 ffff880118545000 ffff8801189e3268 ffff8800b746f690
[  275.574094] Call Trace:
[  275.574114]  [<ffffffffa0082d37>] i915_vma_unbind+0xc3/0x180 [i915]
[  275.574147]  [<ffffffffa00873a0>] i915_gem_evict_vm+0x88/0xd2 [i915]
[  275.574180]  [<ffffffffa0085c93>] i915_gem_context_free+0xd7/0x134 [i915]
[  275.574215]  [<ffffffffa0086d1b>] i915_gem_context_close+0x84/0x8b [i915]
[  275.574248]  [<ffffffffa0073e47>] i915_driver_preclose+0x2e/0x49 [i915]
[  275.574281]  [<ffffffffa0073e19>] ? i915_driver_lastclose+0x40/0x40 [i915]
[  275.574316]  [<ffffffffa0003f23>] drm_release+0x77/0x4c4 [drm]
[  275.574343]  [<ffffffff8112b081>] __fput+0xed/0x1e3
[  275.574365]  [<ffffffff8112b1af>] ____fput+0xe/0x10
[  275.574388]  [<ffffffff81054ddc>] task_work_run+0x7e/0x96
[  275.574412]  [<ffffffff810025f8>] do_notify_resume+0x5a/0x6b
[  275.574439]  [<ffffffff813784ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  275.574467]  [<ffffffff8181a8da>] int_signal+0x12/0x17
[  275.574489] Code: 48 c7 c6 39 fa 0e a0 31 c0 e8 ef b8 f7 ff bf c6 a7 00 00 e8 0b ae 2e e1 41 f6 84 24 3b 01 00 00 01 75 44 49 8b 84 24 40 01 00 00 <8b> 50 08 48 8b 30 49 8b 86 90 03 00 00 48 89 c7 48 81 c7 98 00
[  275.574651] RIP  [<ffffffffa008c473>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
[  275.574689]  RSP <ffff8800b7569d18>
[  275.574704] CR2: 0000000000000008
[  275.578714] ---[ end trace 34d8cec1bb111f7b ]---

Reproduce steps:
-------------------------
Run all piglit cases

Comment 1 Daniel Vetter 2014-01-08 16:30:13 UTC

Is this a regression compared to Ben's -nightly branch?

Also, what happens when you boot with i915.i915_enable_ppgtt=1 added to the kernel cmdline?

Comment 2 Chris Wilson 2014-01-08 16:38:28 UTC

It's not a new regression, but from ppgtt merge. It was first reported in the original performance regression report (bug 72906, see the first dmesg in the initial report) but is an independent issue. It might be bisectable, but I doubt it.

Comment 3 Daniel Vetter 2014-01-08 16:40:29 UTC

I haven't merged all the fixups in Ben's vacation-nightly branch yet since I want to stall a bit for review. Hence the question whether this is a regression compared to that tree.

Comment 4 Chris Wilson 2014-01-08 16:46:18 UTC

I was just saying the bug was reported before your vacation and afaik remains unfixed. :)

Comment 5 lu hua 2014-01-09 03:20:54 UTC

(In reply to comment #3)
> I haven't merged all the fixups in Ben's vacation-nightly branch yet since I
> want to stall a bit for review. Hence the question whether this is a
> regression compared to that tree.

It happens before Ben's vacation-nightly branch.

Comment 6 lu hua 2014-01-09 06:01:58 UTC

Bisect shows: 7e0d96bc03c140cb8183955ad6f0290caa731e64 is the first bad commit.
commit 7e0d96bc03c140cb8183955ad6f0290caa731e64
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Fri Dec 6 14:11:26 2013 -0800

    drm/i915: Use multiple VMs -- the point of no return

    As with processes which run on the CPU, the goal of multiple VMs is to
    provide process isolation. Specific to GEN, there is also the ability to
    map more objects per process (2GB each instead of 2Gb-2k total).

    For the most part, all the pipes have been laid, and all we need to do
    is remove asserts and actually start changing address spaces with the
    context switch. Since prior to this we've converted the setting of the
    page tables to a streamed version, this is quite easy.

    One important thing to point out (since it'd been hotly contested) is
    that with this patch, every context created will have it's own address
    space (provided the HW can do it).

    v2: Disable BDW on rebase

    NOTE: I tried to make this commit as small as possible. I needed one
    place where I could "turn everything on" and that is here. It could be
    split into finer commits, but I didn't really see much point.

    Cc: Eric Anholt <eric@anholt.net>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Comment 7 Ben Widawsky 2014-01-09 23:58:59 UTC

Are we stalled on a resubmit from Chris here?

Comment 8 Daniel Vetter 2014-01-10 07:34:11 UTC

(In reply to comment #7)
> Are we stalled on a resubmit from Chris here?

Only the vma->bind rework from Chris is stalled on a patch split-up, for the others I'm stalling on review. But tbh I'm not sure which patch applies to this situation here ...

Comment 9 Ben Widawsky 2014-01-10 17:11:04 UTC

lu hua, can you please confirm https://patchwork.kernel.org/patch/3423201/ fixes the bug?

Comment 10 lu hua 2014-01-13 05:49:45 UTC

(In reply to comment #9)
> lu hua, can you please confirm https://patchwork.kernel.org/patch/3423201/
> fixes the bug?

Patch failed.
Fail at:
@@ -544,19 +537,12 @@  i915_gem_execbuffer_reserve_vma(struct i915_vma *vma,
 	struct drm_i915_gem_object *obj = vma->obj;
 	struct drm_i915_gem_exec_object2 *entry = vma->exec_entry;
 	bool has_fenced_gpu_access = INTEL_INFO(ring->dev)->gen < 4;
-	bool need_fence;
 	unsigned flags;
 	int ret;
 
 	flags = 0;
-
-	need_fence =
-		has_fenced_gpu_access &&
-		entry->flags & EXEC_OBJECT_NEEDS_FENCE &&
-		obj->tiling_mode != I915_TILING_NONE;
-	if (need_fence || need_reloc_mappable(vma))
+	if (entry->flags & __EXEC_OBJECT_NEEDS_MAP)
 		flags |= PIN_MAPPABLE;
-
 	if (entry->flags & EXEC_OBJECT_NEEDS_GTT)
 		flags |= PIN_GLOBAL;

Comment 11 Ben Widawsky 2014-01-16 19:46:04 UTC

Assigning to Daniel to play chicken with Chris.

Comment 12 Chris Wilson 2014-01-28 22:40:22 UTC

Where's the full dmesg? This actually looks like the bug fixed by

commit 3d4e8c24090300d4f48199de643254a7b41c365c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 27 16:20:02 2014 +0000

    drm/i915: Prevent recursion by retiring requests when the ring is full

http://patchwork.freedesktop.org/patch/18857/

Comment 13 lu hua 2014-01-30 02:32:41 UTC

(In reply to comment #12)
> Where's the full dmesg? This actually looks like the bug fixed by
> 
> commit 3d4e8c24090300d4f48199de643254a7b41c365c
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Jan 27 16:20:02 2014 +0000
> 
>     drm/i915: Prevent recursion by retiring requests when the ring is full
> 
> http://patchwork.freedesktop.org/patch/18857/

Test on latest -nightly kernel(commit 8c56de968c0f3334d942df8deffc0fbada69d7d1),It still happens.

It only happens on nightly testing, I can't catch more dmesg.

Comment 14 Ville Syrjala 2014-02-06 18:44:35 UTC

I can reproduce this easily with 'GpuTest /test=plot3d'. Just run it and hit 'esc' -> boom.

I tried to analyze it a bit and I think what happens is something like this:

i915_vma_unbind(vma)
   ...
   i915_gem_object_finish_gpu()
       ...
           i915_vma_unbind(same_vma_here)
   <-------
   Oops. The current vma was freed already.

Comment 15 Chris Wilson 2014-02-07 11:34:51 UTC

(In reply to comment #14)
> I can reproduce this easily with 'GpuTest /test=plot3d'. Just run it and hit
> 'esc' -> boom.
> 
> I tried to analyze it a bit and I think what happens is something like this:
> 
> i915_vma_unbind(vma)
>    ...
>    i915_gem_object_finish_gpu()
>        ...
>            i915_vma_unbind(same_vma_here)
>    <-------
>    Oops. The current vma was freed already.

Right, this is somebody calling unbind() without holding their own reference. So who called unbind()?

P.S. finish_gpu() doesn't call unbind() in my tree...

Comment 16 Ville Syrjala 2014-02-07 12:40:43 UTC

If just follow the code around I see plenty of codepaths that look like they might lead to such recursion.

i915_vma_unbind
 i915_gem_object_finish_gpu
  i915_gem_object_wait_rendering
   i915_gem_object_wait_rendering__tail
   i915_gem_retire_requests_ring
    i915_gem_object_move_to_inactive
     drm_gem_object_unreference
      i915_gem_free_object
       i915_vma_unbind
        ...
    i915_gem_free_request
     i915_gem_context_unreference
      drm_gem_object_unreference
       i915_gem_free_object
        i915_vma_unbind
         ...
      ppgtt_release
       i915_gem_evict_vm
        i915_gem_retire_requests_ring
         ...
       i915_gem_retire_requests
        i915_gem_retire_requests_ring
         ...

I'm not really sure which paths actually are possible. I suppose I could try to trace it in action and see where it goes.

This is the backtrace from my last oops BTW. Seems it gets a signal and then tries to clean up, and ends up exploding. Eero showed me an apitrace from the test in question and it doesn't seem to be doing anything special. Just a big draw call w/ 6000000 vertices. I quickly tried to come up with an i-g-t test that would create a context and then do a big rendercopy and kill itself mid operation, but sadly I wasn't able to trigger the bug.

[   83.860988] Modules linked in: x86_pkg_temp_thermal coretemp kvm_intel snd_hda_codec_hdmi kvm snd_hda_codec_realtek hid_generic iTCO_wdt aesni_intel snd_hda_intel aes_x86_64 i915 glue_helper lrw snd_hda_codec gf128mul snd_hwdep ablk_helper snd_pcm cryptd psmouse pcspkr usbhid hid snd_page_alloc i2c_i801 snd_timer lpc_ich mfd_core snd cfbfillrect cfbimgblt soundcore cfbcopyarea drm_kms_helper wmi evdev
[   83.861640] CPU: 0 PID: 2807 Comm: GpuTest Not tainted 3.13.0-recurse+ #1
[   83.861667] Hardware name:                  /DZ77BH-55K, BIOS BHZ7710H.86A.0100.2013.0517.0942 05/17/2013
[   83.861701] task: ffff8800cf3b61e0 ti: ffff8800d014a000 task.ti: ffff8800d014a000
[   83.861728] RIP: 0010:[<ffffffffa015906d>]  [<ffffffffa015906d>] i915_vma_unbind+0x7d/0x220 [i915]
[   83.861780] RSP: 0018:ffff8800d014bd48  EFLAGS: 00010246
[   83.861803] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8800cf11ce38 RCX: 0000000000000006
[   83.861830] RDX: 0000000000000006 RSI: 0000000000260013 RDI: ffff8800cf3b61e0
[   83.861856] RBP: ffff8800d014bd78 R08: 0000000000000002 R09: 0000000000000000
[   83.861882] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8800d01400f0
[   83.861908] R13: ffff8800d30fc900 R14: 0000000000000000 R15: ffff8800d4fd8000
[   83.862251] FS:  00007f90e0e4a700(0000) GS:ffff88011f200000(0000) knlGS:0000000000000000
[   83.862279] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   83.862304] CR2: 00007f90e4afe380 CR3: 00000000cf1de000 CR4: 00000000001407f0
[   83.862332] Stack:
[   83.862349]  ffff8800d014bd98 ffff8800cf11e080 ffff8800d01400f0 0000000000000000
[   83.862420]  ffff8800cd0d9ce0 ffff88011a46d6a0 ffff8800d014bda8 ffffffffa015e375
[   83.862491]  ffff8800d27836a8 ffff8800d27836a8 ffff8800d0140000 ffff8800d44a22e8
[   83.862560] Call Trace:
[   83.862585]  [<ffffffffa015e375>] i915_gem_evict_vm+0x75/0x120 [i915]
[   83.862616]  [<ffffffffa015c8cb>] i915_gem_context_free+0x14b/0x150 [i915]
[   83.862647]  [<ffffffffa015dbc5>] i915_gem_context_close+0xb5/0xc0 [i915]
[   83.862678]  [<ffffffffa0146a38>] i915_driver_preclose+0x38/0x60 [i915]
[   83.862708]  [<ffffffff81354398>] drm_release+0x78/0x5c0
[   83.862734]  [<ffffffff81150026>] __fput+0xd6/0x230
[   83.862757]  [<ffffffff811501ce>] ____fput+0xe/0x10
[   83.862782]  [<ffffffff81066327>] task_work_run+0xa7/0xe0
[   83.862807]  [<ffffffff810028a9>] do_notify_resume+0x59/0x80
[   83.862833]  [<ffffffff8129bd0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[   83.862861]  [<ffffffff815150da>] int_signal+0x12/0x17
[   83.862884] Code: 0f 85 30 01 00 00 49 83 bd 40 01 00 00 00 0f 84 54 01 00 00 4c 89 ef e8 c2 d5 ff ff 85 c0 41 89 c6 0f 85 a9 00 00 00 48 8b 43 50 <48> 8b 90 a8 00 00 00 48 8b 92 b8 03 00 00 48 81 c2 a0 36 00 00
[   83.864309] RIP  [<ffffffffa015906d>] i915_vma_unbind+0x7d/0x220 [i915]

Comment 17 Chris Wilson 2014-02-07 13:42:09 UTC

Right, this is just the garbage reference counting of vm.

On this path, i915_gem_evict_vm() does not hold a reference to the object it is trying to unbind, so indeed runs afoul of not just that object, but any object in the list (including the next pointer) being freed. See the safeguards i915_gem_evict_something() and __i915_gem_shrink() have in place for exactly this reason.

The choice is either to rewrite i915_gem_evict_vm() along the same lines, drop the recursion from unbind() (patch on the list http://patchwork.freedesktop.org/patch/18896/) or fix the reference counting of vm (which is not that difficult and cleans up a chunk of code and concepts).

Comment 18 Chris Wilson 2014-02-12 19:40:21 UTC

You might as well test the theory that this system hang avoided by applying http://patchwork.freedesktop.org/patch/18896/

Comment 19 lu hua 2014-02-14 06:53:55 UTC

(In reply to comment #18)
> You might as well test the theory that this system hang avoided by applying
> http://patchwork.freedesktop.org/patch/18896/

Test this patch, It still has this issue.

Comment 20 meng 2014-02-18 09:17:16 UTC

With last kernel, “X no responsive” when run GpuTest v0.5 pixmark_volplosion.

Comment 21 meng 2014-02-19 05:36:33 UTC

(In reply to comment #20)
> With last kernel, “X no responsive” when run GpuTest v0.5 pixmark_volplosion.

The problem exits on SynMark2,too.

Comment 22 Chris Wilson 2014-02-24 10:23:57 UTC

Another bug worth testing with the tree http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug72742

Comment 23 lu hua 2014-02-26 01:07:56 UTC

(In reply to comment #22)
> Another bug worth testing with the tree
> http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug72742

Test this patch, this issue still exists.

Comment 24 lu hua 2014-03-05 02:17:20 UTC

Created attachment 95130 [details]
dmesg(Ogles3conform)

Ogles3conform case  GL3Tests/framebuffer_blit/framebuffer_blit_coverage_framebuffer_status.test also has this issue.

Reproduce steps:
-------------------------
1. xinit
2. ./GTF -width=64 -height=64 -run=GL3Tests/framebuffer_blit/framebuffer_blit_coverage_framebuffer_status.test

Comment 25 Daniel Vetter 2014-03-05 16:48:22 UTC

I guess we need to disable full ppgtt for now, too many ppl are blocked on this.

Comment 26 Chris Wilson 2014-03-05 17:06:36 UTC

I keep thinking once we have all the ppgtt fixes in d-i-n, we might be in a much better place to judge what remains broken in QA.

Comment 27 Daniel Vetter 2014-03-06 15:15:02 UTC

commit 2fa91fdd6a6916c04e069a857d8eedfb1d2d9b5c
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Mar 6 09:40:43 2014 +0100

    drm/i915: Disable full ppgtt by default


Please reopen if there are any regressions left with just aliasing ppgtt. We need to handle those with top priority, given that the 3.15 merge window is approaching fast.

Comment 28 lu hua 2014-03-07 03:07:07 UTC

It works well on latest -nightly kernel.

Comment 29 Ville Syrjala 2014-03-12 17:44:39 UTC

I'll just add my latest observations here even though this bug is already closed.

I wanted to re-test this bug again, but I was unable to reproduce it. The reason turned out to be my Mesa tree which I'd moved forward since the last time I tested this bug.

Mesa HEAD as here
 commit 80c1b9349c861cc023c4f25c329a3c0ed4695b9e
 Author: Kenneth Graunke <kenneth@whitecape.org>
 Date:   Tue Feb 25 12:21:41 2014 -0800

    i965: Convert VUE map generation checks to if rather than switch.

when I was unable to reproduce the bug.

So I rolled Mesa back to
 commit 8c4a9f631d7438aeaf56785401891d0773792123
 Author: Kenneth Graunke <kenneth@whitecape.org>
 Date:   Mon Nov 11 18:30:32 2013 -0800

    i965: Emit 3DSTATE_VF on Broadwell too.

and then I was able to reproduce it again using GpuTest.

But the good news is that it looks like Chris has managed to fix the issue with this patch:
http://lists.freedesktop.org/archives/intel-gfx/2014-March/041576.html

Comment 30 Jari Tahvanainen 2017-07-03 13:54:39 UTC

Closing old verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.