Summary: | [IVB/HSW ppgtt bisected] When run nightly piglit testing causes X no responsive | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | lu hua <huax.lu> | ||||
Component: | DRM/Intel | Assignee: | Daniel Vetter <daniel> | ||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Severity: | blocker | ||||||
Priority: | high | CC: | eero.t.tamminen, intel-gfx-bugs, mengmeng.meng | ||||
Version: | unspecified | ||||||
Hardware: | All | ||||||
OS: | Linux (All) | ||||||
Whiteboard: | |||||||
i915 platform: | i915 features: | ||||||
Attachments: |
|
Description
lu hua
2014-01-08 06:19:30 UTC
Is this a regression compared to Ben's -nightly branch? Also, what happens when you boot with i915.i915_enable_ppgtt=1 added to the kernel cmdline? It's not a new regression, but from ppgtt merge. It was first reported in the original performance regression report (bug 72906, see the first dmesg in the initial report) but is an independent issue. It might be bisectable, but I doubt it. I haven't merged all the fixups in Ben's vacation-nightly branch yet since I want to stall a bit for review. Hence the question whether this is a regression compared to that tree. I was just saying the bug was reported before your vacation and afaik remains unfixed. :) (In reply to comment #3) > I haven't merged all the fixups in Ben's vacation-nightly branch yet since I > want to stall a bit for review. Hence the question whether this is a > regression compared to that tree. It happens before Ben's vacation-nightly branch. Bisect shows: 7e0d96bc03c140cb8183955ad6f0290caa731e64 is the first bad commit. commit 7e0d96bc03c140cb8183955ad6f0290caa731e64 Author: Ben Widawsky <ben@bwidawsk.net> Date: Fri Dec 6 14:11:26 2013 -0800 drm/i915: Use multiple VMs -- the point of no return As with processes which run on the CPU, the goal of multiple VMs is to provide process isolation. Specific to GEN, there is also the ability to map more objects per process (2GB each instead of 2Gb-2k total). For the most part, all the pipes have been laid, and all we need to do is remove asserts and actually start changing address spaces with the context switch. Since prior to this we've converted the setting of the page tables to a streamed version, this is quite easy. One important thing to point out (since it'd been hotly contested) is that with this patch, every context created will have it's own address space (provided the HW can do it). v2: Disable BDW on rebase NOTE: I tried to make this commit as small as possible. I needed one place where I could "turn everything on" and that is here. It could be split into finer commits, but I didn't really see much point. Cc: Eric Anholt <eric@anholt.net> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Ben Widawsky <ben@bwidawsk.net> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Are we stalled on a resubmit from Chris here? (In reply to comment #7) > Are we stalled on a resubmit from Chris here? Only the vma->bind rework from Chris is stalled on a patch split-up, for the others I'm stalling on review. But tbh I'm not sure which patch applies to this situation here ... lu hua, can you please confirm https://patchwork.kernel.org/patch/3423201/ fixes the bug? (In reply to comment #9) > lu hua, can you please confirm https://patchwork.kernel.org/patch/3423201/ > fixes the bug? Patch failed. Fail at: @@ -544,19 +537,12 @@ i915_gem_execbuffer_reserve_vma(struct i915_vma *vma, struct drm_i915_gem_object *obj = vma->obj; struct drm_i915_gem_exec_object2 *entry = vma->exec_entry; bool has_fenced_gpu_access = INTEL_INFO(ring->dev)->gen < 4; - bool need_fence; unsigned flags; int ret; flags = 0; - - need_fence = - has_fenced_gpu_access && - entry->flags & EXEC_OBJECT_NEEDS_FENCE && - obj->tiling_mode != I915_TILING_NONE; - if (need_fence || need_reloc_mappable(vma)) + if (entry->flags & __EXEC_OBJECT_NEEDS_MAP) flags |= PIN_MAPPABLE; - if (entry->flags & EXEC_OBJECT_NEEDS_GTT) flags |= PIN_GLOBAL; Assigning to Daniel to play chicken with Chris. Where's the full dmesg? This actually looks like the bug fixed by commit 3d4e8c24090300d4f48199de643254a7b41c365c Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jan 27 16:20:02 2014 +0000 drm/i915: Prevent recursion by retiring requests when the ring is full http://patchwork.freedesktop.org/patch/18857/ (In reply to comment #12) > Where's the full dmesg? This actually looks like the bug fixed by > > commit 3d4e8c24090300d4f48199de643254a7b41c365c > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Mon Jan 27 16:20:02 2014 +0000 > > drm/i915: Prevent recursion by retiring requests when the ring is full > > http://patchwork.freedesktop.org/patch/18857/ Test on latest -nightly kernel(commit 8c56de968c0f3334d942df8deffc0fbada69d7d1),It still happens. It only happens on nightly testing, I can't catch more dmesg. I can reproduce this easily with 'GpuTest /test=plot3d'. Just run it and hit 'esc' -> boom. I tried to analyze it a bit and I think what happens is something like this: i915_vma_unbind(vma) ... i915_gem_object_finish_gpu() ... i915_vma_unbind(same_vma_here) <------- Oops. The current vma was freed already. (In reply to comment #14) > I can reproduce this easily with 'GpuTest /test=plot3d'. Just run it and hit > 'esc' -> boom. > > I tried to analyze it a bit and I think what happens is something like this: > > i915_vma_unbind(vma) > ... > i915_gem_object_finish_gpu() > ... > i915_vma_unbind(same_vma_here) > <------- > Oops. The current vma was freed already. Right, this is somebody calling unbind() without holding their own reference. So who called unbind()? P.S. finish_gpu() doesn't call unbind() in my tree... If just follow the code around I see plenty of codepaths that look like they might lead to such recursion. i915_vma_unbind i915_gem_object_finish_gpu i915_gem_object_wait_rendering i915_gem_object_wait_rendering__tail i915_gem_retire_requests_ring i915_gem_object_move_to_inactive drm_gem_object_unreference i915_gem_free_object i915_vma_unbind ... i915_gem_free_request i915_gem_context_unreference drm_gem_object_unreference i915_gem_free_object i915_vma_unbind ... ppgtt_release i915_gem_evict_vm i915_gem_retire_requests_ring ... i915_gem_retire_requests i915_gem_retire_requests_ring ... I'm not really sure which paths actually are possible. I suppose I could try to trace it in action and see where it goes. This is the backtrace from my last oops BTW. Seems it gets a signal and then tries to clean up, and ends up exploding. Eero showed me an apitrace from the test in question and it doesn't seem to be doing anything special. Just a big draw call w/ 6000000 vertices. I quickly tried to come up with an i-g-t test that would create a context and then do a big rendercopy and kill itself mid operation, but sadly I wasn't able to trigger the bug. [ 83.860988] Modules linked in: x86_pkg_temp_thermal coretemp kvm_intel snd_hda_codec_hdmi kvm snd_hda_codec_realtek hid_generic iTCO_wdt aesni_intel snd_hda_intel aes_x86_64 i915 glue_helper lrw snd_hda_codec gf128mul snd_hwdep ablk_helper snd_pcm cryptd psmouse pcspkr usbhid hid snd_page_alloc i2c_i801 snd_timer lpc_ich mfd_core snd cfbfillrect cfbimgblt soundcore cfbcopyarea drm_kms_helper wmi evdev [ 83.861640] CPU: 0 PID: 2807 Comm: GpuTest Not tainted 3.13.0-recurse+ #1 [ 83.861667] Hardware name: /DZ77BH-55K, BIOS BHZ7710H.86A.0100.2013.0517.0942 05/17/2013 [ 83.861701] task: ffff8800cf3b61e0 ti: ffff8800d014a000 task.ti: ffff8800d014a000 [ 83.861728] RIP: 0010:[<ffffffffa015906d>] [<ffffffffa015906d>] i915_vma_unbind+0x7d/0x220 [i915] [ 83.861780] RSP: 0018:ffff8800d014bd48 EFLAGS: 00010246 [ 83.861803] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8800cf11ce38 RCX: 0000000000000006 [ 83.861830] RDX: 0000000000000006 RSI: 0000000000260013 RDI: ffff8800cf3b61e0 [ 83.861856] RBP: ffff8800d014bd78 R08: 0000000000000002 R09: 0000000000000000 [ 83.861882] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8800d01400f0 [ 83.861908] R13: ffff8800d30fc900 R14: 0000000000000000 R15: ffff8800d4fd8000 [ 83.862251] FS: 00007f90e0e4a700(0000) GS:ffff88011f200000(0000) knlGS:0000000000000000 [ 83.862279] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 83.862304] CR2: 00007f90e4afe380 CR3: 00000000cf1de000 CR4: 00000000001407f0 [ 83.862332] Stack: [ 83.862349] ffff8800d014bd98 ffff8800cf11e080 ffff8800d01400f0 0000000000000000 [ 83.862420] ffff8800cd0d9ce0 ffff88011a46d6a0 ffff8800d014bda8 ffffffffa015e375 [ 83.862491] ffff8800d27836a8 ffff8800d27836a8 ffff8800d0140000 ffff8800d44a22e8 [ 83.862560] Call Trace: [ 83.862585] [<ffffffffa015e375>] i915_gem_evict_vm+0x75/0x120 [i915] [ 83.862616] [<ffffffffa015c8cb>] i915_gem_context_free+0x14b/0x150 [i915] [ 83.862647] [<ffffffffa015dbc5>] i915_gem_context_close+0xb5/0xc0 [i915] [ 83.862678] [<ffffffffa0146a38>] i915_driver_preclose+0x38/0x60 [i915] [ 83.862708] [<ffffffff81354398>] drm_release+0x78/0x5c0 [ 83.862734] [<ffffffff81150026>] __fput+0xd6/0x230 [ 83.862757] [<ffffffff811501ce>] ____fput+0xe/0x10 [ 83.862782] [<ffffffff81066327>] task_work_run+0xa7/0xe0 [ 83.862807] [<ffffffff810028a9>] do_notify_resume+0x59/0x80 [ 83.862833] [<ffffffff8129bd0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 83.862861] [<ffffffff815150da>] int_signal+0x12/0x17 [ 83.862884] Code: 0f 85 30 01 00 00 49 83 bd 40 01 00 00 00 0f 84 54 01 00 00 4c 89 ef e8 c2 d5 ff ff 85 c0 41 89 c6 0f 85 a9 00 00 00 48 8b 43 50 <48> 8b 90 a8 00 00 00 48 8b 92 b8 03 00 00 48 81 c2 a0 36 00 00 [ 83.864309] RIP [<ffffffffa015906d>] i915_vma_unbind+0x7d/0x220 [i915] Right, this is just the garbage reference counting of vm. On this path, i915_gem_evict_vm() does not hold a reference to the object it is trying to unbind, so indeed runs afoul of not just that object, but any object in the list (including the next pointer) being freed. See the safeguards i915_gem_evict_something() and __i915_gem_shrink() have in place for exactly this reason. The choice is either to rewrite i915_gem_evict_vm() along the same lines, drop the recursion from unbind() (patch on the list http://patchwork.freedesktop.org/patch/18896/) or fix the reference counting of vm (which is not that difficult and cleans up a chunk of code and concepts). You might as well test the theory that this system hang avoided by applying http://patchwork.freedesktop.org/patch/18896/ (In reply to comment #18) > You might as well test the theory that this system hang avoided by applying > http://patchwork.freedesktop.org/patch/18896/ Test this patch, It still has this issue. With last kernel, “X no responsive” when run GpuTest v0.5 pixmark_volplosion. (In reply to comment #20) > With last kernel, “X no responsive” when run GpuTest v0.5 pixmark_volplosion. The problem exits on SynMark2,too. Another bug worth testing with the tree http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug72742 (In reply to comment #22) > Another bug worth testing with the tree > http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug72742 Test this patch, this issue still exists. Created attachment 95130 [details]
dmesg(Ogles3conform)
Ogles3conform case GL3Tests/framebuffer_blit/framebuffer_blit_coverage_framebuffer_status.test also has this issue.
Reproduce steps:
-------------------------
1. xinit
2. ./GTF -width=64 -height=64 -run=GL3Tests/framebuffer_blit/framebuffer_blit_coverage_framebuffer_status.test
I guess we need to disable full ppgtt for now, too many ppl are blocked on this. I keep thinking once we have all the ppgtt fixes in d-i-n, we might be in a much better place to judge what remains broken in QA. commit 2fa91fdd6a6916c04e069a857d8eedfb1d2d9b5c Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Mar 6 09:40:43 2014 +0100 drm/i915: Disable full ppgtt by default Please reopen if there are any regressions left with just aliasing ppgtt. We need to handle those with top priority, given that the 3.15 merge window is approaching fast. It works well on latest -nightly kernel. I'll just add my latest observations here even though this bug is already closed. I wanted to re-test this bug again, but I was unable to reproduce it. The reason turned out to be my Mesa tree which I'd moved forward since the last time I tested this bug. Mesa HEAD as here commit 80c1b9349c861cc023c4f25c329a3c0ed4695b9e Author: Kenneth Graunke <kenneth@whitecape.org> Date: Tue Feb 25 12:21:41 2014 -0800 i965: Convert VUE map generation checks to if rather than switch. when I was unable to reproduce the bug. So I rolled Mesa back to commit 8c4a9f631d7438aeaf56785401891d0773792123 Author: Kenneth Graunke <kenneth@whitecape.org> Date: Mon Nov 11 18:30:32 2013 -0800 i965: Emit 3DSTATE_VF on Broadwell too. and then I was able to reproduce it again using GpuTest. But the good news is that it looks like Chris has managed to fix the issue with this patch: http://lists.freedesktop.org/archives/intel-gfx/2014-March/041576.html Closing old verified+fixed. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.