Summary: | [SNB/IVB/HSW ULT regression]system hang when run nightly testing | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | lu hua <huax.lu> | ||||||||||||||||||||||||
Component: | DRM/Intel | Assignee: | Ben Widawsky <ben> | ||||||||||||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||||||||||
Severity: | critical | ||||||||||||||||||||||||||
Priority: | highest | CC: | kenneth, przanoni | ||||||||||||||||||||||||
Version: | unspecified | ||||||||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||||||
Attachments: |
|
Hi Is this a recent regression? Can it be bisected? Thanks, Paulo Smells like ppgtt fallout, so one for Ben. (In reply to comment #1) > Hi > > Is this a recent regression? Can it be bisected? > > Thanks, > Paulo The latest good commit: 3477e5ea598c88d21f24c00f8fcdfd7f4e837b59(3f577573cd5 6d2b888569d3). The fail is not reproducible by manually. I can't find a good way to bisect it.Do you have any suggest? Can you reproduce the hangs when running e.g. the entire i-g-t testsuite? It'll blow through a bit pile of cpu time, but if that works I think we should try it out ... This one doesn't look the same to me as the one invoked by gem_evict_everything (and Ken just hit it too fwiw). The cause is memory pressure and being forced to hit the bound_list while doing execbuf. It's a similar cause to the other one, but from what I can gather this one fails while we're trying to unmap the gtt userspace mappings. Since I do not know much about i915_gem_release_mmap, it might take me a while to come up with some ideas. It is possible this is another pre-existing bug that's just uncovered by VMA. Is that SHA the use VMAs in execbuffer commit? I dug a bit. It looks to me the failure is that pages is null here: if (!obj->has_dma_mapping) dma_unmap_sg(&dev->pdev->dev, obj->pages->sgl, obj->pages->nents, PCI_DMA_BIDIRECTIONAL); My disasm is a bit too complex to make sense out of at this hour. Is this still broken with latest kernels? Created attachment 84430 [details] [review] quick test patch Please retest with this patch and check in dmesg whether you're hitting the newly-added WARN anywhere ... (In reply to comment #6) > I dug a bit. It looks to me the failure is that pages is null here: > > if (!obj->has_dma_mapping) > dma_unmap_sg(&dev->pdev->dev, > obj->pages->sgl, obj->pages->nents, > PCI_DMA_BIDIRECTIONAL); > > > My disasm is a bit too complex to make sense out of at this hour. I couldn't believe it before, but I did more digging, it seems to fail on: obj->pages->nents, which at least on my compiled obj is done first. 218a0: 49 8b 84 24 08 01 00 mov 0x108(%r12),%rax obj->pages 218a7: 00 218a8: 8b 50 08 mov 0x8(%rax),%edx obj->pages->nents 218ab: 48 8b 30 mov (%rax),%rsi 218ae: 49 8b 85 90 04 00 00 mov 0x490(%r13),%rax // dev->pdev 218b5: 48 89 c7 mov %rax,%rdi 218b8: 48 81 c7 98 00 00 00 add $0x98,%rdi //pdev->dev (In reply to comment #6) Created attachment 84431 [details] [review] Improved patch, now with a bugfix Please disregard the earlier patch and test this one here instead. Created attachment 84444 [details] [review] More vma fixups Updated patch to address a now bogus WARN. (In reply to comment #7) > Is this still broken with latest kernels? It still happens on latest -nightly kernel. Can you please test this patch? https://patchwork.kernel.org/patch/2848475/ (In reply to comment #11) > Created attachment 84444 [details] [review] [review] > More vma fixups > > Updated patch to address a now bogus WARN. Test this patch on latest -nightly branch, This issue goes away. Fixed with commit f833c65abf79c2456fe8e8c487e3d78b9c329daa Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Mon Aug 26 11:23:47 2013 +0200 drm/i915: More vma fixups around unbind/destroy It still happens on latest -nightly kernel. It is a bit random. In recently test, It happens 1 time on SNB. It passes 2 times on IVB and HSW ULT(once with the patch) Can you please attach an updated dmesg with the latest backtrace? Call trace on latest -nightly kernel: Call Trace: Sep 3 00:12:19 x-hswu33 kernel: [22027.292350] [<ffffffffa007c854>] ? i915_vma_unbind+0xe2/0x1d1 [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.292410] [<ffffffffa007d183>] ? __i915_gem_shrink+0xf1/0x162 [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.292464] [<ffffffffa007d2ee>] ? i915_gem_object_get_pages_gtt+0xfa/0x303 [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.292527] [<ffffffffa00795f4>] ? i915_gem_object_get_pages+0x54/0x89 [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.292586] [<ffffffffa007cbda>] ? i915_gem_object_pin+0x238/0x5ce [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.292638] [<ffffffff812cba5f>] ? __sg_page_iter_next+0x2b/0x58 Sep 3 00:12:19 x-hswu33 kernel: [22027.292694] [<ffffffffa0082056>] ? gen6_ppgtt_insert_entries+0xf2/0x114 [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.292754] [<ffffffffa007fe4b>] ? i915_gem_execbuffer_reserve_vma.isra.13+0x79/0x18d [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.292820] [<ffffffffa008017c>] ? i915_gem_execbuffer_reserve+0x21d/0x347 [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.292881] [<ffffffffa0080bfb>] ? i915_gem_do_execbuffer.isra.17+0x4f3/0xe61 [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.292943] [<ffffffffa00795f4>] ? i915_gem_object_get_pages+0x54/0x89 [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.293002] [<ffffffffa007e405>] ? i915_gem_pwrite_ioctl+0x743/0x7a5 [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.293060] [<ffffffffa0081a46>] ? i915_gem_execbuffer2+0x15e/0x1e4 [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.293123] [<ffffffffa000e20d>] ? drm_ioctl+0x2a5/0x3c4 [drm] Sep 3 00:12:19 x-hswu33 kernel: [22027.293173] [<ffffffffa00818e8>] ? i915_gem_execbuffer+0x37f/0x37f [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.293224] [<ffffffff816f64c0>] ? __do_page_fault+0x3ab/0x449 Sep 3 00:12:19 x-hswu33 kernel: [22027.293269] [<ffffffff810be3da>] ? do_mmap_pgoff+0x2b2/0x341 Sep 3 00:12:19 x-hswu33 kernel: [22027.293317] [<ffffffff810e49be>] ? vfs_ioctl+0x1e/0x31 Sep 3 00:12:19 x-hswu33 kernel: [22027.293354] [<ffffffff810e5194>] ? do_vfs_ioctl+0x3ad/0x3ef Sep 3 00:12:19 x-hswu33 kernel: [22027.293396] [<ffffffff810e5224>] ? SyS_ioctl+0x4e/0x7e Sep 3 00:12:19 x-hswu33 kernel: [22027.293435] [<ffffffff816f88d2>] ? system_call_fastpath+0x16/0x1b Sep 3 00:12:19 x-hswu33 kernel: [22027.293478] Code: 52 0c a0 48 c7 c6 22 30 0d a0 31 c0 e8 ef 00 f9 ff bf c6 a7 00 00 e8 90 5d 24 e1 f6 85 13 01 00 00 10 75 44 48 8b 85 18 01 00 00 <8b> 50 08 48 8b 30 49 8b 84 24 88 02 00 00 48 89 c7 48 81 c7 98 Sep 3 00:12:19 x-hswu33 kernel: [22027.293678] RIP [<ffffffffa0082892>] i915_gem_gtt_finish_object+0x68/0xbd [i915] Sep 3 00:12:19 x-hswu33 kernel: [22027.293746] RSP <ffff880028e4b9e8> Sep 3 00:12:19 x-hswu33 kernel: [22027.293773] CR2: 0000000000000008 Created attachment 85095 [details]
dmesg on nightly 8fdad4
Created attachment 85141 [details] [review] An idea. Can you try the attached patch to see if that makes the bug vanish, or if it catches anything? Created attachment 85148 [details] [review] 01- Rename olr Created attachment 85149 [details] [review] 02 - Preallocate request Created attachment 85153 [details] [review] Hold a reference whilst shrinking the objects Third time's the charm. For posterity... I've discussed this with Chris quite a bit, and thought about it myself. I think it's definitely feasible to end up freeing an object while going through i915_vma_unbind. I can't see how this problem is special to the vma addition though since the theoretical issue of the shrinker being invoked due to an object being unbound (and the request being added) is not new. Either way, I have a few of my own hacks we can try. I think the last patch from Chris (https://bugs.freedesktop.org/attachment.cgi?id=85153) is flawed in that it doesn't prevent invalid ptr access on return to i915_vma_unbind. However, I think the two obsoleted patches before that should prevent the problem (though it's duct tape to be sure, since the next user of malloc an easily hit this). The recent patch https://bugs.freedesktop.org/attachment.cgi?id=85153 seems to be useful to fix a bug we haven't hit with consecutive shrinker recursion (thought I haven't charted out how that actually can happen). My ratio of hitting the problem is quite low. I've been running the previous 2 patches for several hours and haven't hit the problem. Created attachment 85158 [details] [review] Hold a reference whilst shrinking the objects I'm going to pretend that this was v3. (In reply to comment #25) > However, I think the two obsoleted patches before that should prevent the > problem (though it's duct tape to be sure, since the next user of malloc an > easily hit this). I wouldn't have been motivated to write the third patch unless the first two failed... And now I really must get some sleep. Created attachment 85194 [details] [review] Hold a reference whilst shrinking the objects Clean version rebased against -nightly This is blocking Ben's big PPGTT work. commit 57094f82465002fbde1447e2fd850e1179bf6d86 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Sep 4 10:45:50 2013 +0100 drm/i915: Hold an object reference whilst we shrink it Verified.Fixed. Closing verified+fixed |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 84120 [details] messages log System Environment: -------------------------- Platform: Ivybridge/Haswell ULT Kernel: (drm-intel-nightly)d93f59e86ae93066969fa8ae2a6c9ccc7fc4728d Bug detailed description: ----------------------------- When run nightly testing, system hang with call trace. I can't reproduce manually. It happens on ivybridge and haswell ult with -nightly kernel. It doesn't happen on -fixed kenrel. BUG info in dmesg: Aug 15 20:12:37 x-ivb9 kernel: [ 8081.750976] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 Call Trace: [ 8081.751843] [<ffffffffa0088e55>] i915_vma_unbind+0xdf/0x1ab [i915] [<ffffffffa0089026>] __i915_gem_shrink+0x105/0x177 [i915] [<ffffffffa0089452>] i915_gem_object_get_pages_gtt+0x108/0x309 [i915] [<ffffffffa0085ba9>] i915_gem_object_get_pages+0x61/0x90 [i915] [<ffffffffa008f22b>] ? gen6_ppgtt_insert_entries+0x103/0x125 [i915] [<ffffffffa008a113>] i915_gem_object_pin+0x1fa/0x5df [i915] [<ffffffffa008cdfe>] i915_gem_execbuffer_reserve_object.isra.6+0x8d/0x1bc [i915] [<ffffffffa008d156>] i915_gem_execbuffer_reserve+0x229/0x367 [i915] [<ffffffffa008dbf6>] i915_gem_do_execbuffer.isra.12+0x4dc/0xf3a [i915] [<ffffffff810fc823>] ? might_fault+0x40/0x90 [<ffffffffa008eb89>] i915_gem_execbuffer2+0x187/0x222 [i915] [<ffffffffa000971c>] drm_ioctl+0x308/0x442 [drm] [<ffffffffa008ea02>] ? i915_gem_execbuffer+0x3ae/0x3ae [i915] [<ffffffff817db156>] ? __do_page_fault+0x3dd/0x481 [<ffffffff8112fdba>] vfs_ioctl+0x26/0x39 [<ffffffff811306a2>] do_vfs_ioctl+0x40e/0x451 [<ffffffff817deda7>] ? sysret_check+0x1b/0x56 [<ffffffff8113073c>] SyS_ioctl+0x57/0x87 [<ffffffff8135bbfe>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff817ded82>] system_call_fastpath+0x16/0x1b Code: 48 c7 c6 84 30 0e a0 31 c0 e8 d0 e9 f7 ff bf c6 a7 00 00 e8 07 af 2c e1 41 f6 84 24 03 01 00 00 10 75 44 49 8b 84 24 08 01 00 00 <8b> 50 08 48 8b 30 49 8b 86 b0 04 00 00 48 89 c7 48 81 c7 98 00 RIP [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915] RSP <ffff88004bdf5958> CR2: 0000000000000008