Bug 68171

Summary:

[SNB/IVB/HSW ULT regression]system hang when run nightly testing

Product:

DRI

Reporter:

lu hua <huax.lu>

Component:

DRM/Intel

Assignee:

Ben Widawsky <ben>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

critical

Priority:

highest

CC:

kenneth, przanoni

Version:

unspecified

Hardware:

All

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
messages log	none
quick test patch	none
Improved patch, now with a bugfix	none
More vma fixups	none
dmesg on nightly 8fdad4	none
An idea.	none
01- Rename olr	none
02 - Preallocate request	none
Hold a reference whilst shrinking the objects	none
Hold a reference whilst shrinking the objects	none
Hold a reference whilst shrinking the objects	none

Description lu hua 2013-08-16 03:36:07 UTC

Created attachment 84120 [details]
messages log

System Environment:
--------------------------
Platform:    Ivybridge/Haswell ULT
Kernel:      (drm-intel-nightly)d93f59e86ae93066969fa8ae2a6c9ccc7fc4728d

Bug detailed description:
-----------------------------
When run nightly testing, system hang with call trace. I can't reproduce manually. It happens on ivybridge and haswell ult with -nightly kernel.
It doesn't happen on -fixed kenrel.

BUG info in dmesg:
Aug 15 20:12:37 x-ivb9 kernel: [ 8081.750976] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008

 Call Trace:
[ 8081.751843]  [<ffffffffa0088e55>] i915_vma_unbind+0xdf/0x1ab [i915]
[<ffffffffa0089026>] __i915_gem_shrink+0x105/0x177 [i915]
[<ffffffffa0089452>] i915_gem_object_get_pages_gtt+0x108/0x309 [i915]
[<ffffffffa0085ba9>] i915_gem_object_get_pages+0x61/0x90 [i915]
[<ffffffffa008f22b>] ? gen6_ppgtt_insert_entries+0x103/0x125 [i915]
[<ffffffffa008a113>] i915_gem_object_pin+0x1fa/0x5df [i915]
[<ffffffffa008cdfe>] i915_gem_execbuffer_reserve_object.isra.6+0x8d/0x1bc [i915]
[<ffffffffa008d156>] i915_gem_execbuffer_reserve+0x229/0x367 [i915]
[<ffffffffa008dbf6>] i915_gem_do_execbuffer.isra.12+0x4dc/0xf3a [i915]
[<ffffffff810fc823>] ? might_fault+0x40/0x90
[<ffffffffa008eb89>] i915_gem_execbuffer2+0x187/0x222 [i915]
[<ffffffffa000971c>] drm_ioctl+0x308/0x442 [drm]
[<ffffffffa008ea02>] ? i915_gem_execbuffer+0x3ae/0x3ae [i915]
[<ffffffff817db156>] ? __do_page_fault+0x3dd/0x481
[<ffffffff8112fdba>] vfs_ioctl+0x26/0x39
[<ffffffff811306a2>] do_vfs_ioctl+0x40e/0x451
[<ffffffff817deda7>] ? sysret_check+0x1b/0x56
[<ffffffff8113073c>] SyS_ioctl+0x57/0x87
[<ffffffff8135bbfe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff817ded82>] system_call_fastpath+0x16/0x1b
 Code: 48 c7 c6 84 30 0e a0 31 c0 e8 d0 e9 f7 ff bf c6 a7 00 00 e8 07 af 2c e1 41 f6 84 24 03 01 00 00 10 75 44 49 8b 84 24 08 01 00 00 <8b> 50 08 48 8b 30 49 8b 86 b0 04 00 00 48 89 c7 48 81 c7 98 00
RIP  [<ffffffffa008fb37>] i915_gem_gtt_finish_object+0x73/0xc8 [i915]
 RSP <ffff88004bdf5958>
 CR2: 0000000000000008

Comment 1 Paulo Zanoni 2013-08-16 21:06:23 UTC

Hi

Is this a recent regression? Can it be bisected?

Thanks,
Paulo

Comment 2 Daniel Vetter 2013-08-18 17:41:06 UTC

Smells like ppgtt fallout, so one for Ben.

Comment 3 lu hua 2013-08-20 05:32:55 UTC

(In reply to comment #1)
> Hi
> 
> Is this a recent regression? Can it be bisected?
> 
> Thanks,
> Paulo

The latest good commit: 3477e5ea598c88d21f24c00f8fcdfd7f4e837b59(3f577573cd5 6d2b888569d3).
The fail is not reproducible by manually. I can't find a good way to bisect it.Do you have any suggest?

Comment 4 Daniel Vetter 2013-08-20 06:00:08 UTC

Can you reproduce the hangs when running e.g. the entire i-g-t testsuite? It'll blow through a bit pile of cpu time, but if that works I think we should try it out ...

Comment 5 Ben Widawsky 2013-08-22 06:11:57 UTC

This one doesn't look the same to me as the one invoked by gem_evict_everything (and Ken just hit it too fwiw).

The cause is memory pressure and being forced to hit the bound_list while doing execbuf. It's a similar cause to the other one, but from what I can gather this one fails while we're trying to unmap the gtt userspace mappings. Since I do not know much about i915_gem_release_mmap, it might take me a while to come up with some ideas.

It is possible this is another pre-existing bug that's just uncovered by VMA.

Is that SHA the use VMAs in execbuffer commit?

Comment 6 Ben Widawsky 2013-08-22 06:42:19 UTC

I dug a bit. It looks to me the failure is that pages is null here:

if (!obj->has_dma_mapping)
        dma_unmap_sg(&dev->pdev->dev,
                     obj->pages->sgl, obj->pages->nents,
                     PCI_DMA_BIDIRECTIONAL);


My disasm is a bit too complex to make sense out of at this hour.

Comment 7 Daniel Vetter 2013-08-22 06:59:42 UTC

Is this still broken with latest kernels?

Comment 8 Daniel Vetter 2013-08-22 07:13:20 UTC

Created attachment 84430 [details] [review]
quick test patch

Please retest with this patch and check in dmesg whether you're hitting the newly-added WARN anywhere ...

Comment 9 Ben Widawsky 2013-08-22 07:15:16 UTC

(In reply to comment #6)
> I dug a bit. It looks to me the failure is that pages is null here:
> 
> if (!obj->has_dma_mapping)
>         dma_unmap_sg(&dev->pdev->dev,
>                      obj->pages->sgl, obj->pages->nents,
>                      PCI_DMA_BIDIRECTIONAL);
> 
> 
> My disasm is a bit too complex to make sense out of at this hour.

I couldn't believe it before, but I did more digging, it seems to fail on:
obj->pages->nents, which at least on my compiled obj is done first.
 218a0:       49 8b 84 24 08 01 00    mov    0x108(%r12),%rax obj->pages
 218a7:       00
 218a8:       8b 50 08                mov    0x8(%rax),%edx obj->pages->nents
 218ab:       48 8b 30                mov    (%rax),%rsi
 218ae:       49 8b 85 90 04 00 00    mov    0x490(%r13),%rax // dev->pdev
 218b5:       48 89 c7                mov    %rax,%rdi
 218b8:       48 81 c7 98 00 00 00    add    $0x98,%rdi  //pdev->dev
(In reply to comment #6)

Comment 10 Daniel Vetter 2013-08-22 07:17:16 UTC

Created attachment 84431 [details] [review]
Improved patch, now with a bugfix

Please disregard the earlier patch and test this one here instead.

Comment 11 Daniel Vetter 2013-08-22 10:26:36 UTC

Created attachment 84444 [details] [review]
More vma fixups

Updated patch to address a now bogus WARN.

Comment 12 lu hua 2013-08-23 01:49:45 UTC

(In reply to comment #7)
> Is this still broken with latest kernels?


It still happens on latest -nightly kernel.

Comment 13 Daniel Vetter 2013-08-23 20:36:04 UTC

Can you please test this patch?

https://patchwork.kernel.org/patch/2848475/

Comment 14 lu hua 2013-08-26 01:54:45 UTC

(In reply to comment #11)
> Created attachment 84444 [details] [review] [review]
> More vma fixups
> 
> Updated patch to address a now bogus WARN.


Test this patch on latest -nightly branch, This issue goes away.

Comment 15 Daniel Vetter 2013-08-26 19:19:52 UTC

Fixed with

commit f833c65abf79c2456fe8e8c487e3d78b9c329daa
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Aug 26 11:23:47 2013 +0200

    drm/i915: More vma fixups around unbind/destroy

Comment 16 lu hua 2013-08-30 02:15:10 UTC

It still happens on latest -nightly kernel.
It is a bit random. 
In recently test, It happens 1 time on SNB. It passes 2 times on IVB and HSW ULT(once with the patch)

Comment 17 Daniel Vetter 2013-08-30 07:34:45 UTC

Can you please attach an updated dmesg with the latest backtrace?

Comment 18 lu hua 2013-09-03 01:45:19 UTC

Call trace on latest -nightly kernel:
Call Trace:
Sep  3 00:12:19 x-hswu33 kernel: [22027.292350]  [<ffffffffa007c854>] ? i915_vma_unbind+0xe2/0x1d1 [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.292410]  [<ffffffffa007d183>] ? __i915_gem_shrink+0xf1/0x162 [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.292464]  [<ffffffffa007d2ee>] ? i915_gem_object_get_pages_gtt+0xfa/0x303 [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.292527]  [<ffffffffa00795f4>] ? i915_gem_object_get_pages+0x54/0x89 [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.292586]  [<ffffffffa007cbda>] ? i915_gem_object_pin+0x238/0x5ce [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.292638]  [<ffffffff812cba5f>] ? __sg_page_iter_next+0x2b/0x58
Sep  3 00:12:19 x-hswu33 kernel: [22027.292694]  [<ffffffffa0082056>] ? gen6_ppgtt_insert_entries+0xf2/0x114 [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.292754]  [<ffffffffa007fe4b>] ? i915_gem_execbuffer_reserve_vma.isra.13+0x79/0x18d [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.292820]  [<ffffffffa008017c>] ? i915_gem_execbuffer_reserve+0x21d/0x347 [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.292881]  [<ffffffffa0080bfb>] ? i915_gem_do_execbuffer.isra.17+0x4f3/0xe61 [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.292943]  [<ffffffffa00795f4>] ? i915_gem_object_get_pages+0x54/0x89 [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.293002]  [<ffffffffa007e405>] ? i915_gem_pwrite_ioctl+0x743/0x7a5 [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.293060]  [<ffffffffa0081a46>] ? i915_gem_execbuffer2+0x15e/0x1e4 [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.293123]  [<ffffffffa000e20d>] ? drm_ioctl+0x2a5/0x3c4 [drm]
Sep  3 00:12:19 x-hswu33 kernel: [22027.293173]  [<ffffffffa00818e8>] ? i915_gem_execbuffer+0x37f/0x37f [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.293224]  [<ffffffff816f64c0>] ? __do_page_fault+0x3ab/0x449
Sep  3 00:12:19 x-hswu33 kernel: [22027.293269]  [<ffffffff810be3da>] ? do_mmap_pgoff+0x2b2/0x341
Sep  3 00:12:19 x-hswu33 kernel: [22027.293317]  [<ffffffff810e49be>] ? vfs_ioctl+0x1e/0x31
Sep  3 00:12:19 x-hswu33 kernel: [22027.293354]  [<ffffffff810e5194>] ? do_vfs_ioctl+0x3ad/0x3ef
Sep  3 00:12:19 x-hswu33 kernel: [22027.293396]  [<ffffffff810e5224>] ? SyS_ioctl+0x4e/0x7e
Sep  3 00:12:19 x-hswu33 kernel: [22027.293435]  [<ffffffff816f88d2>] ? system_call_fastpath+0x16/0x1b
Sep  3 00:12:19 x-hswu33 kernel: [22027.293478] Code: 52 0c a0 48 c7 c6 22 30 0d a0 31 c0 e8 ef 00 f9 ff bf c6 a7 00 00 e8 90 5d 24 e1 f6 85 13 01 00 00 10 75 44 48 8b 85 18 01 00 00 <8b> 50 08 48 8b 30 49 8b 84 24 88 02 00 00 48 89 c7 48 81 c7 98
Sep  3 00:12:19 x-hswu33 kernel: [22027.293678] RIP  [<ffffffffa0082892>] i915_gem_gtt_finish_object+0x68/0xbd [i915]
Sep  3 00:12:19 x-hswu33 kernel: [22027.293746]  RSP <ffff880028e4b9e8>
Sep  3 00:12:19 x-hswu33 kernel: [22027.293773] CR2: 0000000000000008

Comment 19 lu hua 2013-09-03 01:47:29 UTC

Created attachment 85095 [details]
dmesg on nightly 8fdad4

Comment 20 Chris Wilson 2013-09-03 19:00:50 UTC

Created attachment 85141 [details] [review]
An idea.

Comment 21 Chris Wilson 2013-09-03 19:01:21 UTC

Can you try the attached patch to see if that makes the bug vanish, or if it catches anything?

Comment 22 Chris Wilson 2013-09-03 21:31:21 UTC

Created attachment 85148 [details] [review]
01-  Rename olr

Comment 23 Chris Wilson 2013-09-03 21:31:41 UTC

Created attachment 85149 [details] [review]
02 - Preallocate request

Comment 24 Chris Wilson 2013-09-03 23:22:25 UTC

Created attachment 85153 [details] [review]
Hold a reference whilst shrinking the objects

Third time's the charm.

Comment 25 Ben Widawsky 2013-09-03 23:48:30 UTC

For posterity... I've discussed this with Chris quite a bit, and thought about it myself.

I think it's definitely feasible to end up freeing an object while going through i915_vma_unbind. I can't see how this problem is special to the vma addition though since the theoretical issue of the shrinker being invoked due to an object being unbound (and the request being added) is not new.

Either way, I have a few of my own hacks we can try. I think the last patch from Chris (https://bugs.freedesktop.org/attachment.cgi?id=85153) is flawed in that it doesn't prevent invalid ptr access on return to i915_vma_unbind. However, I think the two obsoleted patches before that should prevent the problem (though it's duct tape to be sure, since the next user of malloc an easily hit this). The recent patch https://bugs.freedesktop.org/attachment.cgi?id=85153 seems to be useful to fix a bug we haven't hit with consecutive shrinker recursion (thought I haven't charted out how that actually can happen).

My ratio of hitting the problem is quite low. I've been running the previous 2 patches for several hours and haven't hit the problem.

Comment 26 Chris Wilson 2013-09-04 00:02:30 UTC

Created attachment 85158 [details] [review]
Hold a reference whilst shrinking the objects

I'm going to pretend that this was v3.

Comment 27 Chris Wilson 2013-09-04 00:03:49 UTC

(In reply to comment #25)
> However, I think the two obsoleted patches before that should prevent the
> problem (though it's duct tape to be sure, since the next user of malloc an
> easily hit this).

I wouldn't have been motivated to write the third patch unless the first two failed... And now I really must get some sleep.

Comment 28 Chris Wilson 2013-09-04 12:16:49 UTC

Created attachment 85194 [details] [review]
Hold a reference whilst shrinking the objects

Clean version rebased against -nightly

Comment 29 Daniel Vetter 2013-09-04 15:57:42 UTC

This is blocking Ben's big PPGTT work.

Comment 30 Chris Wilson 2013-09-08 14:57:45 UTC

commit 57094f82465002fbde1447e2fd850e1179bf6d86
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Sep 4 10:45:50 2013 +0100

    drm/i915: Hold an object reference whilst we shrink it

Comment 31 lu hua 2013-09-10 05:56:47 UTC

Verified.Fixed.

Comment 32 Jari Tahvanainen 2016-10-07 05:44:50 UTC

Closing verified+fixed

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.