Bug 68298

Summary: [PNV Regression]glxgears causes call trace and system hang
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Ben Widawsky <ben>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high    
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
dmesg(26b32d77b2469)
none
Patch to rework the exec_list trick
none
More vma fixups
none
dmesg none

Description lu hua 2013-08-20 01:25:10 UTC
Created attachment 84297 [details]
dmesg

System Environment:
--------------------------
Platform:  Pineview
Kernel:    (drm-intel-next-nightly)e0fd293e471707a301fb9d8f83295265b9315161

Bug detailed description:
-----------------------------
Run glxgears, it causes call trace and system hang. It happens on -nightly and -queued kernel.It works well on -fixes kernel.
The latest known good commit: 030c22d0c4a7a98968db820d3a9759197d580217
The latest known bad commit: 815ca9817afc1c1a45799fb9d4a4f413b7b1517c

Call trace:
[  337.005193] Call Trace:
[  337.005209]  [<c0bf78e0>] ? start_kernel+0x318/0x31c
[  337.005216] Code: e8 4c e8 fa ff 83 3d e0 47 bf c0 00 74 10 e8 d6 4b 02 00 fb 89 e2 81 e2 00 e0 ff ff eb 0d e8 55 74 00 00 85 c0 75 e7 eb 10 f3 90 <8b> 42 08 a8 08 74 f7 e8 4a 4b 02 00 eb 55 89 e0 25 00 e0 ff ff
[  337.005009] NMI backtrace for cpu 1
[  337.005009] CPU: 1 PID: 3892 Comm: X Not tainted 3.11.0-rc2_drm-intel-next-queued_815ca9_20130818_+ #7128
[  337.005009] Hardware name: MICRO-STAR INTERNATIONAL CO., LTD MS-N014/MS-N014, BIOS EN014IMS.10B 11/30/2009
[  337.005009] task: f6146e10 ti: c37de000 task.ti: c37de000
[  337.005009] EIP: 0060:[<c04796ac>] EFLAGS: 00203006 CPU: 1
[  337.005009] EIP is at __const_udelay+0x0/0x1a
[  337.005009] EAX: 00418958 EBX: 00002710 ECX: c0a93665 EDX: fffff000
[  337.005009] ESI: f67e8d5c EDI: 000006e0 EBP: c0ba94a0 ESP: c37dfc8c
[  337.005009]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  337.005009] CR0: 80050033 CR2: b75a2e40 CR3: 363e6000 CR4: 000007d0
[  337.005009] Stack:
[  337.005009]  c021febf c0ba94a0 c027d79c c0a9c5c0 0000ea60 00000f8d 00000f8c 000006e0
[  337.005009]  00000000 00000000 00000001 f6146e10 f6146e10 00000000 00000001 c37dfd60
[  337.005009]  c02347a3 f67e8bf4 771044e8 0000004e c025fe2d c0260003 f67e8bf4 f67e899c
[  337.005009] Call Trace:
[  337.005009]  [<c021febf>] ? arch_trigger_all_cpu_backtrace+0x57/0x60
[  337.005009]  [<c027d79c>] ? rcu_check_callbacks+0x141/0x3d5
[  337.005009]  [<c02347a3>] ? update_process_times+0x2a/0x4e
[  337.005009]  [<c025fe2d>] ? tick_sched_handle+0x2d/0x37
[  337.005009]  [<c0260003>] ? tick_sched_timer+0x28/0x4b
[  337.005009]  [<c024227e>] ? __run_hrtimer.isra.23+0x3b/0x88
[  337.005009]  [<c02429d2>] ? hrtimer_interrupt+0xf0/0x1e7
[  337.005009]  [<c021e76a>] ? local_apic_timer_interrupt+0x3d/0x3f
[  337.005009]  [<c021ea78>] ? smp_apic_timer_interrupt+0x2b/0x39
[  337.005009]  [<c087535d>] ? apic_timer_interrupt+0x2d/0x34
[  337.005009]  [<f82af6a4>] ? i915_gem_execbuffer_reserve+0x1d5/0x2d6 [i915]
[  337.005009]  [<f82affee>] ? i915_gem_do_execbuffer.isra.17+0x48c/0xd69 [i915]
[  337.005009]  [<f82ab508>] ? i915_gem_obj_bound_any+0x28/0x43 [i915]
[  337.005009]  [<f82ab508>] ? i915_gem_obj_bound_any+0x28/0x43 [i915]
[  337.005009]  [<f82ab500>] ? i915_gem_obj_bound_any+0x20/0x43 [i915]
[  337.005009]  [<c02b1754>] ? __kmalloc+0xbe/0xe0
[  337.005009]  [<f82b0e3c>] ? i915_gem_execbuffer2+0x12e/0x1c2 [i915]
[  337.005009]  [<f82b0d0e>] ? i915_gem_execbuffer+0x443/0x443 [i915]
[  337.005009]  [<f8103c1e>] ? drm_ioctl+0x23d/0x323 [drm]
[  337.005009]  [<f82b0d0e>] ? i915_gem_execbuffer+0x443/0x443 [i915]
[  337.005009]  [<c02a0328>] ? handle_pte_fault+0x274/0x5e3
[  337.005009]  [<f81039e1>] ? drm_copy_field+0x47/0x47 [drm]
[  337.005009]  [<c02c04fc>] ? vfs_ioctl+0x18/0x21
[  337.005009]  [<c02c0ec8>] ? do_vfs_ioctl+0x3ec/0x42c
[  337.005009]  [<c08778c9>] ? __do_page_fault+0x400/0x43b
[  337.005009]  [<c087787d>] ? __do_page_fault+0x3b4/0x43b
[  337.005009]  [<c02b505a>] ? vfs_read+0xd4/0x11e
[  337.005009]  [<c02c0f51>] ? SyS_ioctl+0x49/0x74
[  337.005009]  [<c087945a>] ? sysenter_do_call+0x12/0x22
[  337.005009] Code: c0 74 1f eb 0a 8d 76 00 8d bc 27 00 00 00 00 eb 0e 8d b4 26 00 00 00 00 8d bc 27 00 00 00 00 48 75 fd 48 c3 ff 15 50 43 bb c0 c3 <64> 8b 15 fc 5b a3 c3 69 d2 fa 00 00 00 c1 e0 02 f7 e2 8d 42 01


Reproduce steps:
----------------------------
1. xinit
2. glxgears
Comment 1 Daniel Vetter 2013-08-20 05:13:12 UTC
Hm, the backtrace is a bit unhelpful ... Can you please attempt to bisect this regression?
Comment 2 lu hua 2013-08-20 07:15:15 UTC
Bisect shows: 04038a515d6eda6dd0857c0ade0b3950d372f4c0 is the first bad commit
commit 04038a515d6eda6dd0857c0ade0b3950d372f4c0
Author:     Ben Widawsky <ben@bwidawsk.net>
AuthorDate: Wed Aug 14 11:38:36 2013 +0200
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Thu Aug 15 15:45:45 2013 +0200

    drm/i915: Convert execbuf code to use vmas

    In order to transition more of our code over to using a VMA instead of
    an <OBJ, VM> pair - we must have the vma accessible at execbuf time. Up
    until now, we've only had a VMA when actually binding an object.

    The previous patch helped handle the distinction on bound vs. unbound.
    This patch will help us catch leaks, and other issues before we actually
    shuffle a bunch of stuff around.

    This attempts to convert all the execbuf code to speak in vmas. Since
    the execbuf code is very self contained it was a nice isolated
    conversion.

    The meat of the code is about turning eb_objects into eb_vma, and then
    wiring up the rest of the code to use vmas instead of obj, vm pairs.

    Unfortunately, to do this, we must move the exec_list link from the obj
    structure. This list is reused in the eviction code, so we must also
    modify the eviction code to make this work.

    WARNING: This patch makes an already hotly profiled path slower. The cost is
    unavoidable. In reply to this mail, I will attach the extra data.

    v2: Release table lock early, and two a 2 phase vma lookup to avoid
    having to use a GFP_ATOMIC. (Chris)

    v3: s/obj_exec_list/obj_exec_link/
    Updates to address
    commit 6d2b888569d366beb4be72cacfde41adee2c25e1
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Wed Aug 7 18:30:54 2013 +0100

        drm/i915: List objects allocated from stolen memory in debugfs

    v4: Use obj = vma->obj for neatness in some places (Chris)
    need_reloc_mappable() should return false if ppgtt (Chris)

    Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
    [danvet: Split out prep patches. Also remove a FIXME comment which is
    now taken care of.]
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Comment 3 Chris Wilson 2013-08-20 07:57:01 UTC
Ben, here's the gen3 failure case we expected to see last night after Dan's for_each_safe email.
Comment 4 Daniel Vetter 2013-08-20 12:23:25 UTC
Please test this patch from Chris:

https://patchwork.kernel.org/patch/2847056/
Comment 5 lu hua 2013-08-21 07:55:47 UTC
(In reply to comment #4)
> Please test this patch from Chris:
> 
> https://patchwork.kernel.org/patch/2847056/

I test it on commit 26b32d77b2469(include this patch).It still exists.
Comment 6 lu hua 2013-08-21 07:56:16 UTC
Created attachment 84370 [details]
dmesg(26b32d77b2469)
Comment 7 Daniel Vetter 2013-08-21 08:39:38 UTC
Hm 0x00100104, so LIST_POISON1 + 0x4, i.e. we've chased a ->next pointer somewhere and then dereffed ->next->prev. Doesn't smell good ...
Comment 8 Daniel Vetter 2013-08-22 07:18:00 UTC
Created attachment 84432 [details] [review]
Patch to rework the exec_list trick

Please test this patch on top of latest -nightly.
Comment 9 Daniel Vetter 2013-08-22 10:26:56 UTC
Created attachment 84445 [details] [review]
More vma fixups

Updated patch to address a now bogus WARN.
Comment 10 lu hua 2013-08-23 06:31:59 UTC
Created attachment 84499 [details]
dmesg

It still exists with this patch.
Comment 11 Daniel Vetter 2013-08-23 20:35:21 UTC
Can you please test this patch?

https://patchwork.kernel.org/patch/2848475/
Comment 12 lu hua 2013-08-26 03:12:19 UTC
(In reply to comment #11)
> Can you please test this patch?
> 
> https://patchwork.kernel.org/patch/2848475/

Fixed by this patch.
Comment 13 Daniel Vetter 2013-08-26 19:19:45 UTC
Fixed with

commit f833c65abf79c2456fe8e8c487e3d78b9c329daa
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Aug 26 11:23:47 2013 +0200

    drm/i915: More vma fixups around unbind/destroy
Comment 14 lu hua 2013-08-28 05:11:37 UTC
Verified.Fixed.
Comment 15 Elizabeth 2017-10-06 14:43:50 UTC
Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.