Bug 76671

Summary: [PNV/ILK Regression]X will be no response when run all piglit cases
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Ville Syrjala <ville.syrjala>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: blocker    
Priority: highest CC: intel-gfx-bugs, jinxianx.guo
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
dmesg(patch)
none
Make the fake context less fake
none
boot log
none
Make the fake context less fake
none
Always use kref tracking for all contexts none

Description lu hua 2014-03-27 03:09:54 UTC
Created attachment 96438 [details]
dmesg

System Environment:
--------------------------
Platform: Ironlake
Kernel:  drm-intel-nightly/07071a9d41aed59f4f2ba66afb82ec35557a80c1

Bug detailed description:
-----------------------------
Run all Piglit cases, X will be no response.
It happens on Pineview and Ironlake with -queued and -nightly kernel, It doesn't happen on -fixes kernel.

The latest known good commit:4726e0b045b80c514377da35ca01467ef6a4de53
The latest known bad commit: 57e00b2e1b348b3855f4a3b681d583f8c9cf39fb

dmesg:
[ 1058.784926] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 1058.784981] IP: [<ffffffffa0078b8f>] i915_gem_free_request+0x90/0xb2 [i915]
[ 1058.785042] PGD ca6de067 PUD ca776067 PMD 0
[ 1058.785071] Oops: 0000 [#1] SMP
[ 1058.785094] Modules linked in: bnep bluetooth 6lowpan_iphc rfkill ipv6 dm_mod snd_hda_codec_hdmi firewire_ohci snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support firewire_core lpc_ich serio_raw pcspkr i2c_i801 crc_itu_t mfd_core snd_hda_intel snd_hda_codec snd_hwdep snd_pcm r8169 snd_timer snd soundcore floppy acpi_cpufreq i915 video button drm_kms_helper drm
[ 1058.785334] CPU: 3 PID: 3822 Comm: X Not tainted 3.14.0-rc7_drm-intel-nightly_07071a_20140326+ #1038
[ 1058.785377] Hardware name: Gigabyte Technology Co., Ltd. H55M-UD2H/H55M-UD2H, BIOS F4 12/02/2009
[ 1058.785424] task: ffff880001c6f350 ti: ffff8800ca688000 task.ti: ffff8800ca688000
[ 1058.785463] RIP: 0010:[<ffffffffa0078b8f>]  [<ffffffffa0078b8f>] i915_gem_free_request+0x90/0xb2 [i915]
[ 1058.785518] RSP: 0018:ffff8800ca689d58  EFLAGS: 00010286
[ 1058.785545] RAX: 0000000000000000 RBX: ffff8800ca588a20 RCX: 00000001802a0024
[ 1058.785581] RDX: ffff8800b5f19870 RSI: ffffea0004395540 RDI: ffff8800ca588780
[ 1058.785617] RBP: 0000000000000000 R08: dead000000200200 R09: ffffea0003299b80
[ 1058.785653] R10: dead000000100100 R11: ffffffffa007a04d R12: ffff880002dc1850
[ 1058.785689] R13: ffff880002dc1840 R14: ffff880002c99a00 R15: 00000000fffffff2
[ 1058.785724] FS:  00007fe02a3de8c0(0000) GS:ffff880117cc0000(0000) knlGS:0000000000000000
[ 1058.785764] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1058.785793] CR2: 0000000000000030 CR3: 00000000ca69f000 CR4: 00000000000007e0
[ 1058.785830] Stack:
[ 1058.785842]  ffffea0004395540 ffff880002dc1750 0000000000032e6c ffffffffa007a04d
[ 1058.785886]  ffff8800d35ee000 ffff8800d3768d80 0000000000000000 ffff8800d35ee000
[ 1058.785936]  ffff8800ca689e18 ffffffffa007ace7 0000000000000008 0000000000000008
[ 1058.785983] Call Trace:
[ 1058.786012]  [<ffffffffa007a04d>] ? i915_gem_retire_requests_ring+0x64/0x93 [i915]
[ 1058.786060]  [<ffffffffa007ace7>] ? i915_gem_busy_ioctl+0x63/0xba [i915]
[ 1058.786102]  [<ffffffffa0002f78>] ? drm_ioctl+0x264/0x38d [drm]
[ 1058.786142]  [<ffffffffa007ac84>] ? i915_gem_pread_ioctl+0x419/0x419 [i915]
[ 1058.786186]  [<ffffffff810dfece>] ? do_readv_writev+0x19d/0x1b2
[ 1058.786223]  [<ffffffff81719ca7>] ? __schedule+0x645/0x78b
[ 1058.786256]  [<ffffffff812d8be9>] ? timerqueue_del+0x4c/0x53
[ 1058.786290]  [<ffffffff8104d219>] ? __remove_hrtimer+0x2a/0x7f
[ 1058.786325]  [<ffffffff810ed5db>] ? do_vfs_ioctl+0x3ec/0x435
[ 1058.786357]  [<ffffffff810378e6>] ? do_setitimer+0xbe/0x19e
[ 1058.786388]  [<ffffffff810ed66d>] ? SyS_ioctl+0x49/0x78
[ 1058.786418]  [<ffffffff81092a84>] ? __audit_syscall_exit+0x209/0x225
[ 1058.786464]  [<ffffffff817214a2>] ? system_call_fastpath+0x16/0x1b
[ 1058.786497] Code: de 48 89 53 48 48 89 4b 50 48 c7 43 40 00 00 00 00 80 45 10 01 48 8b 7b 18 48 85 ff 74 1d 48 8b 47 20 48 85 c0 74 14 48 8b 40 08 <48> 8b 40 30 80 78 18 05 76 06 f0 83 2f 01 74 0b 58 48 89 df 5b
[ 1058.786717] RIP  [<ffffffffa0078b8f>] i915_gem_free_request+0x90/0xb2 [i915]
[ 1058.786767]  RSP <ffff8800ca689d58>
[ 1058.786786] CR2: 0000000000000030
[ 1058.794613] ---[ end trace 15faf5334e1f418b ]---
Comment 1 lu hua 2014-03-27 06:02:51 UTC
> 
> The latest known good commit:4726e0b045b80c514377da35ca01467ef6a4de53
> The latest known bad commit: 57e00b2e1b348b3855f4a3b681d583f8c9cf39fb
> 

Retest commit 4726e0b045b80c514377da35ca01467ef6a4de53, it works well.
The latest known good commit: b2040f6fed736ccd2319768bc59833abe74148b8
Comment 2 Daniel Vetter 2014-03-27 07:23:11 UTC
Can you please bisect where this regression has been introduced?

I know that running the full piglit takes time, but its not a lot of commits in the good..bad range, at least the first one. Also: Why is the good..bad range much bigger in comment #1?
Comment 3 Chris Wilson 2014-03-27 07:58:04 UTC
I'd say it was the request->ctx dereference. But that would suggest that ring->last_context was invalid?
Comment 4 Chris Wilson 2014-03-27 09:52:41 UTC
I think:

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 021e21223c9c..91000ff1b626 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -506,9 +506,6 @@ int i915_gem_context_enable(struct drm_i915_private *dev_priv)
 	struct intel_ring_buffer *ring;
 	int ret, i;
 
-	if (!HAS_HW_CONTEXTS(dev_priv->dev))
-		return 0;
-
 	/* This is the only place the aliasing PPGTT gets enabled, which means
 	 * it has to happen before we bail on reset */
 	if (dev_priv->mm.aliasing_ppgtt) {
@@ -833,7 +830,13 @@ int i915_switch_context(struct intel_ring_buffer *ring,
 
 	/* We have the fake context */
 	if (!HAS_HW_CONTEXTS(ring->dev)) {
-		ring->last_context = to;
+		if (to != ring->last_context) {
+			if (to)
+				i915_gem_context_reference(to);
+			if (ring->last_context)
+				i915_gem_context_unreference(ring->last_context);
+			ring->last_context = to;
+		}
 		return 0;
 	}
Comment 5 lu hua 2014-03-28 05:59:31 UTC
(In reply to comment #4)
> I think:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
> b/drivers/gpu/drm/i915/i915_gem_context.c
> index 021e21223c9c..91000ff1b626 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -506,9 +506,6 @@ int i915_gem_context_enable(struct drm_i915_private
> *dev_priv)
>  	struct intel_ring_buffer *ring;
>  	int ret, i;
>  
> -	if (!HAS_HW_CONTEXTS(dev_priv->dev))
> -		return 0;
> -
>  	/* This is the only place the aliasing PPGTT gets enabled, which means
>  	 * it has to happen before we bail on reset */
>  	if (dev_priv->mm.aliasing_ppgtt) {
> @@ -833,7 +830,13 @@ int i915_switch_context(struct intel_ring_buffer *ring,
>  
>  	/* We have the fake context */
>  	if (!HAS_HW_CONTEXTS(ring->dev)) {
> -		ring->last_context = to;
> +		if (to != ring->last_context) {
> +			if (to)
> +				i915_gem_context_reference(to);
> +			if (ring->last_context)
> +				i915_gem_context_unreference(ring->last_context);
> +			ring->last_context = to;
> +		}
>  		return 0;
>  	}


Test this patch, system boot fail.
Comment 6 lu hua 2014-03-28 05:59:54 UTC
Created attachment 96509 [details]
dmesg(patch)
Comment 7 Chris Wilson 2014-03-28 12:28:54 UTC
Created attachment 96526 [details] [review]
Make the fake context less fake

Updated patch.
Comment 8 lu hua 2014-03-31 02:40:10 UTC
Created attachment 96636 [details]
boot log

(In reply to comment #7)
> Created attachment 96526 [details] [review] [review]
> Make the fake context less fake
> 
> Updated patch.

Apply this patch, system boot fails.
Comment 9 Chris Wilson 2014-03-31 21:43:02 UTC
Created attachment 96685 [details] [review]
Make the fake context less fake

Ok, this time it actually booted here on a pnv. So let's see if it helps.
Comment 10 lu hua 2014-04-01 07:17:46 UTC
(In reply to comment #9)
> Created attachment 96685 [details] [review] [review]
> Make the fake context less fake
> 
> Ok, this time it actually booted here on a pnv. So let's see if it helps.

On latest -nightly kernel:
patching file drivers/gpu/drm/i915/i915_drv.h
Hunk #1 succeeded at 2277 (offset -49 lines).
patching file drivers/gpu/drm/i915/i915_gem.c
Hunk #1 succeeded at 2790 (offset -193 lines).
patching file drivers/gpu/drm/i915/i915_gem_context.c
Hunk #2 FAILED at 222.
Hunk #3 FAILED at 233.
Hunk #4 succeeded at 291 (offset -20 lines).
Hunk #5 succeeded at 339 (offset -20 lines).
Hunk #6 succeeded at 349 (offset -20 lines).
Hunk #7 succeeded at 372 (offset -20 lines).
Hunk #8 succeeded at 389 (offset -20 lines).
Hunk #9 succeeded at 427 (offset -20 lines).
Hunk #10 succeeded at 440 (offset -20 lines).
Hunk #11 succeeded at 450 (offset -20 lines).
Hunk #12 succeeded at 463 (offset -20 lines).
Hunk #13 succeeded at 470 (offset -20 lines).
Hunk #14 succeeded at 484 (offset -20 lines).
Hunk #15 FAILED at 527.
Hunk #16 FAILED at 561.
Hunk #17 succeeded at 555 (offset -19 lines).
Hunk #18 FAILED at 665.
Hunk #19 succeeded at 736 (offset -55 lines).
Hunk #20 succeeded at 744 (offset -55 lines).
Hunk #21 succeeded at 772 (offset -55 lines).
5 out of 21 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gem_context.c.rej
patching file drivers/gpu/drm/i915/i915_gem_execbuffer.c
Hunk #1 succeeded at 1221 with fuzz 2 (offset 13 lines).
Comment 11 Chris Wilson 2014-04-02 06:43:57 UTC
*** Bug 76929 has been marked as a duplicate of this bug. ***
Comment 12 Chris Wilson 2014-04-02 07:10:33 UTC
Created attachment 96762 [details] [review]
Always use kref tracking for all contexts

Now rebased to include an earlier bug fix as well.
Comment 13 lu hua 2014-04-03 06:55:37 UTC
(In reply to comment #12)
> Created attachment 96762 [details] [review] [review]
> Always use kref tracking for all contexts
> 
> Now rebased to include an earlier bug fix as well.


Apply this patch and run one cycle on ILK, it works well.
Comment 14 Chris Wilson 2014-04-11 10:49:21 UTC
commit 349a1b45a80e5f1dcd8a94516c71118350f439e5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Apr 9 09:07:36 2014 +0100

    drm/i915: Always use kref tracking for all contexts.
    
    If we always initialize kref for the context, even if we are using fake
    contexts for hangstats when there is no hw support, we can forgo the
    dance to dereference the ctx->obj and inspect whether we are permitted
    to use kref inside i915_gem_context_reference() and _unreference().
    
    My ulterior motive here is to improve the debugging of a use-after-free
    of ctx->obj. This patch avoids the dereference here and instead forces
    the assertion checks associated with kref.
    
    v2: Refactor the fake contexts to being even more like the real
    contexts, so that there is much less duplicated and special case code.
    
    v3: Tweaks.
    v4: Tweaks, minor.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=76671
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: lu hua <huax.lu@intel.com>
    Cc: Ben Widawsky <benjamin.widawsky@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
    [Jani: tiny change to backport to drm-intel-fixes.]
    Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Comment 15 lu hua 2014-04-16 06:28:21 UTC
Verified.Fixed.
Comment 16 Jari Tahvanainen 2016-10-12 09:06:41 UTC
Closing verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.