Bug 80984

Summary: [GM45] Eaglelake gen4.5 - GPU HANG: ecode -1:0x00000000 when using DRI_PRIME
Product: DRI Reporter: Shawn Starr <shawn.starr>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED WONTFIX QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: GM45 i915 features: GPU hang
Attachments:
Description Flags
Crash dump from /sys/class/drm/card0/error
none
GPU crash dump from sysfs
none
Avoid struct mutex recursion. none

Description Shawn Starr 2014-07-06 20:44:52 UTC
Created attachment 102332 [details]
Crash dump from /sys/class/drm/card0/error

Kernel: kernel-3.16.0-0.rc3.git3.1.fc21.x86_64
MESA: mesa-dri-drivers-10.2.2-3.20140625.fc21.x86_64
Xorg: xorg-x11-server-Xorg-1.15.99.903-100.fc21.x86_64 (patched with [PATCH] dri2: Use the PrimeScreen when creating/reusing buffers) 


Playing Second Life with radeon GPU offload to Intel GPU running with export LIBGL_DRI3_DISABLE=1 since DRI3 in this Mesa has no DRI3 DRI_PRIME support yet.

Kernel spit out this error:

[  293.644011] [drm] GPU HANG: ecode -1:0x00000000, reason: Command parser error, iir 0x00008010, action: continue
[  293.649949] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  293.649949] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  293.649949] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  293.649949] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  293.649949] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  293.649949] i915: render error detected, EIR: 0x00000010
[  293.649949] i915:   IPEIR: 0x00000000
[  293.649949] i915:   IPEHR: 0x01000000
[  293.649949] i915:   INSTDONE_0: 0xfffffffe
[  293.649949] i915:   INSTDONE_1: 0xffffffff
[  293.649949] i915:   INSTDONE_2: 0x00000000
[  293.649949] i915:   INSTDONE_3: 0x00000000
[  293.649949] i915:   INSTPS: 0x0001e000
[  293.649949] i915:   ACTHD: 0x0181c148
[  293.649949] i915: page table error
[  293.649949] i915:   PGTBL_ER: 0x00000001
[  293.649949] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking

[  331.792605] =============================================
[  331.792782] [ INFO: possible recursive locking detected ]
[  331.792971] 3.16.0-0.rc3.git3.1.fc21.x86_64 #1 Tainted: G        W    
[  331.793012] ---------------------------------------------
[  331.793012] Xorg.bin/1015 is trying to acquire lock:
[  331.793012]  (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa010a919>] i915_gem_unmap_dma_buf+0x39/0x110 [i915]
[  331.793012] 
but task is already holding lock:
[  331.793012]  (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa0084902>] drm_gem_object_handle_unreference_unlocked+0x102/0x130 [drm]
[  331.793012] 
other info that might help us debug this:
[  331.793012]  Possible unsafe locking scenario:

[  331.793012]        CPU0
[  331.793012]        ----
[  331.793012]   lock(&dev->struct_mutex);
[  331.793012]   lock(&dev->struct_mutex);
[  331.793012] 
 *** DEADLOCK ***

[  331.793012]  May be due to missing lock nesting notation

[  331.793012] 1 lock held by Xorg.bin/1015:
[  331.793012]  #0:  (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa0084902>] drm_gem_object_handle_unreference_unlocked+0x102/0x130 [drm]
[  331.793012] 
stack backtrace:
[  331.793012] CPU: 0 PID: 1015 Comm: Xorg.bin Tainted: G        W     3.16.0-0.rc3.git3.1.fc21.x86_64 #1
[  331.793012] Hardware name: LENOVO 4058CTO/4058CTO, BIOS 6FET93WW (3.23 ) 10/12/2012
[  331.793012]  0000000000000000 00000000ad742009 ffff88024ff4ba80 ffffffff81807cec
[  331.793012]  ffffffff82bc8240 ffff88024ff4bb60 ffffffff81100fd0 ffffffff81024369
[  331.793012]  ffff88024ff4bac0 ffffffff810e1b3d ffff880000000000 00000000007583ac
[  331.793012] Call Trace:
[  331.793012]  [<ffffffff81807cec>] dump_stack+0x4d/0x66
[  331.793012]  [<ffffffff81100fd0>] __lock_acquire+0x1450/0x1ca0
[  331.793012]  [<ffffffff81024369>] ? sched_clock+0x9/0x10
[  331.793012]  [<ffffffff810e1b3d>] ? sched_clock_local+0x1d/0x90
[  331.793012]  [<ffffffff81024369>] ? sched_clock+0x9/0x10
[  331.793012]  [<ffffffff810e1b3d>] ? sched_clock_local+0x1d/0x90
[  331.793012]  [<ffffffff81102104>] lock_acquire+0xa4/0x1d0
[  331.793012]  [<ffffffffa010a919>] ? i915_gem_unmap_dma_buf+0x39/0x110 [i915]
[  331.793012]  [<ffffffff8180ccd5>] mutex_lock_nested+0x85/0x440
[  331.793012]  [<ffffffffa010a919>] ? i915_gem_unmap_dma_buf+0x39/0x110 [i915]
[  331.793012]  [<ffffffffa010a919>] ? i915_gem_unmap_dma_buf+0x39/0x110 [i915]
[  331.793012]  [<ffffffffa010a919>] i915_gem_unmap_dma_buf+0x39/0x110 [i915]
[  331.793012]  [<ffffffff81532591>] dma_buf_unmap_attachment+0x51/0x80
[  331.793012]  [<ffffffffa009c6c2>] drm_prime_gem_destroy+0x22/0x40 [drm]
[  331.793012]  [<ffffffffa0468112>] radeon_gem_object_free+0x42/0x70 [radeon]
[  331.793012]  [<ffffffffa0084387>] drm_gem_object_free+0x27/0x40 [drm]
[  331.793012]  [<ffffffffa0084920>] drm_gem_object_handle_unreference_unlocked+0x120/0x130 [drm]
[  331.793012]  [<ffffffffa00849ff>] drm_gem_handle_delete+0xcf/0x1a0 [drm]
[  331.793012]  [<ffffffffa0085205>] drm_gem_close_ioctl+0x25/0x30 [drm]
[  331.793012]  [<ffffffffa0082cdf>] drm_ioctl+0x1df/0x6a0 [drm]
[  331.793012]  [<ffffffff81810af6>] ? _raw_spin_unlock_irqrestore+0x36/0x70
[  331.793012]  [<ffffffff810ff72d>] ? trace_hardirqs_on_caller+0x15d/0x200
[  331.793012]  [<ffffffff810ff7dd>] ? trace_hardirqs_on+0xd/0x10
[  331.793012]  [<ffffffffa043604c>] radeon_drm_ioctl+0x4c/0x80 [radeon]
[  331.793012]  [<ffffffff812628d0>] do_vfs_ioctl+0x2f0/0x520
[  331.793012]  [<ffffffff8126ef6a>] ? __fget+0x12a/0x2f0
[  331.793012]  [<ffffffff8126ee45>] ? __fget+0x5/0x2f0
[  331.793012]  [<ffffffff8126f1a0>] ? __fget_light+0x30/0x160
[  331.793012]  [<ffffffff81262b81>] SyS_ioctl+0x81/0xa0
[  331.793012]  [<ffffffff818118e9>] system_call_fastpath+0x16/0x1b
[  418.754354] DMA-API: debugging out of memory - disabling
[  858.336541] NMI: PCI system error (SERR) for reason a1 on CPU 0.
[  858.336762] Dazed and confused, but trying to continue
[  858.336959] dmar: DRHD: handling fault status reg 3
[  858.337115] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e5001000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.337515] dmar: DRHD: handling fault status reg 3
[  858.337660] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e50b9000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.338093] dmar: DRHD: handling fault status reg 3
[  858.338257] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e512b000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.338093] dmar: DRHD: handling fault status reg 3
[  858.338093] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e518e000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.339171] dmar: DRHD: handling fault status reg 3
[  858.339325] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e51ee000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.339718] dmar: DRHD: handling fault status reg 3
[  858.339863] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e5254000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.340235] dmar: DRHD: handling fault status reg 3
[  858.340387] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e52ba000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.340802] dmar: DRHD: handling fault status reg 3
[  858.340947] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e5322000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.341332] dmar: DRHD: handling fault status reg 3
[  858.341483] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e538f000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.341868] dmar: DRHD: handling fault status reg 3
[  858.342006] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e53e8000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.342390] dmar: DRHD: handling fault status reg 3
[  858.342534] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e543f000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.342907] dmar: DRHD: handling fault status reg 3
[  858.343054] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e54af000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.343443] dmar: DRHD: handling fault status reg 3
[  858.343587] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e550a000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.343970] dmar: DRHD: handling fault status reg 3
[  858.344120] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e557a000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.344510] dmar: DRHD: handling fault status reg 3
[  858.344655] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e55ce000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.345028] dmar: DRHD: handling fault status reg 3
[  858.345178] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e563b000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.345571] dmar: DRHD: handling fault status reg 3
[  858.345719] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e569d000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.346108] dmar: DRHD: handling fault status reg 3
[  858.346250] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e5706000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.346650] dmar: DRHD: handling fault status reg 3
[  858.346797] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e5766000 
DMAR:[fault reason 05] PTE Write access is not set
[  858.347168] dmar: DRHD: handling fault status reg 3
[  858.347305] dmar: DMAR:[DMA Write] Request device [01:00.0] fault addr e57c3000 
DMAR:[fault reason 05] PTE Write access is not set

Attached is the crash dump from sysfs
Comment 1 Shawn Starr 2014-07-06 20:48:56 UTC
Created attachment 102333 [details]
GPU crash dump from sysfs
Comment 2 Chris Wilson 2014-07-07 08:03:04 UTC
The GPU hang is immaterial - it is just one of those freak faults gen4 throws out for host access. The DMAR errors look more substantial and also not related - though I am impressed that your have a working DMAR on a gen4 platform, but they need to be directed towards -radeon I guess.
Comment 3 Chris Wilson 2014-07-07 08:03:35 UTC
Created attachment 102352 [details] [review]
Avoid struct mutex recursion.
Comment 4 Chris Wilson 2014-07-07 08:42:42 UTC
(In reply to comment #3)
> Created attachment 102352 [details] [review] [review]
> Avoid struct mutex recursion.

Dave Airlie pointed out that they are two different struct mutexes.
Comment 5 Jairo Miramontes 2015-08-11 14:14:46 UTC
Closed after more than one year of inactivity. Feel free to reopen if needed. Thanks

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.