Bug 75502 - [SNB, IVB libva+rc6] Page flips hang indefinitely, apparently waiting for mmio writes that never happen
Summary: [SNB, IVB libva+rc6] Page flips hang indefinitely, apparently waiting for mmi...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-25 17:00 UTC by Simon Farnsworth
Modified: 2017-07-24 22:55 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg after failure, compressed with gzip. (1.04 MB, application/x-gzip)
2014-02-25 17:06 UTC, Simon Farnsworth
no flags Details
check that the pageflip is truly stuck before crying (7.93 KB, patch)
2014-02-26 10:21 UTC, Chris Wilson
no flags Details | Splinter Review
Reinstate the periodic pageflip stall detector (11.38 KB, patch)
2014-02-28 12:39 UTC, Chris Wilson
no flags Details | Splinter Review
Error state that goes with the hang (691.50 KB, application/x-gzip)
2014-03-04 11:52 UTC, Simon Farnsworth
no flags Details
Register dump from broken box as requested in comment 23 (14.71 KB, text/plain)
2014-03-14 10:34 UTC, Simon Farnsworth
no flags Details
Error state collected as per comment 23 (303.75 KB, application/x-gzip)
2014-03-14 10:36 UTC, Simon Farnsworth
no flags Details
error state collected during hang. (691.50 KB, application/x-gzip)
2014-04-11 10:09 UTC, Simon Farnsworth
no flags Details
Reorder semaphore deadlock detection (1.62 KB, patch)
2014-04-11 12:49 UTC, Chris Wilson
no flags Details | Splinter Review
Kernel messages from syslog (85.22 KB, text/plain)
2014-05-01 10:52 UTC, Simon Farnsworth
no flags Details
Error state just before a hang (663.78 KB, application/x-gzip)
2014-05-07 10:28 UTC, Simon Farnsworth
no flags Details

Description Simon Farnsworth 2014-02-25 17:00:09 UTC
I'm seeing pageflips give up on me on SNB hardware - Celeron 847, i5-2400S - 
where the CRTC becomes unexpectedly busy and never recovers.

When I'm failed, I can see interrupts coming in (Interrupts received in 
/sys/kernel/debug/dri/0/i915_gem_interrupt increments), but the queued flip 
never happens:

 # cat /sys/kernel/debug/dri/0/i915_gem_pageflip 
Flip queued on pipe A (plane A)
Stall check enabled, 1 prepares
Old framebuffer gtt_offset 0x01354000
New framebuffer gtt_offset 0x0093c000
No flip due on pipe B (plane B)

 # cat /sys/kernel/debug/dri/0/i915_gem_interrupt 
North Display Interrupt enable:         8eb48585
North Display Interrupt identity:       00000000
North Display Interrupt mask:           734bfa7a
South Display Interrupt enable:         ffffffff
South Display Interrupt identity:       00000000
South Display Interrupt mask:           f114ffff
Graphics Interrupt enable:              00401001
Graphics Interrupt identity:            00000000
Graphics Interrupt mask:                ffffffff
Interrupts received: 1006958
Graphics Interrupt mask (render ring):  ffffffff
Current sequence (render ring): 1472838
Graphics Interrupt mask (bsd ring):     ffffffff
Current sequence (bsd ring): 1271460
Graphics Interrupt mask (blitter ring): ffffffff
Current sequence (blitter ring): 1472839

I'm running packages from Fedora 20:

# rpm -q mesa-libGL kernel xorg-x11-server-Xorg xorg-x11-drv-intel libva-
intel-driver libdrm
mesa-libGL-9.2.5-1.20131220.fc20.x86_64
kernel-3.13.3-201.fc20.x86_64
xorg-x11-server-Xorg-1.14.4-90.fc20.x86_64
xorg-x11-drv-intel-2.21.15-5.fc20.x86_64
libva-intel-driver-1.2.1-1.fc20.x86_64
libdrm-2.4.50-1.fc20.x86_64

I will attach dmesg from a failed system - note that no error state has been collected.
Comment 1 Simon Farnsworth 2014-02-25 17:06:41 UTC
Created attachment 94724 [details]
dmesg after failure, compressed with gzip.
Comment 2 Chris Wilson 2014-02-26 09:11:30 UTC
Wtf: i915_pageflip_stall_check() is never called.
Comment 3 Chris Wilson 2014-02-26 09:19:34 UTC
Ok, we only ever used it for gen3/4, then presuming we had fixed the interrupt handling, let it drop.
Comment 4 Simon Farnsworth 2014-02-26 10:02:13 UTC
Next time it repros, I'll check to see if DSPSURFA has updated, or if we're really stuck.
Comment 5 Chris Wilson 2014-02-26 10:21:21 UTC
Created attachment 94763 [details] [review]
check that the pageflip is truly stuck before crying

Simon, would be good to run with this patch - but first make the fixup an DRM_ERROR so that we can easily see if it fires.
Comment 6 Chris Wilson 2014-02-28 12:39:33 UTC
Created attachment 94883 [details] [review]
Reinstate the periodic pageflip stall detector

Applies on top of the first check.
Comment 7 Simon Farnsworth 2014-02-28 17:30:53 UTC
It's definitely a new problem, and I've just seen it on IVB (i3-3220), too:

 # intel_reg_dumper  | grep DSP
                      DSPACNTR: 0xd8004400 (enabled)
                      DSPABASE: 0x00000000
                    DSPASTRIDE: 0x00001e00 (120)
                      DSPASURF: 0x019c8000
                   DSPATILEOFF: 0x00000000 (0, 0)
                      DSPBCNTR: 0x58004400 (disabled)
                      DSPBBASE: 0x00000000
                    DSPBSTRIDE: 0x00000a00 (40)
                      DSPBSURF: 0x0085b000
                   DSPBTILEOFF: 0x00000000 (0, 0)
                      DSPCCNTR: 0x00004000 (disabled)
                      DSPCBASE: 0x00000000
                    DSPCSTRIDE: 0x00000000 (0)
                      DSPCSURF: 0x00000000
                   DSPCTILEOFF: 0x00000000 (0, 0)
             PCH_DSPCLK_GATE_D: 0x10000000
              PCH_DSP_CHICKEN1: 0xa0000000
              PCH_DSP_CHICKEN2: 0x00000000
              PCH_DSP_CHICKEN3: 0x00000024
 
 # grep . /sys/kernel/debug/dri/0/i915_gem_pageflip
Flip queued on pipe A (plane A)
Stall check enabled, 1 prepares
Old framebuffer gtt_offset 0x019c8000
New framebuffer gtt_offset 0x011b8000
No flip due on pipe B (plane B)
No flip due on pipe C (plane C)

This looks to me like the flip hasn't happened - DSPASURF is 0x19c8000, which matches old framebuffer gtt_offset.

Note that the kernel doesn't have either of the patches from this bug on it.
Comment 8 Daniel Vetter 2014-03-03 08:03:28 UTC
How definite are we that this is a regression/recent issue? Just wondering whether we could frame this on some recent changes ...
Comment 9 Simon Farnsworth 2014-03-03 09:34:25 UTC
I can confirm that it didn't occur with kernel 3.5.0, libdrm-2.4.46, Mesa 9.1.3, Xorg server 1.11.4 and DDX 2.21.6. It does occur now that I've pulled forward from Fedora 16 with backports to Fedora 20.

If there are good starting points to try, I can take a machine backwards and see if that improves matters; I'd need to know which bits to change, though.
Comment 10 Daniel Vetter 2014-03-03 09:44:09 UTC
(In reply to comment #9)
> I can confirm that it didn't occur with kernel 3.5.0, libdrm-2.4.46, Mesa
> 9.1.3, Xorg server 1.11.4 and DDX 2.21.6. It does occur now that I've pulled
> forward from Fedora 16 with backports to Fedora 20.
> 
> If there are good starting points to try, I can take a machine backwards and
> see if that improves matters; I'd need to know which bits to change, though.

The only relevant bits are the kernel and the ddx (ddx is less likely). At lot of stuff happened since 3.5. so I would only bother with traiging/bisecting if you can somewhat readily reproduce this. No guess as to the actual culprit from my side.
Comment 11 Simon Farnsworth 2014-03-03 09:55:42 UTC
It takes two to three hours to fail each time, and has sometimes lasted as long as 3 days before failing.

git bisect suggests that I've got 16 steps to try to get there on the kernel alone, at worst case time of 3 days per step.

Do you have any further clues I could push on before I dive into bisection?
Comment 12 Daniel Vetter 2014-03-03 10:04:50 UTC
(In reply to comment #11)
> It takes two to three hours to fail each time, and has sometimes lasted as
> long as 3 days before failing.
> 
> git bisect suggests that I've got 16 steps to try to get there on the kernel
> alone, at worst case time of 3 days per step.
> 
> Do you have any further clues I could push on before I dive into bisection?

If the fail/success is so imbalanced you can try to skew the bisect by picking commits which are more likely to be bad than the perfect middle one (since you can detect those faster). Unfortunately there's no cmdline option for this, so you need to do this by hand.

There was a write-up somewhere, but I can't find it any more. Probably not worth the bother though ...
Comment 13 Simon Farnsworth 2014-03-03 12:00:02 UTC
Even restricting to drivers/gpu/drm/i915 leaves me 10 steps to try, at up to 3 days per step. If there's no other way to dig into this, I can start a bisect, but it will take about a month to get results.
Comment 14 Simon Farnsworth 2014-03-04 11:52:55 UTC
Created attachment 95088 [details]
Error state that goes with the hang

Had a hopeful event - I've got the failure recurring, with an error state this time. Ickle's patch gets me:

# cat /sys/kernel/debug/dri/0/i915_gem_pageflip 
Flip queued on pipe A (plane A)
Flip queued on frame 32131, now 3102655
Stall check enabled, 1 prepares
Old framebuffer gtt_offset 0x00372000
New framebuffer gtt_offset 0x08022000
MMIO update completed? 0
No flip due on pipe B (plane B)

showing that the flip hasn't happened, so it's not unblocking.

I'm not 100% confident that the failure and the error state coincide in time - I'm happy to be told that they're not linked, and that I should report the error state as a separate bug.
Comment 15 Chris Wilson 2014-03-04 12:40:40 UTC
Indeed that hang is a separate issue, the infamous bug 54226.

For the flip stall detection, I think I basically need a full request with tracking seqno so that we can check after a vblank for the completed flip command but pending mmio.
Comment 16 Chris Wilson 2014-03-04 12:43:27 UTC
One major question raised by the error state: where is the new framebuffer amongst the pinned? Like bug 73437
Comment 17 Chris Wilson 2014-03-04 13:02:46 UTC
Also notice that the current scanout is 0x00684000 not 0x00372000
Comment 18 Chris Wilson 2014-03-04 13:07:32 UTC
Does

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 25c486d5fb6a..97827bf9396a 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -2359,6 +2359,11 @@ intel_pipe_set_base(struct drm_crtc *crtc, int x, int y,
        struct drm_framebuffer *old_fb;
        int ret;
 
+       if (intel_crtc_has_pending_flip(crtc)) {
+               DRM_ERROR("pipe is still busy with an old pageflip\n");
+               return -EBUSY;
+       }
+
        /* no fb bound */
        if (!fb) {
                DRM_ERROR("No FB bound\n");

register anything after the freeze?
Comment 19 Simon Farnsworth 2014-03-04 14:59:27 UTC
I'm still on 3.13.4 here, so backported it as:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index d8013f5..66ccf9b 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -2300,6 +2300,11 @@ intel_pipe_set_base(struct drm_crtc *crtc, int x, int y,
        struct drm_framebuffer *old_fb;
        int ret;
 
+        if (intel_crtc_has_pending_flip(crtc)) {
+               DRM_ERROR("pipe is still busy with an old pageflip\n");
+               return -EBUSY;
+        }
+
        /* no fb bound */
        if (!fb) {
                DRM_ERROR("No FB bound\n");

Will let you know the results.
Comment 20 Simon Farnsworth 2014-03-07 18:29:32 UTC
It doesn't print after the freeze.

I can see that dmesg covers me back to boot time on the system, but "dmesg | grep -i busy" shows nothing, even with your patch in place.

Xorg is logging lots of pairs of

[  6527.065] (WW) intel(0): Page flip failed: Device or resource busy
[  6527.068] (WW) intel(0): flip queue failed: Device or resource busy

 # cat /sys/kernel/debug/dri/0/i915_gem_pageflip
Flip queued on pipe A (plane A)
Flip queued on frame 364351, now 541203
Stall check enabled, 1 prepares
Old framebuffer gtt_offset 0x02f44000
New framebuffer gtt_offset 0x0093c000
MMIO update completed? 0
No flip due on pipe B (plane B)

is obviously stuck.
Comment 21 Chris Wilson 2014-03-07 20:19:12 UTC
To double check the theory that the scanout went awol, can you please try with

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 94aabfc63c3d..c6b81e59058d 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -611,16 +611,28 @@ static int i915_gem_pageflip_info(struct seq_file *m, void *data)
                                seq_puts(m, "Stall check waiting for page flip ioctl, ");
                        seq_printf(m, "%d prepares\n", atomic_read(&work->pending));
 
+                       {
+                               u32 addr;
+
+                               if (INTEL_INFO(dev)->gen >= 4)
+                                       addr = DSPSURF(crtc->plane);
+                               else
+                                       addr = DSPADDR(crtc->plane);
+
+                               seq_printf(m, "Current scanout address 0x%08lx\n", 
+                                          I915_READ(addr));
+                       }
+
                        if (work->old_fb_obj) {
                                struct drm_i915_gem_object *obj = work->old_fb_obj;
-                               seq_printf(m, "Old framebuffer gtt_offset 0x%08lx\n",
+                               seq_printf(m, "Old framebuffer address 0x%08lx\n",
                                           i915_gem_obj_ggtt_offset(obj));
                        }
                        if (work->pending_flip_obj) {
                                struct drm_i915_gem_object *obj = work->pending_flip_obj;
                                bool complete;
 
-                               seq_printf(m, "New framebuffer gtt_offset 0x%08lx\n",
+                               seq_printf(m, "New framebuffer address 0x%08lx\n",
                                           i915_gem_obj_ggtt_offset(obj));
 
                                if (INTEL_INFO(dev)->gen >= 4) {
Comment 22 Simon Farnsworth 2014-03-14 10:27:35 UTC
(In reply to comment #21)
> To double check the theory that the scanout went awol, can you please try
> with
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c
> b/drivers/gpu/drm/i915/i915_debugfs.c
> index 94aabfc63c3d..c6b81e59058d 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -611,16 +611,28 @@ static int i915_gem_pageflip_info(struct seq_file *m,
> void *data)
>                                 seq_puts(m, "Stall check waiting for page
> flip ioctl, ");
>                         seq_printf(m, "%d prepares\n",
> atomic_read(&work->pending));
>  
> +                       {
> +                               u32 addr;
> +
> +                               if (INTEL_INFO(dev)->gen >= 4)
> +                                       addr = DSPSURF(crtc->plane);
> +                               else
> +                                       addr = DSPADDR(crtc->plane);
> +
> +                               seq_printf(m, "Current scanout address
> 0x%08lx\n", 
> +                                          I915_READ(addr));
> +                       }
> +
>                         if (work->old_fb_obj) {
>                                 struct drm_i915_gem_object *obj =
> work->old_fb_obj;
> -                               seq_printf(m, "Old framebuffer gtt_offset
> 0x%08lx\n",
> +                               seq_printf(m, "Old framebuffer address
> 0x%08lx\n",
>                                            i915_gem_obj_ggtt_offset(obj));
>                         }
>                         if (work->pending_flip_obj) {
>                                 struct drm_i915_gem_object *obj =
> work->pending_flip_obj;
>                                 bool complete;
>  
> -                               seq_printf(m, "New framebuffer gtt_offset
> 0x%08lx\n",
> +                               seq_printf(m, "New framebuffer address
> 0x%08lx\n",
>                                            i915_gem_obj_ggtt_offset(obj));
>  
>                                 if (INTEL_INFO(dev)->gen >= 4) {


I've got a broken machine; with this patch, I see:

Flip queued on pipe A (plane A)
Flip queued on frame 15324, now 3420744
Stall check enabled, 1 prepares
Current scanout address 0x00372000
Old framebuffer address 0x00372000
New framebuffer address 0x00684000
MMIO update completed? 0
No flip due on pipe B (plane B)

I'm going to keep the machine in broken state until you tell me otherwise.
Comment 23 Chris Wilson 2014-03-14 10:30:02 UTC
Hmm, first do an intel_reg_dump and then echo 1 > /sys/kernel/debug/dri/0/i915_wedged and attach the error state.
Comment 24 Simon Farnsworth 2014-03-14 10:34:24 UTC
Created attachment 95792 [details]
Register dump from broken box as requested in comment 23
Comment 25 Simon Farnsworth 2014-03-14 10:36:27 UTC
Created attachment 95793 [details]
Error state collected as per comment 23

Both attached. I'll leave the box broken in case the error state is missing some key detail.
Comment 26 Chris Wilson 2014-03-14 10:42:15 UTC
They confirm that the dspsurf is still at 0x00372000. The earlier theories that another write happened are debunked. So it has to be something in the simple MI_DISPLAY_FLIP command packets that causes the CS to simply ignore it.

Try,

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 0326ea8..1a680bf 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -8842,6 +8842,10 @@ static int intel_gen6_queue_flip(struct drm_device *dev,
        if (ret)
                goto err;
 
+       ret = intel_ring_cacheline_align(ring);
+       if (ret)
+               goto err_unpin;
+
        ret = intel_ring_begin(ring, 4);
        if (ret)
                goto err_unpin;

(current drm-intel-nightly already has that for ivb)
Comment 27 Simon Farnsworth 2014-03-17 10:56:58 UTC
(In reply to comment #26)
> They confirm that the dspsurf is still at 0x00372000. The earlier theories
> that another write happened are debunked. So it has to be something in the
> simple MI_DISPLAY_FLIP command packets that causes the CS to simply ignore
> it.
> 
> Try,
> 
> diff --git a/drivers/gpu/drm/i915/intel_display.c
> b/drivers/gpu/drm/i915/intel_display.c
> index 0326ea8..1a680bf 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -8842,6 +8842,10 @@ static int intel_gen6_queue_flip(struct drm_device
> *dev,
>         if (ret)
>                 goto err;
>  
> +       ret = intel_ring_cacheline_align(ring);
> +       if (ret)
> +               goto err_unpin;
> +
>         ret = intel_ring_begin(ring, 4);
>         if (ret)
>                 goto err_unpin;
> 
> (current drm-intel-nightly already has that for ivb)

Tested, and still fails.
Comment 28 Simon Farnsworth 2014-03-17 11:09:37 UTC
Our QA team has noted that on units where VA-API has never been used, we don't ever get an issue with stuck pageflips.

I have a unit under test which has played a movie using VA-API once, but is now doing a mix of OpenGL and X11 rendering. I'm going to see if it freezes, too, or if it's only possible to freeze it with VA-API actively in use.
Comment 29 Chris Wilson 2014-03-17 11:31:23 UTC
Ah ha, lets hope this leads to a reproducible test case. One thing you might like to try is disabling rc6 - if that helps, I think looking at the media rc6 settings would be in order.
Comment 30 Simon Farnsworth 2014-03-21 11:03:18 UTC
With cacheline align and rc6 disabled, it doesn't hang - at the points where it would hang, libva crashes instead.

With either rc6 enabled, or the flips not cacheline aligned, I can get hangs.

There's an interesting commit in the VA-API driver (libva intel-driver) that's included in the 1.3.0pre1 release:

commit 06702fb609b5fc9707f72a6e15e2117653ffd849
Author: Zhao Yakui <yakui.zhao@intel.com>
Date:   Mon Jan 20 09:58:06 2014 +0800

    Fix the wrong setting in MI_BATCH_BATCH_START command on Snb/Ivy/Haswell
    
    Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>

This changes the MI_BATCH_BUFFER_START command from MI_BATCH_BUFFER_START | (2 << 6) to MI_BATCH_BUFFER_START | (1 << 8) (PPGTT bit set instead of 128 DWORDS size).

Is it worth trying this commit against VA-API just in case?
Comment 31 Chris Wilson 2014-03-21 11:51:18 UTC
How does it crash in libva? GPU hang? Can you attach the error state in that case, or bt otherwise.

At this moment in time, it would be wise to try the latest and greatest libva just in case. And then look for the right patch.
Comment 32 Simon Farnsworth 2014-04-11 10:07:24 UTC
I've updated to the current libva, and had QA test things for me.

With RC6 enabled, I still get the stalls. Without it, I don't get stalls.

I do have an error state from a machine that stalled - I'll attach it, then see if I can collect error states from machines that don't stall.
Comment 33 Simon Farnsworth 2014-04-11 10:09:21 UTC
Created attachment 97218 [details]
error state collected during hang.

Error state collected from a stalled box.

 # grep . /sys/kernel/debug/dri/0/i915_gem_pageflip 
Flip queued on pipe A (plane A)
Flip queued on frame 4876195, now 5100833
Stall check enabled, 1 prepares
Current scanout address 0x01061000
Old framebuffer address 0x01061000
New framebuffer address 0x0085b000
MMIO update completed? 0
No flip due on pipe B (plane B)

is the stall.

I'm running 3.14.0-0.rc7 with the patches from this bug applied.
Comment 34 Daniel Vetter 2014-04-11 12:44:25 UTC
IPEHR doesn't match anything nearby where the ring is supposed to be afaict. Wat?
Comment 35 Chris Wilson 2014-04-11 12:49:41 UTC
Created attachment 97221 [details] [review]
Reorder semaphore deadlock detection

It's #54226 again, but we failed to simply kick the stuck semaphore and declared the GPU hung instead. The attached patch should detect the situation better, but the root cause behind the hangs still remains unresolved.
Comment 36 Simon Farnsworth 2014-05-01 10:51:57 UTC
With all the patches from this bug, on top of http://koji.fedoraproject.org/koji/buildinfo?buildID=512172 (kernel-3.14.1-200.fc20), and RC6 disabled, my QA team get complete system hangs. We did not manage to extract the error state, so this may not be interesting, but syslog shows the following when it goes wrong:

2014-04-30T17:23:22.611582+01:00 kernel[-] info:[drm] stuck on bsd ring
2014-04-30T17:23:22.616207+01:00 kernel[-] info:[drm] stuck on blitter ring
2014-04-30T17:23:22.616232+01:00 kernel[-] info:[drm] GPU crash dump saved to /sys/class/drm/card0/error
2014-04-30T17:23:22.616253+01:00 kernel[-] info:[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
2014-04-30T17:23:22.616273+01:00 kernel[-] info:[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
2014-04-30T17:23:22.616303+01:00 kernel[-] info:[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
2014-04-30T17:23:22.616323+01:00 kernel[-] info:[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
2014-04-30T17:27:38.606633+01:00 kernel[-] info:[drm] stuck on bsd ring
2014-04-30T17:27:38.606733+01:00 kernel[-] info:[drm] stuck on blitter ring
2014-04-30T18:02:45.611611+01:00 kernel[-] info:[drm] stuck on bsd ring
2014-04-30T18:02:45.611713+01:00 kernel[-] info:[drm] stuck on blitter ring
2014-04-30T18:02:49.610836+01:00 kernel[-] err:[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... render ring idle
2014-04-30T18:05:26.612171+01:00 kernel[-] info:[drm] stuck on bsd ring
2014-04-30T18:05:26.612270+01:00 kernel[-] info:[drm] stuck on blitter ring
2014-04-30T18:05:26.612311+01:00 kernel[-] warning:------------[ cut here ]------------
2014-04-30T18:05:26.612341+01:00 kernel[-] warning:WARNING: CPU: 0 PID: 3571 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0()
2014-04-30T18:05:26.612366+01:00 kernel[-] warning:list_del corruption. prev->next should be ffff8800c5337cc0, but was ffff8800c52b5cc0
2014-04-30T18:05:26.613666+01:00 kernel[-] warning:Modules linked in: dummy ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter nf_conntrack_ipv4 nf_defrag_ipv4 ip6_tables xt_conntrack nf_conntrack cfg80211 rfkill w83627ehf hwmon_vid snd_dummy snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt gpio_ich iTCO_vendor_support x86_pkg_temp_thermal coretemp kvm_intel snd_hda_intel kvm snd_hda_codec snd_hwdep snd_seq snd_seq_device crct10dif_pclmul snd_pcm crc32_pclmul crc32c_intel ghash_clmulni_intel lpc_ich microcode serio_raw mfd_core hid_multitouch snd_timer e1000e i2c_i801 snd mei_me soundcore shpchp nuvoton_cir mei rc_core ptp pps_core wmi i915 i2c_algo_bit drm_kms_helper drm i2c_core video
2014-04-30T18:05:26.613814+01:00 kernel[-] warning:CPU: 0 PID: 3571 Comm: queue70:src Tainted: G        W    3.14.1-200.90.fc20.x86_64 #1
2014-04-30T18:05:26.613837+01:00 kernel[-] warning:Hardware name: ONELAN NTB6500/DH61AG, BIOS AGH6110H.86A.0042.2012.0723.1130 07/23/2012
2014-04-30T18:05:26.613856+01:00 kernel[-] warning: 0000000000000000 00000000fda0ea1a ffff8800c5337b50 ffffffff816eeaf2
2014-04-30T18:05:26.613879+01:00 kernel[-] warning: ffff8800c5337b98 ffff8800c5337b88 ffffffff8108a1bd ffff8800c5337ca8
2014-04-30T18:05:26.613899+01:00 kernel[-] warning: ffff8800c5337cc0 ffff880037219990 0000000000000216 ffff880037218000
2014-04-30T18:05:26.613920+01:00 kernel[-] warning:Call Trace:
2014-04-30T18:05:26.613940+01:00 kernel[-] warning: [<ffffffff816eeaf2>] dump_stack+0x45/0x56
2014-04-30T18:05:26.613970+01:00 kernel[-] warning: [<ffffffff8108a1bd>] warn_slowpath_common+0x7d/0xa0
2014-04-30T18:05:26.613989+01:00 kernel[-] warning: [<ffffffff8108a23c>] warn_slowpath_fmt+0x5c/0x80
2014-04-30T18:05:26.614008+01:00 kernel[-] warning: [<ffffffff81095f8b>] ? lock_timer_base.isra.35+0x2b/0x50
2014-04-30T18:05:26.614029+01:00 kernel[-] warning: [<ffffffff8136d001>] __list_del_entry+0xa1/0xd0
2014-04-30T18:05:26.614793+01:00 kernel[-] warning: [<ffffffff810d2003>] finish_wait+0x43/0x70
2014-04-30T18:05:26.614815+01:00 kernel[-] warning: [<ffffffffa00982a9>] __wait_seqno+0x349/0x4f0 [i915]
2014-04-30T18:05:26.614835+01:00 kernel[-] warning: [<ffffffff810d20e0>] ? abort_exclusive_wait+0xb0/0xb0
2014-04-30T18:05:26.614856+01:00 kernel[-] warning: [<ffffffffa0097d00>] ? i915_gem_file_idle_work_handler+0x20/0x20 [i915]
2014-04-30T18:05:26.614876+01:00 kernel[-] warning: [<ffffffffa009a340>] ? i915_gem_object_set_to_cpu_domain+0x30/0x180 [i915]
2014-04-30T18:05:26.614909+01:00 kernel[-] warning: [<ffffffffa009fc1c>] i915_gem_set_domain_ioctl+0x16c/0x250 [i915]
2014-04-30T18:05:26.614930+01:00 kernel[-] warning: [<ffffffffa0020bf2>] drm_ioctl+0x4f2/0x620 [drm]
2014-04-30T18:05:26.614950+01:00 kernel[-] warning: [<ffffffff815c8f97>] ? SYSC_recvfrom+0x127/0x160
2014-04-30T18:05:26.614969+01:00 kernel[-] warning: [<ffffffff811fcba0>] do_vfs_ioctl+0x2e0/0x4a0
2014-04-30T18:05:26.614990+01:00 kernel[-] warning: [<ffffffff811fcde1>] SyS_ioctl+0x81/0xa0
2014-04-30T18:05:26.615009+01:00 kernel[-] warning: [<ffffffff816feea9>] system_call_fastpath+0x16/0x1b
2014-04-30T18:05:26.615030+01:00 kernel[-] warning:---[ end trace 80b0e927330f14a6 ]---

I'll attach the full kernel messages from that boot.
Comment 37 Simon Farnsworth 2014-05-01 10:52:15 UTC
Created attachment 98285 [details]
Kernel messages from syslog
Comment 38 Simon Farnsworth 2014-05-07 10:28:19 UTC
Created attachment 98614 [details]
Error state just before a hang

And we've managed to extract the error state. Shortly after this, the system hangs completely.

We have all the patches from this bug in place.
Comment 39 Chris Wilson 2014-05-07 11:36:10 UTC
Hmm, that error state has some obvious libva badness as well as the mystery QuickSync hang. Sadly, that could very well explain a system hang.
Comment 40 Simon Farnsworth 2014-05-07 13:33:39 UTC
Worth reporting a fresh libva bug for the error state?
Comment 41 Jani Nikula 2014-06-10 16:38:19 UTC
commit ca79d888eb63cdacf80653ae23ce8f7d9ac52c68
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jun 6 10:22:29 2014 +0100

    drm/i915: Reorder semaphore deadlock check
Comment 42 Jani Nikula 2014-09-05 12:37:15 UTC
Okay, there's now

commit 4be173813e57c7298103a83155c2391b5b167b4c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jun 6 10:22:29 2014 +0100

    drm/i915: Reorder semaphore deadlock check

and

commit a0d036b074b4a5a933e37fcb9bdd6b3cc80a0387
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Jul 19 12:40:42 2014 +0100

    drm/i915: Reorder the semaphore deadlock check, again

in drm-intel-nightly. Presuming fixed. Thanks for the report, please reopen if the problem persists.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.