Bug 28852 - [i965 page flipping] GPU hang on 2.6.34-45 32-bit PAE kernel with GL compositor
Summary: [i965 page flipping] GPU hang on 2.6.34-45 32-bit PAE kernel with GL compositor
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Jesse Barnes
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-30 08:48 UTC by Simon Farnsworth
Modified: 2010-07-13 04:54 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments

Description Simon Farnsworth 2010-06-30 08:48:56 UTC
I'm trying to get page flipping working on all my Intel hardware, and I'm hitting a hang on my 965GME hardware (AVALUE EMX 965GME motherboard, Core 2 Duo T7250 CPU). Unlike the 945 hang (being chased in bug 28788), it's not immediate and easy to reproduce - it takes around 2 days of continuous uptime (60Hz display), and doesn't tickle the hangcheck timer, so intel_error_decode doesn't find anything. With the aid of "echo t > /proc/sysrq-trigger", I've been able to determine that X is stalled in the kernel in i915_gem_wait_for_pending_flip

I'm using:
 * Fedora kernel 2.6.34-45.fc14.i686.PAE (i686 architecture)
 * xf86-video-intel as of git 28c0ca676c47e7e38fabdd9ef24a70bd26701f33
 * xserver as of git 3b3c77b87070ddcdbb2acb114a81628485e7a129
 * mesa as of git 7a9246c5d72290ed8455a426801b85b54374e102
 * libdrm as of git 726210f87d558d558022f35bc8c839e798a19f0c

The trace from the kernel is:
Xorg          S 00006585     0  1403      1 0x00400000
 f405ddc0 00203086 04182167 00006585 c0a4fd40 c0a4fd40 c0a4fd40 c0a4fd40
 f401e8ac c0a4fd40 c0a4fd40 000280aa 00000000 f42b8c00 00006585 f401e600
 00000000 f401e600 f405de20 f60b10a4 f405de40 f8067133 0000002e 80000000
Call Trace:
 [<f8067133>] i915_gem_do_execbuffer+0x378/0xbf8 [i915]
 [<f8062c62>] ? list_move_tail+0x18/0x1b [i915]
 [<c04c8f56>] ? __kmalloc+0xfc/0x108
 [<c045212d>] ? autoremove_wake_function+0x0/0x2f
 [<f8067a4f>] i915_gem_execbuffer2+0x9c/0xe2 [i915]
 [<f7f4aa8c>] drm_ioctl+0x237/0x317 [drm]
 [<f80679b3>] ? i915_gem_execbuffer2+0x0/0xe2 [i915]
 [<c04d1976>] ? fsnotify_modify+0x4f/0x5a
 [<c04dc1c9>] vfs_ioctl+0x27/0x91
 [<f7f4a855>] ? drm_ioctl+0x0/0x317 [drm]
 [<c04dc76a>] do_vfs_ioctl+0x48e/0x4cc
 [<c040767f>] ? __switch_to+0x125/0x155
 [<c0437e3b>] ? finish_task_switch+0x34/0x92
 [<c0786093>] ? schedule+0x585/0x5d9
 [<c04d2662>] ? vfs_writev+0x36/0x44
 [<c04dc7e9>] sys_ioctl+0x41/0x61
 [<c040885f>] sysenter_do_call+0x12/0x28
 [<c0780000>] ? init_intel+0x140/0x355

I've confirmed that I'm still seeing interrupts from the device:
# grep i915 /proc/interrupts && sleep 1 && grep i915 /proc/interrupts
  26:   17245844          1   PCI-MSI-edge      i915
  26:   17245904          1   PCI-MSI-edge      i915

and (while hung):
# ~/vbltest 
trying to load module i915...success.
starting count: 27352347
freq: 60.08Hz
freq: 59.80Hz
freq: 59.80Hz
freq: 59.80Hz

Restarting X shows that the GPU isn't hung, but I get lots of:
[176995.150] (WW) intel(0): get vblank counter failed: Invalid argument
[176995.150] (WW) intel(0): first get vblank counter failed: Invalid argument
[176995.172] (WW) intel(0): get vblank counter failed: Invalid argument
[176995.172] (WW) intel(0): first get vblank counter failed: Invalid argument
in the X log

# ~/modetest -s 12:1920x1200 -v # connector 12 is DVI-D
trying to load module i915...success.
setting mode 1920x1200 on connector 12, crtc 4
freq: 60.36Hz
freq: 59.80Hz
freq: 59.80Hz
freq: 59.80Hz
freq: 59.80Hz
freq: 59.80Hz
freq: 59.80Hz
freq: 59.80Hz
freq: 59.80Hz

works too - I get rapid flicking between colourful screen and grey screen. If I then try to restart X11 (after running modetest), I get a complete system hang - no response to a PS/2 keyboard, or on the network.

I'm going to try leaving the system running modetest instead of X and the GL compositor, to see if that suffers a similar fate.
Comment 1 Simon Farnsworth 2010-06-30 08:52:44 UTC
Just tried running modetest, quitting it, and restarting it - that immediately jams:
# ./modetest -s 12:1920x1200 -v
trying to load module i915...success.
setting mode 1920x1200 on connector 12, crtc 4
select timed out or error (ret 0)
select timed out or error (ret 0)

and display stuck on the colourful screen.
Comment 2 Jesse Barnes 2010-06-30 09:17:45 UTC
Does the modetest trace look the same as the earlier X trace?  These messages:

[176995.150] (WW) intel(0): get vblank counter failed: Invalid argument
[176995.150] (WW) intel(0): first get vblank counter failed: Invalid argument

look like the kernel is rejecting vblank event requests for some reason.  A lack of space (i.e. unconsumed events) should result in an -ENOMEM return though; are there any messages in dmesg indicating why the call failed?
Comment 3 Simon Farnsworth 2010-06-30 09:34:09 UTC
modetest never gives me a helpful trace - it's always stuck in select when it dies.

There's no messages in dmesg when things go wrong, to suggest why it's failing. It's just dead. I've also found a way to kill the system (no network, no local console) - run "./modetest -s 12:1920x1200 -v", leave it for a few seconds, and press enter to have it shut down nicely. Run it again, getting output like:
# ./modetest -s 12:1920x1200 -v
trying to load module i915...success.
setting mode 1920x1200 on connector 12, crtc 4
select timed out or error (ret 0)
select timed out or error (ret 0)
select timed out or error (ret 0)
select timed out or error (ret 0)
select timed out or error (ret 0)
select timed out or error (ret 0)
select timed out or error (ret 0)
select timed out or error (ret 0)
select timed out or error (ret 0)

With modetest running in one SSH session, run vbltest in another session. Watch the machine disappear out from under you - even magic SysRq is gone.

Some tidbits that might help:
 * To get identical frequency outputs from vbltest and modetest, I need to run vbltest -s.
 * If I run vbltest -s instead of vbltest in my "kill the world" setup, it doesn't die.
 * If I run vbltest -s while the first modetest is running, the second one succeeds.

I wonder if we're not requesting the right IRQs...
Comment 4 Simon Farnsworth 2010-07-13 02:09:18 UTC
With the fixes in bug #28788 applied (apply all three patches, reverse the order of finish and prepare as in comment 34 on that bug), I no longer get hangs.

The patches are:
https://bugs.freedesktop.org/attachment.cgi?id=36463
https://bugs.freedesktop.org/attachment.cgi?id=36464
https://bugs.freedesktop.org/attachment.cgi?id=35551

And then, in https://bugs.freedesktop.org/attachment.cgi?id=36464 change the following hunk:
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 2479be0..a846cd8 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -940,22 +940,30 @@ irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS)
         if (HAS_BSD(dev) && (iir & I915_BSD_USER_INTERRUPT))
             DRM_WAKEUP(&dev_priv->bsd_ring.irq_queue);

-        if (iir & I915_DISPLAY_PLANE_A_FLIP_PENDING_INTERRUPT)
+        if (iir & I915_DISPLAY_PLANE_A_FLIP_PENDING_INTERRUPT) {
             intel_prepare_page_flip(dev, 0);
+            if (dev_priv->flip_pending_is_done)
+                intel_finish_page_flip_plane(dev, 0);
+        }

-        if (iir & I915_DISPLAY_PLANE_B_FLIP_PENDING_INTERRUPT)
+        if (iir & I915_DISPLAY_PLANE_B_FLIP_PENDING_INTERRUPT) {
+            if (dev_priv->flip_pending_is_done)
+                intel_finish_page_flip_plane(dev, 1);
             intel_prepare_page_flip(dev, 1);
+        }

         if (pipea_stats & vblank_status) {
             vblank++;
             drm_handle_vblank(dev, 0);
-            intel_finish_page_flip(dev, 0);
+            if (!dev_priv->flip_pending_is_done)
+                intel_finish_page_flip(dev, 0);
         }

         if (pipeb_stats & vblank_status) {
             vblank++;
             drm_handle_vblank(dev, 1);
-            intel_finish_page_flip(dev, 1);
+            if (!dev_priv->flip_pending_is_done)
+                intel_finish_page_flip(dev, 1);
         }

         if ((pipea_stats & I915_LEGACY_BLC_EVENT_STATUS) ||


to

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 2479be0..a846cd8 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -940,22 +940,30 @@ irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS)
         if (HAS_BSD(dev) && (iir & I915_BSD_USER_INTERRUPT))
             DRM_WAKEUP(&dev_priv->bsd_ring.irq_queue);

-        if (iir & I915_DISPLAY_PLANE_A_FLIP_PENDING_INTERRUPT)
+        if (iir & I915_DISPLAY_PLANE_A_FLIP_PENDING_INTERRUPT) {
             intel_prepare_page_flip(dev, 0);
+            if (dev_priv->flip_pending_is_done)
+                intel_finish_page_flip_plane(dev, 0);
+        }

-        if (iir & I915_DISPLAY_PLANE_B_FLIP_PENDING_INTERRUPT)
+        if (iir & I915_DISPLAY_PLANE_B_FLIP_PENDING_INTERRUPT) {
             intel_prepare_page_flip(dev, 1);
+            if (dev_priv->flip_pending_is_done)
+                intel_finish_page_flip_plane(dev, 1);
+        }

         if (pipea_stats & vblank_status) {
             vblank++;
             drm_handle_vblank(dev, 0);
-            intel_finish_page_flip(dev, 0);
+            if (!dev_priv->flip_pending_is_done)
+                intel_finish_page_flip(dev, 0);
         }

         if (pipeb_stats & vblank_status) {
             vblank++;
             drm_handle_vblank(dev, 1);
-            intel_finish_page_flip(dev, 1);
+            if (!dev_priv->flip_pending_is_done)
+                intel_finish_page_flip(dev, 1);
         }

         if ((pipea_stats & I915_LEGACY_BLC_EVENT_STATUS) ||
Comment 5 Brian Rogers 2010-07-13 04:54:06 UTC
For reference, the first two patches and the correction to the second patch are included in 2.6.35-rc4 under the following commits:

83f7fd0 drm/i915: don't queue flips during a flip pending event
1afe3e9 drm/i915: gen3 page flipping fixes
70565d0 drm/i915: fix page flip finish vs. prepare on plane B

And the final patch is currently committed to drm-intel-next as the following:

f602afd drm/i915: Include instdone[1] in hangcheck


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.