Bug 27040

Summary: 3D games will make the system hang or gpu hang when it is in its native resolution
Product: Mesa Reporter: zhao jian <jian.j.zhao>
Component: Drivers/DRI/i915Assignee: Jesse Barnes <jbarnes>
Status: VERIFIED FIXED QA Contact:
Severity: major    
Priority: high CC: chris, maximlevitsky
Version: unspecified   
Hardware: Other   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: xorg.0.log
dmesg of gpu hang when change to the native mode
avoid queuing flips while flip is pending
blurred screen on G45
Fix exchange validity check

Description zhao jian 2010-03-12 01:55:56 UTC
Created attachment 33985 [details]
xorg.0.log

System Environment:
--------------------------
Libdrm:         (master)04fd3872ee8bd8d5e2c27740508c67c2d51dbc11
Mesa:           (master)aa311ae61680f0fc300e33e8955c6c58cafd5fb4
Xserver:                (master)f2eacb4646beb25d055de22868f93e6b24f229b6
Xf86_video_intel:       (master)318aa9ed799197810e2039dbe3ec51559dcc888c
Kernel: (drm-intel-next)338d762fc2dc2c1493813123fc4cea998bb3e683

Bug detailed description:
-------------------------
The 3D games(I tested with openarena and ut2004-demo) will make GPU hang when set it to its native resolution. It existed on 945GM, 945GME. And it works well on GM965 and G45. 


Reproduce steps:
--------------------
1. xinit
2. start openarena in the laptops native mode.
Comment 1 zhao jian 2010-03-12 01:57:14 UTC
Created attachment 33986 [details]
dmesg of gpu hang when change to the native mode
Comment 2 Eric Anholt 2010-03-12 16:06:32 UTC
Is this a regression?  Bisect if so.
Comment 3 zhao jian 2010-03-15 03:03:01 UTC
(In reply to comment #2)
> Is this a regression?  Bisect if so.

This may be hard to bisect. they are first failed in mid January that it even can't run(as bug #26064), and it recently was fixed by some new commit but bringing this bug.  
Comment 4 Jesse Barnes 2010-03-23 13:21:35 UTC
Works for me with current bits (master of everything, drm-intel-next of kernel) on 945GM at 1024x600 native resolution.

Could be related to the flipping bug that Li Peng recently fixed.
Comment 5 Jesse Barnes 2010-03-23 13:21:49 UTC
Marking as fixed; please confirm or re-open.
Comment 6 zhao jian 2010-03-24 03:46:30 UTC
(In reply to comment #5)
> Marking as fixed; please confirm or re-open.

With the newest code it now doesn't hang the gpu and can enter into the game. But after it run for a while(about 2 minutes) it will hang the system. If I change the resolution smaller, I run it many times and not hang the system. So it may be another issue but have something to do with its native resolution. 
Comment 7 zhao jian 2010-03-25 02:18:10 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > Marking as fixed; please confirm or re-open.
> With the newest code it now doesn't hang the gpu and can enter into the game.
> But after it run for a while(about 2 minutes) it will hang the system. If I
> change the resolution smaller, I run it many times and not hang the system. So
> it may be another issue but have something to do with its native resolution. 

It now on all the platforms(G45, 945GM, 945GME) will make systerm hang when played 3D game in its native resolution. Seldomly it only hangs the GPU. So I reopen it. And when it makes GPU hang on G45, I get the information from i915_error_state is no error state collected. Its intel_gpu_dump output is in attachment. 
Kernel: 2.6.33.1 or newest on drm-intel-next branch (38d922ba211c1efdbe00809c899f1ca4979c84c7)
Libdrm:         (master)c1c8bbf80b1f734e23996bf805dc78f32ebaf56f
Mesa:           (master)99386921e778271c9b3edf90123ab6319e23fc95 
Xserver:                (master)3083c5d0c4386cdd7083b7a83ac72fdad2f1e61e
Xf86_video_intel:       (master)9c037f61a490c96f9095f7ff3fecbf41f5efe9f7
Comment 8 Jesse Barnes 2010-03-25 19:05:06 UTC
I think I've reproduced this on 945, but at least part of the problem is an overaggressive hang check timer.  With that reduced, my tests run for a quite awhile without a failure, but I still eventually see a hang.  I'm still debugging, but I suspect this is a kernel issue.
Comment 9 Jesse Barnes 2010-03-29 09:20:34 UTC
Another one that needs testings with https://patchwork.kernel.org/patch/88541/
Comment 10 Jesse Barnes 2010-03-29 09:20:50 UTC
(well not just that patch, but the rest of the series too.)
Comment 11 fangxun 2010-03-30 00:34:00 UTC
Tested with the series patch, it still fails.
Comment 12 zhao jian 2010-03-30 05:23:24 UTC
With your patch set, it runs well on 945GME(netbook aspire1), but on 945GM it still has some problem, though it not either hangs gpu or system(the ringbuffer info changes all the way and i915_error_state has hundreds of lines info), but the picture will not update after some seconds from I log in the openarena. So it seems not synced well on 945GM.  After killing the game, part of the game still exist on the screen without being wiped out. 
Comment 13 Jesse Barnes 2010-04-01 17:23:50 UTC
What if you make the hangcheck timer less aggressive?  That seems to make things work much better for me at least, though I haven't root caused the hang check failure yet...

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 4e41f0f..99bfdef 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -279,7 +279,7 @@ typedef struct drm_i915_private {
        int vblank_pipe;
 
        /* For hangcheck timer */
-#define DRM_I915_HANGCHECK_PERIOD 75 /* in jiffies */
+#define DRM_I915_HANGCHECK_PERIOD 750 /* in jiffies */
        struct timer_list hangcheck_timer;
        int hangcheck_count;
        uint32_t last_acthd;
Comment 14 Jesse Barnes 2010-04-05 14:08:38 UTC
Created attachment 34686 [details] [review]
avoid queuing flips while flip is pending

Can you confirm this fix works for you?
Comment 15 zhao jian 2010-04-06 08:23:26 UTC
(In reply to comment #14)
> Created an attachment (id=34686) [details]
> avoid queuing flips while flip is pending
> Can you confirm this fix works for you?

If with this fix only, openarena can run on G45 platforms in its native mode, but fail on pinetrail and 945GM. If use your 7 patch set and a patch that make its hang_checker longer and this patch together, openarena can run on G45, 945GME and Pinetrail in its native mode. But on 945GM it hangs one time after it runs about 10 minutes. After I reboot it, it runs well for nearly an hour. So I think this patch works.
Comment 16 Jesse Barnes 2010-04-06 11:43:26 UTC
*** Bug 26939 has been marked as a duplicate of this bug. ***
Comment 17 zhao jian 2010-04-07 04:51:52 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > Created an attachment (id=34686) [details] [details]
> > avoid queuing flips while flip is pending
> > Can you confirm this fix works for you?
> If with this fix only, openarena can run on G45 platforms in its native mode,
> but fail on pinetrail and 945GM. If use your 7 patch set and a patch that make
> its hang_checker longer and this patch together, openarena can run on G45,
> 945GME and Pinetrail in its native mode. But on 945GM it hangs one time after
> it runs about 10 minutes. After I reboot it, it runs well for nearly an hour.
> So I think this patch works.

Now on 945GM I can reproduce the "hang issue"(in fact it is neither gpu hang nor system hang, but the screen will freeze at one picture without any update). If I restart the X after the "hang issue", and run 3D cases like glxgears, it will still report fps information as normal but we can't see anything on screen, and even the basic command "clear" can't clean the screen. And on one of our G45, though there is no hang, but it will have some mess pictures.(the picture I took with my phone is attached, and if I just use some snapshot software, they are all intact pictures.)
Comment 18 zhao jian 2010-04-07 04:53:27 UTC
Created attachment 34758 [details]
blurred screen on G45
Comment 19 Jesse Barnes 2010-04-12 14:45:27 UTC
Does the corruption (bad OA screen + blank glxgears) appear only after you've hung the display and restarted?

How many times did you have to run OA on your G45 to see the problem?

Can you check where X is hanging when this occurs?  I'd guess it's stuck waiting for a vblank or page flip event (so /proc/<pid>/wchan should report "poll_timeout" or similar).

After the hang and X restart, I'd expect you to see "vblank call failed, invalid arg" or something, if so can you instrument the kernel's drm_irq.c file to see where the -EINVAL is coming from?
Comment 20 zhao jian 2010-04-13 03:20:45 UTC
(In reply to comment #19)
> Does the corruption (bad OA screen + blank glxgears) appear only after you've
> hung the display and restarted?
Only on 945GM, the openarena will freeze the screen, but both system and gpu works well, and then I restart the X, the glxgears not displayed. And on the same machine test with a system in mobile hadr disk, and I played three times, didn't find hang issue. Maybe the hang issue on 945GM is caused by the OS installed on it. We will try to reproduce on the new system. 

> How many times did you have to run OA on your G45 to see the problem?
The problem on G45 is the screen sometimes messed but refreshed at once. And this issue only happens on G45. It happens every time I run it.  

> Can you check where X is hanging when this occurs?  I'd guess it's stuck
> waiting for a vblank or page flip event (so /proc/<pid>/wchan should report
> "poll_timeout" or similar).
Do you mean the issue on 945GM?  on 945GM, I find it report that poll_schedule_timeout. 

> After the hang and X restart, I'd expect you to see "vblank call failed,
> invalid arg" or something, if so can you instrument the kernel's drm_irq.c file
> to see where the -EINVAL is coming from?
This is also on 945GM, right? But I didn't find where can I get the "vblank call failed, invalid arg".
Comment 21 Kristian Høgsberg 2010-06-01 06:37:14 UTC
I think Chris fixed this one with this commit to xf86-video-intel:

commit e2615cdeef078dbd2e834b68c437f098a92b941d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat May 29 16:37:12 2010 +0100

    dri: Only flip if the front and back pixmaps match.
    
    An unredirected window (thanks Michel for the reminder) is backed by the
    Screen pixmap, and so uses a reference of that as its front buffer. The
    back buffer is a pixmap appropriately sized for the drawable. When the
    application requests to swap its buffers, obviously we cannot simply
    exchange the front and back buffer as they do not match, but need to copy
    the appropriate region from the back to the front.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 22 Jesse Barnes 2010-06-01 11:56:06 UTC
Created attachment 35996 [details] [review]
Fix exchange validity check

Looks like this is working on Cantiga with current X & Mesa bits with the attached patch.  Can you confirm?
Comment 23 Jesse Barnes 2010-06-01 13:51:04 UTC
Fix committed.

commit f2272402035574c206a0e3383c55373c440fd928
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Tue Jun 1 13:46:15 2010 -0700

    DRI2: fix new buffer exchange check
Comment 24 zhao jian 2010-06-02 02:45:22 UTC
(In reply to comment #23)
> Fix committed.
> commit f2272402035574c206a0e3383c55373c440fd928
> Author: Jesse Barnes <jbarnes@virtuousgeek.org>
> Date:   Tue Jun 1 13:46:15 2010 -0700
>     DRI2: fix new buffer exchange check

I tested with this commit in 2D driver, but it still fails on our Pineviews. The system will hang after executing the demo 2~5 times.(if in non-native mode, it can run 10 times successfully). It works well I think it may not well fixed. 
code: 
Libdrm:         (master)607e228c263d5d171bd0615d5d93202dda371e5f
Mesa:           (master)79e5bea3cb498e7a663e0f08db49fe2de764650c
Xserver:                (master)e4582d9e5c8649347742a13eae68cf27005296fc
Xf86_video_intel:  (master)f2272402035574c206a0e3383c55373c440fd928
Kernel: both v2.6.34 and 722154e4cacf015161efe60009ae9be23d492296 on for-linus branch.
Comment 25 Jesse Barnes 2010-06-02 11:32:54 UTC
I just ran it 23 times in a row without a hang:

kernel drm-intel-next, e3a815fcd38043b8f1bb526123d8ab6ae01deb77
libdrm master: 73a42a645201a85ce2fe4fc77754df67e5097fc9
mesa master: 5871b7ebc9f9629c076c9fe3c9c32aa9fd531eba
xserver master: e4582d9e5c8649347742a13eae68cf27005296fc + driver load fix
xf86-video-intel master: 2989f51caf3134460c2551de597e7e54fe74ee92

I was running on MeeGo w/o a window manager; I just had a gnome-terminal running and ran Eric's OA demo script in a loop with page flipping enabled.  Flipping was occurring and it seems solid.
Comment 26 zhao jian 2010-06-03 03:12:31 UTC
(In reply to comment #25)
> I just ran it 23 times in a row without a hang:
> kernel drm-intel-next, e3a815fcd38043b8f1bb526123d8ab6ae01deb77
> libdrm master: 73a42a645201a85ce2fe4fc77754df67e5097fc9
> mesa master: 5871b7ebc9f9629c076c9fe3c9c32aa9fd531eba
> xserver master: e4582d9e5c8649347742a13eae68cf27005296fc + driver load fix
> xf86-video-intel master: 2989f51caf3134460c2551de597e7e54fe74ee92
> I was running on MeeGo w/o a window manager; I just had a gnome-terminal
> running and ran Eric's OA demo script in a loop with page flipping enabled. 
> Flipping was occurring and it seems solid.

Yep, I tested with almost the same commit as you said except the xserver without driver load fix(actually I don't know which fix do you mean) and it can run 20 times concecutively very well. So I think it was fixed. But when I change the kernel back to the for-linus branch(722154e4cacf015161efe60009ae9be23d492296) it will make system hang after it runs about 8 times. So maybe it has something to do with our kernel.
Comment 27 Jesse Barnes 2010-06-03 08:59:01 UTC
Now it's been running all day yesterday plus all night, so yeah I think it's fixed.  I don't know all of the differences between for-linus and drm-intel-next though, so I'm not sure which commit fixed it...
Comment 28 zhao jian 2010-06-04 05:37:59 UTC
(In reply to comment #27)
> Now it's been running all day yesterday plus all night, so yeah I think it's
> fixed.  I don't know all of the differences between for-linus and
> drm-intel-next though, so I'm not sure which commit fixed it...

Yes, I pull the for-linus tree to the newest(e3a815fcd38043b8f1bb526123d8ab6ae01deb77) and find it really works well, and I will find out which commit fixed it next week. So verified.
Comment 29 zhao jian 2010-06-06 20:41:24 UTC
I biescted it out on for-linus branch and found it was fixed by commit 9908ff736adf261e749b4887486a32ffa209304c. Can you make it into 2.6.34.x? 

commit 9908ff736adf261e749b4887486a32ffa209304c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat May 15 09:57:03 2010 +0100

    drm/i915: Kill dangerous pending-flip debugging

    We can, by virtue of a vblank interrupt firing in the middle of setting
    up the unpin work (i.e. after we set the unpin_work field and before we
    write to the ringbuffer) enter intel_finish_page_flip() prior to
    receiving the pending flip notification. Therefore we can expect to hit
    intel_finish_page_flip() under normal circumstances without a pending flip
    and even without installing the pending_flip_obj. This is exacerbated by
    aperture thrashing whilst binding the framebuffer

    References:

      Bug 28079 - "glresize" causes kernel panic in intel_finish_page_flip.
      https://bugs.freedesktop.org/show_bug.cgi?id=28079

    Reported-by: Nick Bowler <nbowler@draconx.ca>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
    Cc: stable@kernel.org
    Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Eric Anholt <eric@anholt.net>
Comment 30 Jesse Barnes 2010-06-07 10:32:00 UTC
Yes, it's been cc'd to stable, so should land in 2.6.34.x soon.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.