Bug 33309

Summary: [855GM] GPU freeze due to overlay hang
Product: xorg Reporter: nepo <dwistal>
Component: Driver/intelAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED NOTOURBUG QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: daniel, fdo.12.bendus
Version: unspecifiedKeywords: patch
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
i915 error state
none
dmesg
none
xorg
none
error state13
none
error state30
none
write NOPID reg after MI_WAIT for overlay
none
i915_error_state from 3.3.4 unpatched
none
i915_error_state from 3.3.4 patched
none
i915_error_state from 3.3.5 unpatched
none
i915_error_state from 3.3.5 unpatched, no. 2
none
i915_error_state from 3.3.5 patched
none
i915_error_state from 3.3.5 patched, no. 2 none

Description nepo 2011-01-20 15:40:26 UTC
The GPU hangs after 5-20 sec when watching an (avi) video in e.g. smplayer or vlc. First the player window gets blue, then the desktop freezes. Reproduceable. Initially thought, that it's a coherency error, but Daniel Vetter points to an overlay hang. Thanks for your help!!

Current Operating System: Linux frank 2.6.37-graphics2+12-generic (855 patched kernel by Bryan)
OS: Kubuntu Maverick on a Fujitsu Siemens M7400
chipset: Intel 855GM
system arch: i686
xserver-xorg-video-intel: 2:2.12.0-1ubuntu5.1
xserver core: 2:1.9.0-0ubuntu7.3
mesa: 7.9~git20100924-0ubuntu2 // 1.3 Mesa 7.9-devel
libdrm-intel1: 2.4.21-1ubuntu2.1
Comment 1 nepo 2011-01-20 15:41:18 UTC
Created attachment 42243 [details]
i915 error state
Comment 2 nepo 2011-01-20 15:42:17 UTC
Created attachment 42245 [details]
dmesg
Comment 3 nepo 2011-01-20 15:42:54 UTC
Created attachment 42246 [details]
xorg
Comment 4 nepo 2011-01-20 15:44:16 UTC
xorg.conf just points to the intel driver.
Comment 5 Chris Wilson 2011-01-24 14:25:02 UTC
0x00015808:      0x08800000: MI_OVERLAY_FLIP | CONTINUE
0x0001580c:      0x34591001:    dword 1
0x00015810:      0x01810000: MI_WAIT_FOR_EVENT <-- HANG
0x00015814:      0x08c00000: MI_OVERLAY_FLIP | OFF
0x00015818:      0x34591001:    dword 1
0x0001581c:      0x01810000: MI_WAIT_FOR_EVENT
0x00015820:      0x10800001: MI_STORE_DATA_INDEX
0x00015824:      0x00000080:    dword 1
0x00015828:      0x00027a76:    dword 2

Which also explains why the overlay registers were not recorded.

Can you keep gathering the error-states and maybe we will strike it lucky and spot some vital information?
Comment 6 nepo 2011-01-24 14:53:00 UTC
So i just need to have the laptop run a bit longer after the GPU hang? Thx, n.
Comment 7 Chris Wilson 2011-01-24 14:59:34 UTC
No, only the first error is captured after a hang. You will just have to induce hangs more often. ;-)
Comment 8 nepo 2011-01-31 11:00:05 UTC
Created attachment 42765 [details]
error state13
Comment 9 nepo 2011-01-31 11:01:25 UTC
Created attachment 42766 [details]
error state30
Comment 10 nepo 2011-01-31 11:02:53 UTC
saved two more error states, and errorstate13 (42765) looks a bit different. Hopefully these files help! N.
Comment 11 nepo 2011-02-10 01:06:54 UTC
Hmm, nobody has an idea? It's a bit frustating with my PC, since almost every video freezes my laptop after some secs :P
Let me know if other information is needed or if there's a different way to spot this error in a more detailed way!
Comment 12 Eugeni Dodonov 2011-09-08 15:55:52 UTC
This issue is affecting a hardware component which is not being actively worked on anymore.

Moving the assignee to the dri-devel list as contact, to give this issue a better coverage.
Comment 13 Daniel Vetter 2011-09-08 23:50:15 UTC
Created attachment 50994 [details] [review]
write NOPID reg after MI_WAIT for overlay

If you're still around, I've just stumbled over this little hint in the docs. Maybe it actually helps. Test feedback highly appreciated.

Thanks, Daniel
Comment 14 stefan 2012-03-26 13:17:45 UTC
Hi Daniel,
I believe I'm also bitten by this bug and I'd like to test your patch.
Unfortunately, MI_WRITE_NOPID_REG is not defined in the newest kernels
and I couldn't find any reference. If you could update your patch with
its definition, I can give it a go.

Cheers,
Stefan.
Comment 15 Daniel Vetter 2012-03-26 13:40:43 UTC
> --- Comment #14 from stefan <fdo.12.bendus@xoxy.net> 2012-03-26 13:17:45 PDT ---
> Hi Daniel,
> I believe I'm also bitten by this bug and I'd like to test your patch.
> Unfortunately, MI_WRITE_NOPID_REG is not defined in the newest kernels
> and I couldn't find any reference. If you could update your patch with
> its definition, I can give it a go.

Just add
#define MI_WRITE_NOPID_REG (1 << 22)
somewhere.
Yours, Daniel
Comment 16 stefan 2012-04-26 07:48:56 UTC
(In reply to comment #15)
> > --- Comment #14 from stefan <fdo.12.bendus@xoxy.net> 2012-03-26 13:17:45 PDT ---
> Just add
> #define MI_WRITE_NOPID_REG (1 << 22)
> somewhere.
> Yours, Daniel

Hi Daniel,

sorry for answering so late. I had a few hangs with 3.3.0 with this patch
until I realised that I had relaxed fencing enabled (which I enabled out
of curiosity at some point). Since then I went back to a plain config and
had a GPU hang with the vanilla 3.3.3, I can attach the error state output
if you think it can still be useful.

The good news is that 3.3.0 with this patch seems stable, and I am still
testing 3.3.3 with this patch and will report back if I manage to hang the
GPU (or not).

Cheers,
Stefan.
Comment 17 Daniel Vetter 2012-04-26 07:50:38 UTC
On Thu, Apr 26, 2012 at 16:48,  <bugzilla-daemon@freedesktop.org> wrote:
>
> The good news is that 3.3.0 with this patch seems stable, and I am still
> testing 3.3.3 with this patch and will report back if I manage to hang the
> GPU (or not).

Please also check whether 3.3.0 without the patch still crashes,
otherwise we don't really know whether it's the patch that fixes
things.
Comment 18 stefan 2012-05-14 13:52:54 UTC
(In reply to comment #17)
> On Thu, Apr 26, 2012 at 16:48,  <bugzilla-daemon@freedesktop.org> wrote:
> >
> > The good news is that 3.3.0 with this patch seems stable, and I am still
> > testing 3.3.3 with this patch and will report back if I manage to hang the
> > GPU (or not).
> 
> Please also check whether 3.3.0 without the patch still crashes,
> otherwise we don't really know whether it's the patch that fixes
> things.

Hi Daniel,
unfortunately, the patch doesn't seem to help.
I tested 3.3.3, 3.3.4, and 3.3.5 patched and unpatched.
Although, unlike in the OP, it takes a while for the gpu
to hang, sometimes hours. But I had hangs for both versions.

I captured some i915_error_states for the hangs, if you are
interested.

Cheers,
Stefan.
Comment 19 Chris Wilson 2012-05-14 13:58:13 UTC
Always useful to check to see if there is any variation in the error states. Can you also try: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=fastboot&id=04c8b699bdc9d707233399adf04900507c55bf3b
Comment 20 stefan 2012-05-15 12:58:42 UTC
Created attachment 61685 [details]
i915_error_state from 3.3.4 unpatched
Comment 21 stefan 2012-05-15 12:59:34 UTC
Created attachment 61686 [details]
i915_error_state from 3.3.4 patched
Comment 22 stefan 2012-05-15 13:00:21 UTC
Created attachment 61687 [details]
i915_error_state from 3.3.5 unpatched
Comment 23 stefan 2012-05-15 13:01:58 UTC
Created attachment 61688 [details]
i915_error_state from 3.3.5 unpatched, no. 2
Comment 24 stefan 2012-05-15 13:04:55 UTC
Created attachment 61689 [details]
i915_error_state from 3.3.5 patched
Comment 25 stefan 2012-05-15 13:08:32 UTC
Hi Chris,

(In reply to comment #19)
> Always useful to check to see if there is any variation in the error states.
> Can you also try:
> http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=fastboot&id=04c8b699bdc9d707233399adf04900507c55bf3b

I added some more error state files in the hope they will be useful.
"Patched" means with the patch from comment #13.
I'm still testing your patch.

Cheers,
Stefan.
Comment 26 stefan 2012-05-15 13:16:23 UTC
(In reply to comment #25)
> Hi Chris,
> 
> (In reply to comment #19)
> > Always useful to check to see if there is any variation in the error states.
> > Can you also try:
> > http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=fastboot&id=04c8b699bdc9d707233399adf04900507c55bf3b
> 
> I added some more error state files in the hope they will be useful.
> "Patched" means with the patch from comment #13.
> I'm still testing your patch.
> 
> Cheers,
> Stefan.

As I was writing this, the mplayer window turned blue. :/

So it seems no luck with your patch, too, at least everything
besides xvideo still works and I can gracefully reboot.
I will attach the error state file, hope it helps.

Hth,
Stefan.
Comment 27 stefan 2012-05-15 13:18:21 UTC
Created attachment 61693 [details]
i915_error_state from 3.3.5 patched, no. 2

error state from 3.3.5 with Chris' patch.
Comment 28 stefan 2012-07-02 13:13:59 UTC
Hi,
are there any news on this issue?
The 3.4 and 3.5-rc series seem stable wrt this issue,
but unfortunately something broke resume from s2ram badly,
the backlight stays off and the machine does not respond
even to SysRq and I need to do a hard power-off.

Cheers,
Stefan.
Comment 29 Daniel Vetter 2012-07-02 13:29:18 UTC
(In reply to comment #28)
> are there any news on this issue?
> The 3.4 and 3.5-rc series seem stable wrt this issue,
> but unfortunately something broke resume from s2ram badly,
> the backlight stays off and the machine does not respond
> even to SysRq and I need to do a hard power-off.

That's good&bad news. Can you try to bisect the backlight regression that has been introduce in 3.4 and open a new bug report? That usually helps in fixing it ...
Comment 30 stefan 2012-07-09 11:18:52 UTC
Hi Daniel,
(In reply to comment #29)
> (In reply to comment #28)
> > are there any news on this issue?
> > The 3.4 and 3.5-rc series seem stable wrt this issue,
> > but unfortunately something broke resume from s2ram badly,
> > the backlight stays off and the machine does not respond
> > even to SysRq and I need to do a hard power-off.
> 
> That's good&bad news. Can you try to bisect the backlight regression that has
> been introduce in 3.4 and open a new bug report? That usually helps in fixing
> it ...

It turns out to be an ACPICA regression not related to graphics at all.

As for this issue, I did not observe any more hangs with v3.4 or 3.5-rc, *yet*.
It is hard to tell for sure since it usually takes a while (sometimes hours)
for it to occure. Which makes it also almost impossible to bisect and to find
the commit that might have fixed it.

Cheers,
Stefan.
Comment 31 Chris Wilson 2012-07-09 11:37:34 UTC
Ok, I'm just as surprised; be wary of a surprise attack. In the meantime, have fun!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.