30637 – [965GM] drm:i915_hangcheck_elapsed on 2.13.0-1 before crashing hard

Bug 30637 - [965GM] drm:i915_hangcheck_elapsed on 2.13.0-1 before crashing hard

Summary: [965GM] drm:i915_hangcheck_elapsed on 2.13.0-1 before crashing hard

Status:	CLOSED INVALID

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Chris Wilson
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-10-05 13:19 UTC by Martin Sillence
Modified:	2017-07-24 23:06 UTC (History)
CC List:	3 users (show)

See Also:	https://bugzilla.novell.com/show_bug.cgi?id=658802
i915 platform:
i915 features:

Attachments
kernel log (91.49 KB, text/plain) 2010-10-05 13:19 UTC, Martin Sillence	no flags	Details
error state (761.31 KB, text/plain) 2010-10-05 13:21 UTC, Martin Sillence	no flags	Details
xorg log (27.99 KB, text/plain) 2010-10-05 13:22 UTC, Martin Sillence	no flags	Details
3 crashes with the .37rc4 kernel incuding i915_error_state (689.98 KB, application/x-compressed-tar) 2010-12-02 23:14 UTC, Martin Sillence	no flags	Details
another 3 crashes with the .37rc4 kernel incuding i915_error_state (380.15 KB, application/x-compressed-tar) 2010-12-03 12:10 UTC, Martin Sillence	no flags	Details
2.6.37-rc5 + xorg reorder fix (151.29 KB, application/x-compressed-tar) 2010-12-16 14:58 UTC, Martin Sillence	no flags	Details
i915_error_state (781.42 KB, text/plain) 2010-12-17 07:48 UTC, Michal Marek	no flags	Details
dmesg (57.78 KB, text/plain) 2010-12-17 07:49 UTC, Michal Marek	no flags	Details
Xorg.0.log (24.08 KB, text/plain) 2010-12-17 07:50 UTC, Michal Marek	no flags	Details
Crash with reorder patch including error state (138.71 KB, application/x-compressed-tar) 2011-01-01 04:31 UTC, Martin Sillence	no flags	Details
Show Obsolete (9) View All

Description Martin Sillence 2010-10-05 13:19:01 UTC

Created attachment 39194 [details]
kernel log

After a few minutes of using X the server slows down
Then things start to fail
Eventually the kernel panics and locks hard

Comment 1 Martin Sillence 2010-10-05 13:20:06 UTC

adding versions

Comment 2 Martin Sillence 2010-10-05 13:21:25 UTC

Created attachment 39195 [details]
error state

Comment 3 Martin Sillence 2010-10-05 13:22:12 UTC

Created attachment 39196 [details]
xorg log

Comment 4 Chris Wilson 2010-10-06 02:37:29 UTC

Could be this race:

commit 39b4d07aa3583ceefe73622841303a0a3e942ca1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 30 09:10:26 2010 +0100

    drm: Hold the mutex when dropping the last GEM reference (v2)
    
    In order to be fully threadsafe we need to check that the drm_gem_object
    refcount is still 0 after acquiring the mutex in order to call the free
    function. Otherwise, we may encounter scenarios like:
    
    Thread A:                                        Thread B:
    drm_gem_close
    unreference_unlocked
    kref_put                                         mutex_lock
    ...                                              i915_gem_evict
    ...                                              kref_get -> BUG
    ...                                              i915_gem_unbind
    ...                                              kref_put
    ...                                              i915_gem_object_free
    ...                                              mutex_unlock
    mutex_lock
    i915_gem_object_free -> BUG
    i915_gem_object_unbind
    kfree
    mutex_unlock
    
    Note that no driver is currently using the free_unlocked vfunc and it is
    scheduled for removal, hasten that process.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30454
    Reported-and-Tested-by: Magnus Kessler <Magnus.Kessler@gmx.net>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org
    Signed-off-by: Dave Airlie <airlied@redhat.com>


Not convinced that it is at the moment, but could you try *just* upgrading the kernel to 2.6.36-latest.

Comment 5 Martin Sillence 2010-10-06 15:16:04 UTC

I tried 2.6.36-rc6 not sure if this is latest enough?
It seems to die pretty quickly too. Crashed hard while I was trying to capture the failure.

Comment 6 Martin Sillence 2010-10-06 15:18:45 UTC

I note other people are having success with this driver is it something specific about the 965GM chipset? 

I've not been able to run a modern driver for some time now, the legacy back port was the only driver that almost worked.

Comment 7 Chris Wilson 2010-12-01 04:02:03 UTC

Martin, I can say that I don't see this on my 965GM ;-)

Can you please check again with the latest linus, airlied or drm-intel-next/-fixes and grab a fresh kernel log? Then I know that the known crashes and races have been fixed.

Comment 8 Martin Sillence 2010-12-01 12:45:10 UTC

Just checking, you want me to test just upgrading the kernel to 2.6.37-rc4?
I've currently got the 2:2.13.0-2 though I note there's a newer one in experimental I seem to have trouble installing it:
 xserver-xorg-core: Breaks: xserver-xorg-input-7

Comment 9 Chris Wilson 2010-12-01 12:55:13 UTC

Yes, please do try and grab a dmesg + i915_error_state from a recent kernel if you have the opportunity. The existing i915_error_state doesn't contain an obvious error, so I just want to rule out the reported memory corruption first.

Comment 10 Martin Sillence 2010-12-01 23:25:25 UTC

uname -a
Linux griffin 2.6.37-rc4 #1 SMP Wed Dec 1 21:05:51 GMT 2010 x86_64 GNU/Linux

kern log
Dec  2 07:13:17 griffin kernel: [  116.688049] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Dec  2 07:13:17 griffin kernel: [  116.744544] [drm:intel_panel_get_max_backlight] *ERROR* fixme: max PWM is zero.
Dec  2 07:13:18 griffin kernel: [  117.772036] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Dec  2 07:13:18 griffin kernel: [  117.772485] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
Dec  2 07:13:18 griffin kernel: [  117.772492] [drm:i915_reset] *ERROR* Failed to reset chip.


xorg log
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.
(WW) intel(0): intel_uxa_prepare_access: bo map failed: Input/output error
(WW) intel(0): intel_uxa_prepare_access: bo map failed: Input/output error
(WW) intel(0): intel_uxa_prepare_access: bo map failed: Input/output error
(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.

The screen goes blank - back light went off, I'm unable to switch to text mode and have to ssh in.

BTW good to hear this is working on other machines, thanks for that info.

Comment 11 Chris Wilson 2010-12-02 00:35:24 UTC

Right so a GPU hang without any sign of corruption. Good. Well not good...

Care to upload a few more i915_error_state to see if I can spot the pattern without having to resort to augmenting the hangcheck?

Comment 12 Martin Sillence 2010-12-02 23:14:39 UTC

Created attachment 40756 [details]
3 crashes with the .37rc4 kernel incuding i915_error_state

This contains 3 crashes with the i915_error_state, kernel log and xorg log

This only takes a few minutes, let me know if you need more.

Comment 13 Martin Sillence 2010-12-03 12:10:38 UTC

Created attachment 40787 [details]
another 3 crashes with the .37rc4 kernel incuding i915_error_state

Comment 14 Chris Wilson 2010-12-05 04:39:49 UTC

Odd, they all die in the middle or towards the end of a sequence of flushed operations. This implies that the kernels, surfaces and vertex buffers were valid (since we know that they had been referenced and the instructions executed without error). Not sure what's going on yet. Could be the inexplicable crash bug 28204... Wait, that's you as well. ;-)

Comment 15 Martin Sillence 2010-12-06 02:51:31 UTC

I raised bug 28204 separately as it was against a different X driver/kernel hope that's OK.

Is there anything else you want me to test that could help pin it down?

I could try other ways to trigger it but multiple spinning loading icons in firefox/chrome seems to kill it quite quickly.

Comment 16 Chris Wilson 2010-12-06 03:05:37 UTC

(In reply to comment #15)
> I raised bug 28204 separately as it was against a different X driver/kernel
> hope that's OK.

Of it's ok, I just remembered seeing a similar pattern; the more information the better.
 
> Is there anything else you want me to test that could help pin it down?
> 
> I could try other ways to trigger it but multiple spinning loading icons in
> firefox/chrome seems to kill it quite quickly.

Might be related to the ops, or just the frequency of redraws. At the moment, there is nothing unusual in the error states - but I do know that the shader programs are flawed on gen4, but not how [otherwise I could have made the quick fix already], and I haven't completed my master plan of rewriting them yet.

On current tip of xf86-video-intel.git, I've reordered the commands slightly, this may have any one of 3 possible impacts...

Comment 17 Martin Sillence 2010-12-08 11:52:02 UTC

Wow it's been up for 5 minutes with this change
So I'd say of the 3 possible outcomes, it's a good one so far. I'll test more tomorrow.

Many, many thanks for this!

M

Comment 18 Martin Sillence 2010-12-09 14:18:42 UTC

Well it does last a lot longer however it is still crashing.

Xorg said:
(EE) intel(0): Detected a hung GPU, disabling acceleration.
(WW) intel(0): intel_uxa_prepare_access: bo map failed: Input/output error
(WW) intel(0): intel_uxa_prepare_access: bo map failed: Input/output error
(EE) intel(0): failed to set cursor: Input/output error

and kernel log said:
Dec  9 22:10:07 griffin kernel: [ 1581.580039] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Dec  9 22:10:07 griffin kernel: [ 1581.640523] [drm:intel_panel_get_max_backlight] *ERROR* fixme: max PWM is zero.
Dec  9 22:10:08 griffin kernel: [ 1582.744016] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Dec  9 22:10:08 griffin kernel: [ 1582.744140] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
Dec  9 22:10:08 griffin kernel: [ 1582.744143] [drm:i915_reset] *ERROR* Failed to reset chip.

Unfortunately the machine locked up hard just after I tried to grab the i915 error state and was not there on reboot.

Comment 19 Chris Wilson 2010-12-16 03:26:49 UTC

Dropping priority as this seems to be the same inexplicable bug as before and not a regression.

Comment 20 Martin Sillence 2010-12-16 08:28:13 UTC

I've been using this a while now and it is a _lot_ more stable.

There are other issues with the kernel 27rc5 kernel causing me problems, I'll try the next RC and see if I can get another error state if it stays up a bit longer.

Comment 21 Martin Sillence 2010-12-16 14:58:36 UTC

Created attachment 41191 [details]
 2.6.37-rc5 + xorg reorder fix

I guess my machine heard my enthusiasm for this latest fix

It died again, back light went out and unable to see "text mode" or X but did ssh on and get the error state and logs.

Comment 22 Michal Marek 2010-12-17 07:47:16 UTC

I'm running the 2.13.902 driver and the display is still corrupt. But in my case the machine does not hang, it just displays a corrupt xdm screen that stays there even if I switch to a tty1.

Comment 23 Michal Marek 2010-12-17 07:48:45 UTC

Created attachment 41214 [details]
i915_error_state

Comment 24 Michal Marek 2010-12-17 07:49:51 UTC

Created attachment 41215 [details]
dmesg

Comment 25 Michal Marek 2010-12-17 07:50:45 UTC

Created attachment 41216 [details]
Xorg.0.log

Please let me know if I should report this as a different bug or if you want to continue here.

Comment 26 Chris Wilson 2010-12-17 08:35:45 UTC

(In reply to comment #25)
> Please let me know if I should report this as a different bug or if you want to
> continue here.

It's a different bug. Open a new bug and I can give you a patch to test. ;-)

Comment 27 Martin Sillence 2011-01-01 04:31:13 UTC

Created attachment 41553 [details]
Crash with reorder patch including error state

Hi, Managed to catch this one, it really is behaving much better but still have the occasional crash.

Comment 28 Chris Wilson 2011-01-02 05:27:20 UTC

That execbuffer is from one of my nightmares. What combination of kernel and drivers where you running at the time? I suspect it is a separate issue to the original bug.

Comment 29 Martin Sillence 2011-01-02 07:34:01 UTC

I was running 2.6.37rc7 and a git grab shortly after you added the reorder patch, I guess that's in the latest experimental driver from debian by now so I can switch to that.
Shall we close this bug then (and the related bug 28204)?

Should I grab the latest debian driver and submit a new bug report next time it happens?

Comment 30 Chris Wilson 2011-01-02 08:32:14 UTC

If you can induce a repetition of this style of hang (where the execbuffer is filled with the contents of a render target) then yes. However, if you see a mixture of failures, then I think it is simpler to assume a single bug afflicting 965GM with a mixture of symptoms than beset by a multitude of bugs. Especially one bearing the hallmarks of a coherency issue, which are subtle but quick to anger; I would have expected such a bug to have had much wider impact if it were not chipset specific. Hence, on reflection, it is probably another symptom of the same bug. (Otherwise you have uncovered a true nightmare.)

Comment 31 Chris Wilson 2011-01-02 08:34:40 UTC

In short: keep grabbing those error states and hope for a pattern!

Comment 32 Martin Sillence 2011-02-12 14:46:00 UTC

Hi,

It looks like my issues have all gone away... Seems that despite long testing with memtest86 I had memory trouble.
I filed another bug report about ext3 and they said this corruption is typical of memory faults. So after more testing (and passing) I tried some different memory anyway. I've now been running with the hand built release for a few days with no issues :)))

I'm really sorry for the report but I have done several long (days) memory tests and nothing turned up. I feel I carried out due diligence before filing.

Many thanks for you help and for the speed bump over the legacy driver,
M

Comment 33 Chris Wilson 2011-02-12 15:14:10 UTC

Martin, thanks for letting us know the cause. Inexplicable crashes are just nasty!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.