73696 – [915gm TearFree] slow, screen corruption due to EDEADLK (fence starvation)

Bug 73696 - [915gm TearFree] slow, screen corruption due to EDEADLK (fence starvation)

Summary: [915gm TearFree] slow, screen corruption due to EDEADLK (fence starvation)

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	medium normal
Assignee:	Paulo Zanoni
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-01-16 11:14 UTC by dimon
Modified:	2017-07-24 22:56 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
Xorg.0.log (19.19 KB, text/plain) 2014-01-16 11:49 UTC, dimon	no flags	Details
dmesg (57.33 KB, text/plain) 2014-01-16 11:53 UTC, dimon	no flags	Details
Xorg.0.log (2.11 MB, text/plain) 2014-01-16 15:39 UTC, dimon	no flags	Details
Xorg.0.log --enable-debug (23.24 KB, text/plain) 2014-01-17 01:49 UTC, dimon	no flags	Details
Stall waiting for flips to complete to release a fence (1.61 KB, patch) 2014-01-17 22:37 UTC, Chris Wilson	no flags	Details \| Splinter Review
Xorg.0.log.old (22.81 KB, text/plain) 2014-01-18 22:46 UTC, dimon	no flags	Details
Xorg.0.log (24.66 KB, text/plain) 2014-01-18 22:46 UTC, dimon	no flags	Details
Stall for pending unpin, including debug messages (1.24 KB, patch) 2014-01-19 14:33 UTC, Chris Wilson	no flags	Details \| Splinter Review
Stall for pending unpin, including debug messages, v3 (1.24 KB, patch) 2014-01-19 14:37 UTC, Chris Wilson	no flags	Details \| Splinter Review
dmesg (117.53 KB, text/plain) 2014-01-19 16:53 UTC, dimon	no flags	Details
Show Obsolete (2) View All

Description dimon 2014-01-16 11:14:57 UTC

This happens only sometimes. If this happens I have to restart X several times till the driver is working fine. Once the driver is wokring right it's doing it for the whole X session.

On affecten X sessions I get:
- screen corruptions e.g. in Firefox scrolling is choppy. 
- the tests GtkDrawingArea - Lines, GtkDrawingArea - Circles, GtkDrawingArea - Text in gtkperf run slower, I see only 3-4 different frames in each test.

2:2.99.907+git20140115.40beee99-0ubuntu0sarvatt~saucy
i3wm, no compositing manager

Comment 1 Chris Wilson 2014-01-16 11:20:06 UTC

Please attach your Xorg.0.log, dmesg and possibly /sys/class/drm/card0/error from after the corruption starts appearing.

Comment 2 dimon 2014-01-16 11:49:29 UTC

Created attachment 92219 [details]
Xorg.0.log

Comment 3 dimon 2014-01-16 11:53:38 UTC

Created attachment 92220 [details]
dmesg

Comment 4 dimon 2014-01-16 11:53:57 UTC

(In reply to comment #1)
> Please attach your Xorg.0.log, dmesg and possibly /sys/class/drm/card0/error
> from after the corruption starts appearing.

The error appears never during one X session. After starting X either everithing is fine for the whole session or im I'm affected by this bug.

I'm getting this logs on an affected session

cat /sys/class/drm/card0/error
no error state collected

Comment 5 Chris Wilson 2014-01-16 12:02:13 UTC

Ah, that's interesting. I had interpreted it that it became slow during the session for which I suspected a GPU hang and subsequent use of reads/writes through the GTT. Nothing looks unusual in the log.

Can you please have a look with top to see if there is abnormal CPU loading and if X looks more busy than usual, see if 'sudo perf top' will pinpoint the culprit? What will also be useful, if it doesn't mask the issue, would be building xf86-video-intel with --enable-debug=full and sending me the Xorg.0.log file for the affected session.

Comment 6 dimon 2014-01-16 15:33:31 UTC

(In reply to comment #5)
> Ah, that's interesting. I had interpreted it that it became slow during the
> session for which I suspected a GPU hang and subsequent use of reads/writes
> through the GTT. Nothing looks unusual in the log.
> 
> Can you please have a look with top to see if there is abnormal CPU loading
> and if X looks more busy than usual, see if 'sudo perf top' will pinpoint
> the culprit? What will also be useful, if it doesn't mask the issue, would
> be building xf86-video-intel with --enable-debug=full and sending me the
> Xorg.0.log file for the affected session.

CPU load seems not to be affected.
I can't reproduce screen corruption when using xf86-video-intel with --enable-debug=full. I can't tell anything regarding the slownes because the debug version runs generally much slower.

I've attached a new log file with debugging on.

Comment 7 dimon 2014-01-16 15:39:20 UTC

Created attachment 92232 [details]
Xorg.0.log

Comment 8 Chris Wilson 2014-01-16 16:08:33 UTC

Ok, that's a large haystack that looks fairly typical. Do you ever see similar corruption without TearFree enabled?

Comment 9 Chris Wilson 2014-01-16 16:12:09 UTC

The last thing to test is --enable-debug (without the =full), and see if the sanity checks catch anything. If they do, they will trigger an assert and cause X to crash (just a word of warning).

Comment 10 dimon 2014-01-17 01:47:41 UTC

Here is a logfile with --enable-debug.
X didn't crash but it containts some messages which only appear on affected sessions.

Comment 11 dimon 2014-01-17 01:49:00 UTC

Created attachment 92252 [details]
Xorg.0.log --enable-debug

Comment 12 Chris Wilson 2014-01-17 09:53:59 UTC

Aha! It ran out of fences! Or rather due to the outstanding flips(?) when we asked for 6 fences (out of a possible 8) the kernel was not able to supply them.

Comment 13 Chris Wilson 2014-01-17 22:37:39 UTC

Created attachment 92311 [details] [review]
Stall waiting for flips to complete to release a fence

I think this should do the trick - introduce a stall waiting for the required fence. (Though the stall here will be overkill.)

Comment 14 dimon 2014-01-18 14:54:50 UTC

(In reply to comment #13)
> Created attachment 92311 [details] [review] [review]
> Stall waiting for flips to complete to release a fence
> 
> I think this should do the trick - introduce a stall waiting for the
> required fence. (Though the stall here will be overkill.)

Unfortunately not, still the same problem.

Comment 15 Chris Wilson 2014-01-18 21:40:20 UTC

Hmm, I guess I should add some debug code to dump the fence registers upon EDEADLK.

Done.

commit 9342bc3dfd64b338c0802793f311574323067652
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Jan 18 21:38:59 2014 +0000

    sna: Dump fence registers upon starvation
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=73696
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Please can you attach a fresh Xorg.0.log with --enable-debug and I'll have a second think about where those fences are disappearing.

Comment 16 dimon 2014-01-18 22:46:15 UTC

Created attachment 92362 [details]
Xorg.0.log.old

X crashed upon starting

Comment 17 dimon 2014-01-18 22:46:56 UTC

Created attachment 92363 [details]
Xorg.0.log

Comment 18 Chris Wilson 2014-01-19 10:54:12 UTC

Great, at least that confirms that is indeed pinning multiple fences for scanout. Just need to think how best to either reserve more fences for display (vs execution) or how to recover them.

Comment 19 Chris Wilson 2014-01-19 14:33:17 UTC

Created attachment 92396 [details] [review]
Stall for pending unpin, including debug messages

I think the earlier patch should have worked. :| Can you please try this version and include the dmesg from the inevitable failure, just to confirm I am looking in the right place?

Comment 20 Chris Wilson 2014-01-19 14:37:27 UTC

Created attachment 92397 [details] [review]
Stall for pending unpin, including debug messages, v3

Ah, I think I missed the importance of the earlier return...

Comment 21 dimon 2014-01-19 16:53:39 UTC

Created attachment 92407 [details]
dmesg

Working now, thx

Comment 22 Chris Wilson 2014-01-20 12:45:08 UTC

https://patchwork.kernel.org/patch/3511381/

Comment 23 Chris Wilson 2014-01-21 21:36:34 UTC

commit 4983005fd5eaa7594a830f35f91d7d4d983548ca
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 20 10:17:36 2014 +0000

    drm/i915: Wait for completion of pending flips when starved of fences
    
    On older generations (gen2, gen3) the GPU requires fences for many
    operations, such as blits. The display hardware also requires fences for
    scanouts and this leads to a situation where an arbitrary number of
    fences may be pinned by old scanouts following a pageflip but before we
    have executed the unpin workqueue. This is unpredictable by userspace
    and leads to random EDEADLK when submitting an otherwise benign
    execbuffer. However, we can detect when we have an outstanding flip and
    so cause userspace to wait upon their completion before finally
    declaring that the system is starved of fences. This is really no worse
    than forcing the GPU to stall waiting for older execbuffer to retire and
    release their fences before we can reallocate them for the next
    execbuffer.
    
    v2: move the test for a pending fb unpin to a common routine for
    later reuse during eviction
    
    Reported-and-tested-by: dimon@gmx.net
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73696
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.