Summary: | [915gm TearFree] slow, screen corruption due to EDEADLK (fence starvation) | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | dimon | ||||||||||||||||||||||
Component: | DRM/Intel | Assignee: | Paulo Zanoni <przanoni> | ||||||||||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||||||||
Severity: | normal | ||||||||||||||||||||||||
Priority: | medium | CC: | intel-gfx-bugs | ||||||||||||||||||||||
Version: | unspecified | ||||||||||||||||||||||||
Hardware: | x86 (IA32) | ||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||||
Attachments: |
|
Description
dimon
2014-01-16 11:14:57 UTC
Please attach your Xorg.0.log, dmesg and possibly /sys/class/drm/card0/error from after the corruption starts appearing. Created attachment 92219 [details]
Xorg.0.log
Created attachment 92220 [details]
dmesg
(In reply to comment #1) > Please attach your Xorg.0.log, dmesg and possibly /sys/class/drm/card0/error > from after the corruption starts appearing. The error appears never during one X session. After starting X either everithing is fine for the whole session or im I'm affected by this bug. I'm getting this logs on an affected session cat /sys/class/drm/card0/error no error state collected Ah, that's interesting. I had interpreted it that it became slow during the session for which I suspected a GPU hang and subsequent use of reads/writes through the GTT. Nothing looks unusual in the log. Can you please have a look with top to see if there is abnormal CPU loading and if X looks more busy than usual, see if 'sudo perf top' will pinpoint the culprit? What will also be useful, if it doesn't mask the issue, would be building xf86-video-intel with --enable-debug=full and sending me the Xorg.0.log file for the affected session. (In reply to comment #5) > Ah, that's interesting. I had interpreted it that it became slow during the > session for which I suspected a GPU hang and subsequent use of reads/writes > through the GTT. Nothing looks unusual in the log. > > Can you please have a look with top to see if there is abnormal CPU loading > and if X looks more busy than usual, see if 'sudo perf top' will pinpoint > the culprit? What will also be useful, if it doesn't mask the issue, would > be building xf86-video-intel with --enable-debug=full and sending me the > Xorg.0.log file for the affected session. CPU load seems not to be affected. I can't reproduce screen corruption when using xf86-video-intel with --enable-debug=full. I can't tell anything regarding the slownes because the debug version runs generally much slower. I've attached a new log file with debugging on. Created attachment 92232 [details]
Xorg.0.log
Ok, that's a large haystack that looks fairly typical. Do you ever see similar corruption without TearFree enabled? The last thing to test is --enable-debug (without the =full), and see if the sanity checks catch anything. If they do, they will trigger an assert and cause X to crash (just a word of warning). Here is a logfile with --enable-debug. X didn't crash but it containts some messages which only appear on affected sessions. Created attachment 92252 [details]
Xorg.0.log --enable-debug
Aha! It ran out of fences! Or rather due to the outstanding flips(?) when we asked for 6 fences (out of a possible 8) the kernel was not able to supply them. Created attachment 92311 [details] [review] Stall waiting for flips to complete to release a fence I think this should do the trick - introduce a stall waiting for the required fence. (Though the stall here will be overkill.) (In reply to comment #13) > Created attachment 92311 [details] [review] [review] > Stall waiting for flips to complete to release a fence > > I think this should do the trick - introduce a stall waiting for the > required fence. (Though the stall here will be overkill.) Unfortunately not, still the same problem. Hmm, I guess I should add some debug code to dump the fence registers upon EDEADLK. Done. commit 9342bc3dfd64b338c0802793f311574323067652 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Jan 18 21:38:59 2014 +0000 sna: Dump fence registers upon starvation References: https://bugs.freedesktop.org/show_bug.cgi?id=73696 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Please can you attach a fresh Xorg.0.log with --enable-debug and I'll have a second think about where those fences are disappearing. Created attachment 92362 [details]
Xorg.0.log.old
X crashed upon starting
Created attachment 92363 [details]
Xorg.0.log
Great, at least that confirms that is indeed pinning multiple fences for scanout. Just need to think how best to either reserve more fences for display (vs execution) or how to recover them. Created attachment 92396 [details] [review] Stall for pending unpin, including debug messages I think the earlier patch should have worked. :| Can you please try this version and include the dmesg from the inevitable failure, just to confirm I am looking in the right place? Created attachment 92397 [details] [review] Stall for pending unpin, including debug messages, v3 Ah, I think I missed the importance of the earlier return... Created attachment 92407 [details]
dmesg
Working now, thx
commit 4983005fd5eaa7594a830f35f91d7d4d983548ca Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jan 20 10:17:36 2014 +0000 drm/i915: Wait for completion of pending flips when starved of fences On older generations (gen2, gen3) the GPU requires fences for many operations, such as blits. The display hardware also requires fences for scanouts and this leads to a situation where an arbitrary number of fences may be pinned by old scanouts following a pageflip but before we have executed the unpin workqueue. This is unpredictable by userspace and leads to random EDEADLK when submitting an otherwise benign execbuffer. However, we can detect when we have an outstanding flip and so cause userspace to wait upon their completion before finally declaring that the system is starved of fences. This is really no worse than forcing the GPU to stall waiting for older execbuffer to retire and release their fences before we can reallocate them for the next execbuffer. v2: move the test for a pending fb unpin to a common routine for later reuse during eviction Reported-and-tested-by: dimon@gmx.net Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73696 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.