Bug 55112 - [uxa gm45 3.6] Severe 2D corruption after running 3D application
Summary: [uxa gm45 3.6] Severe 2D corruption after running 3D application
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Chris Wilson
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-19 17:03 UTC by Sandro Mani
Modified: 2012-10-18 13:06 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Xorg.0.log (46.45 KB, text/plain)
2012-09-19 17:03 UTC, Sandro Mani
no flags Details
Screenshot 1 (125.08 KB, image/png)
2012-09-19 17:04 UTC, Sandro Mani
no flags Details
Screenshot 2 (46.69 KB, image/png)
2012-09-19 17:04 UTC, Sandro Mani
no flags Details
Screenshot 3 (117.54 KB, image/png)
2012-09-19 17:05 UTC, Sandro Mani
no flags Details
disable cpu relocs (475 bytes, patch)
2012-10-01 07:44 UTC, Daniel Vetter
no flags Details | Splinter Review
Kernel backtrace (unrelated) (1.98 KB, text/plain)
2012-10-02 23:18 UTC, Sandro Mani
no flags Details
Invaliate all state caches before a patch (1.04 KB, patch)
2012-10-08 22:37 UTC, Chris Wilson
no flags Details | Splinter Review
Invaliate all state caches before a batch (1.06 KB, patch)
2012-10-08 22:41 UTC, Chris Wilson
no flags Details | Splinter Review
Patch for 3.6.1 (816 bytes, text/plain)
2012-10-09 21:42 UTC, Sandro Mani
no flags Details

Description Sandro Mani 2012-09-19 17:03:54 UTC
Created attachment 67409 [details]
Xorg.0.log

Versions:
xorg-x11-drv-intel-2.20.7-1.fc19.x86_64
xorg-x11-server-Xorg-1.12.99.904-2.20120808.fc19.x86_64

Hardware:
Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
(Lenovo T400)

OS:
Fedora rawhide, KDE

Problem:
After running OpenGL applications, I often notice severe 2D corruption (see attachments). The corruptions are not "static", but change every time a redraw is triggered by some action.
Despite many attempts, I have not managed to find a way to reliably and immediately reproduce the issue.

Observations:
- Seems definitely to be a consequence of running 3D applications: running only 2D applications, I noticed no graphics glitch for two weeks in a row (without reboot)
- Cannot reproduce when switching to the discreet graphics (ATI)
- Disabling compositing does not help (actually, makes it even worse, since compositing triggers more redraws)
- First noticed with the xorg-x11-drv-intel-2.20.x series
- Affects Gtk applications in particular. Qt ones are way less affected.
- Usually, some time after the 3D application has quit, the system is able to recover (no more glitches)

Attachments:
- xorg.0.log
- screenshots
- sample opengl application which triggers the issue
Comment 1 Sandro Mani 2012-09-19 17:04:22 UTC
Created attachment 67410 [details]
Screenshot 1
Comment 2 Sandro Mani 2012-09-19 17:04:40 UTC
Created attachment 67411 [details]
Screenshot 2
Comment 3 Sandro Mani 2012-09-19 17:05:41 UTC
Created attachment 67412 [details]
Screenshot 3

Note: the diagonal pattern is a glitch, not the actual image!
Comment 4 Sandro Mani 2012-09-19 17:10:37 UTC
Sample application: http://n.ethz.ch/~smani/download/SampleApp.tar.gz
Comment 5 Chris Wilson 2012-09-19 17:17:17 UTC
Can you please confirm that the Xorg.log is from a session after you start seeing the corruption?
Comment 6 Sandro Mani 2012-09-19 17:18:23 UTC
Yes, I confirm that.
Comment 7 Chris Wilson 2012-09-28 21:48:09 UTC
Dave Airlie has been chasing a similar-ish bug involving glyph corruption on gm45/ilk, with a potential bisect in 3.5-rc1. His bisections suggest that pwrite is involved, and I've been trying to reproduce this by stressing those paths, in particular the unmappable region of the GTT.

If you get the chance, can you please grab /sys/kernel/debug/dri/0/i915_gem_objects at the time you see corruption? If you can also run 'trace-cmd record -e i915' at that time and attach the output of 'trace-cmd report' that would be very informative.
Comment 8 Sandro Mani 2012-09-29 14:43:28 UTC
3.5-rc1 may well be when issues started here too!

I've managed to reproduce the issue as follows:
- start alienarena
- put graphics settings to highest
- start a game, but press esc to return to the menu unlock the mouse from the window (game still runs "behind" the menu though)
- click around in normal 2d apps (i.e. just browse the web in firefox)

(It took me about 10 minutes to reproduce)

Collected information:
- report_3d (trace-cmd report with alienarena running, with glitches observable)
- report_2d (trace-cmd report after alienarena running, with glitches observable)
- i915_gem_objects_3d
- i915_gem_objects_2d

Find here: http://n.ethz.ch/~smani/download/files.tar.xz
Comment 9 Daniel Vetter 2012-10-01 07:44:38 UTC
Created attachment 67912 [details] [review]
disable cpu relocs

Can you please test this quick debug hack?
Comment 10 Sandro Mani 2012-10-01 22:04:42 UTC
So far surviving my stress tests... (i.e. no glitches)
Comment 11 Chris Wilson 2012-10-01 22:12:46 UTC
Can you also please test drm-intel-next-queued from http://cgit.freedesktop.org/~danvet/drm-intel as the use of cpu relocations and flushing is further modified in -next?
Comment 12 Sandro Mani 2012-10-02 23:18:03 UTC
Created attachment 68011 [details]
Kernel backtrace (unrelated)

Looking good so far, though will do some OpenGL development tomorrow to further test.

Unrelated: I got the attached kernel backtrace (fedora's automatic bug reporting tool notified me just after login).
Comment 13 Daniel Vetter 2012-10-03 08:56:01 UTC
(In reply to comment #12)
> Looking good so far, though will do some OpenGL development tomorrow to
> further test.

Just to clarify: Does it look good so far with my patch applied, or when running drm-intel-next?
Comment 14 Sandro Mani 2012-10-03 09:05:53 UTC
I tested one day with your patch applied, and am now testing with drm-next. In both cases, I didn't encounter the issue (yet?).
Comment 15 Sandro Mani 2012-10-06 13:56:15 UTC
Just a status update: so far I haven't encountered the issue again (kernel 3.6 + drm-next). I'll keep testing for another week, if things keep working, then I guess this issue can be marked as solved.
Comment 16 Chris Wilson 2012-10-06 15:28:50 UTC
(In reply to comment #15)
> Just a status update: so far I haven't encountered the issue again (kernel
> 3.6 + drm-next). I'll keep testing for another week, if things keep working,
> then I guess this issue can be marked as solved.

The tricky part is then working out the minimal fix for 3.5/3.6. I think something like http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=for-airlied&id=9c1188292c9da53bbf29799ed5f682029e8c9583 should work around the issue and be backportable. Just I have no idea what the actual underlying bug along that path is.
Comment 17 Chris Wilson 2012-10-08 22:37:59 UTC
Created attachment 68296 [details] [review]
Invaliate all state caches before a patch

To test a theory I have that we miss a GPU flush after a CPU reloc, please could you test this patch on top of 3.5/3.6?
Comment 18 Chris Wilson 2012-10-08 22:41:46 UTC
Created attachment 68297 [details] [review]
Invaliate all state caches before a batch

2nd try
Comment 19 Sandro Mani 2012-10-09 21:42:51 UTC
Created attachment 68364 [details]
Patch for 3.6.1

The patch does not apply to 3.6.1, I'd change it as attached - is this correct?
Comment 20 Chris Wilson 2012-10-18 13:06:32 UTC
In the end,

commit 504c7267a1e84b157cbd7e9c1b805e1bc0c2c846
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 23 13:12:52 2012 +0100

    drm/i915: Use cpu relocations if the object is in the GTT but not mappable
    
    This prevents the case of unbinding the object in order to process the
    relocations through the GTT and then rebinding it only to then proceed
    to use cpu relocations as the object is now in the CPU write domain. By
    choosing to use cpu relocations up front, we can therefore avoid the
    rebind penalty.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

was chosen as the patch to be sent forthwith to stable@.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.