24979 – After a suspend cycle overlay is garbaged

Bug 24979 - After a suspend cycle overlay is garbaged

Summary: After a suspend cycle overlay is garbaged

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	Daniel Vetter
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-11-07 11:33 UTC by maximlevitsky
Modified:	2017-07-24 23:09 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
picture of the corruption (46.34 KB, image/jpeg) 2009-11-29 17:46 UTC, maximlevitsky	no flags	Details
patch against xf86-video-intel (1.03 KB, patch) 2009-12-02 01:20 UTC, Daniel Vetter	no flags	Details \| Splinter Review
xrandr output (3.10 KB, text/plain) 2009-12-26 14:23 UTC, maximlevitsky	no flags	Details
xorg log (71.31 KB, text/x-log) 2009-12-26 14:24 UTC, maximlevitsky	no flags	Details
kernel config (61.78 KB, application/octet-stream) 2009-12-26 14:24 UTC, maximlevitsky	no flags	Details
View All

Description maximlevitsky 2009-11-07 11:33:29 UTC

Using drm-intel-next and DRM_MODE_OVERLAY_LANDED in intel driver with all master versions of everything.
This is G965 system.

After doing a s2ram cycle, the overlay YUV offsets will be wrong.
I will see a b/w picture with series of blue/red rectanges, and occasional flashes of green, like on broken TV.

It doesn't matter if overlay was or wasn't used before suspend.

Note that if overlay was used, it has to be running while doing the suspend, otherwise system will hang on resume, this is separate bug.

Comment 1 Daniel Vetter 2009-11-10 06:47:13 UTC

On Sat, Nov 07, 2009 at 11:33:30AM -0800, bugzilla-daemon@freedesktop.org wrote:
> Using drm-intel-next and DRM_MODE_OVERLAY_LANDED in intel driver with all
> master versions of everything.
> This is G965 system.
> 
> After doing a s2ram cycle, the overlay YUV offsets will be wrong.
> I will see a b/w picture with series of blue/red rectanges, and occasional
> flashes of green, like on broken TV.
> 
> It doesn't matter if overlay was or wasn't used before suspend.
> 
> Note that if overlay was used, it has to be running while doing the suspend,
> otherwise system will hang on resume, this is separate bug.

Does this disappear when you resize the window like when using the overlay
for the first time, too? I suspect this is the same problem as the
overlay-is-green one in disguise.

-Daniel

Comment 2 maximlevitsky 2009-11-11 01:32:24 UTC

No, resizing the window doesn't help at all.
I also notice that bursts of green (several lines momentally turn green) occur when rendering happens in other areas of the screen (like text output in console)

Comment 3 maximlevitsky 2009-11-13 02:18:20 UTC

I did some research on this one.

First of all, registers are exactly same before and after suspend cycle.
I had written a program that dumps all registers from mmio range and from gart mapped overlay page.

Secondary, I understand now the garbaged output much better. 
This U/V layers aren't shifted like I thought. What happens is that of three layers (YUV) some are missing in rectagular areas that are scattered over the overlay window. Places where both U and V are missing are gray, places that miss one of U,V are red/blue, etc...

also I noticed that if I pause the video, still the pattern dosn't halt, but changes dynamically.

Also if I move the window fast enough, I could see the corect picture for a split second.

It looks like overlay hardware is starved on memory access, isn't it?

Looking thorough the source, I now understand that all register access happens through gart-mapped page, except its address that is send through MI_OVERLAY_FLIP.
and gamma correction registers that are written directly.

Comment 4 Daniel Vetter 2009-11-13 02:53:40 UTC

On Fri, Nov 13, 2009 at 02:18:23AM -0800, bugzilla-daemon@freedesktop.org wrote:
> I did some research on this one.
> 
> First of all, registers are exactly same before and after suspend cycle.
> I had written a program that dumps all registers from mmio range and from gart
> mapped overlay page.
> 
> Secondary, I understand now the garbaged output much better. 
> This U/V layers aren't shifted like I thought. What happens is that of three
> layers (YUV) some are missing in rectagular areas that are scattered over the
> overlay window. Places where both U and V are missing are gray, places that
> miss one of U,V are red/blue, etc...
> 
> also I noticed that if I pause the video, still the pattern dosn't halt, but
> changes dynamically.

Are you always seeing uniform colours (in one rectangular area) or is it
sometimes somewhat noisy?

> Also if I move the window fast enough, I could see the corect picture for a
> split second.
> 
> It looks like overlay hardware is starved on memory access, isn't it?

Maybe, but if this happens, you should see cacheline-aligned pieces of
_lines_ with the wrong color, and not rectangular blocks somewhere in the
overlaid image. I suspect there's a memory-barrier missing for the
gart-mapped overlay regs. 965 works different there than all previous
chips. Unfortunately I haven't had time to cook up a debug patch to check
this theory, but I'll do so rsn.

> Looking thorough the source, I now understand that all register access happens
> through gart-mapped page, except its address that is send through
> MI_OVERLAY_FLIP.
> and gamma correction registers that are written directly.

Yep, that's correct.

Comment 5 maximlevitsky 2009-11-18 15:20:11 UTC

Are you always seeing uniform colours (in one rectangular area) or is it
sometimes somewhat noisy?
I usually see one component of three.
This is ether gray red or blue area that has correct brightness levels as in original picture. So yes, colors aren't uniform.


> Also if I move the window fast enough, I could see the corect picture for a
> split second.
> 
> It looks like overlay hardware is starved on memory access, isn't it?

Maybe, but if this happens, you should see cacheline-aligned pieces of
_lines_ with the wrong color, and not rectangular blocks somewhere in the
overlaid image. I suspect there's a memory-barrier missing for the
gart-mapped overlay regs. 965 works different there than all previous
chips. Unfortunately I haven't had time to cook up a debug patch to check
this theory, but I'll do so rsn.
Hard to understand you, and probably hard for me to explain the output I see.
It like checkboard pattern a bit, very irregullar, and changing over time.


> Looking thorough the source, I now understand that all register access happens
> through gart-mapped page, except its address that is send through
> MI_OVERLAY_FLIP.
> and gamma correction registers that are written directly.

Yep, that's correct.

Comment 6 maximlevitsky 2009-11-29 17:46:55 UTC

Created attachment 31567 [details]
picture of the corruption

Don't have a real camera near me now, so this is a picture taken by webcam.

Comment 7 maximlevitsky 2009-11-29 17:51:04 UTC

Also I must note that:

every area has correct brightness, but one of components (or several) are missing.

When graphical output occurs, the pattern changes.
If I start a 3d application, the pattern begins to change rapidly, and some areas briefly show correct color, also by moving the window, its possible to see correct colors for a split second.

Comment 8 Daniel Vetter 2009-11-30 01:50:42 UTC

The picture is very interesting and definitely looks like cache-line sized
blocks (again an example of a picture's worth more than a thousand words
...). Can you please count how many pixels _wide_ these blocks are?

[Just watch a video at 1:1 resolution, count how many blocks you have and
then divide the horizontal size of the video by this. It should yield a
nice power-of-two]

Also, please attach /proc/cpuinfo (so that I know what's the size of your
cpu-cachelines).

Comment 9 Daniel Vetter 2009-11-30 02:06:08 UTC

Another quick question, just to check: When you stop the video (an have a
3d app running alongside), do the colors keep changing forever or do they
settle to something specific after a while (half a minute should do)? If
they settle to something specific, please take a picture of that, too
(save when everything is fine, of course).

-Daniel

Comment 10 maximlevitsky 2009-11-30 11:34:47 UTC

I did the tests.

First of all, it doesn't matter if video is moving or paused.

Then any grapichal output will affect the pattern.
Rapid output like I said makes it look like noise, but still horisontaly it is perfectly aligned.
In fact visablity of the windows doesn't matter ether, this is if I minimize the 3d game, the effect is same.

Aligment is just like you suspected:

First rectangle is 64 pixels wide, all following are 32 pixel wide.
Measured with gimp, by placing its window side by side.
I measured length of 16 blocks and got 512 pixels, and maximum +=4 pixels error.

I have core2 duo processor:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz
stepping : 6
cpu MHz : 1596.000
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips : 4262.88
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

Also zooming in (I use totem) does affect the block size.
When zooming in near maximum zoom, wide green rows appear in between the lines
Same happens if I put overlay partially offscreen.

Comment 11 Daniel Vetter 2009-12-01 01:29:14 UTC

I've thought about this and it doesn't look like the problem is cache
flushing related. Reasons:

- When you stop the video, the image doesn't stabilize. If there is some
  unflushed (on the cpu) or stale (on the gpu) data in caches, these bad
  blocks would slowly disappear (faster under load).

- The blocks are 32 pixels wide, i.e. 16 bytes in the U or V plane (UV are
  subsampled). Your cpu's cacheline size is 64. So it doesn't look like
  it's the cpu cache messing with the image.

- It might still be the gpu/gtt/agp cache. IIRC intel uses 32/16 byte
  cachelines there (I'd have to look that up). But the UV planes are in
  new bo's which have not yet been used by the gpu. So it's unlikely that
  the gpu caches contain so much stale data.

So I think someone is writing crap over the video image. This is supported
by the fact that when you move around the window like crazy, you can see
a correct frame. Moving around windows like crazy usually creates quite
some load, i.e. this may slow down whatever is writing into the video
image. At least slow it down enough so that you're able to see a correct
frame for a split second.

I have a new idea for a debug patch, hope to code and test it today.

-Daniel

PS: If you think any of the observations and conclusions in this small
summary are wrong, please point it out.

Comment 12 Daniel Vetter 2009-12-02 01:20:24 UTC

Created attachment 31650 [details] [review]
patch against xf86-video-intel

Can you quickly check whether this patch changes anything?

Comment 13 maximlevitsky 2009-12-02 14:25:06 UTC

No change at all

Comment 14 Daniel Vetter 2009-12-07 13:00:16 UTC

As you might have guessed from my silence, I'm running out of ideas ...

Just to gather some more information, can you please post your output
config (just send the output from xrandr) you your Xorg.log? Doesn't
really mather which driver version.

Meanwhile I'll try to cook up new ideas to test.

-Daniel

Comment 15 Daniel Vetter 2009-12-08 00:45:20 UTC

Can you also send your .config from the kernel?

Comment 16 maximlevitsky 2009-12-26 14:23:48 UTC

Created attachment 32304 [details]
xrandr output

Comment 17 maximlevitsky 2009-12-26 14:24:23 UTC

Created attachment 32306 [details]
xorg log

Comment 18 maximlevitsky 2009-12-26 14:24:55 UTC

Created attachment 32307 [details]
kernel config

Comment 19 maximlevitsky 2009-12-26 14:28:05 UTC

Hi,

Sorry for delays.
I attached all information you asked for, although I don't think there is anything much useful.

Note that I recently found out that suspend to disk cycle, brings the GPU to same state as on boot, that is green window of first run, then normal video.
suspend to ram cycle shows same problem again.

Comment 20 Daniel Vetter 2010-01-13 06:31:28 UTC

> --- Comment #19 from maximlevitsky@gmail.com  2009-12-26 14:28:05 PST ---
> Hi,
> 
> Sorry for delays.
> I attached all information you asked for, although I don't think there is
> anything much useful.

Thanks. I've looked through it but found nothing suspicious (or that could
be related to other bug reports).

> Note that I recently found out that suspend to disk cycle, brings the GPU to
> same state as on boot, that is green window of first run, then normal video.
> suspend to ram cycle shows same problem again.

Maybe initializing the gpu by the bios changes something. Dunno.

atm I'm hunting down cache flushing bugs, which might be related to your
problem. I'll postpone your report here until I've tracked down all the
issues I'm seeing (still not done, but hopefully getting there).

-Daniel

Comment 21 maximlevitsky 2010-01-15 16:05:55 UTC

I have some good and bad news.

I updated both kernel and GFX stack to latest versions.

The bad news are that now 3D is completely hosed, all 3d applications ether don't start (show complain about failed DRI2 request) or display black window.

On the other hand both green overlay and garbage after resume is gone. Overlay just works (and I did see that it is enabled by doing xvinfo, and it is preferred one)

I then booted old kernel, and overlay issues come back. Thus I suspect that this was accidentally fixed in kernel.

Will compile again old GFX stack + new kernel to see if that is true.

Best regards,
      Maxim Levitsky

Comment 22 Daniel Vetter 2010-01-16 00:32:47 UTC

> --- Comment #21 from maximlevitsky@gmail.com  2010-01-15 16:05:55 PST ---
> I have some good and bad news.
> 
> I updated both kernel and GFX stack to latest versions.
> 
> The bad news are that now 3D is completely hosed, all 3d applications ether
> don't start (show complain about failed DRI2 request) or display black window.
> 
> On the other hand both green overlay and garbage after resume is gone. Overlay
> just works (and I did see that it is enabled by doing xvinfo, and it is
> preferred one)
> 
> I then booted old kernel, and overlay issues come back. Thus I suspect that
> this was accidentally fixed in kernel.

That's really interesting.

> Will compile again old GFX stack + new kernel to see if that is true.

If this is true, can you please post the exact git revisions of the first
good and the last bad kernel. Perhaps I get a clue as to what's the
problem.

Thanks, Daniel

btw: I'll be mostly offline for 2 weeks now, so expect some latencies.

Comment 23 maximlevitsky 2010-01-16 04:22:55 UTC

Yep, I downgraded mesa and xserver, 3D is fine.

Overlay is displayed correctly after suspend to ram.
On the other hand, I do see the green window on first run, I just didn't notice this.

I will for sure bisect to find what fixed that bug.
This is very important, because otherwise it can surface again.

Its a bit weird though, to mark good commits as bad and versa versa....

Comment 24 Daniel Vetter 2010-01-31 15:01:47 UTC

Any news on your bisection? I'd really like to know what fixed your
problem.

-Daniel

Comment 25 maximlevitsky 2010-02-01 14:33:57 UTC

Really sorry.
I will do the bisection really soon.

Comment 26 Jesse Barnes 2010-02-05 15:18:32 UTC

Assuming the bisection will help Daniel fix this issue quickly.

Comment 27 maximlevitsky 2010-02-06 12:05:24 UTC

Just two funny things about git:

Bisecting: 0 revisions left to test after this (roughly 0 steps)

e8b60faea972604c315634cff62d44803731ea9 is first bad commit
commit 7e8b60faea972604c315634cff62d44803731ea9
Author: Andrew Lutomirski <luto@mit.edu>
Date:   Sun Nov 8 13:49:51 2009 -0500

    drm/i915: restore render clock gating on resume
    
    Rather than restoring just a few clock gating registers on resume,
    just reinitialize the whole thing.
    
    Signed-off-by: Andy Lutomirski <luto@mit.edu>
    [anholt: Fixed up for RC6 support landed since the patch was written]
    Signed-off-by: Eric Anholt <eric@anholt.net>


OK, now seriously. I bisected fix for this bug, and 
e8b60faea972604c315634cff62d44803731ea9 is the fix.

Comment 28 maximlevitsky 2010-02-06 12:07:00 UTC

7e8b60faea972604c315634cff62d44803731ea9 of course

Comment 29 Daniel Vetter 2010-02-07 02:50:40 UTC

> --- Comment #27 from maximlevitsky@gmail.com  2010-02-06 12:05:24 PST ---
> Just two funny things about git:
> 
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> 
> e8b60faea972604c315634cff62d44803731ea9 is first bad commit
> commit 7e8b60faea972604c315634cff62d44803731ea9
> Author: Andrew Lutomirski <luto@mit.edu>
> Date:   Sun Nov 8 13:49:51 2009 -0500
> 
>     drm/i915: restore render clock gating on resume
> 
>     Rather than restoring just a few clock gating registers on resume,
>     just reinitialize the whole thing.
> 
>     Signed-off-by: Andy Lutomirski <luto@mit.edu>
>     [anholt: Fixed up for RC6 support landed since the patch was written]
>     Signed-off-by: Eric Anholt <eric@anholt.net>

Thanks alot for bisecting this, this make some sense as a fix. I would
never have come up with such an idea, so there was definitely some decen
amount of luck involved ;)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.