Bug 105720

Summary: screen corruption using SNA and TearFree on Intel GeminiLake
Product: xorg Reporter: Clinton Taylor <clinton.a.taylor>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: blocker    
Priority: medium CC: eero.t.tamminen, martin.peres
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
xorg config file as used on GLK RVP
none
Xorg.1.log from GLK RVP
none
moving window around screen showing corruption
none
Xorg log on a GLK NUC that cannot reproduce the issue
none
Ratelimit per-CRTC updates without TearFree
none
Diff needed to fix the issue, to be applied on top of 6c1e70ca4234a5ab23fcdb3735fc55ac7939c52c (current HEAD)
none
Reorder vblank/flip event handler none

Description Clinton Taylor 2018-03-23 17:43:42 UTC
Created attachment 138318 [details]
xorg config file as used on GLK RVP

Issue:
   Intermittent screen corruption occurs while moving/resizing windows on display.
 
Reproduction environment:
   Pure X windows using Openbox with DRI2/SNA/TEARFREE with second monitor rotated. Running GFX applications like glx_gears appears to make issue appear quicker.
Comment 1 Clinton Taylor 2018-03-23 17:50:55 UTC
Created attachment 138319 [details]
Xorg.1.log from GLK RVP
Comment 2 Clinton Taylor 2018-03-23 17:53:28 UTC
Created attachment 138320 [details]
moving window around screen showing corruption
Comment 3 Clinton Taylor 2018-03-23 21:10:15 UTC
Issue does not appear to occur on Kabylake platforms.
Comment 4 Clinton Taylor 2018-03-23 21:17:50 UTC
Disabling XDAMAGE extension did not change the issue.

Removing INT10 module load from xorg.conf did not change the issue.

Problem changes a little with PerCrtcPixmaps set to true, but causes many other issues.

All intel_options[] have been tried with no change to behavior.

Xrefresh called when screen is corrupted refreshes the displays correctly.
Comment 5 Clinton Taylor 2018-03-23 21:22:54 UTC
Invalidated full pixmap shadow_region.extents in wait_for_shadow() in sna_display.c and sna_mode_redisplay() 

No change in behavior
Comment 6 Kimmo Nikkanen 2018-03-25 08:51:59 UTC
Have we tried to disable all the persistence features such as FBC/PSR and see whether that helps?
Comment 7 Martin Peres 2018-03-26 13:30:10 UTC
Created attachment 138358 [details]
Xorg log on a GLK NUC that cannot reproduce the issue

I tried to reproduce the issue, to no avail:

HW:
 - GLK NUC (J4005) with one or two sticks of RAM
 - 2 HDMI screens, one 4K (rotated) and a full HD one
 
SW:
 - Linux 4.16-rc7
 - X-server 1.19.6
 - Intel DDX / SNA: 2:2.99.917+git20171229-1
 - DRI2 and DRI3 got tested

First of all, are you sure you are using DRI2? I added code to mesa a long time ago to be able to distinguish between the two: LIBGL_DEBUG=verbose glxgears.

Have you tried running a compositor instead of running on a bare-bone X? You can try compton which is a standalone X11 compositor.

Hope this helps.
Comment 8 Chris Wilson 2018-03-26 15:20:13 UTC
Towards the end of the log, we see a sequence like

[    55.623] sna_poly_fill_rect: fallback - fbPolyFillRect
[    55.623] sna_gc_move_to_gpu(0x55e0f751c330)
[    55.623] sna_change_window_attributes
[    55.623] sna_change_window_attributes: flushing background pixmap
[    55.623] sna_validate_pixmap: target bpp=32, source bpp=32
[    55.623] sna_validate_gc(0x55e0f6f690c0) changes=7fffff, previous serial=80000000, drawable=247
[    55.623] sna_validate_gc: recomputing clip
[    55.623] sna_validate_gc: composite clip=1x[(3090, 1211), (3106, 1227)] [0x55e0f73a4f30]
[    55.623] sna_poly_fill_rect(n=1, PlaneMask: ffffffff (solid 1), solid fill: 0 [style=1, tileIsPixel=0], alu=3)
[    55.623] sna_poly_fill_rect_extents: [0] = (0, 0)x(16, 16)
[    55.623] sna_poly_fill_rect: extents(3090, 1211), (3106, 1227), flags=1
[    55.623] sna_poly_fill_rect: dropping last-cpu hint
[    55.623] sna_drawable_use_bo pixmap=8, box=((3090, 1211), (3106, 1227)), flags=9...
[    55.623] sna_drawable_use_bo: pinned, never REPLACES
[    55.623] sna_drawable_use_bo: flush=0, shm=0, cpu=0 => flags=9
[    55.623] sna_drawable_use_bo: use GPU fast path (all-damaged)
[    55.623] sna_drawable_use_bo: applying move-to-gpu override
[    55.623] sna_pixmap_discard_shadow_damage: discarding region 1x[(3090, 1211), (3106, 1227)] from damage 1x[(2989, 1209], (3106, 1229)]
[    55.623] wait_for_shadow: enabled? 1 waiting? 0, flags=3, flips=2, pixmap=8 [front?=1], handle=10, shadow=39
[    55.626] wait_for_shadow: 2 flips still pending, shadow flip_active=2
[    55.626] wait_for_shadow: after waiting 2 flips outstanding, flip_active=2
[    55.626] kgem_create_2d(5760x2160, bpp=32, tiling=1, exact=1, inactive=0, cpu-mapping=0, gtt-mapping=0, scanout?=1, prime?=0, temp?=0)
[    55.626] kgem_surface_size: tile_width=512, tile_height=8 => aligned pitch=23040, height=2160
[    55.626]   1:from scanout: pitch=23040, tiling=1, handle=11, id=447
[    55.626] wait_for_shadow: replacing still-attached GPU bo handle=1, flips=2
[    55.626] wait_for_shadow: copying existing GPU damage: 4x(0, 0), (5760, 2160)
[    55.626] gen9_render_copy_boxes (0, 0)->(0, 0) x 4, alu=3, flags=0, self-copy=0, overlaps? 0

The tricky part here is sna_pixmap_discard_shadow_damage. So the idea behind the earlier test was to remove that discard and force the full repaint. It's still the best idea I have as to where it goes wrong...
Comment 9 Chris Wilson 2018-03-26 16:32:47 UTC
Variant of earlier test, force TearFree to use a new(ish) scanout on every frame:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index e7bf6cab..409bb9bf 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -1785,7 +1785,7 @@ static bool wait_for_shadow(struct sna *sna,
        }
 
        bo = sna->mode.shadow;
-       if (flip_active) {
+       if (flip_active || 1) {
                bo = kgem_create_2d(&sna->kgem,
                                    pixmap->drawable.width,
                                    pixmap->drawable.height,

It is still pulling the scanout from a cache (or else we need to clflush everytime) but we now have a full frame delay before reuse, and we are forced to do a fullscreen copy everytime.
Comment 10 Clinton Taylor 2018-03-26 16:47:30 UTC
(In reply to Martin Peres from comment #7)

> 
> First of all, are you sure you are using DRI2? I added code to mesa a long
> time ago to be able to distinguish between the two: LIBGL_DEBUG=verbose
> glxgears.
> 
> Have you tried running a compositor instead of running on a bare-bone X? You
> can try compton which is a standalone X11 compositor.
> 
> Hope this helps.

Confirmed DRI2 is being used with LIBGL_DEBUG=verbose

Running a compositor increases the time it takes to reproduce, but the issue is still seen. Compositors used have been compiz and xcompmgr.
Comment 11 Clinton Taylor 2018-03-26 17:14:44 UTC
(In reply to Chris Wilson from comment #9)
> Variant of earlier test, force TearFree to use a new(ish) scanout on every
> frame:
> 
> diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
> index e7bf6cab..409bb9bf 100644
> --- a/src/sna/sna_display.c
> +++ b/src/sna/sna_display.c
> @@ -1785,7 +1785,7 @@ static bool wait_for_shadow(struct sna *sna,
>         }
>  
>         bo = sna->mode.shadow;
> -       if (flip_active) {
> +       if (flip_active || 1) {
>                 bo = kgem_create_2d(&sna->kgem,
>                                     pixmap->drawable.width,
>                                     pixmap->drawable.height,
> 
> It is still pulling the scanout from a cache (or else we need to clflush
> everytime) but we now have a full frame delay before reuse, and we are
> forced to do a fullscreen copy everytime.

Still reproducible with this patch.
Comment 12 Chris Wilson 2018-03-26 22:05:54 UTC
Created attachment 138363 [details] [review]
Ratelimit per-CRTC updates without TearFree

So looking at the code, the easiest way to buffer output onto the CRTC is via Option "PerCRTCPixmap". We can replace Option "TearFree" (or force Option "NoTearFree") with the attached as a rough approximation to scheduling the screen update via vblank (v.rough, it definitely does not prevent tearing!), and it definitely feels heavier than TearFree (presumably due to the synchronisation on the extra forced copy).

Testing on bdw/skl showed no sign of the "blank screen", but there's a fishy !rr_active() in the middle of initialising the crtc pixmap that needs double checking (I suspect that actually should be if !(flags & TEAR_FREE)) and may be the cause of the delayed init.

Option "PerCRTCPixmap" + Option "TearFree" should work together, that's akin to triple buffering, as the CRTC flips are double buffered with the ScreenPixmap being the third copy as rendered to by the clients. Without the feedback loop, it should be impervious to the accumulation of dirt, but it may still exhibit the wrong buffer syndrome (if the event accounting is completely skewiff), e.g. glxgears going backwards.

The drawback is that at best it should instill an extra frame of input-output latency, and at worst that latency may jitter.
Comment 13 Chris Wilson 2018-03-27 15:25:55 UTC
So far:

- observed it in TearFree

- not observed it in NoTearFree

- we've nullified damage tracking for TearFree, forcing it to copy every frame,
still reproduces (probably sensible to check through the debug logs for this one to verify the copy to a fresh bo)

- used PerCrtcPixmaps + TearFree, reproduced

- used PerCrtcPixmaps + NoTearFree + attached patch (c12), reproduced

(Please double check the above, and make sure the log file reports using PerCrtcPixmap/TearFree etc.)

So everything above that involves a pageflip (the ratelimit patch includes a pageflip to itself for timing purposes) has reproduced the _persistent_ dirt.

Left to verify,

- PerCrtcPixmaps + NoTearFree (not using attached patch)

- Forcing a full CRTC copy with PerCrtcPixmaps + NoTearFree (requires another hack)

Other things to test (if it is pageflip related), fullscreen DRI2/DRI3 applications (needs to watch HW or logs to confirm pageflips).
Comment 14 Chris Wilson 2018-03-27 17:00:40 UTC
On the assumption it is not the damage tracking between buffers, one of the few other steps is the rendercopy. Another set of tests will then be with

  Option "AccelMethod" "blt"

preferably testing with both TearFree and NoTearFree. (You can selectively disable individual gen9_render.c entry points to narrow down the routine if it appears to help.)
Comment 15 Clinton Taylor 2018-03-28 00:27:58 UTC
(In reply to Chris Wilson from comment #13)
> So far:
> 
> - observed it in TearFree
> 
> - not observed it in NoTearFree
> 
> - we've nullified damage tracking for TearFree, forcing it to copy every
> frame,
> still reproduces (probably sensible to check through the debug logs for this
> one to verify the copy to a fresh bo)
> 
> - used PerCrtcPixmaps + TearFree, reproduced
> 
> - used PerCrtcPixmaps + NoTearFree + attached patch (c12), reproduced

Haven't been able to reproduce with this configuration again. However the last reproduction took over 20 minutes. Will continue testing.

> 
> (Please double check the above, and make sure the log file reports using
> PerCrtcPixmap/TearFree etc.)

Confirmed in logs that the Options are being used correctly.

> 
> So everything above that involves a pageflip (the ratelimit patch includes a
> pageflip to itself for timing purposes) has reproduced the _persistent_ dirt.
> 
> Left to verify,
> 
> - PerCrtcPixmaps + NoTearFree (not using attached patch)

Issue not seen. Rendering appears to be in order and double buffered with the extra crtc pixmap. Need to run this config more to validate.

> 
> - Forcing a full CRTC copy with PerCrtcPixmaps + NoTearFree (requires
> another hack)
> 
> Other things to test (if it is pageflip related), fullscreen DRI2/DRI3
> applications (needs to watch HW or logs to confirm pageflips).
Comment 16 Clinton Taylor 2018-03-28 01:20:01 UTC
(In reply to Chris Wilson from comment #14)
> On the assumption it is not the damage tracking between buffers, one of the
> few other steps is the rendercopy. Another set of tests will then be with
> 
>   Option "AccelMethod" "blt"
> 
> preferably testing with both TearFree and NoTearFree. (You can selectively
> disable individual gen9_render.c entry points to narrow down the routine if
> it appears to help.)

AccelMethod "blt" cause artifacts as well. usually limited to 1 or two lines inside existing windows, but even right click context menus seems to be corrupted. This occurs with both TearFree and NoTearFree.

This does not help
Comment 17 Martin Peres 2018-03-28 13:21:06 UTC
(In reply to Clinton Taylor from comment #10)
> (In reply to Martin Peres from comment #7)
> 
> > 
> > First of all, are you sure you are using DRI2? I added code to mesa a long
> > time ago to be able to distinguish between the two: LIBGL_DEBUG=verbose
> > glxgears.
> > 
> > Have you tried running a compositor instead of running on a bare-bone X? You
> > can try compton which is a standalone X11 compositor.
> > 
> > Hope this helps.
> 
> Confirmed DRI2 is being used with LIBGL_DEBUG=verbose
> 
> Running a compositor increases the time it takes to reproduce, but the issue
> is still seen. Compositors used have been compiz and xcompmgr.

Thanks for the information. I could reproduce the problem with your stack on my GLK.
Comment 18 Clinton Taylor 2018-03-28 17:52:43 UTC
Issue occurs on a BXT based laptop using PCI DEVID 0x5A85
Comment 19 Martin Peres 2018-03-28 18:37:12 UTC
Created attachment 138397 [details] [review]
Diff needed to fix the issue, to be applied on top of 6c1e70ca4234a5ab23fcdb3735fc55ac7939c52c (current HEAD)

So, the root window is getting corrupted even when using tripple buffering, which means the root window would never re-use buffers created by SNA.

To rule out anything related to the HW, we tried using as much fallback code as possible using Option "AccelMethod" "none" (Chris' suggestion).

Instead of fixing the problem, it made it extremely easy to reproduce. Chris asked for a full debug log (--enable-debug=full), which I could not get because SNA would hit an assert. Once commented out, I could produce a trace for Chris to read.

He found an issue using the trace and wrote a one-liner that seem to fix the problem. The only problem is that Chris does not understand why this fixes the issue at hand...

I am still posting the information now, for other parties to try to reproduce.
Comment 20 Chris Wilson 2018-03-28 18:44:14 UTC
Based on Martin's AccelMethod=none logs, we found this:

commit 7418d53c589719f71154c5ec58a2b7c486020587 (HEAD, upstream/master)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Mar 28 19:38:51 2018 +0100

    sna/dri2: Clip application of damage to windowed swapbuffers
    
    Since forever we have been passing region=NULL when doing a windowed
    swapbuffer on vblank, and using that to mark the entire pixmap as being
    damaged. For an uncomposited window, this is incorrect as it points to a
    clipped region of the ScreenPixmap and so we were marking the entire
    ScreenPixmap as being flushed, but only having operated on the windowed
    region. Instead pass along the clip extents if region is unset.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105720
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

It definitely impacts AccelMethod=none and seems to improve AccelMethod=sna as well, but I haven't been able to explain why.
Comment 21 Clinton Taylor 2018-03-28 23:13:16 UTC
(In reply to Chris Wilson from comment #20)
> Based on Martin's AccelMethod=none logs, we found this:
> 
> commit 7418d53c589719f71154c5ec58a2b7c486020587 (HEAD, upstream/master)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed Mar 28 19:38:51 2018 +0100
> 
>     sna/dri2: Clip application of damage to windowed swapbuffers
>     
>     Since forever we have been passing region=NULL when doing a windowed
>     swapbuffer on vblank, and using that to mark the entire pixmap as being
>     damaged. For an uncomposited window, this is incorrect as it points to a
>     clipped region of the ScreenPixmap and so we were marking the entire
>     ScreenPixmap as being flushed, but only having operated on the windowed
>     region. Instead pass along the clip extents if region is unset.
>     
>     References: https://bugs.freedesktop.org/show_bug.cgi?id=105720
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> It definitely impacts AccelMethod=none and seems to improve AccelMethod=sna
> as well, but I haven't been able to explain why.

This patch does not fix the issue. Issue is still seen on GLK and BXT platforms.

There are times that the original issue is not seen for over an hour.
Comment 22 Clinton Taylor 2018-03-28 23:15:53 UTC
(In reply to Martin Peres from comment #19)
> Created attachment 138397 [details] [review] [review]
> Diff needed to fix the issue, to be applied on top of
> 6c1e70ca4234a5ab23fcdb3735fc55ac7939c52c (current HEAD)
> 
> So, the root window is getting corrupted even when using tripple buffering,
> which means the root window would never re-use buffers created by SNA.
> 
> To rule out anything related to the HW, we tried using as much fallback code
> as possible using Option "AccelMethod" "none" (Chris' suggestion).
> 
> Instead of fixing the problem, it made it extremely easy to reproduce. Chris
> asked for a full debug log (--enable-debug=full), which I could not get
> because SNA would hit an assert. Once commented out, I could produce a trace
> for Chris to read.
> 
> He found an issue using the trace and wrote a one-liner that seem to fix the
> problem. The only problem is that Chris does not understand why this fixes
> the issue at hand...
> 
> I am still posting the information now, for other parties to try to
> reproduce.

Patch does not fix the issue. Sometimes the issue can not be reproduced over several hours. 

What was your xorg.conf Options during your testing?
Comment 23 Chris Wilson 2018-03-29 20:15:31 UTC
Try,

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 7fccb508..cb0b573c 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -1934,6 +1934,8 @@ bool sna_pixmap_discard_shadow_damage(struct sna_pixmap *priv,
        if (priv->move_to_gpu != wait_for_shadow)
                return false;
 
+       return true;
+
        sna = priv->move_to_gpu_data;
        if (region) {
                DBG(("%s: discarding region %dx[(%d, %d), (%d, %d)] from damage %dx[(%d, %d], (%d, %d)]\n",
Comment 24 Clinton Taylor 2018-03-29 20:26:17 UTC
(In reply to Chris Wilson from comment #23)
> Try,
> 
> diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
> index 7fccb508..cb0b573c 100644
> --- a/src/sna/sna_display.c
> +++ b/src/sna/sna_display.c
> @@ -1934,6 +1934,8 @@ bool sna_pixmap_discard_shadow_damage(struct
> sna_pixmap *priv,
>         if (priv->move_to_gpu != wait_for_shadow)
>                 return false;
>  
> +       return true;
> +
>         sna = priv->move_to_gpu_data;
>         if (region) {
>                 DBG(("%s: discarding region %dx[(%d, %d), (%d, %d)] from
> damage %dx[(%d, %d], (%d, %d)]\n",

trying now
Comment 25 Clinton Taylor 2018-03-29 20:51:08 UTC
Still reproduces with patch from comment #23
Comment 26 Chris Wilson 2018-03-29 21:55:01 UTC
I doubt it'll be too useful, but you might see something in the pattern:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 7fccb508..b5f569ed 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -1891,6 +1891,20 @@ static bool wait_for_shadow(struct sna *sna,
                                            region_num_rects(&sna->mode.shadow_region),
                                            0))
                        ERR(("%s: copy failed\n", __FUNCTION__));
+
+               sna_blt_fill_boxes(sna, GXcopy,
+                                  bo, pixmap->drawable.bitsPerPixel,
+                                  0x0000ff00,
+                                  region_rects(&sna->mode.shadow_region),
+                                  region_num_rects(&sna->mode.shadow_region));
+       }
+
+       if (RegionNotEmpty(&sna->mode.shadow_cancel)) {
+               sna_blt_fill_boxes(sna, GXcopy,
+                                  bo, pixmap->drawable.bitsPerPixel,
+                                  0x00ff00ff,
+                                  region_rects(&sna->mode.shadow_cancel),
+                                  region_num_rects(&sna->mode.shadow_cancel));
        }
 
        if (priv->cow)
Comment 27 Chris Wilson 2018-03-29 23:00:36 UTC
The lack of green onto the rotated output gives an idea:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 7fccb508..0ddb2080 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -1837,6 +1837,7 @@ static bool wait_for_shadow(struct sna *sna,
        RegionSubtract(&sna->mode.shadow_region,
                       &sna->mode.shadow_region,
                       &sna->mode.shadow_cancel);
+       DamageRegionAppend(sna->front, &sna->mode.shadow_region);
 
        while (!list_is_empty(&sna->mode.shadow_crtc)) {
                struct sna_crtc *crtc =
Comment 28 Chris Wilson 2018-03-29 23:01:51 UTC
Make that DamageRegionAppend(&sna->front->drawable, &sna->mode.shadow_region); for a quieter compiler.
Comment 29 Chris Wilson 2018-03-29 23:05:58 UTC
It shouldn't make a difference. The damaged area on the new ScreenPixmap are what needs to be copied to the rotated output, the undamaged area (that we have to copy back before flipping to the full ScreenPixmap) should match what is already on the rotated per-crtc outputs.

Hmm, going to think move about what the green pattern on the rotated output tells us.
Comment 30 Clinton Taylor 2018-03-30 00:19:08 UTC
(In reply to Chris Wilson from comment #28)
> Make that DamageRegionAppend(&sna->front->drawable,
> &sna->mode.shadow_region); for a quieter compiler.

This makes a big difference to the green display (patch from comment #26) on the rotated display and a little difference on the normal display.

I have also not replicated the issue yet on either display with DamageRegionAppend() added.
Comment 31 Chris Wilson 2018-03-30 10:12:42 UTC
(In reply to Clinton Taylor from comment #30)
> (In reply to Chris Wilson from comment #28)
> > Make that DamageRegionAppend(&sna->front->drawable,
> > &sna->mode.shadow_region); for a quieter compiler.
> 
> This makes a big difference to the green display (patch from comment #26) on
> the rotated display and a little difference on the normal display.
> 
> I have also not replicated the issue yet on either display with
> DamageRegionAppend() added.

It's the "either display" that keeps me worried. If it was just the rotated one, the next step would be

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 7fccb508..1c090946 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -9191,6 +9191,9 @@ void sna_mode_redisplay(struct sna *sna)
                                        RegionUnion(&damage, &damage, &sna_crtc->crtc_damage);
                                sna_crtc->crtc_damage = new_damage;
 
+                               damage.extents = crtc->bounds;
+                               damage.data = NULL;
+
                                sna_crtc_redisplay(crtc, &damage, bo);
                                kgem_bo_submit(&sna->kgem, bo);
                                __kgem_bo_clear_dirty(bo);

to force the entire per-crtc bo to be regenerate on each update.

Oh well if DamageRegionAppend survives, at least we have a w/a. TBD if it's a happy w/a.
Comment 32 Clinton Taylor 2018-03-30 15:42:51 UTC
(In reply to Chris Wilson from comment #31)
> (In reply to Clinton Taylor from comment #30)
> > (In reply to Chris Wilson from comment #28)
> > > Make that DamageRegionAppend(&sna->front->drawable,
> > > &sna->mode.shadow_region); for a quieter compiler.
> > 
> > This makes a big difference to the green display (patch from comment #26) on
> > the rotated display and a little difference on the normal display.
> > 
> > I have also not replicated the issue yet on either display with
> > DamageRegionAppend() added.
> 
> It's the "either display" that keeps me worried. If it was just the rotated
> one, the next step would be
> 
> diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
> index 7fccb508..1c090946 100644
> --- a/src/sna/sna_display.c
> +++ b/src/sna/sna_display.c
> @@ -9191,6 +9191,9 @@ void sna_mode_redisplay(struct sna *sna)
>                                         RegionUnion(&damage, &damage,
> &sna_crtc->crtc_damage);
>                                 sna_crtc->crtc_damage = new_damage;
>  
> +                               damage.extents = crtc->bounds;
> +                               damage.data = NULL;
> +
>                                 sna_crtc_redisplay(crtc, &damage, bo);
>                                 kgem_bo_submit(&sna->kgem, bo);
>                                 __kgem_bo_clear_dirty(bo);
> 
> to force the entire per-crtc bo to be regenerate on each update.
> 
> Oh well if DamageRegionAppend survives, at least we have a w/a. TBD if it's
> a happy w/a.

It's not a happy w/a. Issue has changed and now leaves full copy of moved window. Trying code above.
Comment 33 Clinton Taylor 2018-03-30 17:19:11 UTC
> 
> It's not a happy w/a. Issue has changed and now leaves full copy of moved
> window. Trying code above.

Still reproduces
Comment 34 Chris Wilson 2018-04-03 18:32:17 UTC
Created attachment 138556 [details] [review]
Reorder vblank/flip event handler

I do wonder if "sna: Defer submission of the next shadow frame until halfway through" wasn't that far off the truth. Please try the attached diff (3 patches rolled into one), surmised by

    sna: Reorder vblank/flip event handling to avoid TearFree recursion
    
    TearFree wants to grab the most recently used scanout for rendering the
    next frame into. If the flip event was still pending, we would then
    query the drm event buffer for any pending completions, but this would
    proceed to execute all the other events before the flip events as well.
    Since we they were out of sequence, we pushed them into a buffer to
    execute afterwards, however we forgot the side effects of the flip
    handlers, for example see commit af36a4ab78cc ("sna: Defer submission
    of the next shadow frame until halfway through") and that there may have
    been events read from drm into a local buffer inside sna_mode_wakeup()
    that haven't been processed yet.
    
    Eliminate the need for calling sna_mode_wakeup() by ensuring that all
    flip events have been completed first before handing the vblank
    callbacks and potential drawing, ensuring the correct ordering.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105720
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 35 Clinton Taylor 2018-04-03 19:46:42 UTC
(In reply to Chris Wilson from comment #34)
> Created attachment 138556 [details] [review] [review]
> Reorder vblank/flip event handler
> 
> I do wonder if "sna: Defer submission of the next shadow frame until halfway
> through" wasn't that far off the truth. Please try the attached diff (3
> patches rolled into one), surmised by
> 
>     sna: Reorder vblank/flip event handling to avoid TearFree recursion
>     
>     TearFree wants to grab the most recently used scanout for rendering the
>     next frame into. If the flip event was still pending, we would then
>     query the drm event buffer for any pending completions, but this would
>     proceed to execute all the other events before the flip events as well.
>     Since we they were out of sequence, we pushed them into a buffer to
>     execute afterwards, however we forgot the side effects of the flip
>     handlers, for example see commit af36a4ab78cc ("sna: Defer submission
>     of the next shadow frame until halfway through") and that there may have
>     been events read from drm into a local buffer inside sna_mode_wakeup()
>     that haven't been processed yet.
>     
>     Eliminate the need for calling sna_mode_wakeup() by ensuring that all
>     flip events have been completed first before handing the vblank
>     callbacks and potential drawing, ensuring the correct ordering.
>     
>     References: https://bugs.freedesktop.org/show_bug.cgi?id=105720
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Issue still occurs with this patch
Comment 36 Chris Wilson 2018-04-05 14:49:15 UTC
So, we simplified the wait_for_shadow() to always do a fullscreen copy onto a new buffer and by first filling the buffer with a checkerboard we confirmed that the copy was not occurring.

By using #define DEBUG_NO_BLT 1 and comparing with Option "AccelMethod" "blt", we were able to conclude this affects the render copy; i.e. gen9_render.c seems completely suspect.

That the copy fails entirely suggests some sort of pixel kill, or that the sampler read hits 0. The latter has happened before, when the sample offset was larger the supported table size.

But there is another possibility, that we fail to submit the sampler table, which is tucked in at the end of the batch.

Currently we are investigating this angle with

diff --git a/src/sna/sna.h b/src/sna/sna.h
index 6fe1f0d2..af61e4b7 100644
--- a/src/sna/sna.h
+++ b/src/sna/sna.h
@@ -80,7 +80,7 @@ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 #include "compiler.h"
 #include "debug.h"
 
-#define DEBUG_NO_BLT 0
+#define DEBUG_NO_BLT 1
 
 #define DEBUG_FLUSH_BATCH 0
 
and so far, all seems ok.
Comment 37 Chris Wilson 2018-04-05 14:50:27 UTC
Ah, wrong copy'n'paste:

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index 038ecc87..866796ff 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -79,7 +79,7 @@ search_snoop_cache(struct kgem *kgem, unsigned int num_pages, unsigned flags);
 #define DBG_NO_RELAXED_FENCING 0
 #define DBG_NO_SECURE_BATCHES 0
 #define DBG_NO_PINNED_BATCHES 0
-#define DBG_NO_SHRINK_BATCHES 0
+#define DBG_NO_SHRINK_BATCHES 1
 #define DBG_NO_FAST_RELOC 0
 #define DBG_NO_HANDLE_LUT 0
 #define DBG_NO_WT 0
Comment 38 Martin Peres 2018-04-06 08:34:24 UTC
(In reply to Chris Wilson from comment #37)
> Ah, wrong copy'n'paste:
> 
> diff --git a/src/sna/kgem.c b/src/sna/kgem.c
> index 038ecc87..866796ff 100644
> --- a/src/sna/kgem.c
> +++ b/src/sna/kgem.c
> @@ -79,7 +79,7 @@ search_snoop_cache(struct kgem *kgem, unsigned int
> num_pages, unsigned flags);
>  #define DBG_NO_RELAXED_FENCING 0
>  #define DBG_NO_SECURE_BATCHES 0
>  #define DBG_NO_PINNED_BATCHES 0
> -#define DBG_NO_SHRINK_BATCHES 0
> +#define DBG_NO_SHRINK_BATCHES 1
>  #define DBG_NO_FAST_RELOC 0
>  #define DBG_NO_HANDLE_LUT 0
>  #define DBG_NO_WT 0

No idea how Chris jumped to this idea after days of painful debugging, but boy was I glad to finally not be able to reproduce (despite fearing a fluke). Congrats Chris!

We are however still working on a proper fix that we could land, so please do not close the bug just yet!
Comment 39 Chris Wilson 2018-06-11 10:07:21 UTC
I suspect related to commit 746c8f143afad7aaa66c484485fc39888d437a3f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jun 10 20:43:09 2018 +0100

    drm/i915: Apply batch location restrictions before pinning
    
    We special case the position of the batch within the GTT to prevent
    negative self-relocation deltas from underflowing. However, that
    restriction is being applied after a trial pin of the batch in its
    current position. Thus we are not rejecting an invalid location if the
    batch has been used before, leading to an assertion if we happen to need
    to rearrange the entire payload. In the worst case, this may cause a GPU
    hang on gen7 or perhaps missing state.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105720
    Fixes: 2889caa92321 ("drm/i915: Eliminate lots of iterations over the execob
jects array")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Martin Peres <martin.peres@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180610194325.13467-2-c
hris@chris-wilson.co.uk
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Comment 40 Chris Wilson 2019-02-26 18:14:38 UTC
Pretty sure this was the missing bias in the kernel after all.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.