Bug 111541 - Cursor sprite sometimes not showed since linux 5.2
Summary: Cursor sprite sometimes not showed since linux 5.2
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: highest critical
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords: regression
Depends on:
Blocks:
 
Reported: 2019-09-02 13:07 UTC by Jonas Ådahl
Modified: 2019-10-10 08:46 UTC (History)
9 users (show)

See Also:
i915 platform: BDW, CFL, HSW, IVB, KBL
i915 features: display/Other


Attachments
i915_display_info: cursor invisible (4.22 KB, text/plain)
2019-09-03 07:26 UTC, Jonas Ådahl
no flags Details
i915_display_info: cursor visible (4.22 KB, text/plain)
2019-09-03 07:26 UTC, Jonas Ådahl
no flags Details
intel_reg dump --all: invisible (19.97 KB, text/plain)
2019-09-03 07:27 UTC, Jonas Ådahl
no flags Details
intel_reg dump --all: visible (19.97 KB, text/plain)
2019-09-03 07:27 UTC, Jonas Ådahl
no flags Details
intel_req read 12: invisible (2.14 KB, text/plain)
2019-09-03 07:28 UTC, Jonas Ådahl
no flags Details
intel_req read 12: visible (2.14 KB, text/plain)
2019-09-03 07:28 UTC, Jonas Ådahl
no flags Details
dump-gbm-bo.c (917 bytes, text/x-csrc)
2019-09-11 09:27 UTC, Jonas Ådahl
no flags Details

Description Jonas Ådahl 2019-09-02 13:07:02 UTC
Since upgrading from 5.1 to 5.2, the cursor sprite set via (non-atomic) KMS is not properly shown on screen sometimes when running mutter/GNOME Shell on top of KMS. It happens somewhat randomly and fairly seldom (personally only once), with no clear way of how to reproduce, but often enough to get regular bug reports. It somewhat feels like a race condition somewhere.

E.g.
https://bugzilla.redhat.com/show_bug.cgi?id=1738614 (contains drm.debug log)
https://gitlab.gnome.org/GNOME/gnome-shell/issues/1165

In all cases, downgrading to 5.1 makes the issue go away. If it's not a kernel bug/regression, any hints on what could cause it?
Comment 1 Lakshmi 2019-09-02 14:12:24 UTC
(In reply to Jonas Ådahl from comment #0)
> Since upgrading from 5.1 to 5.2, the cursor sprite set via (non-atomic) KMS
> is not properly shown on screen sometimes when running mutter/GNOME Shell on
> top of KMS. It happens somewhat randomly and fairly seldom (personally only
> once), with no clear way of how to reproduce, but often enough to get
> regular bug reports. It somewhat feels like a race condition somewhere.
> 
> E.g.
> https://bugzilla.redhat.com/show_bug.cgi?id=1738614 (contains drm.debug log)
> https://gitlab.gnome.org/GNOME/gnome-shell/issues/1165
> 
> In all cases, downgrading to 5.1 makes the issue go away. If it's not a
> kernel bug/regression, any hints on what could cause it?

Can you please verify the issue with drmtip?(https://cgit.freedesktop.org/drm-tip). Full logs (from 0 sec) from drmtip will be helpful for investigation. 
Btw attached logs are not from boot. Which platform is this?
Comment 2 Ville Syrjala 2019-09-02 14:15:20 UTC
cat /sys/kernel/debug/dri/0/i915_display_info
intel_reg read --count 12 0x70080 0x71080 0x72080

when the cursor has vanished should at least tell us whether the kernel thinks the cursor should be enabled, and whether it's actually enabled in hardware.
Comment 3 Jonas Ådahl 2019-09-03 07:24:02 UTC
(In reply to Ville Syrjala from comment #2)
> cat /sys/kernel/debug/dri/0/i915_display_info
> intel_reg read --count 12 0x70080 0x71080 0x72080
> 
> when the cursor has vanished should at least tell us whether the kernel
> thinks the cursor should be enabled, and whether it's actually enabled in
> hardware.

I just hit the issue again, and in i915_display_info, the cursor is reported as visible, but it's not showing on screen:

CRTC 47: pipe: A, active=yes, (size=1920x1080), dither=no, bpp=24
        fb: 118, pos: 0x0, size: 1920x1080
        encoder 106: type: DP-MST A, connectors:
                connector 117: type: DP-3, status: connected, mode:
                "1920x1080": 60 148500 1920 2008 2052 2200 1080 1084 1089 1125 0x48 0x5
        cursor visible? yes, position (334, 6), size 256x256, addr 0x00880000
        num_scalers=2, scaler_users=0 scaler_id=-1, scalers[0]: use=no, mode=10000000, scalers[1]: use=no, mode=0
        --Plane id 30: type=PRI, crtc_pos=   0x   0, crtc_size=1920x1080, src_pos=0.0000x0.0000, src_size=1920.0000x1080.0000, format=XR24 little-endian (0x34325258), rotation=0 (0x00000001)
        --Plane id 37: type=OVL, crtc_pos=   0x   0, crtc_size=   0x   0, src_pos=0.0000x0.0000, src_size=0.0000x0.0000, format=N/A, rotation=0 (0x00000001)
        --Plane id 44: type=CUR, crtc_pos= 334x   6, crtc_size= 256x 256, src_pos=0.0000x0.0000, src_size=256.0000x256.0000, format=AR24 little-endian (0x34325241), rotation=0 (0x00000001)
        underrun reporting: cpu=yes pch=yes 

When it's showing, the cursor part of the above text is identical, apart form the 'addr' (including the plane with id 44).

The intel_reg command just printed an error:
Error: /usr/share/igt-gpu-tools/registers/gen8_interrupt.txt:1: ('GEN8_MASTER_IRQ', '0x00044200', '')
Error: /usr/share/igt-gpu-tools/registers/skylake:1: gen8_interrupt.txt
Error: /usr/share/igt-gpu-tools/registers/kabylake:2: skylake
Warning: reading '/usr/share/igt-gpu-tools/registers/kabylake' failed. Using builtin register spec.

but then printed some registers. Attaching for when it's visible, and invisible. Attaching dump --all too for good measure.
Comment 4 Jonas Ådahl 2019-09-03 07:26:37 UTC
Created attachment 145243 [details]
i915_display_info: cursor invisible
Comment 5 Jonas Ådahl 2019-09-03 07:26:51 UTC
Created attachment 145244 [details]
i915_display_info: cursor visible
Comment 6 Jonas Ådahl 2019-09-03 07:27:19 UTC
Created attachment 145245 [details]
intel_reg dump --all: invisible
Comment 7 Jonas Ådahl 2019-09-03 07:27:32 UTC
Created attachment 145246 [details]
intel_reg dump --all: visible
Comment 8 Jonas Ådahl 2019-09-03 07:28:29 UTC
Created attachment 145247 [details]
intel_req read 12: invisible
Comment 9 Jonas Ådahl 2019-09-03 07:28:41 UTC
Created attachment 145248 [details]
intel_req read 12: visible
Comment 10 Ville Syrjala 2019-09-03 14:45:39 UTC
Hmm. Yeah, looks like both the kernel and hw think the cursor should be enabled. 

One theory might be that the alpha channel is all zeroes. Would need to dump the relevant chunk of memory to confirm.

I have a tool to do just that, and I just pimped it to handle cursors.

git://github.com/vsyrjala/intel-gpu-tools.git gtt_dump_2
intel_gtt_dump -f cursor.png -C a # dumps cursor image for pipe A

The downside of my tool is that it requires a kernel patch on PAT machines because the kernel is silly and won't allow you to mmap RAM via /dev/mem:
git://github.com/vsyrjala/linux.git pat_vs_dev_mem
Comment 12 Olivier Crête 2019-09-04 22:14:36 UTC
IVB as well (ThinkPad x230 w/ i7-3520M)
Comment 13 Chris Wilson 2019-09-07 17:08:38 UTC
If v5.1 is good and v5.2 is bad, a bisect may give a clearer hint.
Comment 14 Jonas Ådahl 2019-09-11 09:27:08 UTC
Created attachment 145329 [details]
dump-gbm-bo.c

Using the attached function (compile, break the compositor process using gdb, then run "print dlopen("/path/to/compiled/dump-gbm-bo.so", 2)") I attempted to look at the contents of the cursor buffer before it was passed to drmModeSetCursor2().

When reproducing, it seems that it contains only 0s, but the rest is correct (e.g. size). When it's not reproducing, the dump at the equivalent timing shows correct content. I haven't verified that a dump of the same gbm_bo is correct immediately after writing pixels.

In the cursor renderer code in mutter, we only ever write to a gbm_bo immediately after its construction, and we never map the memory after that. As the case is that this only reproduces with some kernel versions, I have my doubt it's that we upload empty pixels, but will add some code that dumps after construction too to be sure. What could cause memory of a gbm_bo to be cleared after its construction, but before its destruction?
Comment 15 Olivier Fourdan 2019-09-13 07:39:28 UTC
I wonder if this might be related to a VT switch somehow.

It seems to me every time I had that issue with the cursor disappearing, I had switched to another (console) VT shortly before...

It's not a 100% reproducer, but I suspect this might be a factor to trigger the issue.
Comment 16 Mikko Rapeli 2019-09-13 07:50:09 UTC
Not sure about VT switch, but I'm seeing this often in gnome when switching between virtual desktops. Pointer is drawn on first but not on second. There is also something with active windows, since pointer may be drawn on active console app window but not outside of it. And sometimes no pointer visible on lock/login screen... bisecting kernel now.
Comment 17 Jani Saarinen 2019-09-13 08:15:45 UTC
Thanks Mikko.
Comment 18 Andrew Green 2019-09-13 17:17:31 UTC
I finally found a way to reproduce this issue easily. It appears that increasing CPU/memory usage rapidly causes this issue to be triggered.

Steps to reproduce:
- Start a GNOME session
- Open Firefox
- Hold Ctrl-t to open tabs rapidly
- Once system memory reaches about 80% use the hot corner to switch to Activities view in GNOME Shell
- Move cursor in Activities overview
- The cursor should disappear

My current kernel version is:

```
$ uname -a
Linux galago 5.2.13-200.fc30.x86_64 #1 SMP Fri Sep 6 14:30:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
```

I haven't done any debugging to figure out why this is happening, but hopefully this information was useful!
Comment 19 Jonas Ådahl 2019-09-13 17:23:49 UTC
Thanks to being able to reproduce at will, I did some more digging.

  1. Take note of a what gbm_bo successful drmModeSetCursor2() had when showing a cursor on the GNOME Shell panel
  2. Move cursor to a maximized Firefox window just below the top panel (this caused drmModeSetCursor2() calls to change to the cursor from a wl_buffer.
  3. Hold down ^T for a while to open a bunch of tabs
  4. Move the cursor back up to the top panel. This triggered a call to drmModeSetCursor2().

What I could observe is that the same gbm_bo was used in (4) as was in (1), it had become empty. There were no gbm_bo cursor allocations done in between, nor any drmModeSetCursor2() calls.
Comment 20 Chris Wilson 2019-09-20 12:45:31 UTC
Can you please try https://patchwork.freedesktop.org/series/67000/
Comment 21 Chris Wilson 2019-09-24 09:02:39 UTC
I am reasonably confident this should be resolved by

commit 5028851cdfdf78dc22eacbc44a0ab0b3f599ee4a (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 20 13:18:21 2019 +0100

    drm/i915: Mark contents as dirty on a write fault
    
    Since dropping the set-to-gtt-domain in commit a679f58d0510 ("drm/i915:
    Flush pages on acquisition"), we no longer mark the contents as dirty on
    a write fault. This has the issue of us then not marking the pages as
    dirty on releasing the buffer, which means the contents are not written
    out to the swap device (should we ever pick that buffer as a victim).
    Notably, this is visible in the dumb buffer interface used for cursors.
    Having updated the cursor contents via mmap, and swapped away, if the
    shrinker should evict the old cursor, upon next reuse, the cursor would
    be invisible.
    
    E.g. echo 80 > /proc/sys/kernel/sysrq ; echo f > /proc/sysrq-trigger
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111541
    Fixes: a679f58d0510 ("drm/i915: Flush pages on acquisition")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Matthew Auld <matthew.william.auld@gmail.com>
    Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Cc: <stable@vger.kernel.org> # v5.2+
    Reviewed-by: Matthew Auld <matthew.william.auld@gmail.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190920121821.7223-1-chris@chris-wilson.co.uk
Comment 22 Jani Saarinen 2019-09-25 14:44:57 UTC
Let's resolve this. Reporter and commenters please verify on your side and please re-open if issue not fixed.
Comment 23 Jonas Ådahl 2019-09-25 15:09:37 UTC
I've tried 5.2.17 with the mentioned patch applied, and cannot reproduce anymore.
Comment 24 Mikko Rapeli 2019-09-25 15:41:47 UTC
I got stuck in bisecting when the bug no longer reproduced on laptop without the docking station. Now with 10 days uptime and back to docking station use, the bug reproduces again. Will try the patch.
Comment 25 Mikko Rapeli 2019-09-26 07:30:47 UTC
The patch doesn't seem to apply to 5.2 stable tree, so I tried to backport it like this. I hope this is correct.

--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1908,7 +1908,11 @@ vm_fault_t i915_gem_fault(struct vm_fault *vmf)
                list_add(&obj->userfault_link, &dev_priv->mm.userfault_list);
        GEM_BUG_ON(!obj->userfault_count);
 
-       i915_vma_set_ggtt_write(vma);
+       if (write) {
+               GEM_BUG_ON(!i915_gem_object_has_pinned_pages(obj));
+               i915_vma_set_ggtt_write(vma);
+               obj->mm.dirty = true;
+       }
 
 err_fence:
        i915_vma_unpin_fence(vma);
Comment 26 Mikko Rapeli 2019-10-08 11:40:44 UTC
With the patch above applied on top of v5.2.17 kernel the issue does not seem to reproduce anymore.

The patch is not yet in 5.2.20 or 5.3.5 stable trees but would be nice if you could submit there once the fix lands in Linus's tree.
Comment 27 Lakshmi 2019-10-10 08:46:11 UTC
(In reply to Mikko Rapeli from comment #26)
> With the patch above applied on top of v5.2.17 kernel the issue does not
> seem to reproduce anymore.
> 
> The patch is not yet in 5.2.20 or 5.3.5 stable trees but would be nice if
> you could submit there once the fix lands in Linus's tree.

At the moment fix is in drmtip, can not say when it will land in linus tree. You have to check regularly.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.