Bug 104439 - intel_do_flush_locked failed: Invalid argument
Summary: intel_do_flush_locked failed: Invalid argument
Status: RESOLVED WORKSFORME
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i915 (show other bugs)
Version: unspecified
Hardware: Other Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-01-01 17:33 UTC by Francesco Turco
Modified: 2018-05-17 18:09 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
git bisect log (2.83 KB, text/x-log)
2018-01-01 17:33 UTC, Francesco Turco
Details
Bisect stopping a commit with lockup but not distortion (2.45 KB, text/plain)
2018-02-14 20:27 UTC, Urs Fleisch
Details
Graphical corruption as a consecuence of the bug (135.93 KB, image/png)
2018-02-21 12:19 UTC, magiblot
Details

Description Francesco Turco 2018-01-01 17:33:01 UTC
Created attachment 136471 [details]
git bisect log

The Qutebrowser web browser crashes after a while with the following error message:

> intel_do_flush_locked failed: Invalid argument

If I restart Qutebrowser afterwards I obtain some sort of graphical corruption.

This only happens with 4.14.x kernels. No problem with 4.13.x kernels.

I'm running Gentoo Linux (~amd64) on a desktop machine with an Intel GPU:

> # lspci | grep VGA
> 00:02.0 VGA compatible controller: Intel Corporation 82Q35 Express Integrated
> Graphics Controller (rev 02)

Packages:
- www-client/qutebrowser-1.0.4
- x11-apps/mesa-progs-8.3.0
- x11-drivers/xf86-video-intel-2.99.917_p20171018
- x11-libs/libdrm-2.4.88

This is the result of my first attempt at bisecting a kernel:

> fe91f28138e730790db014812623cfaadd318fa6 is the first bad commit

Please also see: https://bugzilla.kernel.org/show_bug.cgi?id=198115
Comment 1 Francesco Turco 2018-01-04 13:34:22 UTC
It seems this problem happens only when using SNA. I can't reproduce this bug with UXA.
Comment 2 Chris Wilson 2018-01-04 13:45:46 UTC
Try (or confirm you have):

commit 1d033beb20d6d5885587a02a393b6598d766a382
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Oct 31 10:36:07 2017 +0000

    drm/i915: Check incoming alignment for unfenced buffers (on i915gm)
    
    In case the object has changed tiling between calls to execbuf, we need
    to check if the existing offset inside the GTT matches the new tiling
    constraint. We even need to do this for "unfenced" tiled objects, where
    the 3D commands use an implied fence and so the object still needs to
    match the physical fence restrictions on alignment (only required for
    gen2 and early gen3).
    
    In commit 2889caa92321 ("drm/i915: Eliminate lots of iterations over
    the execobjects array"), the idea was to remove the second guessing and
    only set the NEEDS_MAP flag when required. However, the entire check
    for an unusable offset for fencing was removed and not just the
    secondary check. I.e.
    
            /* avoid costly ping-pong once a batch bo ended up non-mappable */
            if (entry->flags & __EXEC_OBJECT_NEEDS_MAP &&
                !i915_vma_is_map_and_fenceable(vma))
                    return !only_mappable_for_reloc(entry->flags);
    
    was entirely removed as the ping-pong between execbuf passes was fixed,
    but its primary purpose in forcing unaligned unfenced access to be
    rebound was forgotten.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103502
    Fixes: 2889caa92321 ("drm/i915: Eliminate lots of iterations over the execobjects array")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171031103607.17836-1-chris@chris-wilson.co.uk
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Comment 3 Francesco Turco 2018-01-04 17:50:33 UTC
I can reproduce this bug by checking out tag v4.14 from linux-stable. It includes the commit you mentioned.
Comment 4 Urs Fleisch 2018-01-07 17:06:50 UTC
I have probably the same problem with a HP Compaq dc7800p Small Form Factor having Intel Corporation 82Q35 Express Integrated Graphics Controller (Intel GMA 3100 graphics, Mesa DRI Intel(R) Q35, OpenGL 2.1, DRI driver: i915).

The problem appeared since Arch Linux switched to kernel 4.14 (but not with the 4.9 LTS kernel), I have also reproduced it on Ubuntu using mainline kernels (http://kernel.ubuntu.com/~kernel-ppa/mainline). The problems do not appear with 4.13.16-041316, but starting with 4.14.0-041400rc1, 4.14.0-041400-generic, 4.14.12-041412-generic, and also with the latest kernel available there 4.15.0-041500rc6-generic the problem still exists.

The easiest way to reproduce the problem is starting Chromium and going to maps.google.com. On Arch Linux (both with KDE and LxQt) the address bar disappears and the map tiles are totally distorted. On Ubuntu 17.10 (Gnome), the address bar disappears as soon as I start Chromium, but the contents of Google Maps is not distorted. On both distributions, I can see 

(EE) intel(0): Failed to submit rendering commands (Invalid argument), disabling acceleration.

in /var/log/Xorg.0.log (Arch) or in journalctl (Ubuntu). In journalctl, there is also the following log entry

intel_do_flush_locked failed: Invalid argument

I have looked into the kernel source of the current Arch Linux kernel (4.14.12-1-ARCH), the commit 1d033b mentioned above is included.

With chrome://gpu, some messages are displayed

GpuProcessHostUIShim: The GPU process exited with code 256.
[970:970:0107/073806.982205:ERROR:gles2_cmd_decoder.cc(17977)] [.DisplayCompositor-0x10da87b96000]GL ERROR :GL_INVALID_OPERATION : glCreateAndConsumeTextureCHROMIUM: invalid mailbox name
[970:970:0107/073806.982610:ERROR:gles2_cmd_decoder.cc(9881)] [.DisplayCompositor-0x10da87b96000]RENDER WARNING: texture bound to texture unit 0 is not renderable. It maybe non-power-of-2 and have incompatible texture filtering.
[970:970:0107/073807.029877:ERROR:gles2_cmd_decoder.cc(9881)] [.DisplayCompositor-0x10da87b96000]RENDER WARNING: texture bound to texture unit 0 is not renderable. It maybe non-power-of-2 and have incompatible texture filtering.
...a lot more of these

Please let me know if I can provide more information.
Comment 5 Francesco Turco 2018-02-14 13:52:04 UTC
I just finished trying kernel 4.15.3 with the QupZilla web browser (instead of Qutebrowser). Unfortunately it still crashes.
Comment 6 Urs Fleisch 2018-02-14 20:26:11 UTC
Since Arch Linux switched to kernel 4.14 for linux-lts (and the standard linux package is at 4.15), I now have the choice between two kernels which do not work with my Intel graphics Q35, this gives me quite some motivation to help fixing this issue. Note that I do not know a real workaround, UXA is not an option, it crashes regularly, the modesetting driver does not work with my hardware (it needs at least OpenGL 2.1) and switching off the GPU in Chromium can mitigate the problem, but it still happens.

I made a bisect using the linux-git kernel from AUR and started where the bisect from Francesco Turco stopped (its first bad commit contained only hwmon stuff, but I really expect some i915 code in the first bad commit). I will attach my bisect log (named git_bisect_log_2.txt). My first bad commit is

[170fa29b14fadf2deb361589cefe6a78b21b1b22] drm/i915: Simplify eb_lookup_vmas()

Some remarks about the bisect: To know if a commit is good or bad, I booted the kernel and started Chromium with maps.google.com. If the graphics was distorted, I marked it bad. Two commits were not so easy to judge: "[79364227e6b4923478e99d8480d62482b588ef84] IB/core: Add might_sleep() annotation to ib_init_ah_from_wc()" did not distort the graphics, but I could no longer move the mouse pointer (the computer was still controlable via keyboard), so I marked that commit as good. The second problem commit was near the end of the bisect procedure "[170fa29b14fadf2deb361589cefe6a78b21b1b22] drm/i915: Simplify eb_lookup_vmas()" did not show the distortion, but after browsing maps.google.com, the computer was totally locked up, I had to push the power button to restart it, so I marked this commit as bad. This is the commit which is now marked as the first bad, probably the distortion was introduced later, but the commit before "[c7c6e46f913bb3a6ff19e64940ebb54652033677] drm/i915: Convert execbuf to use struct-of-array packing for critical fields" was good. I will probably have a look at some of the later commits to gain more information.
Comment 7 Urs Fleisch 2018-02-14 20:27:34 UTC
Created attachment 137362 [details]
Bisect stopping a commit with lockup but not distortion
Comment 8 Urs Fleisch 2018-02-17 11:04:24 UTC
I have verified if the git bisect has found the real bad commit and can confirm that

    c7c6e46f913b drm/i915: Convert execbuf to use struct-of-array packing for critical fields

is the last good commit. Two commits later

commit 170fa29b14fadf2deb361589cefe6a78b21b1b22
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Aug 16 09:52:07 2017 +0100

    drm/i915: Simplify eb_lookup_vmas()

    Since the introduction of being able to perform a lockless lookup of an
    object (i915_gem_object_get_rcu() in fbbd37b36fa5 ("drm/i915: Move object
    release to a freelist + worker") we no longer need to split the
    object/vma lookup into 3 phases and so combine them into a much simpler
    single loop.

and

commit d1b48c1e7184d9bc4ae6d7f9fe2eed9efed11ffc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Aug 16 09:52:08 2017 +0100

    drm/i915: Replace execbuf vma ht with an idr

    This was the competing idea long ago, but it was only with the rewrite
    of the idr as an radixtree and using the radixtree directly ourselves,
    along with the realisation that we can store the vma directly in the
    radixtree and only need a list for the reverse mapping, that made the
    patch performant enough to displace using a hashtable. Though the vma ht
    is fast and doesn't require any extra allocation (as we can embed the node
    inside the vma), it does require a thread for resizing and serialization
    and will have the occasional slow lookup. That is hairy enough to
    investigate alternatives and favour them if equivalent in peak performance.
    One advantage of allocating an indirection entry is that we can support a
    single shared bo between many clients, something that was done on a
    first-come first-serve basis for shared GGTT vma previously. To offset
    the extra allocations, we create yet another kmem_cache for them.

the bug was introduced. Graphics are distorted, the Linux kernel is not really usable with my Intel graphics Q35.
The bug is easy to reproduce: Start the Chromium and browse maps.google.com.

Chromium reports

    intel_do_flush_locked failed: Invalid argument

Using strace on ioctl, I can see that the invalid argument is caused by

    ioctl(129, DRM_IOCTL_I915_GEM_EXECBUFFER2_WR, 0x7ffd8b5b6540) = -1 EINVAL (Invalid argument)

Using some printk debugging, it can be seen that i915_gem_execbuffer2() returns first -512=-ERESTARTSYS and then four times -22=-EINVAL.

@Chris Wilson: Could you please have a look at the problem? Do you have old Intel graphics Q35 available to reproduce the issue? I would fix the problem myself if I had enough knowledge about this driver. The best I could probably do would be to introduce a module parameter to restore the old behavior using the hashtable instead of the radix tree, but there must be a better solution to use the new radix tree and keep it working on old Intel graphics hardware. I will provide you with as much information as needed and will check patches once you have some available.
Comment 9 magiblot 2018-02-21 12:19:16 UTC
Created attachment 137499 [details]
Graphical corruption as a consecuence of the bug

I also have this problem. My computer is an Intel Atom N455 with integrated gfx Intel GMA 3150, OpenGL 2.1 enabled. I'm using KDE Plasma 5 and plasmashell sometimes crashes when I move the mouse over the tasks bar or the applications menu. Strangely enough, I have only experienced this issue when using a web browser at the same time. Then there is some graphical corruption that affects only the content of the web display (processed by QtWebEngine in my case, since I have tried QupZilla and Konqueror). Using UXA instead of SNA does not solve the problem for me.

I'm on ArchLinux x86-64 with linux 4.14.20-1-lts installed.

Cheers.
Comment 10 Francesco Turco 2018-04-20 12:13:36 UTC
I can't reproduce this bug anymore with the Falkon 3.0.0 web browser and the Linux 4.16.2 kernel.
Comment 11 magiblot 2018-04-20 12:41:49 UTC
It has been working fine for me for quite a time with the 4.15 kernel as well.
Comment 12 Jani Nikula 2018-05-17 18:09:53 UTC
Thanks for the updates, closing.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.