Created attachment 136471 [details] git bisect log The Qutebrowser web browser crashes after a while with the following error message: > intel_do_flush_locked failed: Invalid argument If I restart Qutebrowser afterwards I obtain some sort of graphical corruption. This only happens with 4.14.x kernels. No problem with 4.13.x kernels. I'm running Gentoo Linux (~amd64) on a desktop machine with an Intel GPU: > # lspci | grep VGA > 00:02.0 VGA compatible controller: Intel Corporation 82Q35 Express Integrated > Graphics Controller (rev 02) Packages: - www-client/qutebrowser-1.0.4 - x11-apps/mesa-progs-8.3.0 - x11-drivers/xf86-video-intel-2.99.917_p20171018 - x11-libs/libdrm-2.4.88 This is the result of my first attempt at bisecting a kernel: > fe91f28138e730790db014812623cfaadd318fa6 is the first bad commit Please also see: https://bugzilla.kernel.org/show_bug.cgi?id=198115
It seems this problem happens only when using SNA. I can't reproduce this bug with UXA.
Try (or confirm you have): commit 1d033beb20d6d5885587a02a393b6598d766a382 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Oct 31 10:36:07 2017 +0000 drm/i915: Check incoming alignment for unfenced buffers (on i915gm) In case the object has changed tiling between calls to execbuf, we need to check if the existing offset inside the GTT matches the new tiling constraint. We even need to do this for "unfenced" tiled objects, where the 3D commands use an implied fence and so the object still needs to match the physical fence restrictions on alignment (only required for gen2 and early gen3). In commit 2889caa92321 ("drm/i915: Eliminate lots of iterations over the execobjects array"), the idea was to remove the second guessing and only set the NEEDS_MAP flag when required. However, the entire check for an unusable offset for fencing was removed and not just the secondary check. I.e. /* avoid costly ping-pong once a batch bo ended up non-mappable */ if (entry->flags & __EXEC_OBJECT_NEEDS_MAP && !i915_vma_is_map_and_fenceable(vma)) return !only_mappable_for_reloc(entry->flags); was entirely removed as the ping-pong between execbuf passes was fixed, but its primary purpose in forcing unaligned unfenced access to be rebound was forgotten. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103502 Fixes: 2889caa92321 ("drm/i915: Eliminate lots of iterations over the execobjects array") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171031103607.17836-1-chris@chris-wilson.co.uk Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
I can reproduce this bug by checking out tag v4.14 from linux-stable. It includes the commit you mentioned.
I have probably the same problem with a HP Compaq dc7800p Small Form Factor having Intel Corporation 82Q35 Express Integrated Graphics Controller (Intel GMA 3100 graphics, Mesa DRI Intel(R) Q35, OpenGL 2.1, DRI driver: i915). The problem appeared since Arch Linux switched to kernel 4.14 (but not with the 4.9 LTS kernel), I have also reproduced it on Ubuntu using mainline kernels (http://kernel.ubuntu.com/~kernel-ppa/mainline). The problems do not appear with 4.13.16-041316, but starting with 4.14.0-041400rc1, 4.14.0-041400-generic, 4.14.12-041412-generic, and also with the latest kernel available there 4.15.0-041500rc6-generic the problem still exists. The easiest way to reproduce the problem is starting Chromium and going to maps.google.com. On Arch Linux (both with KDE and LxQt) the address bar disappears and the map tiles are totally distorted. On Ubuntu 17.10 (Gnome), the address bar disappears as soon as I start Chromium, but the contents of Google Maps is not distorted. On both distributions, I can see (EE) intel(0): Failed to submit rendering commands (Invalid argument), disabling acceleration. in /var/log/Xorg.0.log (Arch) or in journalctl (Ubuntu). In journalctl, there is also the following log entry intel_do_flush_locked failed: Invalid argument I have looked into the kernel source of the current Arch Linux kernel (4.14.12-1-ARCH), the commit 1d033b mentioned above is included. With chrome://gpu, some messages are displayed GpuProcessHostUIShim: The GPU process exited with code 256. [970:970:0107/073806.982205:ERROR:gles2_cmd_decoder.cc(17977)] [.DisplayCompositor-0x10da87b96000]GL ERROR :GL_INVALID_OPERATION : glCreateAndConsumeTextureCHROMIUM: invalid mailbox name [970:970:0107/073806.982610:ERROR:gles2_cmd_decoder.cc(9881)] [.DisplayCompositor-0x10da87b96000]RENDER WARNING: texture bound to texture unit 0 is not renderable. It maybe non-power-of-2 and have incompatible texture filtering. [970:970:0107/073807.029877:ERROR:gles2_cmd_decoder.cc(9881)] [.DisplayCompositor-0x10da87b96000]RENDER WARNING: texture bound to texture unit 0 is not renderable. It maybe non-power-of-2 and have incompatible texture filtering. ...a lot more of these Please let me know if I can provide more information.
I just finished trying kernel 4.15.3 with the QupZilla web browser (instead of Qutebrowser). Unfortunately it still crashes.
Since Arch Linux switched to kernel 4.14 for linux-lts (and the standard linux package is at 4.15), I now have the choice between two kernels which do not work with my Intel graphics Q35, this gives me quite some motivation to help fixing this issue. Note that I do not know a real workaround, UXA is not an option, it crashes regularly, the modesetting driver does not work with my hardware (it needs at least OpenGL 2.1) and switching off the GPU in Chromium can mitigate the problem, but it still happens. I made a bisect using the linux-git kernel from AUR and started where the bisect from Francesco Turco stopped (its first bad commit contained only hwmon stuff, but I really expect some i915 code in the first bad commit). I will attach my bisect log (named git_bisect_log_2.txt). My first bad commit is [170fa29b14fadf2deb361589cefe6a78b21b1b22] drm/i915: Simplify eb_lookup_vmas() Some remarks about the bisect: To know if a commit is good or bad, I booted the kernel and started Chromium with maps.google.com. If the graphics was distorted, I marked it bad. Two commits were not so easy to judge: "[79364227e6b4923478e99d8480d62482b588ef84] IB/core: Add might_sleep() annotation to ib_init_ah_from_wc()" did not distort the graphics, but I could no longer move the mouse pointer (the computer was still controlable via keyboard), so I marked that commit as good. The second problem commit was near the end of the bisect procedure "[170fa29b14fadf2deb361589cefe6a78b21b1b22] drm/i915: Simplify eb_lookup_vmas()" did not show the distortion, but after browsing maps.google.com, the computer was totally locked up, I had to push the power button to restart it, so I marked this commit as bad. This is the commit which is now marked as the first bad, probably the distortion was introduced later, but the commit before "[c7c6e46f913bb3a6ff19e64940ebb54652033677] drm/i915: Convert execbuf to use struct-of-array packing for critical fields" was good. I will probably have a look at some of the later commits to gain more information.
Created attachment 137362 [details] Bisect stopping a commit with lockup but not distortion
I have verified if the git bisect has found the real bad commit and can confirm that c7c6e46f913b drm/i915: Convert execbuf to use struct-of-array packing for critical fields is the last good commit. Two commits later commit 170fa29b14fadf2deb361589cefe6a78b21b1b22 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Aug 16 09:52:07 2017 +0100 drm/i915: Simplify eb_lookup_vmas() Since the introduction of being able to perform a lockless lookup of an object (i915_gem_object_get_rcu() in fbbd37b36fa5 ("drm/i915: Move object release to a freelist + worker") we no longer need to split the object/vma lookup into 3 phases and so combine them into a much simpler single loop. and commit d1b48c1e7184d9bc4ae6d7f9fe2eed9efed11ffc Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Aug 16 09:52:08 2017 +0100 drm/i915: Replace execbuf vma ht with an idr This was the competing idea long ago, but it was only with the rewrite of the idr as an radixtree and using the radixtree directly ourselves, along with the realisation that we can store the vma directly in the radixtree and only need a list for the reverse mapping, that made the patch performant enough to displace using a hashtable. Though the vma ht is fast and doesn't require any extra allocation (as we can embed the node inside the vma), it does require a thread for resizing and serialization and will have the occasional slow lookup. That is hairy enough to investigate alternatives and favour them if equivalent in peak performance. One advantage of allocating an indirection entry is that we can support a single shared bo between many clients, something that was done on a first-come first-serve basis for shared GGTT vma previously. To offset the extra allocations, we create yet another kmem_cache for them. the bug was introduced. Graphics are distorted, the Linux kernel is not really usable with my Intel graphics Q35. The bug is easy to reproduce: Start the Chromium and browse maps.google.com. Chromium reports intel_do_flush_locked failed: Invalid argument Using strace on ioctl, I can see that the invalid argument is caused by ioctl(129, DRM_IOCTL_I915_GEM_EXECBUFFER2_WR, 0x7ffd8b5b6540) = -1 EINVAL (Invalid argument) Using some printk debugging, it can be seen that i915_gem_execbuffer2() returns first -512=-ERESTARTSYS and then four times -22=-EINVAL. @Chris Wilson: Could you please have a look at the problem? Do you have old Intel graphics Q35 available to reproduce the issue? I would fix the problem myself if I had enough knowledge about this driver. The best I could probably do would be to introduce a module parameter to restore the old behavior using the hashtable instead of the radix tree, but there must be a better solution to use the new radix tree and keep it working on old Intel graphics hardware. I will provide you with as much information as needed and will check patches once you have some available.
Created attachment 137499 [details] Graphical corruption as a consecuence of the bug I also have this problem. My computer is an Intel Atom N455 with integrated gfx Intel GMA 3150, OpenGL 2.1 enabled. I'm using KDE Plasma 5 and plasmashell sometimes crashes when I move the mouse over the tasks bar or the applications menu. Strangely enough, I have only experienced this issue when using a web browser at the same time. Then there is some graphical corruption that affects only the content of the web display (processed by QtWebEngine in my case, since I have tried QupZilla and Konqueror). Using UXA instead of SNA does not solve the problem for me. I'm on ArchLinux x86-64 with linux 4.14.20-1-lts installed. Cheers.
I can't reproduce this bug anymore with the Falkon 3.0.0 web browser and the Linux 4.16.2 kernel.
It has been working fine for me for quite a time with the 4.15 kernel as well.
Thanks for the updates, closing.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.