Bug 103025 - [GM45] corrupt renders, GPU hangs, and eventual loss of driver acceleration occurs when performing certain 2D operations
Summary: [GM45] corrupt renders, GPU hangs, and eventual loss of driver acceleration o...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords: bisected, regression
Depends on:
Blocks: 105980
  Show dependency treegraph
 
Reported: 2017-09-28 14:46 UTC by Adric Blake
Modified: 2019-07-20 18:32 UTC (History)
4 users (show)

See Also:
i915 platform: GM45
i915 features: GEM/PPGTT


Attachments
Xorg logs with the error (12.56 KB, application/x-compressed-tar)
2017-09-28 14:46 UTC, Adric Blake
no flags Details
Xorg log tail with xf86-video-intel asserts+debug enabled (13.63 KB, text/plain)
2017-10-06 14:04 UTC, Adric Blake
no flags Details
Xorg log tail with xf86-video-intel debug enabled; longer history (52.97 KB, application/x-bzip)
2017-10-06 14:13 UTC, Adric Blake
no flags Details
sna: Skip the exact match if we can't change tiling (1.80 KB, patch)
2017-10-06 14:15 UTC, Chris Wilson
no flags Details | Splinter Review
Xorg log tail from second crash (75.66 KB, text/plain)
2017-10-07 10:45 UTC, Adric Blake
no flags Details
Xorg log tail with failure of new assertion (41.16 KB, application/x-bzip)
2017-10-07 13:30 UTC, Adric Blake
no flags Details
Flush current vblank queue (3.36 KB, patch)
2017-10-07 17:13 UTC, Chris Wilson
no flags Details | Splinter Review
Xorg log tail with capture of the latest vblank error (39.16 KB, application/x-bzip)
2017-10-07 19:18 UTC, Adric Blake
no flags Details
Flush current vblank queue (4.51 KB, patch)
2017-10-07 19:44 UTC, Chris Wilson
no flags Details | Splinter Review
Queue vblank keepalive (8.04 KB, patch)
2017-10-08 11:15 UTC, Chris Wilson
no flags Details | Splinter Review
Flush current vblank queue (8.04 KB, patch)
2017-10-08 11:32 UTC, Chris Wilson
no flags Details | Splinter Review
Queue vblank keepalive (8.12 KB, patch)
2017-10-08 12:31 UTC, Chris Wilson
no flags Details | Splinter Review
Xorg log tail for crash on Dec 10 (85.29 KB, application/gzip)
2017-12-11 05:04 UTC, Adric Blake
no flags Details
systemd journal log for time around crash on Dec 10 (11.17 MB, application/gzip)
2017-12-11 05:10 UTC, Adric Blake
no flags Details
Xorg log tail for recent crashes (259.76 KB, application/zip)
2017-12-13 01:09 UTC, Adric Blake
no flags Details
Xorg log tail for crash on similar assert (93.63 KB, application/gzip)
2017-12-13 05:25 UTC, Adric Blake
no flags Details
Xorg log tail for most recent asserts (270.00 KB, application/zip)
2017-12-13 15:38 UTC, Adric Blake
no flags Details
Xorg log snip with accel loss and dump (154.77 KB, text/plain)
2017-12-13 17:42 UTC, Adric Blake
no flags Details
Xorg log tail for crash after accel loss (52.18 KB, application/gzip)
2017-12-13 17:45 UTC, Adric Blake
no flags Details
Xorg log snip with accel loss and dump, again (233.20 KB, text/plain)
2017-12-14 16:07 UTC, Adric Blake
no flags Details
Xorg log with assertion failure Dec 16 (6.47 KB, text/plain)
2017-12-16 23:04 UTC, Adric Blake
no flags Details
GPU hang error state (26.21 KB, text/plain)
2017-12-17 00:54 UTC, Adric Blake
no flags Details
prebuilt intel driver installation files (x86-64) (3.40 MB, application/zip)
2018-03-04 19:10 UTC, Adric Blake
no flags Details
gen2 bug replication: hang dumps, assert-enabled Xorg logs (72.02 KB, application/zip)
2018-07-14 14:57 UTC, Adric Blake
no flags Details
Captured drm.debug=0x1f output and Xorg logs after bug detected in Xorg driver (1.10 MB, application/zip)
2018-07-30 16:35 UTC, Adric Blake
no flags Details
Linux 5.3-dev [drm-tip] - collected card error states (93.41 KB, application/zip)
2019-07-20 15:04 UTC, Adric Blake
no flags Details
Linux 5.3-dev [drm-tip] - captured Xorg log (no debug) (5.62 KB, application/gzip)
2019-07-20 15:06 UTC, Adric Blake
no flags Details
Linux 5.3-dev [drm-tip] - all kernel messages (drm.debug=0x1e) (1.90 MB, application/gzip)
2019-07-20 15:08 UTC, Adric Blake
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Adric Blake 2017-09-28 14:46:58 UTC
Created attachment 134543 [details]
Xorg logs with the error

After each boot, the intel driver reports an error after working fine for several hours:
[ 44974.354] (EE) intel(0): Failed to submit rendering commands (No such file or directory), disabling acceleration.
with no other errors logged.

The effects are immediately noticeable in any gpu-powered application: glxgears, for example, stops spinning properly (it appears to shake between the same two frames and only spins while resizing), and video is blank for players vlc and mpv. 

I've had this happen for at least the past three boots, with cause seemingly unknown (I've tested moderate memory pressure, suspend + lid open/close, concurrent GPU-using programs, and hibernation). The issue appears to have been present at least intermittently for weeks, way back when I was testing drm-tip with the 4.13.1 kernel.

System Information:
Distribution: Arch Linux x86_64
DRM-tip commit 13d4fadfbe07 (uses kernel 4.14rc2)
xorg-server 1.19.3-3
xf86-video-intel 1:2.99.917+781+gc8990575-1
libdrm 2.4.83-1
mesa 17.2.1-3

Hardware:
Dell Inspiron 1545 (laptop)

Additionally, the kernel logs three different drm-related warnings and drm also prints an *ERROR*:
[drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
particularly when the system is resumed.

Nothing happens during the time that the Xorg error occurred. It is difficult to capture the bug with drm debugging due to its randomness and the huge amount of disk space the logs would consume, though I will try and gather some data on it.
Comment 1 Chris Wilson 2017-09-28 14:55:20 UTC
(In reply to Adric Blake from comment #0)
> I've had this happen for at least the past three boots, with cause seemingly
> unknown (I've tested moderate memory pressure, suspend + lid open/close,
> concurrent GPU-using programs, and hibernation). The issue appears to have
> been present at least intermittently for weeks, way back when I was testing
> drm-tip with the 4.13.1 kernel.

If you can identify that v4.12 or v4.13.0 is ok (a day or so of uptime), that would be a big help (in tying it to a kernel update). Similarly checking current (v4.15) drm-tip would also be valuable information.

ENOENT means the handle lookup failed, either in the initial object lookup or the relocation lookup.
Comment 2 Stefan Jensen 2017-09-30 18:12:46 UTC
Just a quick "me too", also on Arch. I have to reboot, if this happens. Can't really point my finger on anything yet. 

Also: https://bugs.archlinux.org/task/55732
Comment 3 Adric Blake 2017-10-06 05:23:23 UTC
uptime: over 5 days
xorg-server 1.19.3-3         
xf86-video-intel 1:2.99.917+781+gc8990575-1
mesa 17.2.1-3
libdrm 2.4.83-1
linux 4.13.4-1 (Arch Linux stock kernel -- vanilla, with no new drm-tip patches)

No sign of the issue on the 4.13.4 kernel.

There is no released 4.14 or 4.15 kernel right now, so I'm not quite sure what you meant by your last suggestion. I was using the latest drm-tip kernel at the time.

I will check to see if the issue emerges in the 4.14-rc3 kernel *without* the latest drm-tip changes to try and identify the responsible party further.
Comment 4 Chris Wilson 2017-10-06 10:48:23 UTC
(In reply to Adric Blake from comment #3)
> uptime: over 5 days
> xorg-server 1.19.3-3         
> xf86-video-intel 1:2.99.917+781+gc8990575-1
> mesa 17.2.1-3
> libdrm 2.4.83-1
> linux 4.13.4-1 (Arch Linux stock kernel -- vanilla, with no new drm-tip
> patches)
> 
> No sign of the issue on the 4.13.4 kernel.
> 
> There is no released 4.14 or 4.15 kernel right now, so I'm not quite sure
> what you meant by your last suggestion. I was using the latest drm-tip
> kernel at the time.

That was me misreading (the "way back when I was testing drm-tip" stuck in my head).

> I will check to see if the issue emerges in the 4.14-rc3 kernel *without*
> the latest drm-tip changes to try and identify the responsible party further.

Ta. That's the next one, it narrows it down to the previous cycle or this one. Though bisection is viable at this point. I can guess it will be

commit d1b48c1e7184d9bc4ae6d7f9fe2eed9efed11ffc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Aug 16 09:52:08 2017 +0100

    drm/i915: Replace execbuf vma ht with an idr

though quite unexpected. If you have the patient, compiling xf86-video-intel with --enable-debug=full should result in the erroroneous submit being dumped.
Comment 5 Adric Blake 2017-10-06 14:00:55 UTC
uptime: about 45 minutes at the time
linux-git 4.14rc3.r323.g7a92616c0bac-1
xorg-server 1.19.4-1
mesa 17.2.2-1
libdrm 2.4.83-1
xf86-video-intel 1:2.99.917+781+gc8990575-1 (with full debug enabled for compile)

Xorg caught an assert shortly after boot:

[  2548.437] (EE) kgem_create_2d:5720 assertion 'bo->pitch >= pitch' failed

It's a bit hard to capture the error if it crashes beforehand!
Comment 6 Adric Blake 2017-10-06 14:04:23 UTC
Created attachment 134705 [details]
Xorg log tail with xf86-video-intel asserts+debug enabled
Comment 7 Adric Blake 2017-10-06 14:13:51 UTC
Created attachment 134706 [details]
Xorg log tail with xf86-video-intel debug enabled; longer history
Comment 8 Chris Wilson 2017-10-06 14:15:29 UTC
Created attachment 134707 [details] [review]
sna: Skip the exact match if we can't change tiling
Comment 9 Adric Blake 2017-10-07 02:41:07 UTC
linux-git 4.14rc3.r394.gbf2db0b9f580-1  (not linux-drm-tip-git)
<rest of config is the same>

Hit the assert again with the patch applied:
[ 18341.569] (EE) kgem_create_2d:5734 assertion 'bo->pitch >= pitch' failed
Comment 10 Chris Wilson 2017-10-07 09:28:35 UTC
[  2548.122] kgem_create_2d(1366x22, bpp=32, tiling=1, exact=1, inactive=0, cpu-mapping=0, gtt-mapping=0, scanout?=0, prime?=0, temp?=0)
[  2548.122] kgem_surface_size: tile_width=512, tile_height=8 => aligned pitch=5632, height=24
[  2548.125] kgem_set_tiling: handle=42, tiling=1 [1], pitch=4096 [5632]: 1

Wait a sec. That's not the error path, that's the success path. The kernel returned a pitch less than we requested. On no.
Comment 11 Chris Wilson 2017-10-07 10:35:45 UTC
For the moment, can you apply

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index 122dda71..29e09e33 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -488,7 +488,7 @@ restart:
                     bo->tiling, tiling,
                     bo->pitch, stride,
                     set_tiling.tiling_mode == tiling));
-               return set_tiling.tiling_mode == tiling;
+               return set_tiling.tiling_mode == tiling && bo->pitch >= stride;
        }
 
        err = errno;

and move on to the next error. I need to work out exactly what's going on here, and we still haven't even hit the original bug.
Comment 12 Adric Blake 2017-10-07 10:45:04 UTC
Created attachment 134724 [details]
Xorg log tail from second crash

If I understand you correctly, then the second crash is caused by the same issue:

[ 18336.593] kgem_create_2d(1233x24, bpp=32, tiling=1, exact=1, inactive=0, cpu-mapping=0, gtt-mapping=0, scanout?=0, prime?=0, temp?=0)
[ 18336.593] kgem_surface_size: tile_width=512, tile_height=8 => aligned pitch=5120, height=24
[ 18336.594] kgem_set_tiling: handle=43, tiling=1 [1], pitch=1024 [5120]: 1
Comment 13 Adric Blake 2017-10-07 13:30:47 UTC
Created attachment 134725 [details]
Xorg log tail with failure of new assertion

Hit a new, different assert with both patches applied:
[  6175.184] (EE) sna_crtc_set_vblank:660 assertion '(*sna_crtc_flags(crtc) & CRTC_VBLANK) < 3' failed
Comment 14 Chris Wilson 2017-10-07 15:35:08 UTC
(In reply to Adric Blake from comment #13)
> Created attachment 134725 [details]
> Xorg log tail with failure of new assertion
> 
> Hit a new, different assert with both patches applied:
> [  6175.184] (EE) sna_crtc_set_vblank:660 assertion '(*sna_crtc_flags(crtc)
> & CRTC_VBLANK) < 3' failed

Oh, now that's odd. The design is that we only ever had 2 real vblanks in flight, this and the next. Everything more than is punted to a timer until we are within a vblank. Thanks for the log, I'm sure it's going to be just as entertaining as the first :(
Comment 15 Chris Wilson 2017-10-07 17:13:29 UTC
Created attachment 134731 [details] [review]
Flush current vblank queue

Maybe just a timing issue with the extra debug logging...
Comment 16 Adric Blake 2017-10-07 19:18:51 UTC
Created attachment 134732 [details]
Xorg log tail with capture of the latest vblank error

The most recent assertion failure recurred, much faster (and it seems more frequent) than before:
[   790.049] (EE) sna_crtc_set_vblank:660 assertion '(*sna_crtc_flags(crtc) & CRTC_VBLANK) < 3' failed
This comes several lines after a rather disconcerting message:
[   789.326] sna_present_vblank_handler: arrived unexpectedly early (not queued)

The glxgears program I had running also shows yet another bug. Shortly before the bug occured, the window suddenly stops rendering (it logs no FPS). After restarting glxgears the window is still black and the FPS runs as if it were not synced (several hundred FPS). This bug also has just happened again shortly after Xorg is restarted; by chance I caught it stop, and it only goes black once the window becomes obscured.
Comment 17 Chris Wilson 2017-10-07 19:44:31 UTC
Created attachment 134733 [details] [review]
Flush current vblank queue

Better version, with bugfix (hopefully).
Comment 18 Chris Wilson 2017-10-08 11:15:09 UTC
Created attachment 134743 [details] [review]
Queue vblank keepalive

Third attempt, hopefully on the right track now (present/vblank).
Comment 19 Chris Wilson 2017-10-08 11:32:18 UTC
Created attachment 134744 [details] [review]
Flush current vblank queue

Fourth is always better than the third.
Comment 20 Adric Blake 2017-10-08 12:28:24 UTC
The last two attachments are the same. Are you sure you uploaded the right one?
Comment 21 Chris Wilson 2017-10-08 12:31:57 UTC
Created attachment 134745 [details] [review]
Queue vblank keepalive

Fourth attempt, take 2.
Comment 22 Adric Blake 2017-10-17 16:02:08 UTC
uptime: 5 days
errors detected: none
packages:
linux-git 4.14rc3.r394.gbf2db0b9f580-1
xorg-server 1.19.4-1
mesa 17.2.2-1
libdrm 2.4.83-1
xf86-video-intel 1:2.99.917+783+g5332a0ef-1 (w/ debug)
-----
uptime: 2 days
errors detected: none
packages:
linux-drm-tip-git 4.14+rc4+990+g005c15a27958+708068-1 (this should be correct)
xorg-server 1.19.5-1
mesa 17.2.2-1
libdrm 2.4.84-1
xf86-video-intel 1:2.99.917+796+g04b4f3b7-1 (w/ debug)

It seems I cannot replicate this bug anymore. Whatever the cause was seems to have disappeared a few weeks ago.
Comment 23 Elizabeth 2017-10-18 14:45:23 UTC
(In reply to Adric Blake from comment #22)
> ...
> It seems I cannot replicate this bug anymore. Whatever the cause was seems
> to have disappeared a few weeks ago.
Is this without Chris patch? Thank you.
Comment 24 Adric Blake 2017-10-18 19:07:15 UTC
I built the xf86-video-intel driver with the full debug option and with the patches applied. I used it for both the linux-git kernel and the drm-tip linux kernel (the kernels which I listed in my last post).
Comment 25 Jim Turner 2017-10-26 21:55:04 UTC
I can confirm this pbm. still exists in xserver-xorg-video-intel v2:2.99.917+git20161206-1.  Pbm. started with Linux kernel 4.13 and still exists in 4.13.9 (4.13.0-9.1-liquorix-686-pae #1 ZEN SMP PREEMPT liquorix 4.13-3 (2017-10-24) i686 GNU/Linux).  I'm using intel i915 "Ironlake":

Graphics:  Card: Intel Core Processor Integrated Graphics Controller
           Display Server: X.Org 1.19.5 driver: intel
           Resolution: 1600x900@60.00hz, 1920x1080@60.00hz
           OpenGL: renderer: Mesa DRI Intel Ironlake Mobile x86/MMX/SSE2
           version: 2.1 Mesa 17.2.3

Boot up works fine, post-suspend works fine, but after a random period of time, graphics becomes very slow (confirmed by glxgears) and Xorg process starts gobbling up 100%CPU trying to drag a window around while an SDL video is playing.  

Xorg logs the single message:

(EE) intel(0): Failed to submit rendering commands (No such file or directory), disabling acceleration.

Killing and Restarting X "fixes" the symptoms (for a while).
Comment 26 Luka Paunovic 2017-10-31 15:26:35 UTC
I have the same issue!
here is my bug https://bugs.freedesktop.org/show_bug.cgi?id=103509
I just figured this error i have too

(EE) intel(0): Failed to submit rendering commands (No such file or directory), disabling acceleration.
Comment 27 Jim Turner 2017-11-01 23:49:15 UTC
Further update, it's connected to kernels 4.13+, NOT xf86-video-intel or xserver-xorg - I tested downgrading these packages to a time before I had ever seen this pbm, but the pbm. remained.  I then upgraded them back to their current latest versions in TESTING and reverted back to last 4.12 kernel (4.12.0-14.3-liquorix-686-pae #1 ZEN SMP PREEMPT liquorix 4.12-11 (2017-10-14) i686 GNU/Linux) and pbm. gone.  So, for now, guess I'll stick w/the 4.12 kernel. :(
Comment 28 Adric Blake 2017-12-03 23:06:44 UTC
Got the problem, yet again, after smooth sailing with the 4.13 kernel series. First time the bug has occured since I upgraded to the 4.14 series a few days ago. Might be aggrevated by GPU acceleration used for Firefox Quantum.

[ 93351.317] (EE) intel(0): Failed to submit rendering commands (No such file or directory), disabling acceleration.

uptime: 26 hours

Arch Linux x86_64
linux-ck-core2 4.14.3-1 (repo-ck.com)
xorg-server 1.19.5-1
mesa 17.2.6-1
libdrm 2.4.88-1
xf86-video-intel 1:2.99.917+800+g37a682aa-1

Aside from the kernel, these are all the latest Arch packages.

(I am the reporter.)
Comment 29 Adric Blake 2017-12-04 01:27:34 UTC
So it seems I spoke too soon. Hours later, after just over 4 days of uptime, the twin sister laptop got the bug on the 4.13 kernel. It had been fine since I updated the kernel a few weeks ago, until now.

Arch Linux x86_64
linux-ck-core2 4.13.12-2
xorg-server 1.19.5-1
mesa 17.2.5-1
libdrm 2.4.88-1
xf86-video-intel 1:2.99.917+800+g37a682aa-1

[195685.080] (EE) intel(0): Failed to submit rendering commands (No such file or directory), disabling acceleration.
Comment 30 Adric Blake 2017-12-11 04:58:56 UTC
Hope ya didn't forget about this, because I've got another assertion failure! This one is relatively close to the time it takes for the ENOENT bug to appear, so I have hope...

Arch Linux x86_64
linux-ck-core2 4.14.4-1 (not on 4.14.5 yet)
xorg-server 1.19.5-1
mesa 17.3.0-1
libdrm 2.4.88-1
xf86-video-intel 1:2.99.917+800+g37a682aa-1

Error:
[ 98991.493] (EE) assert_tiling:296 assertion 'tiling.tiling_mode == bo->tiling' failed

Backtrace and journal logs incoming.
Comment 31 Adric Blake 2017-12-11 05:04:04 UTC
Created attachment 136074 [details]
Xorg log tail for crash on Dec 10

End of Xorg log with the assert failure
Comment 32 Adric Blake 2017-12-11 05:10:44 UTC
Created attachment 136075 [details]
systemd journal log for time around crash on Dec 10

Journal log section with the driver-forced debug info
Comment 33 Adric Blake 2017-12-11 06:39:37 UTC
Line number information for the backtrace, since I compiled the driver on a different system:

0x432f7 is in kgem_create_2d (kgem.c:5809).
5804					bo->delta = 0;
5805					DBG(("  1:from active: pitch=%d, tiling=%d, handle=%d, id=%d\n",
5806					     bo->pitch, bo->tiling, bo->handle, bo->unique_id));
5807					assert(bo->pitch*kgem_aligned_height(kgem, height, bo->tiling) <= kgem_bo_size(bo));
5808					assert_tiling(kgem, bo);

(gdb) list *sna_pixmap_alloc_gpu+0xe7
0x5565f is in sna_pixmap_alloc_gpu (sna_accel.c:1693).
1688		} else
1689			tiling = sna_pixmap_default_tiling(sna, pixmap);
1690	
1691		DBG(("%s: pixmap=%ld\n", __FUNCTION__, pixmap->drawable.serialNumber));
1692	
1693		priv->gpu_bo = kgem_create_2d(&sna->kgem,
1694					      pixmap->drawable.width,
1695					      pixmap->drawable.height,
1696					      pixmap->drawable.bitsPerPixel,
1697					      tiling, flags);

0x6047f is in sna_pixmap_move_to_gpu (sna_accel.c:4334).
4329	
4330					sna_pixmap_alloc_gpu(sna, pixmap, priv, create);

0x5e8a1 is in sna_drawable_use_bo (sna_accel.c:3810).
3805	
3806	create_gpu_bo:
3807			move = MOVE_WRITE | MOVE_READ | MOVE_ASYNC_HINT;
3808			if (flags & FORCE_GPU)
3809				move |= __MOVE_FORCE;
3810			if (!sna_pixmap_move_to_gpu(pixmap, move))
3811				goto use_cpu_bo;

0x8caae is in sna_poly_fill_rect (sna_accel.c:15129).
15124		    sna_pixmap_is_gpu(gc->tile.pixmap)) {
15125			DBG(("%s: source is already on the gpu\n", __FUNCTION__));
15126			hint |= FORCE_GPU;
15127		}
15128	
15129		bo = sna_drawable_use_bo(draw, hint, &region.extents, &damage);
Comment 34 Adric Blake 2017-12-13 01:07:41 UTC
Arch Linux x86_64
linux-drm-tip-git 4.15.0-rc3-15+rc3+939+g8874c0f95698+723702 (self-built, self-named...)
xorg-server 1.19.5-1
mesa 17.3.0-2
libdrm 2.4.88-1
xf86-video-intel 1:2.99.917+800+g37a682aa-1

Just tried again with both linux 4.14.5 and linux from current drm-tip. Both crash quickly (on that same assert) once it becomes GPU-heavy. I can replicate it with dwarf fortress from LNP. It crashes within minutes upon playing.

I am unable to test for the original bug (which is now noticeably more frequent again) until this problem gets fixed.

Three crashes follow.
Comment 35 Adric Blake 2017-12-13 01:09:03 UTC
Created attachment 136126 [details]
Xorg log tail for recent crashes
Comment 36 Adric Blake 2017-12-13 05:22:23 UTC
As I "investigate", I've found that the assert I've been hitting will continue to fail if I modify the source emit a warning instead, and that eventually the source line before assert_tiling() will fail too:

[  1499.168] (EE) kgem_create_2d:5745 assertion 'bo->pitch*kgem_aligned_height(kgem, height, bo->tiling) <= kgem_bo_size(bo)' failed

Will attach debug info.
Comment 37 Adric Blake 2017-12-13 05:25:24 UTC
Created attachment 136129 [details]
Xorg log tail for crash on similar assert

Crash log for the new assert (which is suspiciously close to the previous one).
Comment 38 Adric Blake 2017-12-13 15:38:47 UTC
Created attachment 136142 [details]
Xorg log tail for most recent asserts

[  1306.763] (EE) sna_blt_fill_boxes:3622 assertion 'box->y2 * bo->pitch <= kgem_bo_size(bo)' failed
after ignoring that, we have this:
[ 28738.434] (EE) kgem_bo_free:2544 assertion 'bo->exec == NULL' failed
Comment 39 Adric Blake 2017-12-13 17:42:46 UTC
Created attachment 136146 [details]
Xorg log snip with accel loss and dump

Finally, finally got the dump! As to how useful it is, I don't know, since it seems closely tied to the bo->exec == NULL assert failure I had got earlier. But it was requested, so here it is.

I also have the whole 1G log, if you really want to see it or want more context for the error.
Comment 40 Adric Blake 2017-12-13 17:45:20 UTC
Created attachment 136147 [details]
Xorg log tail for crash after accel loss

Bonus crash after VT switch after the failure had occurred.
Comment 41 Adric Blake 2017-12-14 16:07:56 UTC
Created attachment 136174 [details]
Xorg log snip with accel loss and dump, again

Replicated bug on latest drm-tip kernel, 4.15.0-rc3-15+rc3+960+g91d06d0bbd1a+723723 .

Again, I have a 500M log that is available on request.
Comment 42 Adric Blake 2017-12-15 04:29:06 UTC
The bug is significantly older than anticipated.

So far, I have been able to experimentally deduce that kernels all the way back to Linux 4.11.9 have the bug. Kernels that I've tested so far that work bug-free are linux-lts 4.9.67 and linux 4.10.9. I have yet to bisect further as the compiler is being a pain and apparently needs to be downgraded. Will update, eventually....

For note, my method of testing is opening previously said dwarf fortress game (other games not tested) and both casually and aggressively switch workspaces to and from the game window using marco, my window manager. It's guaranteed to trip the bug in under 10 minutes (usually no more than 5).
Comment 43 Adric Blake 2017-12-16 06:15:13 UTC
I have bisected (!) the commit responsible for the bug, using the drm-tip kernel tree.

5b30694b474d00f8588fa367f9562d8f2e4c7075 is the first bad commit
commit 5b30694b474d00f8588fa367f9562d8f2e4c7075
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jan 9 16:16:09 2017 +0000

    drm/i915: Align GGTT sizes to a fence tile row
    
    Ensure the view occupies the full tile row so that reads/writes into the
    VMA do not escape (via fenced detiling) into neighbouring objects - we
    will pad the object with scratch pages to satisfy the fence. This
    applies the lazy-tiling we employed on gen2/3 to gen4+.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/20170109161613.11881-2-chris@chris-wilson.co.uk

Bisect log for reference points:

git bisect start
# bad: [c1ae3cfa0e89fa1a7ecc4c99031f5e9ae99d9201] Linux 4.11-rc1
git bisect bad c1ae3cfa0e89fa1a7ecc4c99031f5e9ae99d9201
# good: [c470abd4fde40ea6a0846a2beab642a578c0b8cd] Linux 4.10
git bisect good c470abd4fde40ea6a0846a2beab642a578c0b8cd
# good: [caa59428971d5ad81d19512365c9ba580d83268c] Merge tag 'staging-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect good caa59428971d5ad81d19512365c9ba580d83268c
# good: [ca2dea434d10e3a676482fdf6df17d20cdb3e907] Merge tag 'juno-fixes-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into next/late
git bisect good ca2dea434d10e3a676482fdf6df17d20cdb3e907
# bad: [5d8a00eee2ed2e548a5d21b0edf495f3f7bf8bb4] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
git bisect bad 5d8a00eee2ed2e548a5d21b0edf495f3f7bf8bb4
# bad: [29a73d906bd386839015602c4bd35ef2e3531abc] Merge branch 'drm-next-4.11' of git://people.freedesktop.org/~agd5f/linux into drm-next
git bisect bad 29a73d906bd386839015602c4bd35ef2e3531abc
# bad: [957870f9341201b176e41eb5fa8a750b13e501aa] drm/i915: Split out i915_gem_object_set_tiling()
git bisect bad 957870f9341201b176e41eb5fa8a750b13e501aa
# good: [108109444ff64fb1a2976174ec23e9e2117b5709] drm/i915: Check num_pipes before initializing audio component
git bisect good 108109444ff64fb1a2976174ec23e9e2117b5709
# good: [05ab3c2eec74507712cd5f2b027c7c8b9a66f190] drm: kselftest for drm_mm and top-down allocation
git bisect good 05ab3c2eec74507712cd5f2b027c7c8b9a66f190
# good: [a402eae64d0ad12b1c4a411f250d6c161e67f623] Merge tag 'v4.10-rc2' into drm-intel-next-queued
git bisect good a402eae64d0ad12b1c4a411f250d6c161e67f623
# good: [56f6e0a7e7b09adb553339f9075696e918b96587] drm/i915: Assert that we do create the deferred context
git bisect good 56f6e0a7e7b09adb553339f9075696e918b96587
# good: [5c693b2b8ae4ec51f0890b7a1368425f8898f0bb] drm/i915: s/gen8_setup_page_directory/gen8_setup_pdpe/
git bisect good 5c693b2b8ae4ec51f0890b7a1368425f8898f0bb
# bad: [91d4e0aa923e13ef832e9d793b6d080b6318f2d9] drm/i915: Move ggtt fence/alignment to i915_gem_tiling.c
git bisect bad 91d4e0aa923e13ef832e9d793b6d080b6318f2d9
# bad: [5b30694b474d00f8588fa367f9562d8f2e4c7075] drm/i915: Align GGTT sizes to a fence tile row
git bisect bad 5b30694b474d00f8588fa367f9562d8f2e4c7075
# good: [9e65a37872174bd3615b16fa556377ebf5a3f0cd] drm/i915: don't open code the pdpe/pml4e clearing
git bisect good 9e65a37872174bd3615b16fa556377ebf5a3f0cd
# good: [6649a0b6501d78042fd0fffaaefab1aeee27e75d] drm/i915: Extract tile_row_size for fencing
git bisect good 6649a0b6501d78042fd0fffaaefab1aeee27e75d
# first bad commit: [5b30694b474d00f8588fa367f9562d8f2e4c7075] drm/i915: Align GGTT sizes to a fence tile row
Comment 44 Chris Wilson 2017-12-16 10:41:21 UTC
Ok, so this bit

4996330-[   481.828] _kgem_bo_destroy: handle=66, proxy? 0
4996331-[   481.828] __kgem_bo_destroy: handle=66, size=98304
4996332-[   481.828] __kgem_bo_destroy: handle=66 -> active
4996333-[   481.828] __sna_free_pixmap(pixmap=68861)
4996334-[   481.828] sna_create_pixmap(798, 24, 24, usage=0)
4996335-[   481.828] kgem_can_create_2d: 798x24 @ 24
4996336-[   481.828] kgem_surface_size: tile_width=8, tile_height=1 => aligned pitch=3192, height=24
4996337-[   481.828] kgem_can_create_2d: untiled size=77824
4996338-[   481.828] kgem_choose_tiling: TLB near-miss between lines 798x24 (pitch=3192), forcing tiling 1
4996339-[   481.828] kgem_surface_size: tile_width=8, tile_height=1 => aligned pitch=3192, height=24
4996340-[   481.828] kgem_can_create_2d: tiled[-1] size=77824
4996341-[   481.828] sna_create_pixmap: usage=0, flags=1b
4996342-[   481.828] sna_create_pixmap: creating GPU pixmap 798x24, stride=3192, flags=1b
4996343-[   481.828] __pop_freed_pixmap: reusing freed pixmap=68861 header
4996344-[   481.828] create_pixmap_hdr: pixmap=68875, width=798, height=24, usage=0
4996345-[   481.828] sna_create_pixmap: serial=68875, 798x24, usage=0
4996346-[   481.828] sna_composite_rectangles(pixmap=68876, op=0, 0 x 1 [(0, 0)x(798, 24) ...])
4996347-[   481.828] sna_composite_rectangles: converted to op 0
4996348-[   481.828] sna_composite_rectangles[0] (0, 0)x(798, 24) -> (0, 0), (798, 24)
4996349-[   481.828] sna_composite_rectangles: nrects=1, region=(0, 0), (798, 24) x 1
4996350-[   481.828] sna_composite_rectangles: clipped extents (0, 0),(798, 24) x 1
4996351-[   481.828] sna_composite_rectangles: pixmap +(0, 0) extents (0, 0),(798, 24)
4996352-[   481.828] sna_composite_rectangles: dropping last-cpu hint
4996353-[   481.828] sna_drawable_use_bo pixmap=68876, box=((0, 0), (798, 24)), flags=19...
4996354-[   481.828] sna_drawable_use_bo: flush=0, shm=0, cpu=0 => flags=19
4996355-[   481.828] sna_drawable_use_bo: gpu? 0, damaged? 0; cpu? 0, damaged? 0
4996356-[   481.828] sna_pixmap_move_to_gpu(pixmap=68876, usage=0), flags=b
4996357-[   481.828] sna_pixmap_move_to_gpu: CPU damage? 0
4996358-[   481.828] sna_pixmap_move_to_gpu: creating GPU bo (798x24@32), create=b
4996359-[   481.828] kgem_choose_tiling: TLB near-miss between lines 798x24 (pitch=3192), forcing tiling 1
4996360-[   481.828] kgem_choose_tiling: TLB near-miss between lines 798x24 (pitch=3192), forcing tiling 1
4996361-[   481.828] sna_pixmap_alloc_gpu: pixmap=68876
4996362-[   481.828] kgem_create_2d(798x24, bpp=32, tiling=1, exact=1, inactive=0, cpu-mapping=0, gtt-mapping=0, scanout?=0, prime?=0, temp?=0)
4996363-[   481.828] kgem_surface_size: tile_width=512, tile_height=8 => aligned pitch=3584, height=24
4996364-[   481.833] kgem_set_tiling: handle=66, tiling=1 [1], pitch=1024 [3584]: 1
4996365-[   481.833] tiled and pitch not exact: tiling=1, (want 1), pitch=1024, need 3584
4996366-[   481.838] kgem_set_tiling: handle=66, tiling=1 [1], pitch=1024 [3584]: 1
4996367-[   481.838] kgem_bo_free: handle=66, size=98304
4996368-[   481.838] (WW) intel(0): assertion failed: `bo->exec == NULL'; ignoring and trudging onward.
4996369-[   481.838] kgem_bo_free: releasing 0x7fd74a5cb000:0x0 vma for handle=66, count=0

explains the ENOENT. We are calling gem_close() on a handle that is still in the pending batchbuffer. Still scratching my head over the kernel rejecting the surface though, the principle is that the kernel fills in with scratch pages if the fence region didn't fit.
Comment 45 Chris Wilson 2017-12-16 11:04:05 UTC
commit af6d8e9e8f546e5cba60e3a62765c2dbd0328e83 (upstream/master)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Dec 16 10:55:54 2017 +0000

    sna: Avoid calling kgem_bo_free() on a still active bo
    
    If we fail to manipulate a bo from the active cache for reuse, then we
    have to be careful not to immediately close it as it is still referenced
    from the current batch.
    
    Reported-by: Adric Blake <promarbler14@gmail.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=103025#c44
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

should end the ENOENT fail. The underlying problem that the kernel is rejecting tiling changes that (presumably) used to work remains.
Comment 46 Adric Blake 2017-12-16 23:04:03 UTC
Created attachment 136232 [details]
Xorg log with assertion failure Dec 16

I compiled the driver (with debug, to replicate my test conditions), but got this:
[    58.487] (EE) to_sna:499 assertion 'sna->scrn == scrn' failed

Probably unrelated to this bug, but still...
Comment 47 Adric Blake 2017-12-17 00:54:48 UTC
Created attachment 136235 [details]
GPU hang error state

Tested the driver with the commit that caused the scrn assert error reverted. Pitch/tiling bugs indeed remain, and I haven't managed to trigger the original bug, but I got a (long) GPU hang. :(

[  468.733033] arch_pc kernel: [drm] GPU HANG: ecode 4:0:0xf2a7fff8, in Xorg [430], reason: Hang on rcs0, action: reset
[  468.733037] arch_pc kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  468.733039] arch_pc kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  468.733040] arch_pc kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  468.733041] arch_pc kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  468.733043] arch_pc kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  468.733112] arch_pc kernel: i915 0000:00:02.0: Resetting chip after gpu hang
[  476.767139] arch_pc kernel: i915 0000:00:02.0: Resetting chip after gpu hang
[  484.703080] arch_pc kernel: i915 0000:00:02.0: Resetting chip after gpu hang
[  492.703084] arch_pc kernel: i915 0000:00:02.0: Resetting chip after gpu hang
[  500.767134] arch_pc kernel: i915 0000:00:02.0: Resetting chip after gpu hang
Comment 48 Luka Paunovic 2017-12-17 13:48:42 UTC
Is this bug related to https://bugs.freedesktop.org/show_bug.cgi?id=103509 ?
Because it looks 100% like it. 
Everything worked fine with graphics for years on many kernel versions and suddenly something changes in the kernel that is causing this and making computers unusable
Comment 49 Luka Paunovic 2017-12-17 13:56:59 UTC
@Adric Blake
do you think downgrading to 4.10.9 or 4.9.67 will solve the problem for Ubuntu 17.10? Until the fixes are finally released into the WOOOOOOOOOOORLD.....
Comment 50 Adric Blake 2017-12-17 18:06:58 UTC
The git version of the driver (xf86-video-intel) has the "disabling acceleration" fix, which honestly is the most annoying issue. I don't know how to install a git version on Ubuntu, as I never got used to their package system. Perhaps install a PPA? The relevant intel driver package should be xserver-xorg-video-intel, unless you use the modesetting driver.

Downgrading to the linux LTS kernel would probably work, assuming the bug that indirectly caused the accel loss is also responsible for the hangs. Be warned that the LTS kernel might not have all the features needed for an updated system, particularly if you use an up-to-date version of systemd. I wouldn't use the 4.10 kernel because it doesn't receive any further patches and could be a risk to your security or data.

If you or anyone else can produce a reliable test procedure that trips the bugs (pitch/tiling or the GPU hang), that would be helpful.

Have patience. :)
Comment 51 Adric Blake 2017-12-17 19:54:13 UTC
Example of one of the types of corruption visible on rare occasion. Seems to occur after hitting one of the asserts (as visible behind the corrupted window). These images persisted and did not quickly disappear after taking the picture. They only disappeared if the source image was redrawn.

Bugzilla doesn't seem to be accepting uploads due to some bug. So, have some hosted links instead.

Corruption (Type 1)
https://i.imgur.com/MK7LO97.jpg
https://i.imgur.com/i8bsuM0.jpg

Corruption (Type 2)
https://i.imgur.com/i9g2ddV.jpg
https://i.imgur.com/n7enyeB.jpg
https://i.imgur.com/gbTIqNA.jpg
https://i.imgur.com/kQqvVyx.jpg
Comment 52 Adric Blake 2018-02-10 15:11:49 UTC
The latest drm-tip kernel still presents the underlying bug.

linux-drm-tip-git 4.15+2329+g897018779f8b+726780-1
xorg-server 1.19.6+13+gd0d1a694f-1
mesa 17.3.3-2
libdrm 2.4.89-1
xf86-video-intel 1:2.99.917+811+g5c7e4e0e-1

I built a debug version of the xf86-video-driver to test for the bug again.

The kernel commit:
commit 897018779f8b9aeac758688f0fd21169dfe66fdf (HEAD -> drm-tip, origin/drm-tip, origin/HEAD)
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Fri Feb 9 18:19:20 2018 +0200

    drm-tip: 2018y-02m-09d-16h-18m-21s UTC integration manifest
Comment 53 Benson Bear 2018-03-04 01:24:45 UTC
I am afraid I have the same problem.  It is very discouraging
and disheartening.  But could it be a multitude of different
sorts of problems that produce the same message about
acceleration being disabled?  I assume so, so it might
not be the same problem. 

I just upgraded to Fedora 27 from an old
Fedora 20 where everything worked fine.

So now anywhere from 10 minutes to 3 days,
I get a big slowdown all of a sudden, and
in particular typing is basically impossible,
and I see nothing in journals but in Xorg log
file I see either

 (EE) intel(0): Failed to submit rendering commands (No such file or directory), disabling acceleration.

 or
 
 (EE) intel(0): Failed to submit rendering commands (Invalid argument), disabling acceleration.

The relevant things I have installed are basically what others
have reported:

kernel 4.15.6-300.fc27.x86_64
org-x11-drv-intel.x86_64  2.99.917-31.20171025.fc27 
libdrm.x86_64  2.4.89-1.fc27 
xorg-x11-server-Xorg.x86_64 1.19.6-2.fc27  
mesa-* 17.3.5-1.fc27

I originally had slightly older versions but upgraded to the
latest available with fc27.

I tried also using uxa acceleration instead of sna but that
made no difference.

Should I try Adric's suggestiong of building the git version of the intel driver?
Comment 54 Adric Blake 2018-03-04 02:32:05 UTC
Commit af6d8e9e8f546e5cba60e3a62765c2dbd0328e83 contains the fix to the accel loss issue.

Your intel driver appears to be dated 2017-10-25, while the fix was committed on Dec 16. If it's the same issue, building the latest git version of the driver will fix it.
Comment 55 Benson Bear 2018-03-04 03:07:17 UTC
"Your intel driver appears to be dated 2017-10-25, while the fix was committed on Dec 16"

Thanks, but I see now that the problem that was fixed is only that signaled by the specific message "File not found" whereas I had sometimes that message and sometimes "Invalid argument".   So I assume that there are other problems as well and that it could be any number of things that would result in the acceleration being disabled (although merely disabling acceleration should not cause such an egregious slowdown, so I guess there is something else that happens at the same time as well).  

I guess I just have to go and buy some other video card?

I will try to try the git version but I doubt I can get it built correctly.
Comment 56 Adric Blake 2018-03-04 18:54:17 UTC
I wouldn't be too sure that you have a different bug. Looking back at my testing data, the nonexistent file error was most common, but I also recorded at least two instances of the invalid args error which followed the same sequence of executed code as the others.

Looking at the drm-gem manpage, it states: "Invalid object handles return EINVAL and invalid object names return ENOENT [file not found]." Which error is returned, I presume, is dependent on how the kernel/drm system reuses that memory.

I don't have Fedora to try and create a package on, but I can attach the installed files and you could copy/install the intel_drv.so file yourself, bypassing the package manager.
Comment 57 Adric Blake 2018-03-04 19:10:53 UTC
Created attachment 137781 [details]
prebuilt intel driver installation files (x86-64)

Copy (at the very least) usr/lib/xorg/modules/drivers/intel_drv.so to /usr/lib/xorg/modules/drivers/intel_drv.so
Make a backup of the file(s) first.
Use at your own risk. It should work, though.
Comment 58 Benson Bear 2018-03-05 01:17:10 UTC
Thanks a lot Adric, I will be trying that.

Right away last night I downloaded the git version, which seemed to build okay, and I did a make install.  It put the drivers in /usr/local and changed the ldconfig, and I restarted the display manager which came up but when I logged in a got just a blank screen. I was a little spooked and just did a make uninstall and went back to the old.   But I will try it again and try also your version.

Your quote from the manpage does suggest it still is possible it is the same error.

I wonder why there are so few reports.  Also why Fedora has a driver from so long still in its repo. 

I might wait a while to try this until the problem happens again since it appears I did not after all test the older UXA mode after all (stupidly added SNA and UXA lines but then commented out the wrong one) and would like to see if the UXA works.  But will report back asap.
Comment 59 Adric Blake 2018-03-05 15:08:49 UTC
Remember that you can view the Xorg.0.log/Xorg.0.log.old files and the system log if something blackscreens.

The essence of the build script I use if you want to replicate:
  NOCONFIGURE=1 ./autogen.sh
  export CFLAGS=${CFLAGS/-fno-plt}
  export CXXFLAGS=${CXXFLAGS/-fno-plt}
  export LDFLAGS=${LDFLAGS/,-z,now}
  ./configure --prefix=/usr \
    --libexecdir=/usr/lib \
    --with-default-dri=3
  make
  make install  # this will OVERWRITE the previous files in /usr, /usr/lib
Comment 60 Benson Bear 2018-03-06 00:04:59 UTC
Thanks Adric, I am going to try various things eventually but the first thing now I want to do is use the old UXA method to see if it runs okay for a long time without the loss of acceleration.  Then I will try various versions of the driver with SNA and I will report back.

I also just discovered when googling around a great deal that many recommend not to use the intel driver at all anymore.   And in fact when I checked the drivers on a new computer I just installed fc27 on, I see it is not using this driver although it is installed (using the onboard graphics with an i3-8100)!  Wow, I had no idea about this.  It is using the generic "modesetting" driver (seems to be a misleading name) and that seems to work just fine, even with things like google earth and the like. 

I guess most of the important hardware specific stuff lies really in the kernel side of the drivers, at least for the newer hardware.  I should try this also on the older (onboard graphics with i3-540) hardware.  I bet it won't work but first I want to run the UXA longer to see if gets the "acceleration dropped" problem. 

Prima facie it seems like use SNA with the intel driver would be the most desirable configuration however.
Comment 61 Adric Blake 2018-03-06 00:09:54 UTC
UXA and modesetting are noticeably slower, though. For my older hardware, it's preferable to have SNA.
Comment 62 Benson Bear 2018-03-10 07:56:41 UTC
They are slower and also perhaps the generic DDX driver doesn't work with older hardware at all.  That turns out to be why I got the blank screen earlier.   
 
Anyway I have now run a recent git version of the intel driver for quite a while in SNA mode without a recurrence of the big slowdown after acceleration being disabled.  I will assume it is probably going to be okay, and am now turning attention to another machine, a much newer one. 

I have found the intel driver is not usable on this right now because of some strange issues with webcams.  It cannot display webcam output on the screen.   The generic DDX "modesetting" driver works just fine in this and other regards (although like all drivers, it simply cannot do smooth scrolling but that's apparently how linux is...).   
 
I will make a detailed new bug report on this issue after I have done some more looking into it when I get the time.
Comment 63 Jani Saarinen 2018-03-29 07:11:38 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 64 Benson Bear 2018-03-30 11:22:52 UTC
Quoting my earlier message:

"Anyway I have now run a recent git version of the intel driver for quite a while in SNA mode without a recurrence of the big slowdown"


Having now run a recent version for about 3 weeks without stopping it, it seems to go okay.  Tt has not suffered the slowdown in all that time. 

"I have found the intel driver is not usable on this right now because of some strange issues with webcams.  It cannot display webcam output on the screen."

Also, the more recent driver no longer has this problem on the newer hardware (i3-8100).
Comment 65 Jani Saarinen 2018-03-30 12:10:07 UTC
Are you saying this can be resolved / closed?
Comment 66 Adric Blake 2018-03-30 17:32:05 UTC
I don't think so. The underlying kernel bug still exists, although the xf86-video-intel driver bug has been fixed. (See comments 45 and 10-11.) Though, if you think that the kernel bug deserves its own bug, rather than a bug report for the sympton I can do that, but I think it should stay with the data here.
Comment 67 Benson Bear 2018-03-31 03:37:26 UTC
Yes, sorry, I did not mean to imply the problem had been fully resolved, only that the driver apparently had been altered so the particular symptom of losing acceleration and subsequent slowdown had gone away in my case.  I should have made that more clear.
Comment 68 Adric Blake 2018-05-15 04:08:29 UTC
I have found yet another test case towards triggering this bug, and this time it's much faster. Using this method allows me to fairly easily trigger GPU hangs and eventually cause loss of acceleration one way or another.

Arch Linux x86_64, as always.
Active packages:
linux 4.16.8-1    (having trouble running drm-tip...)
xorg-server 1.20.0-1
mesa 18.0.3-3
libdrm 2.4.92-1
xf86-video-intel 1:2.99.917+832+g35947721-1

In this case, I am running a freshly-installed cinnamon (1.8) and using gnome-terminal. To trigger the bug, I open a new virtual terminal window and resize the window in the horizontal direction (up-and-down alone doesn't work). When you rapidly reduce the terminal window size in the horizontal direction, it is very likely that the bug will occur. A larger window seems to help; the contents or zoom level of the window might have an effect as well. Rapidly performing this process for extended periods of time has very interesting effects (see below).

About every time the bug is tripped by this method, visual flickering and/or corruption occur with the window, but occasionally other random parts of the screen bear corruption as well. If I manage to stop resizing the window when the corruption occurs, the corruption tends to persist. The windows themselves will contain the corruption; it can be captured by screenshots and the alt-tab previews, as well as be minimized and unminimized without losing the corrupted contents. I have several screenshots of the corruption as it builds. If you patch the xf86-video-intel driver to change would-be asserts into driver warnings (non-debug version), you'll see that the warnings (that are reachable in the non-debug build) are emitted whenever the graphical corruption occurs.

When my test method is done repeatedly with varying intensity for extended periods of time, after about several minutes the GPU will hang, sometimes repeatedly. If you're unlucky, the reset can fail (haven't reproduced that on my exact software setup though). Alternatively, the 2D driver can break and lose acceleration almost like in the original bug report, but I haven't yet replicated that either.

I have as many GPU error states as I could capture. However, except for a few relative timestamps and maybe one or two other minor things, they all appear *exactly* the same. Does it only capture the first error?

The accel loss bug I managed to trigger printed this:
[ 10991.936] (EE) intel(0): Failed to submit rendering commands (Input/output error), disabling acceleration.
This occurred about the same time (~0.5 seconds after) one of multiple gpu hangs and coupled with a few fence timeouts, which makes me think that it might be unlucky timing, shown here:
...
[10868.532336] i915 0000:00:02.0: Resetting chip after gpu hang
[10877.492358] i915 0000:00:02.0: Resetting chip after gpu hang
[10937.439048] i915 0000:00:02.0: Resetting chip after gpu hang
[10946.399014] i915 0000:00:02.0: Resetting chip after gpu hang
[10948.532260] asynchronous wait on fence i915:[global]:25b0f4 timed out
[10955.359040] i915 0000:00:02.0: Resetting chip after gpu hang
[10964.532326] i915 0000:00:02.0: Resetting chip after gpu hang
[10973.492343] i915 0000:00:02.0: Resetting chip after gpu hang
[10982.452359] i915 0000:00:02.0: Resetting chip after gpu hang
[10991.412319] i915 0000:00:02.0: Resetting chip after gpu hang
[11000.376705] i915 0000:00:02.0: Resetting chip after gpu hang
[11002.505602] asynchronous wait on fence i915:[global]:25b0fc timed out
[11009.549091] i915 0000:00:02.0: Resetting chip after gpu hang
[11013.385588] asynchronous wait on fence i915:[global]:25b0fe timed out
[11018.505675] i915 0000:00:02.0: Resetting chip after gpu hang

The Xorg.0.log around that time (piped through uniq -c for compactness):
...
      1 [ 10931.396] (WW) intel(0): assertion failed: `bo->pitch*kgem_aligned_height(kgem, height, bo->tiling) <= kgem_bo_size(bo)'; ignoring and trudging onward.
      1 [ 10931.397] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <= kgem_bo_size(bo)'; ignoring and trudging onward.
      2 [ 10931.398] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <= kgem_bo_size(bo)'; ignoring and trudging onward.
      6 [ 10931.399] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <= kgem_bo_size(bo)'; ignoring and trudging onward.
      4 [ 10931.400] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <= kgem_bo_size(bo)'; ignoring and trudging onward.
      6 [ 10931.401] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <= kgem_bo_size(bo)'; ignoring and trudging onward.
      1 [ 10931.402] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <= kgem_bo_size(bo)'; ignoring and trudging onward.
      1 [ 10931.439] (WW) intel(0): assertion failed: `bo->pitch*kgem_aligned_height(kgem, height, bo->tiling) <= kgem_bo_size(bo)'; ignoring and trudging onward.
      3 [ 10931.440] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <= kgem_bo_size(bo)'; ignoring and trudging onward.
     13 [ 10931.441] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <= kgem_bo_size(bo)'; ignoring and trudging onward.
      8 [ 10931.442] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <= kgem_bo_size(bo)'; ignoring and trudging onward.
      1 [ 10991.936] (EE) intel(0): Failed to submit rendering commands (Input/output error), disabling acceleration.
      1 [ 10991.937] (EE) intel(0): When reporting this, please include /sys/class/drm/card0/error and the full dmesg.

I might be able to test using a debug driver if need be.
Comment 69 Adric Blake 2018-05-15 04:13:00 UTC
I fat-fingered the submit button, but I meant to add that I was running glxgears and intel_gpu_top at the time, in case it's relevant. I'll try to test the drm-tip as soon as I can get my version to boot.
Comment 70 Adric Blake 2018-07-06 04:03:35 UTC
Underlying bug still present as of kernel ~4.18rc3.

linux-drm-tip-git 4.18rc3+717+g2fa1923491b6+767886-1
xorg-server 1.20.0-9
mesa 18.1.3-1
libdrm 2.4.92-1
xf86-video-intel 1:2.99.917+840+g42c24d32-1

I've experimented with my newest trigger case, and I can determine that the trigger for the bug depends on the window manager in use. So far, I've only been able to replicate the issue with cinnamon and gnome-/mate-terminal, but I haven't thoroughly tested every possibility.
Comment 71 Adric Blake 2018-07-14 14:55:50 UTC
I decided to test on a few other different machines that I had around but couldn't test until now.

System 1:
Almador (gen2) - An 845G system

Testing with the cinnamon desktop, I can load the desktop, but the moment the gnome-terminal window opens the window flickers/glitches like it would do on my gm45 and the GPU hangs. No resizing required.

I collected three dumps for each boot I tested this.
First dump was on the stock xf86-video-intel, and the cinnamon desktop
Second dump was an assert-enabled build with a patch to ignore some problematic asserts. Oddly enough, it hung without hitting any of the kgem/pitch asserts that I was expecting, despite the same flicker I saw. Running the cinnamon desktop.
Third dump was an assert-enabled build like before, but I started with the mate desktop and attempted to replace it with cinnamon from within. It hung without me opening gnome-terminal, though I already had a terminal window open.

System 2:
Haswell (gen7.5, gt3) - Intel Iris Pro? No msg in kernel log so I'm not sure. EFI.

I tested this system like I would with my gm45 (cinnamon, resizing gnome-terminal as stated before), but I could not observe any flickering using the stock xf86-video-intel driver. I maxed the cpu and spun up as many instances of glxgears as possible, but resizing the window still failed to trigger the bug.
Comment 72 Adric Blake 2018-07-14 14:57:58 UTC
Created attachment 140636 [details]
gen2 bug replication: hang dumps, assert-enabled Xorg logs
Comment 73 Adric Blake 2018-07-30 06:15:18 UTC
Getting back on track, I made a decent attempt at debugging the (underlying) bug that is responsible for these issues.

Tested on kernel 4.18rc6+218. I reproduced the bug with full debug logging and drm.debug=0x1f. When the bug would occur, it would hit an assert that I have ignored in the xorg driver. Using the time of the assert, it is easy to line it up with the ioctl result listed in the kernel logs. I determined the execution of the ioctl by setting a logged global to certain values depending of the part of the code that gets run.

Here is the trace of I915_GEM_SET_TILING ioctl:
deepest/latest call last
ioctl(I915_GEM_SET_TILING...) -> ret = -512
drm_ioctl_kernel(...)
i915_gem_set_tiling_ioctl(...)
i915_gem_object_set_tiling(...)
i915_gem_object_fence_prepare(...)
i915_vma_unbind(...)
i915_gem_active_retire(...)
i915_request_wait(...) -> ret = -ERESTARTSYS (512?)
signal_pending_state(...) -> ret != 0
<uncertain core kernel code follows...>

This is probably not the only kernel code path which causes the bug in the xorg driver. My bug replication method tends to only hit the bug on one path in the xorg driver, while a few other known paths get triggered by other uncertain uses (see the asserts that I ignore).
Comment 74 Adric Blake 2018-07-30 16:35:18 UTC
Created attachment 140899 [details]
Captured drm.debug=0x1f output and Xorg logs after bug detected in Xorg driver

With this attachment you should also be able to see the other ioctl errors that were reported in the bug trace. Presumably, some of these will also be the result of some unhandled error in the kernel module.
Comment 75 Anonymous Helper 2018-08-19 09:59:58 UTC
What is the plan about this bug?
Do you guys maintain i915 kernel driver or should it be reported upstream?
If the bug was bisected in kernel to specific commit can it be made optional as a workaround if it is impossible to fix?
Comment 76 Lakshmi 2018-10-02 10:38:30 UTC
(In reply to Adric Blake from comment #74)
> Created attachment 140899 [details]
> Captured drm.debug=0x1f output and Xorg logs after bug detected in Xorg
> driver
> 
> With this attachment you should also be able to see the other ioctl errors
> that were reported in the bug trace. Presumably, some of these will also be
> the result of some unhandled error in the kernel module.

Adric, The original issue is related to "Failed to submit rendering commands", in result the system has become slow. In the attached log, I don't see any similar errors. If this issue is different than the original bug I would consider to close this bug and recommend to create a new bug.
Comment 77 Francesco Balestrieri 2018-11-07 09:20:16 UTC
It would be good to at least update the summary to describe what the remaining "underlying kernel bug" is. I read through the comments but it wasn't very clear to me at least.
Comment 78 Francesco Balestrieri 2019-02-11 11:28:53 UTC
Adric Blake, as some time has passed since the last update, I hope you don't mind me asking: are you still experiencing the issue as originally reported?
Comment 79 Adric Blake 2019-02-13 13:25:21 UTC
On an up-to-date GM45 system, the corruption+hang bug still exists; it hasn't changed. This was tested using kernel 5.0rc6.

The userspace driver has been patched to prevent itself from closing a handle too soon because of unusual kernel allocation behavior (which caused the original bug), but either a problem in the driver or in the DRM/GEM system causes corruption to form in drawn buffers and possibly in other memory with certain driver operations. I believe this to be caused by the same state which triggered the original bug.
Comment 80 Lakshmi 2019-03-08 14:30:34 UTC
(In reply to Adric Blake from comment #79)
> On an up-to-date GM45 system, the corruption+hang bug still exists; it
> hasn't changed. This was tested using kernel 5.0rc6.

Can you please add the error, dmesg and xorg logs from this kernel?
Comment 81 Lakshmi 2019-07-19 10:15:04 UTC
(In reply to Lakshmi from comment #80)
> (In reply to Adric Blake from comment #79)
> > On an up-to-date GM45 system, the corruption+hang bug still exists; it
> > hasn't changed. This was tested using kernel 5.0rc6.
> 
> Can you please add the error, dmesg and xorg logs from this kernel?

Adric, to proceed further with this bug, we need above information.
Comment 82 Adric Blake 2019-07-20 15:04:43 UTC
Created attachment 144828 [details]
Linux 5.3-dev [drm-tip] - collected card error states
Comment 83 Adric Blake 2019-07-20 15:06:21 UTC
Created attachment 144829 [details]
Linux 5.3-dev [drm-tip] - captured Xorg log (no debug)
Comment 84 Adric Blake 2019-07-20 15:08:17 UTC
Created attachment 144830 [details]
Linux 5.3-dev [drm-tip] - all kernel messages (drm.debug=0x1e)
Comment 85 Adric Blake 2019-07-20 15:19:54 UTC
Arch Linux 64-bit, still.

linux-drm-tip-git 5.2+3225+ga9fbb0055257 (built less than a day ago)
xorg-server 1.20.5-2
mesa 19.1.2-1
libdrm 2.4.99-1
xf86-video-intel 1:2.99.917+865+g60022507-1 (release build)

The "error", aside from the visual artifacts/corruption, is a loss of acceleration due to an I/O error. That is, the GPU becomes hung.

I have attached all of the error states from the hangs leading up to the failure. I also have attached the Xorg log and kernel log associated with the failure.

Sorry for the delay.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.