Created attachment 102302 [details] xorg.log I updated to 2.99.912-211-g57d0cc82d851 recently and from time to time, I see many horizontal black lines in the firefox window. Usually after switching a tab or such. The line is one pixel in height and about 20 in width. I couldn't take a screenshot of that so far.
Any chance at reproducing this? Do you think it would be bisectable?
(In reply to comment #1) > Any chance at reproducing this? It happened three times over the past week so far. Barely, unless I find the trigger. > Do you think it would be bisectable? Having the trigger, easily then :)
My suspicion is that the corruption is dirty cache lines - a missing cache flush on the GPU, and the likely commit commit 2bf36d54ebdfa2a59bf7ef71134b953628ae5d50 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Jul 2 11:31:54 2014 +0100 sna/gen6+: Tweak consideration of compositing on BLT Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> I think it is not the commit that is at fault here, but an underlying issue. I suspect I saw the corruption on first booting into master on this ivb, but haven't seen it again, nor reproduced it on another machine (snb/ivb). It is likely to depend critically on render path, GPU state and memory states.
Nothing yet on the reproduction side (trying different wm on ivb/byt) - if this is hard to reproduce, I will drop the priority and continue with the release.
(In reply to comment #4) > Nothing yet on the reproduction side (trying different wm on ivb/byt) - if > this is hard to reproduce, I will drop the priority and continue with the > release. I see it on a hourly basis. Usually when browsing (google or openstreet) maps. But it is only for a microsecond while scrolling/moving map. I don't know how to reproduce reliably.
One experiment you can try is: diff --git a/src/sna/gen7_render.c b/src/sna/gen7_render.c index b1faac4..6a5a993 100644 --- a/src/sna/gen7_render.c +++ b/src/sna/gen7_render.c @@ -121,7 +121,7 @@ static const struct gt_info ivb_gt2_info = { .max_wm_threads = (172-1) << IVB_PS_MAX_THREADS_SHIFT, .urb = { 256, 704, 320, 8 }, .gt = 2, - .mocs = 3, + //.mocs = 3, }; static const struct gt_info byt_gt_info = {
(In reply to comment #6) > One experiment you can try is: > > diff --git a/src/sna/gen7_render.c b/src/sna/gen7_render.c > index b1faac4..6a5a993 100644 > --- a/src/sna/gen7_render.c > +++ b/src/sna/gen7_render.c > @@ -121,7 +121,7 @@ static const struct gt_info ivb_gt2_info = { > .max_wm_threads = (172-1) << IVB_PS_MAX_THREADS_SHIFT, > .urb = { 256, 704, 320, 8 }, > .gt = 2, > - .mocs = 3, > + //.mocs = 3, > }; > > static const struct gt_info byt_gt_info = { This did not help. I saw the artefacts now with this applied.
Hmm, on the one hand I am relieved. Next experiment: diff --git a/src/sna/gen7_render.c b/src/sna/gen7_render.c index b1faac4..87a2712 100644 --- a/src/sna/gen7_render.c +++ b/src/sna/gen7_render.c @@ -2194,6 +2194,8 @@ try_blt(struct sna *sna, return true; } + return false; + bo = __sna_drawable_peek_bo(dst->pDrawable); if (bo == NULL) return true;
Created attachment 102472 [details] pattern (In reply to comment #8) > Hmm, on the one hand I am relieved. Next experiment: I haven't tried this. Attaching JFYI what does it look like (taken by phone's camera).
The pattern is more regular than I expected. Looks more like a tiling mismatch, though it could well still be a missed flush - or even just a bad cache superline? Also, there hasn't been the flood of "me too" reports, so maybe this is not so severe? Or maybe it just requires an unfortunate series of timing mishaps.
(In reply to comment #8) > Hmm, on the one hand I am relieved. Next experiment: > > diff --git a/src/sna/gen7_render.c b/src/sna/gen7_render.c > index b1faac4..87a2712 100644 > --- a/src/sna/gen7_render.c > +++ b/src/sna/gen7_render.c > @@ -2194,6 +2194,8 @@ try_blt(struct sna *sna, > return true; > } > > + return false; > + This does not help either. (I put there only this, and removed "//" as suggested in comment 6.)
I think I am barking at the wrong tree. Have you tried with an older kernel recently? I would be very grateful if you could do a bisect to see which commit triggers the corruption (either ddx or kernel :).
If this is a kernel regression, it must be in 3.15..3.15.4. But there is no i915 patch AFAICS. I can bisect ddx, but I lost track what was a "good" commit :/.
(In reply to comment #9) > Created attachment 102472 [details] > pattern FWIW this blinks too (as in the rest in bug 81385).
I've started seeing this irregularly. Afaict, it happens here inside the compositor or at the GL client level. I reached this conclusion when I saw the black lines only corruption the lower-right half of the window (i.e. filling a triangle) which is indicative of the mesa render path and not the ddx. The other occurrence I have seen is within GL games.
Had a thought: maybe http://patchwork.freedesktop.org/patch/35002/ ? Too early to tell here as I find the corruption unpredictable.
I'm feeling confident. Any chance you could test that patch as well?
Scratch that, finally saw the black lines covering the lower-right triangle of a window.
Created attachment 112435 [details] [review] Don't override PTE cache settings on scanout buffers I think you want somehting like the mesa patch attached.
(In reply to Chris Wilson from comment #19) > Created attachment 112435 [details] [review] [review] > Don't override PTE cache settings on scanout buffers > > I think you want somehting like the mesa patch attached. Rebooted to that today. I will keep you posted. Thanks.
(In reply to Jiri Slaby from comment #20) > (In reply to Chris Wilson from comment #19) > > Created attachment 112435 [details] [review] [review] [review] > > Don't override PTE cache settings on scanout buffers > > > > I think you want somehting like the mesa patch attached. > > Rebooted to that today. I will keep you posted. Thanks. I hope that I am really running with that patch because I have just hit that :(.
glxinfo should have the git describe string, which would be useful to check against the patched source tree.
(In reply to Chris Wilson from comment #22) > glxinfo should have the git describe string, which would be useful to check > against the patched source tree. So I changed VERSION in your patch too and with $ glxinfo |grep version.*Mesa OpenGL core profile version string: 3.3 (Core Profile) Mesa 10.4.3-chris I still see the issue.
Created attachment 113879 [details] [review] Don't override PTE cache settings on scanout buffers Missed a gen7 renderbuffers mocs. Please can you test this?
Created attachment 113937 [details] [review] backported patch No luck :(. But note that I had to backport the patch. I am using Mesa 10.4.4 and the first hunk in gen8_surface_state.c didn't apply. So the question is: is this a correct backport in the first place?
Backport looks fine. :|
Created attachment 114162 [details] pattern Just a re-cap what it looks like. It paints the lines over a window like this. The initial pattern image was only an excerpt.
Can you please test this kernel patch? http://www.spinics.net/lists/intel-gfx/msg64074.html
(In reply to Daniel Vetter from comment #28) > Can you please test this kernel patch? > > http://www.spinics.net/lists/intel-gfx/msg64074.html I have jsut applied the patch. The occurence rate reduced over time and it almost does not happen now. So let's see if it dismissed completely now... Thanks.
(In reply to Jiri Slaby from comment #29) > (In reply to Daniel Vetter from comment #28) > > Can you please test this kernel patch? > > > > http://www.spinics.net/lists/intel-gfx/msg64074.html > > I have jsut applied the patch. The occurence rate reduced over time and it > almost does not happen now. So let's see if it dismissed completely now... I can tentatively conclude it's been fixed. (These bugs usually happen again when I report success, so let's trigger it :P.)
(In reply to Jiri Slaby from comment #30) > I can tentatively conclude it's been fixed. (These bugs usually happen again > when I report success, so let's trigger it :P.) Ok, didn't happen.
(In reply to Jiri Slaby from comment #31) > (In reply to Jiri Slaby from comment #30) > > I can tentatively conclude it's been fixed. (These bugs usually happen again > > when I report success, so let's trigger it :P.) > > Ok, didn't happen. And updating the kernel to a clean 4.0.4 w/o the patch -- I see it again.
I still see this, despite "drm/i915: Ensure cache flushes prior to doing CS flips" applied. It happens very rarely now. I am now at 4.0.5 + that patch + intel_drv 2.99.917-373-g6fc7b16b9319 (sna).
(In reply to Jiri Slaby from comment #33) > I am now at 4.0.5 + that patch No, it is actually 4.1 + that patch.
Boo! Hiss!
Ah yes. the mmio_flip has the same bug. \o/
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d index d0f3cbc..b4c5507 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -10022,6 +10022,11 @@ static int intel_queue_mmio_flip(struct drm_device *dev uint32_t flags) { struct intel_crtc *intel_crtc = to_intel_crtc(crtc); + int ret; + + ret = i915_gem_check_olr(obj->last_write_req); + if (ret) + return ret; i915_gem_request_assign(&intel_crtc->mmio_flip.req, obj->last_write_req);
The two patches could be merged ofc.
(In reply to Chris Wilson from comment #38) > The two patches could be merged ofc. Ok, the first is upstream now, so the latter has to be a separate one... It looks promising so far.
Hmm 4.1.1 with both of the hunks freezes now whenever X is started. Last xorg.log entry is [ 136.329] X.Org Server Extension : 9.0 That is, intel_drv is not loaded or the kernel crashes or whatever. 4.1 plus both of them were running here with no problems. Or with luck. Two subsequent 4.1.1 boots froze. Now I am at 4.1.1 with the first hunk only and it is OK. But there are no gpu patches in 4.1..4.1.1, so I don't know.
strace? dmesg? cat proc/`pidof Xorg`/stack?
Oh, I see now. This is a separate bug, I think? [ 25.013109] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 25.013126] IP: [<ffffffffa05710e9>] i915_gem_check_olr+0x9/0x60 [i915] [ 25.013156] PGD 2ff388067 PUD 2fe143067 PMD 0 [ 25.013167] Oops: 0000 [#1] PREEMPT SMP [ 25.013178] Modules linked in: fuse rfcomm cmac ecb ctr ccm nf_log_ipv6 xt_pkttype nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_recent af_packet ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_ftp nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables bnep nls_iso8859_1 nls_cp437 vfat fat snd_hda_codec_hdmi snd_hda_codec_realtek arc4 snd_hda_codec_generic iwldvm snd_hda_intel mac80211 x86_pkg_temp_thermal btusb snd_hda_controller uvcvideo intel_powerclamp btbcm snd_hda_codec btintel bluetooth coretemp videobuf2_vmalloc snd_hda_core videobuf2_memops snd_hwdep videobuf2_core snd_pcm_oss thinkpad_acpi kvm_intel v4l2_common dm_mod snd_pcm videodev snd_seq kvm snd_seq_device crct10dif_pclmul crc32_pclmul i915 crc32c_intel iwlwifi ghash_clmulni_intel snd_timer aesni_intel snd_mixer_oss iTCO_wdt aes_x86_64 cfg80211 snd mei_me joydev iTCO_vendor_support tpm_tis rfkill lrw mei serio_raw soundcore gf128mul lpc_ich i2c_i801 tpm shpchp glue_helper ablk_helper mfd_core wmi ac cryptd thermal pcspkr video battery button processor sdhci_pci sdhci mmc_core e1000e ptp xhci_pci pps_core xhci_hcd sg radeon i2c_algo_bit drm_kms_helper ttm drm efivarfs [ 25.013329] CPU: 2 PID: 1413 Comm: X Not tainted 4.1.1-2.g538dd8d-desktop #1 [ 25.013344] Hardware name: LENOVO 23252SG/23252SG, BIOS G2ET33WW (1.13 ) 07/24/2012 [ 25.013356] task: ffff8802ff3b2250 ti: ffff8802ff38c000 task.ti: ffff8802ff38c000 [ 25.013368] RIP: 0010:[<ffffffffa05710e9>] [<ffffffffa05710e9>] i915_gem_check_olr+0x9/0x60 [i915] [ 25.013392] RSP: 0018:ffff8802ff38fcd8 EFLAGS: 00010246 [ 25.013402] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88031cbbac00 [ 25.013412] RDX: 0000000000000000 RSI: ffffffffa0648c60 RDI: 0000000000000000 [ 25.013421] RBP: ffff88030b683400 R08: 0000000000000005 R09: ffff8802ff38fc80 [ 25.013431] R10: ffff8802ff38fc80 R11: ffff880303267000 R12: ffff88030b683400 [ 25.013441] R13: ffff8800cfd97740 R14: ffff880303df0000 R15: ffff8800d3f43000 [ 25.013451] FS: 00007f85c2f4f9c0(0000) GS:ffff88031c280000(0000) knlGS:0000000000000000 [ 25.013461] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 25.013470] CR2: 0000000000000008 CR3: 00000002ff3fd000 CR4: 00000000001407e0 [ 25.013480] Stack: [ 25.013489] ffff88030baed940 ffffffffa05aeb69 ffff880303267000 ffff8800d3115ec0 [ 25.013500] ffff88031cbbac00 0000000000000001 ffff8803032671a8 ffff880303df3e40 [ 25.013511] ffff880303267000 ffff8802ff38fde0 ffff880303267000 ffff8803032671a8 [ 25.013522] Call Trace: [ 25.013558] [<ffffffffa05aeb69>] intel_crtc_page_flip+0x309/0x910 [i915] [ 25.013584] [<ffffffffa001aebb>] drm_mode_page_flip_ioctl+0x1ab/0x370 [drm] [ 25.013603] [<ffffffffa000a74a>] drm_ioctl+0x11a/0x5d0 [drm] [ 25.013619] [<ffffffff811f8ccf>] do_vfs_ioctl+0x2bf/0x4d0 [ 25.013634] [<ffffffff811f8f61>] SyS_ioctl+0x81/0xa0 [ 25.013649] [<ffffffff816af0f2>] system_call_fastpath+0x16/0x75 [ 25.013662] [<00007f85c0b7bfd7>] 0x7f85c0b7bfd7 [ 25.013672] Code: 85 22 fe ff ff be 2f 00 00 00 48 c7 c7 62 a6 61 a0 e8 ec 75 af e0 c6 05 8d 57 0d 00 01 e9 05 fe ff ff 0f 1f 44 00 00 53 48 89 fb <48> 8b 7f 08 48 8b 47 10 8b 40 60 83 f8 01 74 27 48 3b 9f 50 01 [ 25.013719] RIP [<ffffffffa05710e9>] i915_gem_check_olr+0x9/0x60 [i915] [ 25.013740] RSP <ffff8802ff38fcd8> [ 25.013748] CR2: 0000000000000008 [ 25.018611] ---[ end trace f68319f4de3c6164 ]---
(In reply to Jiri Slaby from comment #42) > [ 25.013522] Call Trace: > [ 25.013558] [<ffffffffa05aeb69>] intel_crtc_page_flip+0x309/0x910 [i915] Oh, and this is the first hunk's path. So calling it from both locations can trigger this BUG_ON? BTW every boot shows: 2015-06-22T11:02:32.086138+02:00 anemoi kernel: [ 5.222347] [drm:intel_set_pch_fifo_underrun_reporting [i915]] *ERROR* uncleared pch fifo underrun on pch transcoder A 2015-06-22T11:02:32.086138+02:00 anemoi kernel: [ 5.222385] [drm:cpt_irq_handler [i915]] *ERROR* PCH transcoder A FIFO underrun 2015-06-25T17:48:29.159519+02:00 anemoi kernel: [ 5.156922] [drm:intel_set_pch_fifo_underrun_reporting [i915]] *ERROR* uncleared pch fifo underrun on pch transcoder A 2015-06-25T17:48:29.159519+02:00 anemoi kernel: [ 5.156946] [drm:cpt_irq_handler [i915]] *ERROR* PCH transcoder A FIFO underrun 2015-07-02T21:03:33.947421+02:00 anemoi kernel: [183015.139612] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link training 2015-07-02T21:03:33.947449+02:00 anemoi kernel: [183015.142321] [drm:intel_dp_complete_link_train [i915]] *ERROR* failed to start channel equalization 2015-07-06T10:41:47.285360+02:00 anemoi kernel: [ 5.006666] [drm:intel_set_pch_fifo_underrun_reporting [i915]] *ERROR* uncleared pch fifo underrun on pch transcoder A 2015-07-06T10:41:47.285361+02:00 anemoi kernel: [ 5.006713] [drm:cpt_irq_handler [i915]] *ERROR* PCH transcoder A FIFO underrun 2015-07-06T12:56:41.992582+02:00 anemoi kernel: [ 5.219848] [drm:intel_set_pch_fifo_underrun_reporting [i915]] *ERROR* uncleared pch fifo underrun on pch transcoder A 2015-07-06T12:56:41.992582+02:00 anemoi kernel: [ 5.219867] [drm:cpt_irq_handler [i915]] *ERROR* PCH transcoder A FIFO underrun 2015-07-06T12:59:18.957624+02:00 anemoi kernel: [ 5.102770] [drm:intel_set_pch_fifo_underrun_reporting [i915]] *ERROR* uncleared pch fifo underrun on pch transcoder A 2015-07-06T12:59:18.957625+02:00 anemoi kernel: [ 5.102812] [drm:cpt_irq_handler [i915]] *ERROR* PCH transcoder A FIFO underrun
Just patch isn't defensive enough. if (obj->last_write_req) { ret = i915_gem_check_olr(obj->last_write_req); if (ret) return ret; }
It still happens with both patches, but very rarely.
(In reply to Jiri Slaby from comment #45) > It still happens with both patches, but very rarely. Anything else to try? It usually happens when mapping using josm. Or when browsing a map in firefox. So maybe a lot of tiles?
I updated to 4.2 and dropped both patches. The former is in and the latter no longer applies due to rewrite.
Still happens with 4.5 and 2.99.917-578-g68913715a298.
Still happens with 4.6 and *modesetting_drv* (I did not try intel_drv). It never happened with XRender compositing, happens only with OpenGL compositing.
Related: commit 7aa6ca61ee5546d74b76610894924cdb0d4a1af0 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Nov 7 16:52:04 2016 +0000 drm/i915: Mark CPU cache as dirty when used for renderin
(In reply to Chris Wilson from comment #50) > Related: > > commit 7aa6ca61ee5546d74b76610894924cdb0d4a1af0 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Mon Nov 7 16:52:04 2016 +0000 > > drm/i915: Mark CPU cache as dirty when used for renderin Jiri Slaby, can you please re-test and confirm current defect status then?
(In reply to yann from comment #51) > Jiri Slaby, can you please re-test and confirm current defect status then? I have just applied that and booted. Let's see.
Any updates now?
(In reply to Jani Saarinen from comment #53) > Any updates now? No lines seen yet.
Just let us know how long we wait to keep open / close?
(In reply to Jani Saarinen from comment #55) > Just let us know how long we wait to keep open / close? Feel free to close this, I will reopen if it recurs (unlikely).
(In reply to Jiri Slaby from comment #56) > (In reply to Jani Saarinen from comment #55) > > Just let us know how long we wait to keep open / close? > > Feel free to close this, I will reopen if it recurs (unlikely). Thanks Jiri for your feedback. Then closing is as fixed
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.