Bug 77436

Summary:

KDE Konsole very slow scrolling with TearFree and QT graphics system native

Product:

xorg

Reporter:

Ildar Nurislamov <absorbb>

Component:

Driver/intel

Assignee:

Chris Wilson <chris>

Status:

RESOLVED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

normal

Priority:

medium

Version:

unspecified

Hardware:

All

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
High iowait	none
SysRq-t	none
perf record	none
perf.stdio	none
SysRq-t	none
SysRq-t with last patch	none
/sys/class/drm/card0/error	none
intel-gpu-overlay output + intel_gpu_top	none
intel-gpu-overlay 1.7 output + intel_gpu_top + Diablo 3	none
Screen garbage with blt	none
Screen garbage with latest master	none
trace-cmd report	none
Xorg.0.log	none
X output	none
lowlevel-blt-bench results	none
lowlevel-blt-bench results 1024x768	none
i915_ring_freq_table	none
cpuinfo	none

Description Ildar Nurislamov 2014-04-14 14:50:40 UTC

I'm on lastest 2.99.9xx versions. i noticed this some time ago - maybe about a month or so. Scrolling text back in KDE Konsole application became very slow.
How to reproduce.
Start Konsole (KDE 4.12.4) with QT graphics system set to native:
konsole -graphicssystem native
print long output e.g. using dmesg command.
grab the scrollbar and try to scroll back and forth.

Changing font type to bitmap or disabling AA doesn't affect performance anyhow.

Everything is fine in UXA mode. With QT graphics system raster scrolling also is pretty fast.
My hardware is:  Intel(R) HD Graphics 2500.
Kernel: 3.13.7
TearFree is enabled.

Konsole is only case where i've noticed this regression for now.

Comment 1 Ildar Nurislamov 2014-04-14 15:00:51 UTC

Disabling TearFree affects problem in funny way. Rendering of content become laggy. If you move scrollbar up and down few times very fast - content of windows spend few seconds after that trying to catch up and draw all the states.
Problem is not affected by method i use to achive Vsync (driver or Kwin's one)

Comment 2 Chris Wilson 2014-04-14 15:05:15 UTC

There is no apparent lag here with TearFree enabled, cpu/gpu is well below maximum. This is ivb hd graphics (i.e. slightly less than HD2500), but with an uptodate kernel.

Comment 3 Chris Wilson 2014-04-14 15:07:30 UTC

I guess the first to do is to confirm this problem with xf86-video-intel.git and see if you can capture an --enable-debug=full of the slowdown.

Comment 4 Ildar Nurislamov 2014-04-15 08:23:38 UTC

Xorg.0.log with driver from git and --enable-debug=full
https://dl.dropboxusercontent.com/u/42927427/Xorg.tar.bz2
     
My actions:
started kde
started Konsole
ran dmesg
spent about 10 sec moving scroll bar up and down
swiched VT and killed Xorg

weird thing is that scrolling Konsole with --enable-debug=full was little bit faster. and it was only thing that was faster - other things are slow as hell.


P.S.: commit http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=cb35b397395cff8ff7f5155759f58c1840bfeeb5 breaks compiling with --enable-debug=full
with error: sna_display.c:3176:43: error: 'struct <anonymous>' has no member named 'rotation'
     assert(cursor->rotation != sna->cursor.rotation);
so i had to rollback this one

Comment 5 Chris Wilson 2014-04-15 10:12:16 UTC

So far, I have only seen a pattern of good behaviour in the glyph rendering. About the only missed improvement would be to remove overdrawn copies back from the TearFree buffers.

Back to normal (still using xf86-video-intel.git), is there any abnormalities in CPU loading? Where does sudo perf top say all the time is being spent?

Comment 6 Ildar Nurislamov 2014-04-15 10:59:29 UTC

Created attachment 97394 [details]
High iowait

So i grab scroll bar and start to move it like ∞ (to exclude disk io from reading scrollback buffer) very fast. after some time it gets very laggy till display stops to redraw completely (even mouse cursor hangs) after that i can witness spike of iowait for 1 core for a moment. sometimes to 100%. 

it depends on window size. If Konsole size is small e.g. 869,653 - everything is fast - and i cannot reproduce problem no matter how hard i try.

Comment 7 Chris Wilson 2014-04-15 11:19:18 UTC

Do you have sysrq enabled? It would be great to get a kernel stacktrace to see where that iowait is coming from (sysrq-T). If the cursor stops moving, we have serious problems, it suggests that X is being trapped in a throttle() waiting for the GPU.

One thing to try for poor GPU performance is my branch http://cgit.freedesktop.org/~ickle/linux-2.6/ #bug77436

Comment 8 Ildar Nurislamov 2014-04-15 12:43:53 UTC

Created attachment 97406 [details]
SysRq-t

the last one on 246 sec - i think i've caught exact hang moment.

Comment 9 Chris Wilson 2014-04-15 12:56:21 UTC

This is the stalled process:


[  246.122116] Xorg            D ffff88021fa13800     0  1343   1298 0x00400004
[  246.122117]  ffff8802115bbac8 0000000000000086 ffff880212418210 ffff8802115bbfd8
[  246.122119]  0000000000013800 0000000000013800 ffff880212418210 ffff88021fa140b8
 [  246.122120]  ffff8802116d0000 0000000000000001 0000000000000000 ffff880212418210
[  246.122121] Call Trace:
[  246.122124]  [<ffffffff815cdfb8>] io_schedule+0x98/0x130
[  246.122135]  [<ffffffffa0117915>] __wait_seqno+0x315/0x580 [i915]
[  246.122156]  [<ffffffffa0118a6a>] i915_gem_object_wait_rendering+0x3a/0x70 [i915]
[  246.122174]  [<ffffffffa0118f3b>] i915_gem_object_finish_gpu.part.50+0xb/0x20 [i915]
[  246.122192]  [<ffffffffa011d6b8>] i915_gem_object_set_cache_level+0x298/0x350 [i915]
[  246.122211]  [<ffffffffa011d896>] i915_gem_object_pin_to_display_plane+0x46/0x190 [i915]
[  246.122230]  [<ffffffffa013608f>] intel_pin_and_fence_fb_obj+0x9f/0x140 [i915]
[  246.122262]  [<ffffffffa013617f>] intel_gen7_queue_flip+0x4f/0x300 [i915]
[  246.122295]  [<ffffffffa013e819>] intel_crtc_page_flip+0x1d9/0x3c0 [i915]
[  246.122328]  [<ffffffffa0096bde>] drm_mode_page_flip_ioctl+0x15e/0x2d0 [drm]
[  246.122346]  [<ffffffffa008696d>] drm_ioctl+0x46d/0x570 [drm]
[  246.122350]  [<ffffffff811a23f7>] do_vfs_ioctl+0x2d7/0x4b0
[  246.122356]  [<ffffffff811a2650>] SyS_ioctl+0x80/0xa0
[  246.122359]  [<ffffffff815d807d>] system_call_fastpath+0x1a/0x1f
[  246.122361]  [<00007fd174c481e7>] 0x7fd174c481e6

Comment 10 Ildar Nurislamov 2014-04-15 13:33:13 UTC

As i see you've made the commit to your kernel branch to fix this.
I will wait for a fix in mainline kernel. Not sure about compiling 2.6 kernel for OpenSuse 13.1.
Weird though if this caused by page flipping issues why only Qt application affected (kwrite, okular appeared to be affected too, when dolphin is fine)
Mozilla Firefox never hangs and scrolls very fast even in fullscreen mode on text and images heavy pages. even Java draws Netbeans fast. Maybe that is the reason why QT uses raster backend by default.

Comment 11 Chris Wilson 2014-04-15 13:47:08 UTC

Right, I don't have an explanation as to how it becomes backlogged, but that seems to be the culprit behind the bad behaviour here.

This may help improve things in the meantime:

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index 0b8af54..30adeb1 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -4295,6 +4295,11 @@ struct kgem_bo *kgem_create_2d(struct kgem *kgem,
                bo = __kgem_bo_create_as_display(kgem, size, tiling, pitch);
                if (bo)
                        return bo;
+
+               if (bucket >= NUM_CACHE_BUCKETS)
+                       goto large_inactive;
+               else
+                       goto search_inactive;
        }
 
        if (bucket >= NUM_CACHE_BUCKETS) {

Comment 12 Ildar Nurislamov 2014-04-15 14:02:57 UTC

tried. caused segfault at X start:
[  4940.341] (EE) Backtrace:
[  4940.341] (EE) 0: /usr/bin/Xorg (xorg_backtrace+0x48) [0x58d6c8]
[  4940.341] (EE) 1: /usr/bin/Xorg (0x400000+0x1916b9) [0x5916b9]
[  4940.341] (EE) 2: /lib64/libc.so.6 (0x7fa9e2e3a000+0x358d0) [0x7fa9e2e6f8d0]
[  4940.341] (EE) 3: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fa9dfed4000+0x27840) [0x7fa9dfefb840]
[  4940.341] (EE) 4: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fa9dfed4000+0x3ea8c) [0x7fa9dff12a8c]
[  4940.341] (EE) 5: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fa9dfed4000+0x6a755) [0x7fa9dff3e755]
[  4940.341] (EE) 6: /usr/bin/Xorg (0x400000+0xafabe) [0x4afabe]
[  4940.341] (EE) 7: /usr/bin/Xorg (0x400000+0x41436) [0x441436]
[  4940.341] (EE) 8: /lib64/libc.so.6 (__libc_start_main+0xf5) [0x7fa9e2e5bbe5]
[  4940.342] (EE) 9: /usr/bin/Xorg (0x400000+0x2ca91) [0x42ca91]
[  4940.342] (EE) 
[  4940.342] (EE) Segmentation fault at address 0x6c

Comment 13 Chris Wilson 2014-04-15 16:03:43 UTC

Oops, the patch was silly and overlooked gcc warning about a valid issue. Anyway I've applied the couple of ideas discussed today to xf86-video-intel.git, and would very much appreciate some feedback.

Comment 14 Chris Wilson 2014-04-15 17:27:56 UTC

There is one other thing to be wary of, though I would expect it to affect every setup, Xorg-1.15.0 + mesa-10.1 have a bug in that kwin tries to use a 8x msaa framebuffer. This causes everything to be much slower than normal, trigger GPU hangs and cause corruption.

Comment 15 Ildar Nurislamov 2014-04-15 19:54:35 UTC

Thanks. I hope tomorrow i will have time to test your last fixes.

I prefer to use Xrender for compositing in kwin. Don't know why they call it "software". Because with your driver it seems to be 100% hardware. It is faster(in gtkperf) and less CPU hungry than opengl. And how is it "software" if i can cover window with 3d game with another transparend window and still have 60fps?

Comment 16 Ildar Nurislamov 2014-04-16 06:41:09 UTC

Tested latest from git. Didn't noticed any changes.

Comment 17 Chris Wilson 2014-04-16 06:45:59 UTC

Can you please look out for more iowaits? 'sudo perf record -g -a -e i915:i915_gem_request_wait_begin' should help

Comment 18 Chris Wilson 2014-04-16 06:48:01 UTC

And :( for no change. I thought there would have at least been some small improvement.

Comment 19 Ildar Nurislamov 2014-04-16 06:59:50 UTC

Created attachment 97446 [details]
perf record

Maybe it takes longer to cause a freeze now. This perf record contain a lot of scrolling and a spike of 100% iowait in the end.

And one more thing that i didnt tell before - After this freeze it draws a lot of frames very fast - like all the frames that it should have drawn.

Comment 20 Chris Wilson 2014-04-16 07:17:24 UTC

Apologies, should have mentioned the output of perf record has to be decoded on the same machine for symbol resolution (it is much simpler that way). Please run it through "perf report" and hope that it managed to unwind through the ioctl.

Comment 21 Ildar Nurislamov 2014-04-16 07:42:12 UTC

Created attachment 97454 [details]
perf.stdio

hm. not much of symbols info it could find in my system.

Comment 22 Chris Wilson 2014-04-17 06:59:02 UTC

Annoying, it appears that perf has dropped all the relevant frames to figure out if I need to be worried, or to suggest what is going on. Do you still see the high iowaits? Do you think you could hit one with a sysrq-T again?

Comment 23 Ildar Nurislamov 2014-04-17 07:24:27 UTC

Created attachment 97492 [details]
SysRq-t

freeze was long enough to make 2 SysRq-t prints

Comment 24 Chris Wilson 2014-04-17 07:58:08 UTC

Hmm.

[  253.551569] Xorg            D ffff88021fa13800     0  1348   1302 0x00400004
[  253.551571]  ffff880036459a68 0000000000000082 ffff8802143d2050 ffff880036459fd8
[  253.551572]  0000000000013800 0000000000013800 ffff8802143d2050 ffff88021fa140b8
[  253.551573]  ffff88021165c000 0000000000000001 0000000000000000 ffff8802143d2050
[  253.551575] Call Trace:
[  253.551577]  [<ffffffff815cdfb8>] io_schedule+0x98/0x130
[  253.551589]  [<ffffffffa0117915>] __wait_seqno+0x315/0x580 [i915]
[  253.551606]  [<ffffffffa0118a6a>] i915_gem_object_wait_rendering+0x3a/0x70 [i915]
[  253.551622]  [<ffffffffa0118f3b>] i915_gem_object_finish_gpu.part.50+0xb/0x20 [i915]
[  253.551638]  [<ffffffffa011bef8>] i915_vma_unbind+0x1d8/0x2b0 [i915]
[  253.551657]  [<ffffffffa011ca5f>] i915_gem_object_pin+0x17f/0x750 [i915]
[  253.551676]  [<ffffffffa011d917>] i915_gem_object_pin_to_display_plane+0xc7/0x190 [i915]
[  253.551696]  [<ffffffffa013608f>] intel_pin_and_fence_fb_obj+0x9f/0x140 [i915]
[  253.551726]  [<ffffffffa013617f>] intel_gen7_queue_flip+0x4f/0x300 [i915]
[  253.551756]  [<ffffffffa013e819>] intel_crtc_page_flip+0x1d9/0x3c0 [i915]
[  253.551789]  [<ffffffffa0096bde>] drm_mode_page_flip_ioctl+0x15e/0x2d0 [drm]
[  253.551807]  [<ffffffffa008696d>] drm_ioctl+0x46d/0x570 [drm]
[  253.551811]  [<ffffffff811a23f7>] do_vfs_ioctl+0x2d7/0x4b0
[  253.551817]  [<ffffffff811a2650>] SyS_ioctl+0x80/0xa0
[  253.551820]  [<ffffffff815d807d>] system_call_fastpath+0x1a/0x1f
[  253.551822]  [<00007fe5cdca91e7>] 0x7fe5cdca91e6


[  257.990590] Xorg            D ffff88021fa13800     0  1348   1302 0x00400004
[  257.990591]  ffff880036459a68 0000000000000082 ffff8802143d2050 ffff880036459fd8
[  257.990593]  0000000000013800 0000000000013800 ffff8802143d2050 ffff88021fa140b8
[  257.990594]  ffff88021165c000 0000000000000001 0000000000000000 ffff8802143d2050
[  257.990596] Call Trace:
[  257.990598]  [<ffffffff815cdfb8>] io_schedule+0x98/0x130
[  257.990614]  [<ffffffffa0117915>] __wait_seqno+0x315/0x580 [i915]
[  257.990634]  [<ffffffffa0118a6a>] i915_gem_object_wait_rendering+0x3a/0x70 [i915]
[  257.990652]  [<ffffffffa0118f3b>] i915_gem_object_finish_gpu.part.50+0xb/0x20 [i915]
[  257.990671]  [<ffffffffa011bef8>] i915_vma_unbind+0x1d8/0x2b0 [i915]
[  257.990693]  [<ffffffffa011ca5f>] i915_gem_object_pin+0x17f/0x750 [i915]
[  257.990712]  [<ffffffffa011d917>] i915_gem_object_pin_to_display_plane+0xc7/0x190 [i915]
[  257.990735]  [<ffffffffa013608f>] intel_pin_and_fence_fb_obj+0x9f/0x140 [i915]
[  257.990767]  [<ffffffffa013617f>] intel_gen7_queue_flip+0x4f/0x300 [i915]
[  257.990802]  [<ffffffffa013e819>] intel_crtc_page_flip+0x1d9/0x3c0 [i915]
[  257.990836]  [<ffffffffa0096bde>] drm_mode_page_flip_ioctl+0x15e/0x2d0 [drm]
[  257.990859]  [<ffffffffa008696d>] drm_ioctl+0x46d/0x570 [drm]
[  257.990864]  [<ffffffff811a23f7>] do_vfs_ioctl+0x2d7/0x4b0
[  257.990870]  [<ffffffff811a2650>] SyS_ioctl+0x80/0xa0
[  257.990872]  [<ffffffff815d807d>] system_call_fastpath+0x1a/0x1f
[  257.990875]  [<00007fe5cdca91e7>] 0x7fe5cdca91e6

Comment 25 Chris Wilson 2014-04-17 08:00:58 UTC

Do you have VT'd (iommu) enabled? Otherwise I expect it had to move the fb into the mappable portion.

Comment 26 Chris Wilson 2014-04-17 08:05:32 UTC

I still haven't explained how we end up with the backlog, but perhaps this will help again with alleviating the symptoms:

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index a840ed6..7ba3ac9 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -4194,6 +4194,12 @@ static void __kgem_bo_make_scanout(struct kgem *kgem,
            do_ioctl(kgem->fd, DRM_IOCTL_MODE_ADDFB, &arg) == 0) {
                bo->scanout = true;
                bo->delta = arg.fb_id;
+
+               /* Pre-emptively move the object into the mappable
+                * portion to avoid rebinding later. */
+               bo->map__gtt = __kgem_bo_map__gtt(kgem, bo);
+               if (bo->map__gtt)
+                       *bo->map__gtt = 0;
        }
 }

Comment 27 Chris Wilson 2014-04-17 08:06:39 UTC

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index a840ed6..24841dd 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -4194,6 +4194,12 @@ static void __kgem_bo_make_scanout(struct kgem *kgem,
            do_ioctl(kgem->fd, DRM_IOCTL_MODE_ADDFB, &arg) == 0) {
                bo->scanout = true;
                bo->delta = arg.fb_id;
+
+               /* Pre-emptively move the object into the mappable
+                * portion to avoid rebinding later. */
+               bo->map__gtt = __kgem_bo_map__gtt(kgem, bo);
+               if (bo->map__gtt)
+                       *(uint32_t *)bo->map__gtt = 0;
        }
 }

Comment 28 Ildar Nurislamov 2014-04-17 08:52:29 UTC

Created attachment 97499 [details]
SysRq-t with last patch

I cant get mouse cursor freeze after the last patch. and my hand is tired. And not sure about iowaits anymore. Sometimes they happen and i tried to catch them with SysRt-t but something changed in those trace or i failed to catch the right moment.
Screen still stops refreshing. Mouse cursor moves fine and CPU load looks ok most of the time. But everything is still as laggy as before.

My CPU Core™ i5-3450 doesnt support VT-d.

Comment 29 Chris Wilson 2014-04-17 09:06:52 UTC

(In reply to comment #28)
> Screen still stops refreshing. Mouse cursor moves fine and CPU load looks ok
> most of the time. But everything is still as laggy as before.

Good. That is actually what I expect. We haven't done anything to prevent the backlog, so eventually we have to service all that rendering debt and stop processing new commands. However, whereas previously that happened at an unexpected point and blocked the Xserver, it now happens more or less under our control and we allow X to continue processing requests and update the cursor.

Thanks for all your help so far, I need to think how we can get a clear picture of what is causing the backlog.

Comment 30 Chris Wilson 2014-04-17 09:09:21 UTC

Here's something you can try. Roll back a few commits in xf86-video-intel.git so that you can easily trigger the massive iowaits again. Then in the middle of a freeze, "echo 1 | sudo tee /sys/kernel/debug/dri/0/i915_wedged" and please attach /sys/class/drm/card0/error

Comment 31 Ildar Nurislamov 2014-04-17 10:01:23 UTC

Created attachment 97501 [details]
/sys/class/drm/card0/error

first 3 i did with VT switch and execute command.
last 3 - just pressed enter in the same Konsole that i scrolled.
not sure about catching exact moment. it seemed like this command unfreeze the GPU.

Comment 32 Ildar Nurislamov 2014-04-17 11:43:30 UTC

Hm. I mentioned at the beginning that "Problem is not affected by method i use to achieve Vsync (driver or Kwin's one)"
This is not the case anymore. Or rather i was wrong that time. if i disable TearFree and enable OpenGL 3.1 in kwin with vsync no problem happens. But cursor flickers annoyingly and black boxes appears randomly on screen - mesa bug that you mentioned? 
Also i made few roll backs to previos revisions till 15 Oct to check whether it was a regression. Nope. TearFree always worked this way. It's me that started using it few month ago.

Comment 33 Ildar Nurislamov 2014-04-17 11:56:49 UTC

OMG. My previous comment was not correct. i forgot to come back to HEAD version.
So that comment applies to 15 Oct version.

Comment 34 Chris Wilson 2014-04-18 13:26:10 UTC

I believe this should be fixed by

commit fdc28205d590d560a6f3886189ef974d1d00f2cd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Apr 18 11:34:07 2014 +0100

    sna: Always flush the shadow scanout
    
    When the shadow scanout is active, make sure we call the BlockHandler
    every time, not just when we about to sleep. This is in case we are
    swamped by clients trying to render and forgo the important step of
    ensuring that their output reaches the screen.
    
    Reported-by: Ildar Nurislamov <absorbb@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=77436
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 35 Ildar Nurislamov 2014-04-18 14:03:17 UTC

It is sad. But nothing changed for me.
I've made some additional tests with different modes. Maybe this can shed some light:
1. TearFree true, kwin Xrender - lag that grows in time. it takes seconds to draw 1 frame after some time of intense scrolling
2. TearFree false, kwin Xrender - smooth. kinda. there is a delay in drawing actual screen content that grows in time. so after i stop doing anything my screens keeps draw all the frames until it catches up.
3. TearFree false, kwin OpenGL 3.1, Vsync Full screen repaint - same as 1 but lag between redraws grows with slower speed.
4. TearFree false, kwin OpenGL 3.1, Vsync None - same as 1 but lag between redraws grows with slower speed.

Something wrong with the way kde applications draws themselves. Even kwin can't handle it. Or maybe this is kwin's bug. Im going to test with different WM...

Comment 36 Ildar Nurislamov 2014-04-18 14:25:08 UTC

the same with Icewm.
i dont know if this problem deserve so much attention. QT doesn't use xrender by default anyway. Of course it is surprising how application produces frames without caring about whether they was drawn on screen and no proper frame skipping takes place. I believe that app always must try to show its actual state. The most surprising thing for me is - where all this queue of frames to be drawn is stored. 
in 2. TearFree false, kwin Xrender mode it takes seconds or even dozens of second to catch up to actual state. And all the frames are drawn. Where was they stored .. so much data.

Offtopic: i've noticed sna performance regression with 3.14 kernel. from 1.18s in gtkperf to 1.88s. should i report bug or this is smth in OpenSuSE?

Comment 37 Chris Wilson 2014-04-18 19:53:30 UTC

I'll take another pass at seeing if I can reproduce the slow rendering. You did lead me to find a real issue with TearFree, but it appears to have been a separate issue entirely. :(

(In reply to comment #36)
> Offtopic: i've noticed sna performance regression with 3.14 kernel. from
> 1.18s in gtkperf to 1.88s. should i report bug or this is smth in OpenSuSE?

Please do file a bug report. (When in doubt, always file a new bug report.) Try to give the simplest (but as detailed as possible) steps to reproduce and include an Xorg.0.log so that I can check hw and sw versions later.

Comment 38 Ildar Nurislamov 2014-04-18 20:02:38 UTC

Try to test on bigger screen resolution. i have 1920x1080. As i sad before problem happens only with big window size. Maybe your GPU need 2560x1600 or 4k to reach the limits.
I will be able to assist you and file a bug report at Monday. Because i have Linux and Core I5 at my PC at work.

Comment 39 Chris Wilson 2014-04-21 07:41:09 UTC

I haven't yet reproduced the symptoms of the render backlog spiralling out of control (went up to 1366x768 + 2560+1440 with a lowly IVB celeron). But then I realised I am testing on my tweaked kernels and should try a vanilla 3.13/3.14...

Comment 40 Ildar Nurislamov 2014-04-21 08:02:47 UTC

Can you reproduce problem now?
Tested 2.99.911-91-gb259cbf - no difference.

Comment 41 Ildar Nurislamov 2014-04-22 07:19:18 UTC

Compiled kernel from your bug77436 branch - 3.14.0-6.ga68bc7c-desktop+ #1 SMP PREEMPT ( i hope i did it right way ) - no difference for me. 
and performance regression described https://bugs.freedesktop.org/show_bug.cgi?id=77712 also applies to it

Comment 42 Chris Wilson 2014-04-22 10:52:39 UTC

I've been playing with the stock Ubuntu 3.13.0-24 kernel to see if anything unusual happens, no luck so far.

Comment 43 Chris Wilson 2014-05-16 18:12:32 UTC

I'm still puzzling how we end up a rendering backlog here. As always if you can check the latest git updates and see if the damage reduction for TearFree helps, that would be useful.

Comment 44 Ildar Nurislamov 2014-05-16 18:54:50 UTC

i build updated git version every working day since then. especially after tearfree related commits. no changes was noticed since. btw is this possible rightnow to use DRI3?

Comment 45 Chris Wilson 2014-05-16 18:59:58 UTC

It's possible -- just needs some not upstreamed code. I am not yet impressed by the Present implementation, it is not a replacement for DRI2 yet. But DRI3, modulo a few protocol bugs!, is ready.

Comment 46 Chris Wilson 2014-06-05 14:00:17 UTC

Ok, DRI3/Present support is available upstream now. As are more refinements to TearFree. If you do have the opportunity, I would like to now how it all fares (DRI2 and/or DRI3, with/without TearFree).

Comment 47 Ildar Nurislamov 2014-06-05 14:47:22 UTC

I saw all this commits and built driver after every bunch of them. No changes was noticed.
[281306.077] (II) intel(0): [DRI2] Setup complete
[281306.077] (II) intel(0): [DRI2]   DRI driver: i965
[281306.077] (II) intel(0): [DRI2]   VDPAU driver: i965
[281306.077] (II) intel(0): direct rendering: DRI2 DRI3 enabled
[281306.090] (II) GLX: Initialized DRI2 GL provider for screen 0
I believe this means that DRI3 is enabled. I did more testing today:
AccelMethod "uxa" and AccelMethod "none" still doing great here.
TearFree "false" with sna is still produce huge delay in drawing actual frames. That is why i believe problem is not TearFree related. It just appears differently and without TearFree it is much more weird. Low framerate and hangs are common things but delay in drawing actual screen content while doing it with a quite smooth framerate is something i experience the first time in my entire PC usage practice.

Some new observations:
1. Lag depends of amount of text drawn in Konsole. Maybe
2. Running Diablo 3 in parallel improves the situation. Amount of improvements depends on scene that opened in Diablo. If it is character menu - scrolling is almost smooth, if it is actual game results are mixed.

Comment 48 Chris Wilson 2014-06-05 15:06:44 UTC

Hmm, that is starting to sound like cpufreq - glyph rendering should be limited to how fast we can construct the GPU commands. I don't think we played around with cpufreq or pstate in this bug yet...

Comment 49 Ildar Nurislamov 2014-06-05 15:31:14 UTC

Maybe. but not entirely. I use performance governor since then. And i don't think that this is glyph rendering performance problem even if it depends on amount of symbols (which i'm not entirely sure of). Other applications like gtkperf draws and scroll text very fast even in maximized window.
What Diablo may be doing is maxing out GPU frequency (Do your GPU use higher freq in 3D mode?). So in character menu we have max frequency but very low GPU usage - and scrolling become faster. In actual game GPU usage is very high so even GPU freq is still at max value - this doesn't help much.
Im going to check out TearFree off with Diablo..

Comment 50 Chris Wilson 2014-06-05 15:40:11 UTC

If you do suspect that the GPU is not reclocking as required, you can get my monitor from http://cgit.freedesktop.org/~ickle/linux-2.6/ (#master is current) and the intel-gpu-overlay from http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/ which provide an overlay showing what the GPU is doing.

Comment 51 Ildar Nurislamov 2014-06-05 15:54:36 UTC

Same story - with diablo it is hardly possible to achieve delay.
I came out with different theory. what if Diablo actually makes everything slower (as it should). And Konsole manages to draw much less fps so frame buffers or queues (i don't know) doesn't overrun. Yes i have no idea how driver works. But behavior with TearFree disabled make me believe that there must be some queue somewhere and that this queue can be insanely long like few seconds of drawing at ~20fps (what i think is weird. before i though that there is tripple buffer and nothing more)
Or maybe this is not frames queue in the driver but rather X protocol commands queue in the X server. And QT is the one who stupidly overwhelm server with commands without much care of produced result. QT doesn't use xrender backend by default. It may be abandoned and broken.

Comment 52 Ildar Nurislamov 2014-06-05 16:01:28 UTC

damn. this theory doesn't explain why uxa is doing well.

Comment 53 Ildar Nurislamov 2014-06-10 12:35:52 UTC

Created attachment 100808 [details]
intel-gpu-overlay output + intel_gpu_top

tried your kernel. But intel-gpu-overlay 1.6 says that GPU freq not found in debugfs.
Anyway i caught some unusual output from dmesg:
[   89.523354] ------------[ cut here ]------------
[   89.523362] WARNING: CPU: 2 PID: 2623 at drivers/gpu/drm/i915/intel_display.c:9494 intel_check_page_flip+0x117/0x130()
[   89.523363] Kicking stuck page flip: queued at 5336, now 5341
[   89.523364] Modules linked in: fuse af_packet nls_iso8859_1 nls_cp437 vfat fat snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal snd_hda_intel intel_powerclamp snd_hda_controller coretemp snd_hda_codec snd_hwdep snd_pcm kvm snd_seq snd_timer snd_seq_device keucr(C) crct10dif_pclmul crc32_pclmul crc32c_intel snd ghash_clmulni_intel usb_storage aesni_intel ablk_helper iTCO_wdt cryptd r8169 iTCO_vendor_support lrw gf128mul mei_me glue_helper mii aes_x86_64 pcspkr soundcore mei sr_mod serio_raw cdrom lpc_ich shpchp mfd_core i2c_i801 tpm_tis wmi tpm battery sg dm_mod efivarfs fan thermal processor scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh
[   89.523394] CPU: 2 PID: 2623 Comm: konsole Tainted: G         C    3.15.0-rc7-2.g0de0f93-desktop+ #1
[   89.523395] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.13 01/11/2013
[   89.523396]  0000000000000009 ffff88021fb03d80 ffffffff816e80cc ffff88021fb03dc8
[   89.523398]  ffff88021fb03db8 ffffffff8105b02d ffff8802124f9000 ffff88021247c000
[   89.523399]  ffff88021247c198 0000000000000086 0000000000000009 ffff88021fb03e18
[   89.523401] Call Trace:
[   89.523402]  <IRQ>  [<ffffffff816e80cc>] dump_stack+0x4d/0x6f
[   89.523409]  [<ffffffff8105b02d>] warn_slowpath_common+0x7d/0xa0
[   89.523410]  [<ffffffff8105b09c>] warn_slowpath_fmt+0x4c/0x50
[   89.523414]  [<ffffffff81493309>] ? __intel_pageflip_stall_check+0xc9/0x140
[   89.523416]  [<ffffffff814a1e17>] intel_check_page_flip+0x117/0x130
[   89.523418]  [<ffffffff8147a06b>] ironlake_irq_handler+0x15b/0xdf0
[   89.523420]  [<ffffffff8153e76c>] ? __usb_hcd_giveback_urb+0x9c/0x100
[   89.523423]  [<ffffffff810b793e>] handle_irq_event_percpu+0x3e/0x1f0
[   89.523424]  [<ffffffff810b7b33>] handle_irq_event+0x43/0x60
[   89.523426]  [<ffffffff810ba811>] handle_edge_irq+0x91/0x160
[   89.523428]  [<ffffffff81004d61>] handle_irq+0x21/0x30
[   89.523431]  [<ffffffff816fb93f>] do_IRQ+0x4f/0xf0
[   89.523433]  [<ffffffff816f1b6d>] common_interrupt+0x6d/0x6d
[   89.523434]  <EOI>  [<ffffffff816f972d>] ? system_call_fastpath+0x1a/0x1f
[   89.523436] ---[ end trace ff48f125529813ef ]---

And made a picture of gpu_top and gpu-overlay. Xorg waits showed some extreme numbers for a moments but was 300-500ms most of time.

Comment 54 Chris Wilson 2014-06-10 12:44:37 UTC

Ugh. That's really, really bad. It says the the render engine on the GPU is stalling for ages upon ages waiting for memory access. I suspect the reason why UXA is not so bad in this case is that it simply doesn't use the GPU that much...


If you repeat this experiment with Option "AccelMethod" "blt", how does it fare?

intel-gpu-overlay from i-g-t-1.7 should show the frequencies with the recent kernels.

Comment 55 Chris Wilson 2014-06-10 13:05:52 UTC

It reminds me of severe TLB miss:

Can you please try

diff --git a/src/sna/gen7_render.c b/src/sna/gen7_render.c
index ce137bd..b08dc5c 100644
--- a/src/sna/gen7_render.c
+++ b/src/sna/gen7_render.c
@@ -1295,6 +1295,11 @@ gen7_bind_bo(struct sna *sna,
                return offset * sizeof(uint32_t);
        }
 
+       if (bo->tiling == I915_TILING_NONE && bo->pitch >= 4096) {
+               ErrorF("slow %s access to linear bo with pitch %d (handle=%d, size=%d), scanout? %d, io? %d\n",
+                      is_dst ? "write" : "read", bo->pitch, bo->handle, kgem_bo_size(bo), bo->scanout, bo->io);
+       }
+
        offset = sna->kgem.surface -=
                sizeof(struct gen7_surface_state) / sizeof(uint32_t);
        ss = sna->kgem.batch + offset;


and see if that warns about anything.

Comment 56 Ildar Nurislamov 2014-06-10 13:30:03 UTC

Created attachment 100819 [details]
intel-gpu-overlay 1.7 output + intel_gpu_top + Diablo 3

I tests with my kernel now 3.14.4. Yours one hangs PC completely when Diablo 3 is running.
Frequency is maxed out when i scroll. with Diablo it is maxed out always. So 1100Mhz in both cases. Everything else seems differently. 

I didnt see any warnings after your patch in dmesg or Xorg.log. Should i compile debug version?

Comment 57 Chris Wilson 2014-06-10 13:42:36 UTC

(In reply to comment #56)
> Created attachment 100819 [details]
> intel-gpu-overlay 1.7 output + intel_gpu_top + Diablo 3
> 
> I tests with my kernel now 3.14.4. Yours one hangs PC completely when Diablo
> 3 is running.

That is scary. Would be nice to know if it oopsed.

> Frequency is maxed out when i scroll. with Diablo it is maxed out always. So
> 1100Mhz in both cases. Everything else seems differently. 
> 
> I didnt see any warnings after your patch in dmesg or Xorg.log. Should i
> compile debug version?

No. It will print out into the logfile (and to stderr) if it spots anything egregious. So back to the drawing board.

Comment 58 Ildar Nurislamov 2014-06-10 14:02:06 UTC

(In reply to comment #57)
> That is scary. Would be nice to know if it oopsed.
Didn't find anything about that in /var/log/messages
> No. It will print out into the logfile (and to stderr) if it spots anything
> egregious. So back to the drawing board.
And where is that logfile. Cant find that messages anywhere. i assume condition never matched.

Continue to play with your kernel. last time when i tried to run it - it couldn't start X because i915 failed to open VT7 o_O

Comment 59 Chris Wilson 2014-06-10 14:05:13 UTC

It used to be in /var/log/Xorg.0.log. If you are using systemd there is an arcane spell to retrieve it. The VT bug is not new, the only thing that is new is that kernel detects it and X dies instead of it hanging indefinitely.

Comment 60 Ildar Nurislamov 2014-06-10 14:13:02 UTC

Smth new int Xorg.0.log:
(EE) [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
(EE) 
(EE) Backtrace:
(EE) 0: /usr/bin/Xorg (xorg_backtrace+0x48) [0x58dd48]
(EE) 1: /usr/bin/Xorg (mieqEnqueue+0x22b) [0x56fc5b]
(EE) 2: /usr/bin/Xorg (QueuePointerEvents+0x52) [0x454632]
(EE) 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fc777f9b000+0x54c0) [0x7fc777fa04c0]
(EE) 4: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fc777f9b000+0x593a) [0x7fc777fa093a]
(EE) 5: /usr/bin/Xorg (0x400000+0x7aa58) [0x47aa58]
(EE) 6: /usr/bin/Xorg (0x400000+0xa3800) [0x4a3800]
(EE) 7: /lib64/libc.so.6 (0x7fc77f049000+0x358d0) [0x7fc77f07e8d0]
(EE) 8: /lib64/libc.so.6 (ioctl+0x7) [0x7fc77f1291e7]
(EE) 9: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fc77baa7000+0x232dc) [0x7fc77baca2dc]
(EE) 10: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fc77baa7000+0x2ba3c) [0x7fc77bad2a3c]
(EE) 11: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fc77baa7000+0x3b7bd) [0x7fc77bae27bd]
(EE) 12: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fc77baa7000+0x44d4b) [0x7fc77baebd4b]
(EE) 13: /usr/bin/Xorg (0x400000+0x17ea4b) [0x57ea4b]
(EE) 14: /usr/bin/Xorg (0x400000+0xc8a35) [0x4c8a35]
(EE) 15: /usr/bin/Xorg (0x400000+0x3abd6) [0x43abd6]
(EE) 16: /usr/bin/Xorg (0x400000+0x3d6ce) [0x43d6ce]
(EE) 17: /usr/bin/Xorg (0x400000+0x414ea) [0x4414ea]
(EE) 18: /lib64/libc.so.6 (__libc_start_main+0xf5) [0x7fc77f06abe5]
(EE) 19: /usr/bin/Xorg (0x400000+0x2cb41) [0x42cb41]
(EE) 
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is *NOT* the cause.  It is a victim.
(EE) [mi] EQ overflow continuing.  100 events have been dropped.

Comment 61 Ildar Nurislamov 2014-06-10 14:23:44 UTC

Here we are:
[   852.533] slow read access to linear bo with pitch 5464 (handle=697, size=1642496), scanout? 0, io? 0
[   852.551] slow read access to linear bo with pitch 5464 (handle=697, size=1642496), scanout? 0, io? 0
[   852.567] slow read access to linear bo with pitch 5464 (handle=697, size=1642496), scanout? 0, io? 0

The thing is these messages never appear when i scroll in Konsole - that is why i didnt see them. Running OpenSuSE YaST triggers them, but it works fine.

Comment 62 Ildar Nurislamov 2014-06-10 14:33:53 UTC

(In reply to comment #54)
> If you repeat this experiment with Option "AccelMethod" "blt", how does it
> fare?
No bug with Option "AccelMethod" "blt". But drawing is broking at some places. Gtkpeft cant even draw results properly.

Comment 63 Chris Wilson 2014-06-11 07:14:13 UTC

Hmm, did not see gtkperf fail, but I did see some unhappiness with PutImage, fixed by

commit 53c1faa5093f8a0c969e3ebe9f27ee1ce5137e14
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jun 11 07:59:41 2014 +0100

    sna: Mark the GPU as all damaged when discarding CPU bo during uploads

Do you have any more failures with BLT?

Comment 64 Ildar Nurislamov 2014-06-11 09:15:10 UTC

Created attachment 100870 [details]
Screen garbage with blt

It seems that all GTK applications affected except Firefox (which is barely GTK this days): gtkperf, mysql-workbench, GIMP
If i use Xrender for composting - area where left panel drops shadow sometimes forget to update.

Also http://ie.microsoft.com/testdrive/Performance/ParticleAcceleration/ results dropped from 24-25fps to 14-15 in Firefox 33.0a1

Comment 65 Chris Wilson 2014-06-11 11:17:28 UTC

(In reply to comment #64)
> Also http://ie.microsoft.com/testdrive/Performance/ParticleAcceleration/
> results dropped from 24-25fps to 14-15 in Firefox 33.0a1

As a result of BLT, or with normal acceleration? BLT is a fallback acceleration in case where there is no render acceleration for that GPU. (But it also provides acceleration for legacy X drawing using the dedicated hw.)

Comment 66 Ildar Nurislamov 2014-06-11 11:39:30 UTC

(In reply to comment #65)
> As a result of BLT, or with normal acceleration?
As a result of BLT

Btw. Lastest commits made Xorg crash with BLT:
[ 74651.198] (EE) Backtrace:
[ 74651.198] (EE) 0: /usr/bin/Xorg (xorg_backtrace+0x48) [0x58dd48]
[ 74651.198] (EE) 1: /usr/bin/Xorg (0x400000+0x191d89) [0x591d89]
[ 74651.198] (EE) 2: /lib64/libc.so.6 (0x7f700c256000+0x358d0) [0x7f700c28b8d0]
[ 74651.198] (EE) 3: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f7008cb4000+0x5ce6a) [0x7f7008d10e6a]
[ 74651.198] (EE) 4: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f7008cb4000+0x452c2) [0x7f7008cf92c2]
[ 74651.198] (EE) 5: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f7008cb4000+0x45b59) [0x7f7008cf9b59]
[ 74651.198] (EE) 6: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f7008cb4000+0x5abb3) [0x7f7008d0ebb3]
[ 74651.198] (EE) 7: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f7008cb4000+0x5ef37) [0x7f7008d12f37]
[ 74651.198] (EE) 8: /usr/bin/Xorg (0x400000+0x117fd2) [0x517fd2]
[ 74651.198] (EE) 9: /usr/bin/Xorg (0x400000+0x10e1ea) [0x50e1ea]
[ 74651.198] (EE) 10: /usr/bin/Xorg (0x400000+0x3d6ce) [0x43d6ce]
[ 74651.198] (EE) 11: /usr/bin/Xorg (0x400000+0x414ea) [0x4414ea]
[ 74651.198] (EE) 12: /lib64/libc.so.6 (__libc_start_main+0xf5) [0x7f700c277be5]
[ 74651.198] (EE) 13: /usr/bin/Xorg (0x400000+0x2cb41) [0x42cb41]
[ 74651.198] (EE) 
[ 74651.198] (EE) Segmentation fault at address 0x82c000
[ 74651.198] (EE) 
Fatal server error:
[ 74651.198] (EE) Caught signal 11 (Segmentation fault). Server aborting

Comment 67 Chris Wilson 2014-06-11 11:58:53 UTC

Exciting. Please run addr2line -e /usr/lib64/xorg/modules/drivers/intel_drv.so  -i 0x5ce6a 0x452c2

Comment 68 Ildar Nurislamov 2014-06-11 12:01:22 UTC

/home/absorbb/Documents/xf86-video-intel/src/sna/sna_blt.c:3796 (discriminator 1)
/home/absorbb/Documents/xf86-video-intel/src/sna/sna_accel.c:3295

Comment 69 Chris Wilson 2014-06-11 12:53:57 UTC

If I understand correctly, the only way that could happen was if the damage was empty. Hmm. I couldn't see a way we would make that mistake, it would cause various issues (but shouldn't actually affect rendering) as the assumption is that no damage is tracked by the NULL pointer.

Comment 70 Chris Wilson 2014-06-11 14:43:49 UTC

This is hopefully the cause of your crash:


commit a90160dcecf0a3df21a04b4f467e660f69ddae54
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jun 11 15:40:25 2014 +0100

    sna: Update damage pointer after the implicit reduction
    
    sna_damage_contains_box() routine implicitly reduces the damage before
    performing its check. This may alter and even destroy the damage entry,
    so pass in the handle so that it can be updated correctly.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=77436
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 71 Ildar Nurislamov 2014-06-11 14:49:20 UTC

(In reply to comment #70)
> This is hopefully the cause of your crash:
Yep. No crashes now.

Comment 72 Ildar Nurislamov 2014-06-11 14:56:46 UTC

This KDE Konsole bug was also fixed. Until this commit:
	sna: Remove debugging hunk
Now i have it again :(

Comment 73 Chris Wilson 2014-06-11 15:04:15 UTC

Haha! Ok, I have no idea yet, but at least we have narrowed down the exact code path that is troublesome!

Comment 74 Ildar Nurislamov 2014-06-11 15:10:21 UTC

I've rolled back to a90160dcecf0a3df21a04b4f467e660f69ddae54. And can confirm that it works fine. No significant slowdowns noticed. Maybe gtkperf is slower by 1-2% - hard to say.

We are going to have 4-day holidays here. Next time i be able to test smth at Monday.

Comment 75 Chris Wilson 2014-06-11 15:33:33 UTC

I think we will want to test next:

diff --git a/src/sna/gen7_render.c b/src/sna/gen7_render.c
index 3fcca3e..ad6ea32 100644
--- a/src/sna/gen7_render.c
+++ b/src/sna/gen7_render.c
@@ -2965,7 +2965,8 @@ fallback_blt:
                if (too_large(extents.x2-extents.x1, extents.y2-extents.y1))
                        goto fallback_blt;
 
-               if (can_switch_to_blt(sna, dst_bo, flags) &&
+               if ((sna->render_state.gen7.gt == 0 ||
+                    can_switch_to_blt(sna, dst_bo, flags)) &&
                    sna_blt_copy_boxes(sna, alu,
                                       src_bo, src_dx, src_dy,
                                       dst_bo, dst_dx, dst_dy,

Comment 76 Chris Wilson 2014-06-11 15:34:48 UTC

Sorry, that should be

diff --git a/src/sna/gen7_render.c b/src/sna/gen7_render.c
index 3fcca3e..d1115b9 100644
--- a/src/sna/gen7_render.c
+++ b/src/sna/gen7_render.c
@@ -2965,7 +2965,8 @@ fallback_blt:
                if (too_large(extents.x2-extents.x1, extents.y2-extents.y1))
                        goto fallback_blt;
 
-               if (can_switch_to_blt(sna, dst_bo, flags) &&
+               if ((sna->render_state.gen7.gt < 2 ||
+                    can_switch_to_blt(sna, dst_bo, flags)) &&
                    sna_blt_copy_boxes(sna, alu,
                                       src_bo, src_dx, src_dy,
                                       dst_bo, dst_dx, dst_dy,

Comment 77 Chris Wilson 2014-06-11 18:21:10 UTC

I pushed a variant as

commit 1beaa980ea6a9617f7dd4dc87b881c37cc7a277b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jun 11 16:41:22 2014 +0100

    sna/gen6+: On small GT avoid the intermediate copy for overlaps
    
    If the memory bw is constrained on the GPU avoid doing the 2-pass copy
    for overlaps on the render ring, and do the single pass slower BLT copy
    instead - as since it has to transfer less data it will be faster.
    
    Reported-by: Ildar Nurislamov <absorbb@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=77436
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


If that makes scrolling fluid, great. I'd like to dig more into why it slowed down so much though. My guess is due to cache thrashing on the GPU and memory bandwidth, but I don't feel that adequately explains how far behind the GPU got in the command stream, nor how it interacted with a loaded system.

Comment 78 Ildar Nurislamov 2014-06-13 11:09:50 UTC

Current git version has the bug.

Comment 79 Chris Wilson 2014-06-13 11:21:47 UTC

Drat. I thought I had understood the failure path. :|

Comment 80 Chris Wilson 2014-06-13 11:28:47 UTC

If you try:

diff --git a/src/sna/gen7_render.c b/src/sna/gen7_render.c
index 3fcca3e..f0e68a2 100644
--- a/src/sna/gen7_render.c
+++ b/src/sna/gen7_render.c
@@ -2858,6 +2858,8 @@ prefer_blt_copy(struct sna *sna,
                struct kgem_bo *dst_bo,
                unsigned flags)
 {
+       return true;
+
        if (sna->kgem.mode == KGEM_BLT)
                return true;
 
does that prevent the lag?

And in particular,

diff --git a/src/sna/gen7_render.c b/src/sna/gen7_render.c
index 3fcca3e..41950ed 100644
--- a/src/sna/gen7_render.c
+++ b/src/sna/gen7_render.c
@@ -2965,8 +2965,7 @@ fallback_blt:
                if (too_large(extents.x2-extents.x1, extents.y2-extents.y1))
                        goto fallback_blt;
 
-               if (can_switch_to_blt(sna, dst_bo, flags) &&
-                   sna_blt_copy_boxes(sna, alu,
+               if (sna_blt_copy_boxes(sna, alu,
                                       src_bo, src_dx, src_dy,
                                       dst_bo, dst_dx, dst_dy,
                                       dst->drawable.bitsPerPixel,

Comment 81 Ildar Nurislamov 2014-06-13 11:40:54 UTC

Nope. 1nd alone and 1nd and 2nd together don't help.

Comment 82 Chris Wilson 2014-06-13 13:04:46 UTC

Actually, it is not quite as cut and dried as I earlier thought. If we force the fallback there, we move the entire pixmap to the CPU which encourages the next set of operations to be done on the CPU as well. So rather than just the copy operation being suspect, we prevent loading the GPU at all. Sigh.

Comment 83 Chris Wilson 2014-06-13 13:06:06 UTC

What happens with?

diff --git a/src/sna/sna_glyphs.c b/src/sna/sna_glyphs.c
index 5d8dd78..df9e24c 100644
--- a/src/sna/sna_glyphs.c
+++ b/src/sna/sna_glyphs.c
@@ -67,7 +67,7 @@
 #include "sna_render_inline.h"
 #include "fb/fbpict.h"
 
-#define FALLBACK 0
+#define FALLBACK 1
 #define NO_GLYPH_CACHE 0
 #define NO_GLYPHS_TO_DST 0
 #define NO_GLYPHS_VIA_MASK 0

Comment 84 Chris Wilson 2014-06-13 13:38:10 UTC

What happens with?

diff --git a/src/sna/gen6_common.h b/src/sna/gen6_common.h
index 6668620..1277804 100644
--- a/src/sna/gen6_common.h
+++ b/src/sna/gen6_common.h
@@ -106,6 +106,8 @@ static int prefer_blt_bo(struct sna *sna, struct kgem_bo *bo)
 
 inline static bool force_blt_ring(struct sna *sna)
 {
+       return true;
+
        if (sna->flags & SNA_POWERSAVE)
                return true;

Comment 85 Chris Wilson 2014-06-13 13:43:49 UTC

The other approach would be to bisect to find the slowdown.

Comment 86 Ildar Nurislamov 2014-06-16 07:03:49 UTC

1st patch helps.
2nd patch don't.
I tested them separately. Should i apply them together?

Comment 87 Chris Wilson 2014-06-16 07:18:13 UTC

(In reply to comment #86)
> 1st patch helps.
> 2nd patch don't.
> I tested them separately. Should i apply them together?

Nope. They were different tress I was barking at. This is slightly more conclusive that is the glyph rendering that is producing too much load.

First can you quickly test master as I tweaked placement of glyph operations.

Then check

diff --git a/src/sna/sna_glyphs.c b/src/sna/sna_glyphs.c
index e8d92b8..5350eb0 100644
--- a/src/sna/sna_glyphs.c
+++ b/src/sna/sna_glyphs.c
@@ -70,7 +70,7 @@
 #define FALLBACK 0
 #define NO_GLYPH_CACHE 0
 #define NO_GLYPHS_TO_DST 0
-#define NO_GLYPHS_VIA_MASK 0
+#define NO_GLYPHS_VIA_MASK 1
 #define NO_SMALL_MASK 0
 #define NO_GLYPHS_SLOW 0
 #define NO_DISCARD_MASK 0

and then

diff --git a/src/sna/sna_glyphs.c b/src/sna/sna_glyphs.c
index e8d92b8..f7b8607 100644
--- a/src/sna/sna_glyphs.c
+++ b/src/sna/sna_glyphs.c
@@ -71,7 +71,7 @@
 #define NO_GLYPH_CACHE 0
 #define NO_GLYPHS_TO_DST 0
 #define NO_GLYPHS_VIA_MASK 0
-#define NO_SMALL_MASK 0
+#define NO_SMALL_MASK 1
 #define NO_GLYPHS_SLOW 0
 #define NO_DISCARD_MASK 0

Comment 88 Ildar Nurislamov 2014-06-16 07:38:19 UTC

Master: nope
NO_GLYPHS_VIA_MASK 1: helps
NO_SMALL_MASK 1: nope

FALLBACK and NO_GLYPHS_VIA_MASK regresses gtkperf ~5%. But i suppose no surprise here.

Comment 89 Chris Wilson 2014-06-16 09:14:09 UTC

On the off-chance that it helps:

commit 62102f505cd13840e4c910adbe762b3fb46dfaec
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jun 16 09:46:54 2014 +0100

    sna: Promote better active buffer reuse
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 90 Chris Wilson 2014-06-16 09:35:29 UTC

And another glyph test...

Update to

commit 10cb36e1ddc6f4bf41941b24d6557343aa595a13
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jun 16 10:33:07 2014 +0100

    sna/glyphs: Add DBG option to force use of the glyph image mask
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Then apply

diff --git a/src/sna/sna_glyphs.c b/src/sna/sna_glyphs.c
index 1ce7170..f433573 100644
--- a/src/sna/sna_glyphs.c
+++ b/src/sna/sna_glyphs.c
@@ -71,7 +71,7 @@
 #define NO_GLYPH_CACHE 0
 #define NO_GLYPHS_TO_DST 0
 #define NO_GLYPHS_VIA_MASK 0
-#define FORCE_SMALL_MASK 0 /* -1 = never, 1 = always */
+#define FORCE_SMALL_MASK 1 /* -1 = never, 1 = always */
 #define NO_GLYPHS_SLOW 0
 #define NO_DISCARD_MASK 0

Comment 91 Ildar Nurislamov 2014-06-16 09:38:29 UTC

Compiled 6b82962e58c0959c94f2df1f0ebd9a478b15cb6b.
Still has the bug.
But i noticed with latest master versions that lag duration is lower across the board but not significantly.

Going to try FORCE_SMALL_MASK...

Comment 92 Ildar Nurislamov 2014-06-16 09:42:41 UTC

It didn't help

Comment 93 Ildar Nurislamov 2014-06-16 09:57:14 UTC

Played with font size.
Droid Sans Mono - 5pt still has the bug.
4pt - no bug. i made a script that fills entire console area with some text - still no bug

Comment 94 Ildar Nurislamov 2014-06-16 10:04:36 UTC

ouch. with this script output no bug even with a larger font size.
it seems like it has something to do with glyph cash.
dmesg has a large variety of symbols. but text that i use in script doesn't - i just repeat some text paragraph in a loop.

Comment 95 Ildar Nurislamov 2014-06-16 10:08:45 UTC

nope. i made a script that print entire dmesg output but with no line breaks. it works fast event with 7pt. When original dmesg lags as usual.

Comment 96 Ildar Nurislamov 2014-06-16 10:12:54 UTC

Sorry for such an intensive flooding. But it is getting weirder.
Now i have almost fluent scrolling even with 8pt font size with both dmesg and dmesg with no breaks output. All i did was reseting the console.

Comment 97 Chris Wilson 2014-06-16 10:17:20 UTC

No worries, the information is helpful. If you find a good script, I'll give it a go as well. At the moment, I am struggling to make konsole step out of the bottom GPU frequency whilst scrolling through several dmesgs.

Comment 98 Ildar Nurislamov 2014-06-16 10:27:58 UTC

Didn't you forget about "-graphicssystem native"?
At 8pt font size results are not stable. Sometimes in can be fast like there is no bug at all. And sometimes it lags. I caught a moment  when even 9pt was fluent for a moment. But nevertheless 8pt seems like a threshold. 
It seems switching to Mozilla and back improves a situation for a some time.

Looks like current master has some improvements. At the very beginning i haven't noticed any dependencies on font size.

Comment 99 Ildar Nurislamov 2014-06-16 10:33:54 UTC

And i use freetype2 with infinality patches if that matters. Going to test with original.

Comment 100 Ildar Nurislamov 2014-06-16 10:54:09 UTC

The same with original freetype2

Comment 101 Chris Wilson 2014-06-16 10:55:38 UTC

It did indeed take me a moment to remember to add -graphicssystem native, and that did make it cleaner but still the GPU could render at 60fps whilst at lowest frequency. (An ivb gt1 celeron.)

Do the characteristics of the lag remain the same -- very longs waits whilst stalling for memory (GAM)?

Comment 102 Ildar Nurislamov 2014-06-16 11:31:17 UTC

Im not sure. Situating is changing continuously. I mean i forgot which font i used before. But i see now that it depends on font too: DejaVu Sans Mono is much faster then Droid Sans Mono or Liberation Mono. And depends on composting engine also:
Now im on DejaVu Sans Mono 9pt and Xrender lags when OpenGL is fast.

>Do the characteristics of the lag remain the same -- very longs waits whilst stalling for memory (GAM)?
Looks like. But harder to acheive. And waits ms is lower.

Hm. I caught total hang maybe because of using OpenGL or gpu-overlay so i had to restart. After restart everything little bit slower again. It seems i need to do all the window switch, font change magic again.

Comment 103 Chris Wilson 2014-06-16 12:48:36 UTC

(In reply to comment #102) 
> Hm. I caught total hang maybe because of using OpenGL or gpu-overlay so i
> had to restart.

intel-gpu-overlay or intel-gpu-top will cause a total system hang if used against stock kernels (and intel-gpu-top requires another patch I have to be safe).

I'll wait patiently until you have a nasty reproducible case again. Still have to solve why memory spontaneously? becomes the bottleneck.

Comment 104 Ildar Nurislamov 2014-06-16 13:03:32 UTC

I still can reproduce it even it little bit less nasty than before. There are conditions that lower severity of lags (like font and composition engine) and many changes in master version showed little improvements. What seemed like a big relief in my last comments is actually shaky and depends on magic or solar activity. But bug is still there and i'm ready to continue testing things.

Comment 105 Chris Wilson 2014-06-16 13:32:21 UTC

Let's attack the other angle for a bit. Can you trigger the bad behaviour at all without TearFree enabled?

Comment 106 Ildar Nurislamov 2014-06-16 13:52:20 UTC

Same as usual. Smooth rendering but growing delay. Only speed of delay growth is lower.

Comment 107 Chris Wilson 2014-06-16 15:05:10 UTC

Oh, ok. TearFree is not a critical component here, it just generates a wait that triggers unhappiness. I was about to ask you to try

commit a4d845bc8f77e4a36064f132720ce462d0bdd5b4
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jun 16 14:25:16 2014 +0100

    sna: Regularly check the cache level on bo
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

with --enable-debug, but I don't that is going to be insightful.

Comment 108 Ildar Nurislamov 2014-06-17 06:50:02 UTC

I can't compile it anyway:
sna_glyphs.c: In function 'clipped_glyphs':
sna_glyphs.c:561:2: error: invalid use of void expression
  DBG(("%s? %d glyph in %d lists extents (%d, %d), (%d, %d), region (%d, %d), (%d, %d): %s\n",
  ^
sna_glyphs.c: In function 'glyphs_via_mask':
sna_glyphs.c:1130:2: error: invalid use of void expression
  DBG(("%s: nlist=%d, count=%d, bounds=((%d, %d), (%d, %d))\n", __FUNCTION__,
  ^
sna_glyphs.c: In function 'glyphs_fallback':
sna_glyphs.c:1628:2: error: invalid use of void expression
  DBG(("%s: nlist=%d, count=%d, extents (%d, %d), (%d, %d)\n", __FUNCTION__,
  ^
sna_glyphs.c: In function 'glyphs_via_image':
sna_glyphs.c:2054:2: error: invalid use of void expression
  DBG(("%s: nlist=%d, count=%d, bounds=((%d, %d), (%d, %d))\n", __FUNCTION__,

Comment 109 Chris Wilson 2014-06-17 07:14:07 UTC

(In reply to comment #108)
> I can't compile it anyway:

Oops fixed. First try --enable-debug not --enable-debug=full, the later will slow your machine down considerably and probably mask the bug in the process.

Comment 110 Ildar Nurislamov 2014-06-17 07:46:38 UTC

--enable-debug doesn't produce any extra output at all. It must be in Xorg.0.log correct?

Comment 111 Chris Wilson 2014-06-17 07:56:14 UTC

It should just generate one extra line:

[    42.772] (II) intel(0): SNA compiled with assertions enabled

and then abort if it detects an error.

Comment 112 Ildar Nurislamov 2014-06-17 08:49:45 UTC

This line exists. Apparently there are no errors

Comment 113 Ildar Nurislamov 2014-06-18 07:46:48 UTC

https://dl.dropboxusercontent.com/u/42927427/Xorg.0.log.tar.bz2
enable-debug=full
as it was before - with debug scrolling is faster. 
tried to find operations that took at least 1ms
is 
[157786.128] glyphs_format: overlapping glyph inside line, current bbox (54, 290), (72, 299), glyph (70, 292), (79, 299) 
a bad thing?

Comment 114 Chris Wilson 2014-06-18 07:55:09 UTC

(In reply to comment #113)
> https://dl.dropboxusercontent.com/u/42927427/Xorg.0.log.tar.bz2
> enable-debug=full
> as it was before - with debug scrolling is faster. 

Which is unfortunate as it means we don't get to see the true hog. Thanks, I'll look at that later.

> tried to find operations that took at least 1ms
> is 
> [157786.128] glyphs_format: overlapping glyph inside line, current bbox (54,
> 290), (72, 299), glyph (70, 292), (79, 299) 
> a bad thing?

It's fairly normal. It means that the glyphs bounding box overlaps (e.g. kerning, but it is usually due to subpixel filtering blurring the glyphs larger than designed) and prevents us from taking a fast path (due to Render semantics). It's only about 5x slower...

Comment 115 Ildar Nurislamov 2014-06-18 08:00:51 UTC

Created attachment 101284 [details]
Screen garbage with latest master

New thing. It not always like this. Few times per minute maybe

Comment 116 Chris Wilson 2014-06-18 19:13:37 UTC

(In reply to comment #115)
> Created attachment 101284 [details]
> Screen garbage with latest master
> 
> New thing. It not always like this. Few times per minute maybe

Which commit are you running?

Comment 117 Ildar Nurislamov 2014-06-18 19:47:08 UTC

> Which commit are you running?
That was a HEAD at the time i made this comment. Tomorrow i be more specific.

Comment 118 Chris Wilson 2014-06-18 20:00:01 UTC

Just wondering if was before

commit 62aaf2ff4f8597067cf387865707baa00ed9a123
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jun 18 18:41:15 2014 +0100

    sna: Skip redundant copies when already cloned
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=77436
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

which obviously I didn't think through very well, and had to fixup. The dangers of trying to work on multiple problems at once.


commit 51a0559b120dd0a57a938b300bccefbf1142e495
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jun 18 20:55:02 2014 +0100

    sna: Do not treat both src/dst having no clones as being the same
    
    Fixes regression from
    commit 62aaf2ff4f8597067cf387865707baa00ed9a123
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Wed Jun 18 18:41:15 2014 +0100
    
        sna: Skip redundant copies when already cloned
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk

Comment 119 Ildar Nurislamov 2014-06-19 07:45:13 UTC

It was before both of them and it still present in 	c257c936b42b92827b784cee0b7caa54e6040364. So i've made a separate bug:
https://bugs.freedesktop.org/show_bug.cgi?id=80215

Comment 120 Chris Wilson 2014-06-25 07:14:24 UTC

Maybe something like:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 9c24ea6..6835bda 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -6495,6 +6495,7 @@ fixup_flip:
                                                        mode = KGEM_BLT;
                                                DBG(("%s: marking flip bo as busy [%x -> mode=%d]\n", __FUNCTION__, busy.busy, mode));
                                                kgem_bo_mark_busy(&sna->kgem, flip_bo, mode);
+                                               sna->kgem.need_throttle = true;
                                        } else
                                                __kgem_bo_clear_busy(flip_bo);
                                }
@@ -6511,6 +6512,11 @@ fixup_flip:
                        assert(old->refcnt >= 1);
                        set_shadow(sna, region);
                }
+
+               if (sna->kgem.need_throttle) {
+                       drmCommandNone(sna->kgem.fd, DRM_IOCTL_I915_GEM_THROTTLE);
+                       sna->kgem.need_throttle = false;
+               }
        } else
                kgem_submit(&sna->kgem);

Comment 121 Chris Wilson 2014-06-25 08:27:20 UTC

Something that's been bothering me is that I think we end up not calling the BlockHandler often enough. That's called after a client uses up his time slice or has no more commands to execute.


diff --git a/src/sna/sna_driver.c b/src/sna/sna_driver.c
index e0995ce..4eaf1d5 100644
--- a/src/sna/sna_driver.c
+++ b/src/sna/sna_driver.c
@@ -719,6 +719,9 @@ sna_block_handler(BLOCKHANDLER_ARGS_DECL)
 
        sna->BlockHandler(BLOCKHANDLER_ARGS);
 
+       ErrorF("%s (tv=%ld.%06ld), shadow?=%d\n", __FUNCTION__,
+            *tv ? (*tv)->tv_sec : -1, *tv ? (*tv)->tv_usec : 0, has_shadow(sna));
+
        if (*tv == NULL || ((*tv)->tv_usec | (*tv)->tv_sec) || has_shadow(sna))
                sna_accel_block_handler(sna, tv);
 }


Would give a rough indication of how often we are called, and how often we would proceed. The complication here is that whilst TearFree is active, continuation is done immediately after the flip.

On that front, "trace-cmd record -e drm -e i915" during the lag would be useful. Attach the output of "trace-cmd report". (See http://git.kernel.org/cgit/linux/kernel/git/rostedt/trace-cmd.git)

Comment 122 Ildar Nurislamov 2014-06-25 08:46:48 UTC

Last commits already fixed the bug. 
Maybe its not always 60fps but i spent some more time testing to insure that there are no more lags.

Should i apply the patch?

Comment 123 Ildar Nurislamov 2014-06-25 08:51:22 UTC

Fixed only for TearFree. Without it delay still increases.

Comment 124 Ildar Nurislamov 2014-06-25 08:53:03 UTC

And as result "no tearfree" + "kwin opengl vsync" still lags

Comment 125 Chris Wilson 2014-06-25 08:53:24 UTC

The commit supersedes comment 120 - I thought a little more and decided that catching it when we were rendering faster than flipping was a more useful throttle point. Let me know how this feels, and once you have a better handle on the behaviour we can think about how to improve it further. (I.e. try to tackle the root cause rather than fashion better bandaids.)

Comment 126 Chris Wilson 2014-06-25 08:54:48 UTC

Indeed, the throttling was only added to TearFree. Without TearFree is still an issue, so working out where the missing throttle is will be important. It should be hitting the BlockHandler and from there throttling at least every 40ms.

Comment 127 Ildar Nurislamov 2014-06-25 09:30:41 UTC

(In reply to comment #126)
> It should be hitting the BlockHandler and from there throttling at least every
> 40ms.
I think that is why i wrote "Maybe its not always 60fps". I can feel it.
I used Kwin Show Fps plugin. fps if quite high ~50. But there are red spikes on the graph that appears only when i scroll konsole. It represents frame draw time. It become red in 20ms-50ms range - if i read kwin sources correctly.

Is it possible to throttle for 15ms and not more?

Comment 128 Chris Wilson 2014-06-25 09:37:40 UTC

It's a balance between render lag and input lag. If we forgo throttling, we can build long streams of render commands on the GPU, and the display lags behind. If we wait for the GPU for too long, we stop sending clients updates.

I have two problems here:

1. The throttling should already exist to prevent the huge +2s stalls. Unless it is just a single massive batch constructed within one throttle period that generates the massive stall. trace-cmd should help identify if that is the case.

2. There is no way that we should be able to queue more than the GPU can handle in the first place, certainly not +2s! So something is not quite right with the command stream. This is why I have been trying to spot likely culprits such as using large untiled buffers.

Comment 129 Chris Wilson 2014-06-25 09:45:05 UTC

Time for something different:

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index 7a7e2f2..52e78f2 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -3369,6 +3369,9 @@ void _kgem_submit(struct kgem *kgem)
 
        kgem_reset(kgem);
 
+       if (kgem->scanout_busy)
+               isItTimeToYield = TRUE;
+
        assert(kgem->next_request != NULL);
 }

Comment 130 Ildar Nurislamov 2014-06-25 10:53:13 UTC

Created attachment 101742 [details]
trace-cmd report

It seems nothing has changed

Comment 131 Chris Wilson 2014-06-25 11:21:10 UTC

That's mostly 60fps, with a few periods where it dropped to 32ms and even 48ms between the flip-request and the actual flip.

14338.924991: render=0.0165519999991375 fps=60.092542514884
14338.941653: render=0.0166360000002896 fps=60.0168047053296
14338.958318: render=0.0166340000014316 fps=60.0060005964152
14338.974979: render=0.0166399999998248 fps=60.0204069395908
14339.008323: render=0.0333019999998214 fps=29.9904030714778
14339.058291: render=0.0499429999999847 fps=20.0128081975053
14339.074952: render=0.0166440000011789 fps=60.0204069330379
14339.091615: render=0.0166379999991477 fps=60.0132029099832
14339.108294: render=0.016658999998981 fps=59.955632830645
14339.124943: render=0.016631000000416 fps=60.0636674837393
14339.141607: render=0.016646000000037 fps=60.0096015403702
14339.158289: render=0.0166600000011385 fps=59.944850732613
14339.174930: render=0.0166149999986374 fps=60.0925425214526
14339.191595: render=0.0166370000006282 fps=60.0060005964152
14339.224940: render=0.0330740000008518 fps=29.9895036738702
14339.274915: render=0.0470879999993485 fps=20.0100050025397
14339.291575: render=0.0166399999998248 fps=60.0240096062931
14339.324906: render=0.03326399999969 fps=30.0021001465206
14339.341564: render=0.0166119999994407 fps=60.0312162307778
14339.358221: render=0.016631000000416 fps=60.0348201952704
14339.374882: render=0.0166349999999511 fps=60.0204069395908
14339.391549: render=0.0166420000005019 fps=59.9988000244664
14339.408213: render=0.0165100000012899 fps=60.0096015338197
14339.441539: render=0.0333039999986795 fps=30.0066014533538
14339.491547: render=0.0498279999992519 fps=19.9968005118534
14339.508191: render=0.00871700000061537 fps=60.0817111228772
14339.524856: render=0.016646000000037 fps=60.0060006029649
14339.558187: render=0.0332960000014282 fps=30.0021001465206
14339.574842: render=0.0102200000001176 fps=60.0420294225921


Which says that the slowdown is due to the GPU load and not latency in preparing the next flip.

Comment 132 Ildar Nurislamov 2014-06-25 11:34:35 UTC

I scroll manually. so i could pause for a moment. And i was changing a font in konsole in the middle of the trace. second half of it made with liberation - and kwin's show fps showed spikes of >50ms screen draws.

Any idea why you couldn't reproduce it on your machine?

Comment 133 Chris Wilson 2014-06-25 12:24:07 UTC

The big difference is that on this machine I haven't been able to saturate the GPU at 100% busy let alone 100% busy at max frequency. It's only an Ivybridge Celeron with GT1 graphics, so not a powerhouse, but on the other hand only 1366x768.

Looking through the trace-cmd, I still haven't had that eureka moment of spotting the beginning of the snowball, let alone understand the cause. There are some long throttling periods which corresponds to the low framerate, but the commands running up to those do not look absurd. There is the tendency for the batch buffer size to grow when loaded, as expected though, and I don't think scary.

What I do see though is that I perhaps neutered the regular throttling too much.
So, 

commit 5ad1661b870b5b189e52383e45a9b8f570b40501
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jun 25 13:21:58 2014 +0100

    sna: Only mark throttle as done after success
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=77436
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

should be a step in the right direction.

Comment 134 Ildar Nurislamov 2014-06-25 12:36:40 UTC

didn't notice any changes. i rolled back the last path, right?

Comment 135 Chris Wilson 2014-06-25 12:57:43 UTC

Yup. :(

As a final stab in this direction, please try:

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index 50a7da2..3058086 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -967,7 +967,7 @@ static bool __kgem_throttle(struct kgem *kgem)
         * too much and completely starve X. We will sleep again shortly,
         * and so catch up or detect the hang.
         */
-       if (ioctl(kgem->fd, DRM_IOCTL_I915_GEM_THROTTLE) == 0) {
+       if (drmIoctl(kgem->fd, DRM_IOCTL_I915_GEM_THROTTLE) == 0) {
                kgem->need_throttle = 0;
                return false;
        }

Comment 136 Ildar Nurislamov 2014-06-25 13:06:20 UTC

kgem.c:970:2: error: too few arguments to function 'drmIoctl'
  if (drmIoctl(kgem->fd, DRM_IOCTL_I915_GEM_THROTTLE) == 0) {
  ^

Comment 137 Chris Wilson 2014-06-25 13:23:14 UTC

Picky. Just give it an extra 0:

if (drmIoctl(kgem->fd, DRM_IOCTL_I915_GEM_THROTTLE, 0) == 0) {

Comment 138 Ildar Nurislamov 2014-06-25 13:30:02 UTC

Hard to say. Maybe red spikes appears less often and not so high or maybe not )

Comment 139 Chris Wilson 2014-06-25 13:35:45 UTC

Try without TearFree to see if the long lag periods are gone.

Comment 140 Ildar Nurislamov 2014-06-25 13:46:29 UTC

Yep. its gone. I maybe feels more snappier without TearFree. im not sure.

No lag with kwin openGL vsync either. but performance drop quite huge - to 20-25fps

Comment 141 Chris Wilson 2014-06-25 14:49:29 UTC

I've restored the throttling and tried to compromise to not over-throttle during allocations:

commit ca0d06add926eb17fcec6c031adabb656498d744
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jun 25 14:19:33 2014 +0100

    sna: Improve throttling during bo allocation
    
    By controlling how long we may block during buffer allocation, we can
    relax the throttle elsewhere to prevent render lag buildup.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=77436
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


Can you please check that prevents the lag buildup on both TearFree and non-TearFree setups?

Due to the well flips are scheduled in TearFree there can easily be a 20ms delay between render and seeing the results, at best 16ms. That should be enough to make non-TearFree feel snappier. To improve, I would need to submit the next frame just before the vblank to minimise the render output latency. The current solution is nice for its simplicity. :)

Comment 142 Ildar Nurislamov 2014-06-25 15:14:20 UTC

Yep no lags in both mode. Cant say if there is difference between this commit and the patch though.

About snappiness. Actually i doubt that i would ever noticed smth if it was like that in a first time. It simply me looking at the performance side very captiously now.
Btw. Will these changes be helpful in any other cases? Or that is just workaround for buggy qt xrender backend and only i am going to be happy about it?

Comment 143 Chris Wilson 2014-06-25 15:42:59 UTC

It's not the only pathological renderer! I still think something is subpar and anomalous about the behavior on your machine, but this bug has gone on long enough...

For reference, can you please run

x11perf -aa10text -aa24text -rgb10text -rgb24text

under the various compositor guises (TearFree, vsync, vsync+TearFree, etc).

For reference on my baby ivb, with tear-free enabled and bare X I get:

Sync time adjustment is 0.0324 msecs.
200000000 trep @   0.0001 msec (7920000.0/sec): Char in 80-char aa line (Charter 10)
112000000 trep @   0.0003 msec (3850000.0/sec): Char in 30-char aa line (Charter 24)
120000000 trep @   0.0003 msec (3280000.0/sec): Char in 80-char rgb line (Charter 10)
32000000 trep @   0.0011 msec (871000.0/sec): Char in 30-char rgb line (Charter 24)

and without:

Sync time adjustment is 0.0328 msecs.
280000000 trep @   0.0001 msec (10200000.0/sec): Char in 80-char aa line (Charter 10)
112000000 trep @   0.0002 msec (4130000.0/sec): Char in 30-char aa line (Charter 24)
120000000 trep @   0.0003 msec (3400000.0/sec): Char in 80-char rgb line (Charter 10)
32000000 trep @   0.0011 msec (890000.0/sec): Char in 30-char rgb line (Charter 24)

Comment 144 Ildar Nurislamov 2014-06-26 07:57:33 UTC

Bare X + NoTearFree:
120000000 trep @   0.0002 msec (4080000.0/sec): Char in 80-char aa line (Charter 10)
32000000 trep @   0.0009 msec (1100000.0/sec): Char in 30-char aa line (Charter 24)
120000000 trep @   0.0002 msec (4080000.0/sec): Char in 80-char rgb line (Charter 10)
32000000 trep @   0.0009 msec (1100000.0/sec): Char in 30-char rgb line (Charter 24)

Bare X + TearFree:
120000000 trep @   0.0002 msec (4040000.0/sec): Char in 80-char aa line (Charter 10)
32000000 trep @   0.0009 msec (1100000.0/sec): Char in 30-char aa line (Charter 24)
120000000 trep @   0.0002 msec (4040000.0/sec): Char in 80-char rgb line (Charter 10)
32000000 trep @   0.0009 msec (1100000.0/sec): Char in 30-char rgb line (Charter 24)

KDE + TearFree + Kwin OpenGL Vsync:
120000000 trep @   0.0002 msec (4130000.0/sec): Char in 80-char aa line (Charter 10)
32000000 trep @   0.0009 msec (1110000.0/sec): Char in 30-char aa line (Charter 24)
120000000 trep @   0.0002 msec (4130000.0/sec): Char in 80-char rgb line (Charter 10)
32000000 trep @   0.0009 msec (1110000.0/sec): Char in 30-char rgb line (Charter 24)

My CPU is Core™ i5-3450. ivb too. Intel® HD Graphics 2500.
And our results are nothing alike.

Comment 145 Ildar Nurislamov 2014-06-26 08:03:29 UTC

Forgot to mention that 70-90% of cpuload was IOwait.

Comment 146 Chris Wilson 2014-06-26 08:05:00 UTC

That's really, really odd. aa10 should be much faster than rgb10. Can you attach Xorg.0.log? I just want to check if you have enabled any kernel features, fbc being the important one.

Comment 147 Ildar Nurislamov 2014-06-26 08:08:33 UTC

Created attachment 101786 [details]
Xorg.0.log

Btw. When i ran bare X it printed something in console output. GLYPHS was also mentioned there.

Comment 148 Ildar Nurislamov 2014-06-26 08:32:27 UTC

Created attachment 101787 [details]
X output

Comment 149 Chris Wilson 2014-06-26 08:40:38 UTC

Puzzling, didn't spot anything anomalous. (Other than a ugly use of "enabled".) Just to put my concerns to rest, what does /sys/kernel/debug/dri/0/i915_fbc_status report?

(In reply to comment #148)
> Created attachment 101787 [details]
> X output

Ah, that's ok. I was worried there for a moment, but that is just some leak debugging output that X does on shutdown/regeneration.

Comment 150 Ildar Nurislamov 2014-06-26 08:43:21 UTC

# cat /sys/kernel/debug/dri/0/i915_fbc_status
FBC disabled: disabled per chip default

Comment 151 Chris Wilson 2014-06-26 08:48:54 UTC

Let's run through some other highlights:

x11perf -shmput500 -shmget500 -copywinwin500 -copywinpix500 -copypixwin500 -comppixwin500

That should more or less tell us how fast memory access is in different modes (reading/writing by the CPU/GPU/display).

Comment 152 Ildar Nurislamov 2014-06-26 09:00:58 UTC

KDE + TearFree
 200000 trep @   0.1356 msec (  7380.0/sec): Copy 500x500 from window to window
 200000 trep @   0.1356 msec (  7370.0/sec): Copy 500x500 from pixmap to window
 200000 trep @   0.1357 msec (  7370.0/sec): Copy 500x500 from window to pixmap
 400000 trep @   0.0878 msec ( 11400.0/sec): ShmPutImage 500x500 square
 400000 trep @   0.0991 msec ( 10100.0/sec): ShmGetImage 500x500 square                                                                                                    
Some tests where 100% pure iowait.

Now im going to disable intel pstate driver and see what happens..

Comment 153 Ildar Nurislamov 2014-06-26 09:24:40 UTC

disabling intel p-state didnt help.

This is my memory info btw:
# dmidecode --type 17
# dmidecode 2.12
# SMBIOS entry point at 0x000f04c0
SMBIOS 2.7 present.

Handle 0x0029, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x0028
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMM1
        Bank Locator: BANK 0
        Type: DDR3
        Type Detail: Synchronous
        Speed: 1333 MHz
        Manufacturer: Kingston
        Serial Number: 9D3AA568
        Asset Tag: 9876543210
        Part Number: 99U5471-012.A00LF 
        Rank: 2
        Configured Clock Speed: 1333 MHz

Handle 0x002B, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x0028
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMM3
        Bank Locator: BANK 2
        Type: DDR3
        Type Detail: Synchronous                                                                                                                                           
        Speed: 1333 MHz                                                                                                                                                    
        Manufacturer: Elpida                                                                                                                                               
        Serial Number: 2A5629D7                                                                                                                                            
        Asset Tag: 9876543210                                                                                                                                              
        Part Number: EBJ41UF8BDW0-GN-F                                                                                                                                     
        Rank: 2
        Configured Clock Speed: 1333 MHz

Comment 154 Chris Wilson 2014-06-27 07:12:17 UTC

(In reply to comment #152)
> KDE + TearFree
>  200000 trep @   0.1356 msec (  7380.0/sec): Copy 500x500 from window to
> window
>  200000 trep @   0.1356 msec (  7370.0/sec): Copy 500x500 from pixmap to
> window
>  200000 trep @   0.1357 msec (  7370.0/sec): Copy 500x500 from window to
> pixmap
>  400000 trep @   0.0878 msec ( 11400.0/sec): ShmPutImage 500x500 square
>  400000 trep @   0.0991 msec ( 10100.0/sec): ShmGetImage 500x500 square      

Your memory is much higher performance than mine! Yours is running at dual channel compared to the single channel here, so is roughly 2 faster.

So your CPU is faster, your memory is faster, but your EU are slower... Odd.

Can you run xf86-video-intel/tests/lowlevel-blt-bench? Despite its name that should exercise the EU.

Comment 155 Ildar Nurislamov 2014-06-27 08:09:42 UTC

Created attachment 101843 [details]
lowlevel-blt-bench results

in bare X

Comment 156 Chris Wilson 2014-06-27 08:54:07 UTC

Comparing our machines:

Yours:
Testing a8r8g8b8 with Src: ref=1.348951, out=1.340763
Testing x8r8g8b8 with Src: ref=1.342294, out=1.334717
Testing a8r8g8b8 with Over: ref=2.078094, out=1.460116
Testing x8r8g8b8 with Over: ref=1.338495, out=1.325701

Mine:
Testing a8r8g8b8 with Src: ref=0.895362, out=0.570604
Testing x8r8g8b8 with Src: ref=0.883392, out=0.579583
Testing a8r8g8b8 with Over: ref=1.833890, out=0.755831
Testing x8r8g8b8 with Over: ref=0.882829, out=0.578426

Not quite the results I was expecting. Oh darn, now I remember it runs at fullscreen. Who wrote this test!

Do you mind repeating that test but first doing a 'xrandr --output HDMI1 --mode 1024x768'

Comment 157 Ildar Nurislamov 2014-06-27 09:13:40 UTC

Created attachment 101850 [details]
lowlevel-blt-bench results 1024x768

Comment 158 Chris Wilson 2014-06-27 09:25:40 UTC

Yours @ 1024x768:
Testing a8r8g8b8 with Src: ref=0.273359, out=0.420982
Testing x8r8g8b8 with Src: ref=0.273589, out=0.421422
Testing a8r8g8b8 with Over: ref=0.775864, out=0.441080
Testing x8r8g8b8 with Over: ref=0.274387, out=0.416628

Mine @ 1024x768:
Testing a8r8g8b8 with Src: ref=0.870605, out=0.593283
Testing x8r8g8b8 with Src: ref=0.858216, out=0.589730
Testing a8r8g8b8 with Over: ref=1.947748, out=0.767362
Testing x8r8g8b8 with Over: ref=0.857667, out=0.586849

Looking at the reference, which uses pure software rasterisation, your cpu/memory is a little over 2x faster (indeed that is most likely due to the dual channel advantage you have). But it seems that the GPU is completely memory bandwidth limited. Comparing Over with argb, which is the only operation there that does perform a read from dst and blend with src before writing, we do expect to see a significant impact from the extra read.

Can you please 'cat /sys/kernel/debug/dri/0/i915_ring_freq_table' and 'cat /proc/cpuinfo'

Comment 159 Ildar Nurislamov 2014-06-27 09:41:10 UTC

Created attachment 101852 [details]
i915_ring_freq_table

Comment 160 Ildar Nurislamov 2014-06-27 09:41:50 UTC

Created attachment 101853 [details]
cpuinfo

Comment 161 Chris Wilson 2014-06-27 09:50:36 UTC

Drat, no debugfs hack, so do you mind compiling a kernel with

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index d23d114832fa..db61a1bd9b06 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -3707,6 +3707,7 @@ static void __gen6_update_ring_freq(struct drm_device *dev)
                        else
                                ia_freq = max_ia_freq - ((diff * scaling_factor) / 2);
                        ia_freq = DIV_ROUND_CLOSEST(ia_freq, 100);
+                       ia_freq = max_ia_freq;
                }
 
                sandybridge_pcode_write(dev_priv,

applied?

Comment 162 Chris Wilson 2014-06-27 10:40:10 UTC

On a GT2 ivb with dual channel I get @1024x768:

Testing a8r8g8b8 with Src: ref=0.336485, out=0.397242
Testing x8r8g8b8 with Src: ref=0.339070, out=0.389645
Testing a8r8g8b8 with Over: ref=1.031118, out=0.350234
Testing x8r8g8b8 with Over: ref=0.339027, out=0.386291

which more closely match your results. Maybe not quite the lead I hoped. Time to ponder again.

Comment 163 Chris Wilson 2014-06-27 10:45:12 UTC

That's a i7-3720QM CPU @ 2.60GHz, which is exactly in line with your CPU results (when scaling your results at 3.5GHz to this 2.6GHz CPU). Similary, the max GPU clocks here are 1250Mhz which accounts for the difference in GPU results.

So, it looks like your results are nearly perfectly scaling for clocks with mine. I.e. no discrepancy here. :|

Comment 164 Chris Wilson 2014-06-27 11:16:18 UTC

Ah, I have an idea that would account for the glyph discrepancy - it is as if your aa glyphs miss the fast path entirely. That explains why both aa and rgb glyphs are the same speed as well, and why the impact of the extra damage tracking for a compositor doesn't affect performance much.

So I guess the characteristic of the slowdown on your machine is going to be best modeled by rgb24text.

And that as far as I can see, everything is in order with your hardware - it just seems to be a particularly brutal rendering path for your GPU.

Comment 165 Ildar Nurislamov 2014-06-27 11:21:56 UTC

(In reply to comment #164)
> Ah, I have an idea that would account for the glyph discrepancy - it is as
> if your aa glyphs miss the fast path entirely.
Is it fixable?

And what are x11perf -aa10text -aa24text -rgb10text -rgb24text results on GT2 ?

Btw couldn't compile your kernel:
  CC      drivers/gpu/drm/i915/i915_gem.o
drivers/gpu/drm/i915/i915_gem.c: In function ‘i915_gem_phys_pwrite’:
drivers/gpu/drm/i915/i915_gem.c:396:2: error: ‘ret’ undeclared (first use in this function)
  ret = i915_gem_object_wait_rendering(obj, false);
  ^
drivers/gpu/drm/i915/i915_gem.c:396:2: note: each undeclared identifier is reported only once for each function it appears in

DRM and i915 was set to built into kernel.

Comment 166 Chris Wilson 2014-06-27 11:30:31 UTC

This system has the opposite issue (the glyph mask is also discarded for rgb glyphs!):

Sync time adjustment is 0.0164 msecs.
800000000 trep @   0.0001 msec (18100000.0/sec): Char in 80-char aa line (Charter 10)
320000000 trep @   0.0001 msec (6680000.0/sec): Char in 30-char aa line (Charter 24)
800000000 trep @   0.0001 msec (18000000.0/sec): Char in 80-char rgb line (Charter 10)
320000000 trep @   0.0001 msec (6680000.0/sec): Char in 30-char rgb line (Charter 24)

Turning off GLYPHS_TO_DST:

200000000 trep @   0.0001 msec (6670000.0/sec): Char in 80-char aa line (Charter 10)
64000000 trep @   0.0005 msec (1930000.0/sec): Char in 30-char aa line (Charter 24)
200000000 trep @   0.0001 msec (6670000.0/sec): Char in 80-char rgb line (Charter 10)
64000000 trep @   0.0005 msec (1930000.0/sec): Char in 30-char rgb line (Charter 24)

which starts to look very similar to your results.

Comment 167 Chris Wilson 2014-06-27 12:46:09 UTC

I can't see anything else to squeeze out of masked glyphs. It basically requires 3 passes: (1) clear temporary mask, (2) render glyphs to mask, (3) composite mask onto dst.

That of course doesn't help your situation where konsole is able to queue up rendering faster than can be processed by the GPU. So I tweaked the cheat:


commit a6ba93283b20a50ff36758624d2967562b7bdae9
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jun 27 13:42:37 2014 +0100

    sna/glyphs: Add DBG option for glyph tolerance
    
    And bump it to 3 so that glyph filtering doesn't force us to use the
    mask too often.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=77436
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 168 Ildar Nurislamov 2014-06-27 12:57:32 UTC

KDE + Tearfree + Kwin Xrender
240000000 trep @   0.0001 msec (8680000.0/sec): Char in 80-char aa line (Charter 10)
80000000 trep @   0.0004 msec (2680000.0/sec): Char in 30-char aa line (Charter 24)
240000000 trep @   0.0001 msec (8690000.0/sec): Char in 80-char rgb line (Charter 10)
80000000 trep @   0.0004 msec (2680000.0/sec): Char in 30-char rgb line (Charter 24)

Oh thanks :). twice as fast. but still aa==rgb

Comment 169 Chris Wilson 2014-06-27 13:03:18 UTC

Yeah. On my GT2 + dual-channel aa==rgb, whereas on my GT1 + single-channel aa==1.5*rgb.

That means I am memory bound with single-channel, but compute bound with dual-channel. I think. Will have to ponder a little bit more, because despite that, my baby ivb is still 10% faster than yours. But I think it does mean that there is room in the shaders to squeeze more performance out of the GPU. :|

But back to the bigger question... How does the system behave now when scrolling?

Comment 170 Ildar Nurislamov 2014-06-27 13:18:02 UTC

Considering the fact that scrolling is already 95% perfect - no difference was noticed. Red spikes on its places. 
(In reply to comment #169)
> Yeah. On my GT2 + dual-channel aa==rgb, whereas on my GT1 + single-channel
> aa==1.5*rgb.
> 
> That means I am memory bound with single-channel, but compute bound with
> dual-channel. I think. Will have to ponder a little bit more, because
> despite that, my baby ivb is still 10% faster than yours. 
And your GT2 is 2.5faster than mine 

> But back to the bigger question... How does the system behave now when
> scrolling?
Considering the fact that scrolling is already 95% perfect - no difference was noticed. Red spikes are still on its places.

Comment 171 Ildar Nurislamov 2014-10-24 11:42:33 UTC

Hi.
It seems something has changed in last few weeks or days. I don't see that throttle takes place during scrolling anymore. Scrolling is absolutely perfect and shows constant 60fps in both Xrender and OpenGL compositing.
Great job!

Comment 172 Ildar Nurislamov 2014-10-24 11:50:14 UTC

Bummer. I know what has changed. I replaced my A4tech XL-750F 1600dpi mouse with a cheap Logitech one. Plugged A4tech - and throttling came back.
And now we have a reason - A4tech mouse is very fast and produce too many of mouse move events.

Comment 173 Chris Wilson 2014-10-24 11:58:50 UTC

I'm not sure just what improved it for you, but any improvement is welcome!

If you do feel bored a reverse bisect to find the fix would be interesting to see where the problem (or at least the low hanging fruit) lay.

Comment 174 Ildar Nurislamov 2014-10-24 12:06:32 UTC

(In reply to Ildar Nurislamov from comment #172)
> Bummer. I know what has changed. I replaced my A4tech XL-750F 1600dpi mouse
> with a cheap Logitech one. Plugged A4tech - and throttling came back.
> And now we have a reason - A4tech mouse is very fast and produce too many of
> mouse move events.

Yep. I tested it - A4Tech refresh rate is 500Hz. it's insane :)
ordinary mouse has 125Hz.  No surprise that throttle takes place when i'm using this mouse.
So Konsole just tried to draw all frames per second.

Don't bother. Everything is fine. And i use ordinary mouse now, which ironically gives smother result )

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.