Bug 76892 - [hsw] full debugging leads to long delays between TearFree updates due to an overloaded GPU!
Summary: [hsw] full debugging leads to long delays between TearFree updates due to an ...
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-04-01 09:06 UTC by Andreas Reis
Modified: 2014-04-04 14:58 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Xorg.0.log (no crash) (39.46 KB, text/plain)
2014-04-01 09:13 UTC, Andreas Reis
no flags Details
dmesg (no crash) (79.09 KB, text/plain)
2014-04-01 09:14 UTC, Andreas Reis
no flags Details
dmesg with echo 7 (518.73 KB, text/plain)
2014-04-02 13:28 UTC, Andreas Reis
no flags Details
dmesg given by journalctl -ab (40.57 KB, text/plain)
2014-04-02 13:45 UTC, Andreas Reis
no flags Details
output of xserver to the console (23.62 KB, text/plain)
2014-04-02 16:32 UTC, Andreas Reis
no flags Details
Xorg.0.log of the same run (2.15 MB, text/plain)
2014-04-02 16:51 UTC, Andreas Reis
no flags Details
Xorg.0.log of the same run (2.15 MB, application/x-xz)
2014-04-02 17:11 UTC, Andreas Reis
no flags Details
Xorg.0.log with i915.enable_fbc=0 (2.59 MB, application/x-xz)
2014-04-03 14:24 UTC, Andreas Reis
no flags Details
Xorg.0.log with f0565852 (1.99 MB, text/plain)
2014-04-03 19:32 UTC, Andreas Reis
no flags Details
Cursor comparison (27.51 KB, text/plain)
2014-04-04 14:18 UTC, Andreas Reis
no flags Details
Cursor comparison (27.51 KB, image/jpeg)
2014-04-04 14:20 UTC, Andreas Reis
no flags Details
Xorg.0.log at 564e4a98 (2.25 MB, application/x-xz)
2014-04-04 14:25 UTC, Andreas Reis
no flags Details

Description Andreas Reis 2014-04-01 09:06:02 UTC
Since about two days the xserver keeps freezing at "random" intervals, also refusing to accept any input. The rest of the system (like audio) keeps working fine, eg. pressing the power button still shuts down the xserver / computer properly.

xf86-video-intel is the only X component that I've updated in that time (from the git version of two days ago to the current one), so I'm rather sure it's the cause.

Arch Linux x64
xorg-server 1.15.0-5
xf86-video-intel-git
mesa-git
libdrm 2.4.52-1
kernel 3.14.0
Haswell 4770 / HD 4600 (GT2) on a Z87
two displays via HDMI, one QHD, one FHD

(Sorry for the lame summary, but with no logs apparently containing any error that's the best I could come up with…)
Comment 1 Chris Wilson 2014-04-01 09:07:12 UTC
Can you please attach dmesg and Xorg.0.log from after the frozen Xserver if possible, or before if not?
Comment 2 Andreas Reis 2014-04-01 09:13:54 UTC
Created attachment 96702 [details]
Xorg.0.log (no crash)
Comment 3 Andreas Reis 2014-04-01 09:14:33 UTC
Created attachment 96703 [details]
dmesg (no crash)
Comment 4 Andreas Reis 2014-04-01 10:07:28 UTC
Just tried with xf86-video-intel reverted to 03-28's "sna: Suppress a compiler warning", still got the freeze. I'll try with "sna: Cursors only need to be cleared when they are shrunk" next, but maybe it's not related to the driver after all.
Comment 5 Chris Wilson 2014-04-01 10:12:05 UTC
So after the freeze, you are not able to access the system at all? Either via a VT or remotely? Your description seems like it should just be the xserver (or compositor?) blocking. Does the cursor still move?
Comment 6 Chris Wilson 2014-04-01 10:14:30 UTC
Hmm, TearFree. I may have introduced a new (or old) recursion issue with event handling. You can try disabling that, but I'd rather attach gdb to Xorg during the freeze and see what's going on?

Alternatively, if you could compile with ./configure --enable-debug=full, we may be able to log what is happening.
Comment 7 Andreas Reis 2014-04-01 10:23:36 UTC
I can still press the power button, which will close (or kill?) the xserver and shutdown the system properly.

Switching to VT doesn't work, as neither keyboard nor mouse input gets processed. Haven't setup any remote access, so I can't attach gdb.

I've recompiled the xserver with that debug option, I'll see if that lists anything relevant.
Comment 8 Chris Wilson 2014-04-01 10:28:11 UTC
Apologies, it is not the xserver that wants --enable-debug=full, but xf86-video-intel.
Comment 9 Andreas Reis 2014-04-01 14:44:06 UTC
Finally another freeze.

After setting up sshd, logging into the computer with the frozen xserver, "gdb /usr/bin/Xorg $(pidof X)" and bt:

#0  0x00007f85722ea28b in __lll_lock_wait_private () from /usr/lib/libc.so.6
#1  0x00007f857227650a in _L_lock_6045 () from /usr/lib/libc.so.6
#2  0x00007f8572274025 in malloc () from /usr/lib/libc.so.6
#3  0x00007f856cf84675 in ?? () from /usr/lib/xorg/modules/drivers/intel_drv.so
#4  0x00007f856cf84e31 in ?? () from /usr/lib/xorg/modules/drivers/intel_drv.so
#5  0x00000000005bc027 in ?? ()
#6  0x00000000005bcd8e in miPointerSetPosition ()
#7  0x000000000045354e in ?? ()
#8  0x0000000000455c30 in GetPointerEvents ()
#9  0x0000000000456350 in QueuePointerEvents ()
#10 0x00007f8569e6681d in ?? () from /usr/lib/xorg/modules/input/evdev_drv.so
#11 0x00000000004888a8 in ?? ()
#12 0x00000000004b83b0 in ?? ()
#13 <signal handler called>
#14 0x00007f8572270bbe in malloc_consolidate () from /usr/lib/libc.so.6
#15 0x00007f8572271eb8 in _int_malloc () from /usr/lib/libc.so.6
#16 0x00007f8572274030 in malloc () from /usr/lib/libc.so.6
#17 0x00007f856cf84675 in ?? () from /usr/lib/xorg/modules/drivers/intel_drv.so
#18 0x00007f856cf84a07 in ?? () from /usr/lib/xorg/modules/drivers/intel_drv.so
#19 0x00007f856cf84b70 in ?? () from /usr/lib/xorg/modules/drivers/intel_drv.so
#20 0x00000000004de665 in ?? ()
#21 0x00000000004db9e2 in ?? ()
#22 0x00000000005bc6aa in miPointerUpdateSprite ()
#23 0x00000000005bc90b in ?? ()
#24 0x00000000004eec3e in ?? ()
#25 0x000000000053b2ce in ?? ()
#26 0x0000000000445d07 in WindowHasNewCursor ()
#27 0x0000000000476d60 in ChangeWindowAttributes ()
#28 0x0000000000433001 in ?? ()
#29 0x0000000000438456 in ?? ()
#30 0x000000000043cdd5 in ?? ()
#31 0x00007f8572219b05 in __libc_start_main () from /usr/lib/libc.so.6
#32 0x000000000042510e in _start ()

(with xf86-video-intel again at git head)

Guess I'll try running with actual debug symbols compiled in next…

Interestingly though: After killing X via ssh I did get back to my VT login getty (I don't use a display manager), but it still wouldn't accept keyboard input. Its prompt didn't blink, either.

Rebooting with Ctrl-Alt-Del worked, though.
Comment 10 Chris Wilson 2014-04-01 14:48:36 UTC
Ok, that looks like memory corruption. Probably the new cursor code is at fault, so you could try rolling back to e3810cff42bca1badc5844002694a6f582c0f423
Comment 11 Chris Wilson 2014-04-01 15:16:38 UTC
I've had a quick run under valgrind which didn't turn up any nasty surprises.
Comment 12 Chris Wilson 2014-04-02 07:21:28 UTC
Oh hum, I guess that it is just the signal driven cursor updating happening at an inopportune time, not actual memory corruption.
Comment 13 Chris Wilson 2014-04-02 08:51:23 UTC
commit f98b2e164637292c2425f6e6d2c22bd9a2800f8e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Apr 2 08:36:14 2014 +0100

    sna: Prevent signal re-entrancy into cursor update routines
    
    As we may need to allocate from within the cursor update, we are prone
    to re-entrancy issues within malloc()/free(). To avoid these we need to
    block SigIO (for pointer updates) whilst in the critical section.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 14 Andreas Reis 2014-04-02 10:09:25 UTC
Still got a freeze.

#0  0x00007f00714d0907 in ioctl () from /usr/lib/libc.so.6
#1  0x00007f006c102b24 in ?? () from /usr/lib/xorg/modules/drivers/intel_drv.so
#2  0x00007f006c10ab63 in ?? () from /usr/lib/xorg/modules/drivers/intel_drv.so
#3  0x00007f006c159680 in ?? () from /usr/lib/xorg/modules/drivers/intel_drv.so
#4  0x00007f006c15ab35 in ?? () from /usr/lib/xorg/modules/drivers/intel_drv.so
#5  0x00007f006c188823 in ?? () from /usr/lib/xorg/modules/drivers/intel_drv.so
#6  0x000000000043d804 in BlockHandler ()
#7  0x00000000005cf6c3 in WaitForSomething ()
#8  0x00000000004381a1 in ?? ()
#9  0x000000000043cdd5 in ?? ()
#10 0x00007f0071413b05 in __libc_start_main () from /usr/lib/libc.so.6
#11 0x000000000042510e in _start ()
Comment 15 Andreas Reis 2014-04-02 10:16:52 UTC
Followed up by a trace with symbols since I can actually reproduce the issue:

#0  0x00007fb61c394907 in ioctl () from /usr/lib/libc.so.6
#1  0x00007fb616fc6b24 in __kgem_throttle (kgem=0x7fb61dfca000) at kgem.c:944
#2  0x00007fb616fceb63 in kgem_throttle (kgem=0x7fb61dfca000) at kgem.c:3234
#3  0x00007fb61701d680 in sna_accel_throttle (sna=0x7fb61dfca000) at sna_accel.c:16771
#4  0x00007fb61701eb35 in sna_accel_block_handler (sna=0x7fb61dfca000, tv=0x7fff525d32d0) at sna_accel.c:17238
#5  0x00007fb61704c823 in sna_block_handler (arg=0x2596680, timeout=0x7fff525d32d0, read_mask=0x871120 <LastSelectMask>) at sna_driver.c:646
#6  0x000000000043d804 in BlockHandler ()
#7  0x00000000005cf6c3 in WaitForSomething ()
#8  0x00000000004381a1 in Dispatch ()
#9  0x000000000043cdd5 in dix_main ()
#10 0x00007fb61c2d7b05 in __libc_start_main () from /usr/lib/libc.so.6
#11 0x000000000042510e in _start ()

file-roller, select some files from an archive, try to drag that selection.
(Cursor changes, which is still movable for a while, then it too freezes.)
Comment 16 Chris Wilson 2014-04-02 10:31:41 UTC
That looks to be something entirely different. Typically that is a GPU hang, and should clear itself after a few seconds or else the kernel is broken.
Comment 17 Chris Wilson 2014-04-02 10:34:07 UTC
Is there anything in dmesg? How about capturing the freeze with drm.debug=7 dmesg?
Comment 18 Chris Wilson 2014-04-02 10:34:34 UTC
echo 7 > /sys/modules/drm/parameters/debug ; # reproduce freeze ; dmesg
Comment 19 Andreas Reis 2014-04-02 11:01:41 UTC
Seems like it indeed does clear itself after a while (more like minutes thuogh), but only if I manage to move the cursor to another region/window that has it change its appearance (which doesn't always work as it should).

I could send you the full dmesg, but its entire content consists of nothing but

[  468.314302] [drm:drm_ioctl], pid=1328, dev=0xe200, auth=1, I915_GEM_BUSY

and

[  468.328904] [drm:drm_calc_vbltimestamp_from_scanoutpos], crtc 1 : v 7 p(0,-45)@ 468.661563 -> 468.662230 [e 0 us, 0 rep]

the latter if I move the cursor and during freezes, with some other messages like below immediately before the move:

[  468.314550] [drm:drm_ioctl], pid=1328, dev=0xe200, auth=1, I915_GEM_CREATE
[  468.314571] [drm:drm_ioctl], pid=1328, dev=0xe200, auth=1, I915_GEM_PWRITE
[  468.314613] [drm:drm_ioctl], pid=1328, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2
[  468.314739] [drm:drm_ioctl], pid=1328, dev=0xe200, auth=1, I915_GEM_BUSY
[  468.314778] [drm:drm_ioctl], pid=1328, dev=0xe200, auth=1, I915_GEM_PWRITE
[  468.314784] [drm:drm_ioctl], pid=1328, dev=0xe200, auth=1, I915_GEM_PWRITE
[  468.314785] [drm:drm_ioctl], pid=1328, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2
[  468.314819] [drm:drm_ioctl], pid=1328, dev=0xe200, auth=1, I915_GEM_THROTTLE
Comment 20 Chris Wilson 2014-04-02 11:04:56 UTC
Please compress the log and attach, maybe the pattern will ring a bell.
Comment 21 Andreas Reis 2014-04-02 13:19:28 UTC
Sorry for the delay, had to leave for a bit.

I've tried to get a full dmesg, but the debug setting spams it so quickly that the beginning gets instantly cut off.
Comment 22 Andreas Reis 2014-04-02 13:28:53 UTC
Created attachment 96781 [details]
dmesg with echo 7
Comment 23 Chris Wilson 2014-04-02 13:44:38 UTC
Odd. What I would say that is the blockhandler, which implies that some client is sending lots of non-rendering requests. It doesn't suggest that the DDX/Xserver is trapped in a loop though, just doing busy-work.

You may find that "sudo perf top" gives a clue, but really you want a log of the requests made by each client. I guess one thing to check is whether this is a recent regression in the ddx (i.e. try 2.99.910 etc). The other approach as you have a highly reproducible scenario is that grabbing a full-debug log is promising (compile the ddx with ./configure --enable-debug=full)
Comment 24 Andreas Reis 2014-04-02 13:45:34 UTC
Created attachment 96784 [details]
dmesg given by journalctl -ab

Fresh boot, evoked bug, cursor froze within seconds, killed X almost immediately afterward.
Comment 25 Chris Wilson 2014-04-02 14:05:43 UTC
Ok, that really does look like it gets stuck trying to render something. Then gets into a loop of batching up more requests whilst it waits for the GPU, eventually it ends up waiting for noticeable periods (2-3s). Ugh. I am not sure why the cursor stops moving though during this time, as the cursor is signal driven.

Please can you look into getting a full-debug log from the ddx. :|
Comment 26 Andreas Reis 2014-04-02 14:07:07 UTC
Haha, I just noticed these hangs are actually *caused* by --enable-debug=full, which I had set in my build script the whole time.

Removing it makes the dragging animation as fluent as ever, without any freezing. (I've verified it twice.)
Comment 27 Chris Wilson 2014-04-02 15:21:30 UTC
(In reply to comment #26)
> Haha, I just noticed these hangs are actually *caused* by
> --enable-debug=full, which I had set in my build script the whole time.
> 
> Removing it makes the dragging animation as fluent as ever, without any
> freezing. (I've verified it twice.)

Intriguing. Can you compress and send me the Xorg.0.log from the freeze?
Comment 28 Andreas Reis 2014-04-02 16:32:30 UTC
Created attachment 96793 [details]
output of xserver to the console

Complete freeze (ie. cursor also stopped moving, didn't wait to check if it would unfreeze though), then "pkill X".
Comment 29 Chris Wilson 2014-04-02 16:35:29 UTC
Yes, that's what we expect given the multi-second delays we build up. Can you please attach the compressed /var/log/Xorg.0.log or send it to me? Pretty please?
Comment 30 Andreas Reis 2014-04-02 16:41:33 UTC
I'm trying to, but for whatever reason I get a whooping 3 kB/s upload to this site.
Comment 31 Andreas Reis 2014-04-02 16:51:45 UTC
Created attachment 96796 [details]
Xorg.0.log of the same run
Comment 32 Andreas Reis 2014-04-02 17:11:25 UTC
Created attachment 96798 [details]
Xorg.0.log of the same run
Comment 33 Chris Wilson 2014-04-02 17:50:45 UTC
The Xorg.0.log shows the same pattern. The batches gradually get larger and larger as each takes longer to render, eventually reaching the point where we are stuck for long periods of time without and update to the scanout. Weird. Weirder that it is the extra debugging that triggers this behaviour, usually it has the opposite effect of making batches shorter (as it takes longer for the ddx to construct a batch giving the GPU more time to execute each and so we send shorter batches to try and keep the GPU busy). I'll see if I can piece together the cause and see if that suggests an improvement I can ask you test.
Comment 34 Chris Wilson 2014-04-02 18:15:05 UTC
One thing to test in the meantime is i915.enable_fbc=0
Comment 35 Andreas Reis 2014-04-03 14:24:10 UTC
Created attachment 96852 [details]
Xorg.0.log with i915.enable_fbc=0

Added i915.enable_fbc=0 to the kernel line, no change, freezing occurs as before.

Here's the Xorg.0.log for good measure.
Comment 36 Chris Wilson 2014-04-03 16:05:58 UTC
Drat, it was a nice theory. Thanks for testing it. How about
cat /sys/class/drm/card0/gt_RP0_freq_mhz > /sys/class/drm/card0/gt_min_freq_mhz
?
Comment 37 Chris Wilson 2014-04-03 16:12:05 UTC
Based on your debug logs so far I have spotted one missed optimisation opportunity:

commit f0565852b668832d4988e875bc12e23abb6e1829
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Apr 3 16:27:47 2014 +0100

    sna: Short-circuit repeated clearing to the same color
    
    Spotted in amongst a KDE debug log was a series of clear; copy; repeat
    using Composite().
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 38 Andreas Reis 2014-04-03 19:32:58 UTC
Created attachment 96871 [details]
Xorg.0.log with f0565852

Tried with
cat /sys/class/drm/card0/gt_RP0_freq_mhz > /sys/class/drm/card0/gt_min_freq_mhz (ie. 1200)
but without f0565852, no change. (Btw, I don't use KDE, or any DE.)

Manually setting this seems pretty redundant though, as gt_cur_freq_mhz goes to 1200 even without it.

The freezing still persists with f0565852, unfortunately. Although it does seem to have made it rarer for the cursor to freeze as well, or at least now it often takes a little longer for it to do so.

Attached is a Xorg.0.log of the bug with f0565852 applied.
Comment 39 Chris Wilson 2014-04-03 20:17:10 UTC
Apologies about assuming KDE, it was just that the rendering sequence was a little strange to me and I was not sure who was responsible!
Comment 40 Chris Wilson 2014-04-03 20:44:09 UTC
Right, that cut down on a lot of the initial rendering, delaying the build up of the large render queue.

It still is rendering very slowly though, and I don't see yet why that GPU behaviour would differ between builds.
Comment 41 Andreas Reis 2014-04-04 08:55:44 UTC
It doesn't even always render that slowly, a few times when I tried to get that f0565852 log it would actually do so fluently, ie. without any freezing.

No need to apologize for assuming KDE :) It's xmonad with compton-git as compositor (set to provide transparency and do fade in/out with a 5ms stepping, if that's of any relevance).
Comment 42 Chris Wilson 2014-04-04 10:28:43 UTC
(In reply to comment #41)
> No need to apologize for assuming KDE :) It's xmonad with compton-git as
> compositor (set to provide transparency and do fade in/out with a 5ms
> stepping, if that's of any relevance).

How do you setup compton?
Comment 43 Andreas Reis 2014-04-04 10:44:59 UTC
Here's the repo: https://github.com/chjj/compton

I just call "compton -f -D 5 &" from my .xinitrc.

(The only other truly display related command there is calling xrandr twice to set "Broadcast RGB" "Full" for both HDMI monitors.)
Comment 44 Chris Wilson 2014-04-04 13:34:36 UTC
I am not sure what else to do about the heisenbug. Given that the original bug is solved, and that in normal usage you do not see the performance issue, I am going to sleep on it. If you have any clues, let me know.
Comment 45 Andreas Reis 2014-04-04 14:18:28 UTC
Created attachment 96903 [details]
Cursor comparison

Guess the only thing I might still add is that it always seems to happen with a particular cursor (though I have not checked if it can be reproduced with others).

That particular cursor has an effect where the "type icons" (eg. image icon, folder icon) of the selected files are overlaid with it — if the program calling it supports that. file-roller doesn't seem to, so if I select one or more files to drag (thus evoking the bug with debug=full), its space for the type icons stays empty. But not so with debug=full: Then a black + is shown instead.

I've added an image showing the cursor as it should be (right) and during the debug=full bug (left). Screenshot wouldn't include the cursor, so I had to take the pictures with my cellphone.

I'll also add one more Xorg.0.log of the bug with the most recent commit ("Move constants MOCS…").
Comment 46 Andreas Reis 2014-04-04 14:20:33 UTC
Created attachment 96904 [details]
Cursor comparison

Why on earth does this site default to "select from list: plain text" instead of "auto-detect"?
Comment 47 Andreas Reis 2014-04-04 14:25:23 UTC
Created attachment 96905 [details]
Xorg.0.log at 564e4a98

Anyway, the final Xorg.0.log.

But: Commit "sna/gen2+: Replace composite sources with solids where possible" causes the xserver to instantly crash when starting certain programs (noticed it with Chromium, pinta, and gimp):

Program received signal SIGSEGV, Segmentation fault.
0x00007f5dd78696e3 in gen7_composite_picture () from /usr/lib/xorg/modules/drivers/intel_drv.so
(gdb) bt
#0  0x00007f5dd78696e3 in gen7_composite_picture () from /usr/lib/xorg/modules/drivers/intel_drv.so
#1  0x00007f5dd786f78d in gen7_render_composite () from /usr/lib/xorg/modules/drivers/intel_drv.so
#2  0x00007f5dd77cf9ed in sna_composite () from /usr/lib/xorg/modules/drivers/intel_drv.so
#3  0x0000000000543f9d in damageComposite ()
#4  0x0000000000537f6a in ProcRenderComposite ()
#5  0x0000000000438456 in Dispatch ()
#6  0x000000000043cdd5 in dix_main ()
#7  0x00007f5ddc9ddb05 in __libc_start_main () from /usr/lib/libc.so.6
#8  0x000000000042510e in _start ()

BT was taken at git-head, ie. "Move constants MOCS", then I "bisected" to find which of the three new commits was the cause.
Comment 48 Chris Wilson 2014-04-04 14:58:12 UTC
(In reply to comment #47) 
> But: Commit "sna/gen2+: Replace composite sources with solids where
> possible" causes the xserver to instantly crash when starting certain
> programs (noticed it with Chromium, pinta, and gimp):

Oh, oops.

commit 938eea6dee0be153fcf007549a50213f6c957305
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Apr 4 15:56:38 2014 +0100

    sna/gen2+: Beware the unattached ShmPixmap
    
    When dereferences priv, make sure it exists first. ShmPixmaps for
    example, may not have one, nor do very small buffers.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Let me take a look at the cursor issue.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.