Bug 90492 - Xorg unresponsive under heavy system-load
Summary: Xorg unresponsive under heavy system-load
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-17 14:07 UTC by alexander.maznev
Modified: 2015-05-17 22:23 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Xorg logs, vmstat, and dmesg output. (56.91 KB, text/plain)
2015-05-17 14:07 UTC, alexander.maznev
no flags Details
vmstat, dmesg, xorg logs, with oom_kill_allocating_task set to 0 (27.36 KB, application/x-gzip)
2015-05-17 16:52 UTC, alexander.maznev
no flags Details
sudo perf top, xorg log, vmstat (390.00 KB, text/plain)
2015-05-17 21:01 UTC, alexander.maznev
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description alexander.maznev 2015-05-17 14:07:58 UTC
Created attachment 115855 [details]
Xorg logs, vmstat, and dmesg output.

I have had this problem on rare occasions over the last-year, when CPU or memory resources are in heavy use, but been unable to reproduce it. Recently I can reproduce it consistently by triggering firefox bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1165624 

On triggering the Firefox bug, Xserver becomes nearly or completely unresponsive to user input, after which either Xserver fails, or hard-reboot is required. 

The following is consistently triggered: 

(WW) intel(0): Failed to submit rendering commands, trying again with outputs disabled.e
(EE) intel(0): Failed to submit rendering commands, disabling acceleration.

Followed either by a segfault or a bus error. 

System info: 

VGA compatible controller [0300]: Intel Corporation 3rd Gen Core processor Graphics Controller [8086:0166] (rev 09)

Linux version 4.0.1-1-ARCH (builduser@tobias) (gcc version 5.1.0 (GCC) ) #1 SMP PREEMPT Wed Apr 29 12:00:26 CEST 2015

xorg-server 1.17.1-5

xf86-video-intel 2.99.917-5

Xorg logs, vmstat, and dmesg output are attached.
Comment 1 Chris Wilson 2015-05-17 15:15:55 UTC
You ran out of memory (including swap). The kernel cannot execute the GPU commands due to memory exhaustion and that eventually lands in an untrapped SIGBUS. The segfault is a bug elsewhere in the xserver. It is unresponsive due to the amount of swap thrashing...

For example, the first Xorg.0.log, the crash is due to not being able to page in a library from disk. The second didn't have enough memory even to do symbol lookups in the backtrace. The third is a segfault during client shutdown, but it is not clear where. The last one is the only one that looks like the ddx should have prevented but didn't (the ddx is aware that accessing a mmap() may cause a SIGBUS due to oom and tries to trap them) -- however, symbol lookup failed. So for what it is worth, I added a few missing traps around pixman_fill based on the Xorg.0.log you reported on irc:

commit 335ac8742e9eadcdf8f087b24f84c7df0fda31c0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun May 17 08:53:45 2015 +0100

    sna: Wrap CPU access for composite operations with sigtrap
Comment 2 alexander.maznev 2015-05-17 15:36:27 UTC
(In reply to Chris Wilson from comment #1)
> You ran out of memory (including swap). The kernel cannot execute the GPU
> commands due to memory exhaustion and that eventually lands in an untrapped
> SIGBUS. The segfault is a bug elsewhere in the xserver. It is unresponsive
> due to the amount of swap thrashing...
> 
> For example, the first Xorg.0.log, the crash is due to not being able to
> page in a library from disk. The second didn't have enough memory even to do
> symbol lookups in the backtrace. The third is a segfault during client
> shutdown, but it is not clear where. The last one is the only one that looks
> like the ddx should have prevented but didn't (the ddx is aware that
> accessing a mmap() may cause a SIGBUS due to oom and tries to trap them) --
> however, symbol lookup failed. So for what it is worth, I added a few
> missing traps around pixman_fill based on the Xorg.0.log you reported on irc:
> 
> commit 335ac8742e9eadcdf8f087b24f84c7df0fda31c0
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Sun May 17 08:53:45 2015 +0100
> 
>     sna: Wrap CPU access for composite operations with sigtrap

Hello Chris, 
Thank you for looking at this. I suspect in some-cases it's not running out of Memory at least directly (when I run with 1GB swap, opening this page with the firefox fills about 400mgb of the swap [so it's possible to load 3-4 before experiencing problems], if I turn swapoff, with 3gb RAM free the system becomes completely unresponsive just after one page load). vmstat shows a lot of inactive memory which is not being freed when the system runs low on resources? Further why isn't oom_killer handling this gracefully? Firefox has a +100 oomscore over any other running process when this is triggered and terminating it should return the system to a normal state.
Comment 3 Chris Wilson 2015-05-17 15:45:32 UTC
(In reply to alexander.maznev from comment #2)
> Hello Chris, 
> Thank you for looking at this. I suspect in some-cases it's not running out
> of Memory at least directly (when I run with 1GB swap, opening this page
> with the firefox fills about 400mgb of the swap [so it's possible to load
> 3-4 before experiencing problems], if I turn swapoff, with 3gb RAM free the
> system becomes completely unresponsive just after one page load). vmstat
> shows a lot of inactive memory which is not being freed when the system runs
> low on resources?

Your dmesg indicated the first failure with completely full swap. Though it is a known issue that the kernel will start killing processes with swap still available. One of the problems here is during the early oom, the GPU is still active and itself waiting upon allocations (which is preventing it from yielding memory back to the system).

> Further why isn't oom_killer handling this gracefully?
> Firefox has a +100 oomscore over any other running process when this is
> triggered and terminating it should return the system to a normal state.

You have oom_kill_allocating_task set in sysctl.
Comment 4 alexander.maznev 2015-05-17 15:50:21 UTC
low on resources?

> Your dmesg indicated the first failure with completely full swap. Though it
> is a known issue that the kernel will start killing processes with swap
> still available. One of the problems here is during the early oom, the GPU
> is still active and itself waiting upon allocations (which is preventing it
> from yielding memory back to the system).

Yes, I turned swap off afterwards to get more consistent behaviour (the thing with having swap on is if there is some very small amount of space, it will freeze while aggressively trying to swap stuff around to get memory, so the system hangs instead of erroring out of xserver). 

> You have oom_kill_allocating_task set in sysctl.

cat /proc/sys/vm/oom_kill_allocating_task 
0
Comment 5 Chris Wilson 2015-05-17 15:58:11 UTC
(In reply to alexander.maznev from comment #4)
> low on resources?
> 
> > Your dmesg indicated the first failure with completely full swap. Though it
> > is a known issue that the kernel will start killing processes with swap
> > still available. One of the problems here is during the early oom, the GPU
> > is still active and itself waiting upon allocations (which is preventing it
> > from yielding memory back to the system).
> 
> Yes, I turned swap off afterwards to get more consistent behaviour (the
> thing with having swap on is if there is some very small amount of space, it
> will freeze while aggressively trying to swap stuff around to get memory, so
> the system hangs instead of erroring out of xserver). 

But X still can't do very much if the kernel is preventing the driver from doing anything whilst it is processing oom.

> > You have oom_kill_allocating_task set in sysctl.
> 
> cat /proc/sys/vm/oom_kill_allocating_task 
> 0

$ grep oom_kill_allocating_task dmesg.txt 
[ 1481.886894] Out of memory (oom_kill_allocating_task): Kill process 540 (tilda) score 0 or sacrifice child
[ 1486.885295] Out of memory (oom_kill_allocating_task): Kill process 540 (tilda) score 0 or sacrifice child
[ 1491.883715] Out of memory (oom_kill_allocating_task): Kill process 540 (tilda) score 0 or sacrifice child
[ 1495.103841] Out of memory (oom_kill_allocating_task): Kill process 538 (tint2) score 0 or sacrifice child
[ 1495.104251] Out of memory (oom_kill_allocating_task): Kill process 1138 (Timer) score 0 or sacrifice child
[ 3405.538346] Out of memory (oom_kill_allocating_task): Kill process 5293 (systemd-coredum) score 0 or sacrifice child
Comment 6 alexander.maznev 2015-05-17 16:52:54 UTC
Created attachment 115857 [details]
vmstat, dmesg, xorg logs, with oom_kill_allocating_task set to 0

> Kill process 540 (tilda) score 0 or sacrifice child

Yes, sorry that was the dmesg from testing with oom_kill_allocating_task set to 1, here is what I can get with it set to 0 (systemd.journal crashes and fails to record, so this is running with dmesg -w > ).
Comment 7 Chris Wilson 2015-05-17 18:09:06 UTC
Agreed, that looks more suspicious. Can you get a "sudo perf top" during that? I'd guess it was stuck in a pagefault handler.
Comment 8 Chris Wilson 2015-05-17 18:11:05 UTC
Also it would be great if you could build http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/ (sudo apt-get build-dep xserver-xorg-video-intel ; git clone ssh://git.freedesktop.org/git/xorg/driver/xf86-video-intel ; cd xf86-video-intel ; ./autogen.sh --enable-debug --prefix=/usr ; make && sudo make install) as I expect we will be testing some patches eventually...
Comment 9 Chris Wilson 2015-05-17 19:39:59 UTC
Looking at the page, firefox allocates over 3.6GiB of Pixmaps on my 32bit system. (Quite an impressive feat!) Since that is beyond the address space limits for Xorg, we will be trying to hide allocation and access failures. Trying again without swap, resulted in oom and firefox dying (as expected) - but managed to trigger a few i915.ko warnings.
Comment 10 alexander.maznev 2015-05-17 21:01:17 UTC
Created attachment 115863 [details]
sudo perf top, xorg log, vmstat

> Looking at the page, firefox allocates over 3.6GiB of Pixmaps on my 32bit system. (Quite an impressive feat!)

Yeah there's always some subtle bug browsers screw the user with, but I finally got one good for diagnosing other stuff! 

> Since that is beyond the address space limits for Xorg, we will be trying to hide allocation and access failures. Trying again without swap, resulted in oom and firefox dying (as expected) - but managed to trigger a few i915.ko warnings.

So with oom_kill_allocating_task 0, xserver freezes, and whether firefox is actually ever attempted to be killed I still don't know. This is with a 1 minute scheduled shutdown (x-server froze, than system powered off correctly). Running with renice priorities for vmstat seems to help, sudo perf top output included. oomscore is from watch cat /proc/<firefoxpid>/oom_score > oomscore.txt, with renice -5.
Comment 11 Chris Wilson 2015-05-17 21:18:32 UTC
Just looks like the kernel gets trapped in a direct reclaim loop. Hmm, before you do the shutdown, can you do a sysrq-t?
Comment 12 alexander.maznev 2015-05-17 22:02:28 UTC
Well it's a lenovo x1 so the only way to get sysrq behaviour is to use a custom keybaord map or echo <letter> > /pros/sysrq-trigger -- but even if a cron or otherwise scripted job for the later would go through, I believe the output is written to journalctl rather than console, and even with a negative renice for  watch dmesg or journalctl piping to a log file I'm unable to get any of the important output. I'll update when I figure out a way to get debug output for what's going on.
Comment 13 Chris Wilson 2015-05-17 22:23:58 UTC
It was only get the stacktrace and see where the direct reclaim was happening. You can use "perf record -g -a", but you need to coordinate stopping it and syncing the output before shutdown. (And then perf report for attaching.)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.