Summary: | [BDW Bisected]igt/gem_close_race/process-exit causes system hang with OOM Killer, when true PPGTT enabled | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Guo Jinxian <jinxianx.guo> | ||||||
Component: | DRM/Intel | Assignee: | Michel Thierry <michel.thierry> | ||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Severity: | critical | ||||||||
Priority: | highest | CC: | eero.t.tamminen, hengx.ding, huax.lu, intel-gfx-bugs, valtteri.rantala | ||||||
Version: | DRI git | ||||||||
Hardware: | Other | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
i915 platform: | i915 features: | ||||||||
Attachments: |
|
Description
Guo Jinxian
2014-12-19 05:11:25 UTC
No dmesg, not even over netconsole? Created attachment 111208 [details]
screen shot
I tried to use netconsole and serial port.
Our usb network card doesn't not support netconsole.
Attach serial port, system boots fail, and stop at below lines:
[ 28.646597] dracut: Switching root
[ 28.723991] random: init urandom read with 78 bits of entropy available
[ 28.821374] init: plymouth-upstart-bridge main process (2710) terminated with status 1
[ 28.917644] init: plymouth-upstart-bridge main process ended, respawning
[ 29.003485] init: plymouth-upstart-bridge main process (2720) terminated with status 1
[ 29.099711] init: plymouth-upstart-bridge main process ended, respawning
* Startin
Attachment is the screen shot, looks like it's about OOM killer.
Created attachment 111348 [details]
dmesg(BSW)
I reproduce the OOM killer on BSW but not system hang. BDW has oom killer and system hang.
output:
IGT-Version: 1.9-geb799b2 (x86_64) (Linux: 3.18.0_drm-intel-nightly_4fa231_20141225+ x86_64)
Killed
dmesg:
[ 88.732628] Call Trace:
[ 88.732643] [<ffffffff8178d5e2>] ? dump_stack+0x41/0x51
[ 88.732651] [<ffffffff8178ac83>] ? dump_header.isra.10+0x69/0x191
[ 88.732660] [<ffffffff8107f537>] ? ktime_get+0x44/0x80
[ 88.732668] [<ffffffff8133894a>] ? ___ratelimit+0xae/0xc8
[ 88.732676] [<ffffffff810d1bc4>] ? oom_kill_process+0x76/0x330
[ 88.732681] [<ffffffff810d1981>] ? find_lock_task_mm+0x22/0x6e
[ 88.732690] [<ffffffff810406de>] ? has_ns_capability_noaudit+0xe/0x15
[ 88.732696] [<ffffffff810d23fb>] ? out_of_memory+0x41f/0x452
[ 88.732703] [<ffffffff810d638a>] ? __alloc_pages_nodemask+0x65e/0x7aa
[ 88.732711] [<ffffffff81338608>] ? radix_tree_lookup_slot+0x10/0x23
[ 88.732718] [<ffffffff81104d10>] ? alloc_pages_current+0xaf/0xcc
[ 88.732724] [<ffffffff810d0f36>] ? filemap_fault+0x289/0x3ae
[ 88.732732] [<ffffffff810ec105>] ? __do_fault+0x35/0x73
[ 88.732738] [<ffffffff810ee092>] ? do_read_fault.isra.81+0x1ae/0x26d
[ 88.732746] [<ffffffff8111ef5b>] ? __pollwait+0xcb/0xcb
[ 88.732753] [<ffffffff810efa65>] ? handle_mm_fault+0x1eb/0x840
[ 88.732759] [<ffffffff8111ef5b>] ? __pollwait+0xcb/0xcb
[ 88.732768] [<ffffffff81031906>] ? __do_page_fault+0x42e/0x47b
[ 88.732774] [<ffffffff8111ef5b>] ? __pollwait+0xcb/0xcb
[ 88.732780] [<ffffffff8111ef5b>] ? __pollwait+0xcb/0xcb
[ 88.732787] [<ffffffff8107f3f1>] ? ktime_get_ts64+0x4b/0xb6
[ 88.732794] [<ffffffff8111f14f>] ? poll_select_set_timeout+0x4e/0x6f
[ 88.732801] [<ffffffff81794602>] ? page_fault+0x22/0x30
[ 88.732805] Mem-Info:
[ 88.732809] Node 0 DMA per-cpu:
[ 88.732814] CPU 0: hi: 0, btch: 1 usd: 0
[ 88.732817] CPU 1: hi: 0, btch: 1 usd: 0
[ 88.732821] CPU 2: hi: 0, btch: 1 usd: 0
[ 88.732825] CPU 3: hi: 0, btch: 1 usd: 0
[ 88.732828] Node 0 DMA32 per-cpu:
[ 88.732833] CPU 0: hi: 186, btch: 31 usd: 0
[ 88.732836] CPU 1: hi: 186, btch: 31 usd: 30
[ 88.732840] CPU 2: hi: 186, btch: 31 usd: 0
[ 88.732844] CPU 3: hi: 186, btch: 31 usd: 0
[ 88.732847] Node 0 Normal per-cpu:
[ 88.732851] CPU 0: hi: 186, btch: 31 usd: 0
[ 88.732855] CPU 1: hi: 186, btch: 31 usd: 0
[ 88.732859] CPU 2: hi: 186, btch: 31 usd: 0
[ 88.732863] CPU 3: hi: 186, btch: 31 usd: 0
[ 88.732871] active_anon:8072 inactive_anon:22204 isolated_anon:0
[ 88.733042] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 88.733053] [ 2462] 0 2462 1113 21 7 0 0 sh
[ 88.733059] [ 2489] 0 2489 4437 605 14 1 0 initctl
[ 88.733064] [ 2589] 0 2589 4936 131 13 0 0 upstart-udev-br
[ 88.733072] [ 2593] 0 2593 12449 238 27 0 -1000 systemd-udevd
[ 88.733077] [ 3394] 0 3394 5857 67 17 0 0 rpcbind
[ 88.733083] [ 3437] 0 3437 7444 62 19 0 0 rpc.idmapd
[ 88.733089] [ 3466] 102 3466 9893 176 23 0 0 dbus-daemon
[ 88.733094] [ 3560] 0 3560 82589 300 66 1 0 ModemManager
[ 88.733100] [ 3570] 0 3570 10864 88 27 0 0 systemd-logind
[ 88.733105] [ 3594] 0 3594 89162 405 71 0 0 NetworkManager
[ 88.733111] [ 3606] 101 3606 65535 179 30 0 0 rsyslogd
[ 88.733116] [ 3616] 117 3616 5388 114 16 0 0 rpc.statd
[ 88.733122] [ 3622] 0 3622 73632 196 46 0 0 polkitd
[ 88.733127] [ 3664] 0 3664 2560 574 11 0 0 dhclient
[ 88.733133] [ 3672] 0 3672 5006 39 13 0 0 getty
[ 88.733139] [ 3677] 0 3677 5006 40 13 0 0 getty
[ 88.733144] [ 3684] 0 3684 5006 40 13 0 0 getty
[ 88.733149] [ 3685] 0 3685 5006 39 13 0 0 getty
[ 88.733155] [ 3688] 0 3688 5006 41 13 0 0 getty
[ 88.733161] [ 3715] 0 3715 15343 171 34 0 -1000 sshd
[ 88.733166] [ 3723] 0 3723 5916 63 17 0 0 cron
[ 88.733172] [ 3724] 0 3724 4799 58 13 0 0 irqbalance
[ 88.733177] [ 3738] 0 3738 1094 45 7 0 0 acpid
[ 88.733183] [ 3742] 106 3742 9288 82 20 0 0 kerneloops
[ 88.733188] [ 3744] 109 3744 109308 360 78 0 0 whoopsie
[ 88.733194] [ 3837] 111 3837 8090 78 21 0 0 avahi-daemon
[ 88.733201] [ 3841] 0 3841 5006 41 13 0 0 getty
[ 88.733206] [ 3846] 0 3846 19215 278 41 0 0 cupsd
[ 88.733212] [ 3849] 111 3849 8058 63 20 0 0 avahi-daemon
[ 88.733217] [ 3891] 0 3891 18840 224 41 0 0 cups-browsed
[ 88.733223] [ 3901] 7 3901 15791 127 34 0 0 dbus
[ 88.733228] [ 3971] 65534 3971 8808 63 21 0 0 dnsmasq
[ 88.733234] [ 4003] 0 4003 3959 201 13 0 0 upstart-file-br
[ 88.733239] [ 4022] 0 4022 4058 289 12 0 0 upstart-socket-
[ 88.733245] [ 4271] 0 4271 27447 253 57 2 0 sshd
[ 88.733251] [ 4346] 0 4346 6814 623 18 0 0 bash
[ 88.733256] [ 4360] 0 4360 27483 255 56 0 0 sshd
[ 88.733262] [ 4396] 0 4396 6787 594 18 0 0 bash
[ 88.733267] [ 4411] 0 4411 22606 122 41 0 1000 gem_close_race
[ 88.733272] Out of memory: Kill process 4411 (gem_close_race) score 999 or sacrifice child
[ 88.733433] Killed process 4411 (gem_close_race) total-vm:90424kB, anon-rss:484kB, file-rss:4kB
2f82bbdf3d4f1361c3d713c516d8aa390102374d is the first bad commit commit 2f82bbdf3d4f1361c3d713c516d8aa390102374d Author: Michel Thierry <michel.thierry@intel.com> Date: Mon Dec 15 14:58:00 2014 +0000 drm/i915: Use true PPGTT in Gen8+ when execlists are enabled In Gen8+, full ppgtt needs execlist, otherwise the ctx switch can hang. Also remove the current restriction, a user should be able to explicitly set ppgtt=2. Note, this patch considers that execlist support has been enabled by default on Gen8. v2: Remove non-default restriction and clarify commit message (Daniel) Cc: Daniel Vetter <daniel@ffwll.ch> Signed-off-by: Michel Thierry <michel.thierry@intel.com> [danvet: s/comment/commit message/ in the commit message since that's what Michel meant as per our irc discussion.] Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> :040000 040000 c14ea5c72e697dbb673dfc767f593762b547a31c 8d36deb61be8a1026babd6596a872e43041f18cf M drivers This issue could not be reroduced on BSW, I had confirmed with Lu Hua. The bisect result is for both system hang and OOM killer issue on BDW. OOM problem won't go away until "deferred allocation / dynamic page allocation" is added; gem_close_race creates (and keeps alive) just too many contexts (and therefore ppgtts). I can confirm it passes in my local branch with deferred allocation enabled. I couldn't reproduce on my BDW either. Could you please retest on latest -nightly? This issue still exist on BSW with latest nightly branch. igt commit:3d65ff780d6d7a1b354bd530942a194a97f73dca nightly commit:d6bc7a6a0a7573350e8be8ec54002c20d1dbe1e0 (In reply to Rodrigo Vivi from comment #7) > I couldn't reproduce on my BDW either. Could you please retest on latest > -nightly? (In reply to Rodrigo Vivi from comment #7) > I couldn't reproduce on my BDW either. We've seen this in our testing, 100% reproducible. Seeing the OOM requires: 1. Java process mapping GBs of memory in large anonyous mappings (of which only few tens of MB are actually used) in total 2. Ubuntu 14.04 or 14.10 as those have version of Unity "compiz" compositor, which leaks Windows references on every window close 3. Running SynMark GL context recreation test about dozen times after boot Without 1), seeing the OOM kill took hundred(s) of repeats of step 3). slabtop shows vm struct slabs growing slowly, first I thought that to be the kernel issue, but I think compiz leak explains that. There's something with the graphics driver which dislikes a lot of memory being mapped while there's a small leak for handles. *** Bug 89646 has been marked as a duplicate of this bug. *** Dynamic page allocation finally landed in nightly: http://cgit.freedesktop.org/drm-intel/commit/?id=90ae20039e11a91e7144ab4e1800616d03403df5 Test should not cause OOM. Tested on the latest nightly kernel and latest igt, this issue does not exists. Verified it. output: -------------------- root@x-bsw08:/GFX/Test/Intel_gpu_tools/intel-gpu-tools/tests# ./gem_close_race --run-subtest process-exit IGT-Version: 1.10-g1f6a64e (x86_64) (Linux: 4.0.0-rc7_drm-intel-nightly_044307_20150410+ x86_64) Subtest process-exit: SUCCESS (9.913s) Closing old verified+fixed. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.