Bug 87477 - [BDW Bisected]igt/gem_close_race/process-exit causes system hang with OOM Killer, when true PPGTT enabled
Summary: [BDW Bisected]igt/gem_close_race/process-exit causes system hang with OOM Kil...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: highest critical
Assignee: Michel Thierry
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 89646 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-12-19 05:11 UTC by Guo Jinxian
Modified: 2017-07-03 14:00 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
screen shot (1.85 MB, image/jpeg)
2014-12-23 07:04 UTC, lu hua
no flags Details
dmesg(BSW) (124.54 KB, text/plain)
2014-12-26 02:45 UTC, lu hua
no flags Details

Description Guo Jinxian 2014-12-19 05:11:25 UTC
==System Environment==
--------------------------
Regression: Yes.
Good commit on -next-queued: 372ee59699d9704086dadb084209542d10e28851(2014_12_01)

Non-working platforms: BDW

==kernel==
--------------------------
origin/drm-intel-nightly: 2014_12_19(fails)
origin/drm-intel-next-queued:140fd38dc4962ae3694f81900b51c567df1b6d33(fails)
    drm/i915: Hold runtime PM during plane commit
origin/drm-intel-fixes: b0616c5306b342ceca07044dbc4f917d95c4f825(works)
    drm/i915: Unlock panel even when LVDS is disabled

==Bug detailed description==
-----------------------------
igt/gem_close_race/process-exit causes system hang. Because system hang, unable to catch dmesg.

==Reproduce steps==
---------------------------- 
1. ./gem_close_race --run-subtest process-exit


./gem_close_race --run-subtest process-exit
Comment 1 Chris Wilson 2014-12-19 07:35:52 UTC
No dmesg, not even over netconsole?
Comment 2 lu hua 2014-12-23 07:04:59 UTC
Created attachment 111208 [details]
screen shot

I tried to use netconsole and serial port. 
Our usb network card doesn't not support netconsole.
Attach serial port, system boots fail, and stop at below lines:
[   28.646597] dracut: Switching root
[   28.723991] random: init urandom read with 78 bits of entropy available
[   28.821374] init: plymouth-upstart-bridge main process (2710) terminated with status 1
[   28.917644] init: plymouth-upstart-bridge main process ended, respawning
[   29.003485] init: plymouth-upstart-bridge main process (2720) terminated with status 1
[   29.099711] init: plymouth-upstart-bridge main process ended, respawning
 * Startin

Attachment is the screen shot, looks like it's about OOM killer.
Comment 3 lu hua 2014-12-26 02:45:47 UTC
Created attachment 111348 [details]
dmesg(BSW)

I reproduce the OOM killer on BSW but not system hang. BDW has oom killer and system hang.
output:
IGT-Version: 1.9-geb799b2 (x86_64) (Linux: 3.18.0_drm-intel-nightly_4fa231_20141225+ x86_64)
Killed

dmesg:
[   88.732628] Call Trace:
[   88.732643]  [<ffffffff8178d5e2>] ? dump_stack+0x41/0x51
[   88.732651]  [<ffffffff8178ac83>] ? dump_header.isra.10+0x69/0x191
[   88.732660]  [<ffffffff8107f537>] ? ktime_get+0x44/0x80
[   88.732668]  [<ffffffff8133894a>] ? ___ratelimit+0xae/0xc8
[   88.732676]  [<ffffffff810d1bc4>] ? oom_kill_process+0x76/0x330
[   88.732681]  [<ffffffff810d1981>] ? find_lock_task_mm+0x22/0x6e
[   88.732690]  [<ffffffff810406de>] ? has_ns_capability_noaudit+0xe/0x15
[   88.732696]  [<ffffffff810d23fb>] ? out_of_memory+0x41f/0x452
[   88.732703]  [<ffffffff810d638a>] ? __alloc_pages_nodemask+0x65e/0x7aa
[   88.732711]  [<ffffffff81338608>] ? radix_tree_lookup_slot+0x10/0x23
[   88.732718]  [<ffffffff81104d10>] ? alloc_pages_current+0xaf/0xcc
[   88.732724]  [<ffffffff810d0f36>] ? filemap_fault+0x289/0x3ae
[   88.732732]  [<ffffffff810ec105>] ? __do_fault+0x35/0x73
[   88.732738]  [<ffffffff810ee092>] ? do_read_fault.isra.81+0x1ae/0x26d
[   88.732746]  [<ffffffff8111ef5b>] ? __pollwait+0xcb/0xcb
[   88.732753]  [<ffffffff810efa65>] ? handle_mm_fault+0x1eb/0x840
[   88.732759]  [<ffffffff8111ef5b>] ? __pollwait+0xcb/0xcb
[   88.732768]  [<ffffffff81031906>] ? __do_page_fault+0x42e/0x47b
[   88.732774]  [<ffffffff8111ef5b>] ? __pollwait+0xcb/0xcb
[   88.732780]  [<ffffffff8111ef5b>] ? __pollwait+0xcb/0xcb
[   88.732787]  [<ffffffff8107f3f1>] ? ktime_get_ts64+0x4b/0xb6
[   88.732794]  [<ffffffff8111f14f>] ? poll_select_set_timeout+0x4e/0x6f
[   88.732801]  [<ffffffff81794602>] ? page_fault+0x22/0x30
[   88.732805] Mem-Info:
[   88.732809] Node 0 DMA per-cpu:
[   88.732814] CPU    0: hi:    0, btch:   1 usd:   0
[   88.732817] CPU    1: hi:    0, btch:   1 usd:   0
[   88.732821] CPU    2: hi:    0, btch:   1 usd:   0
[   88.732825] CPU    3: hi:    0, btch:   1 usd:   0
[   88.732828] Node 0 DMA32 per-cpu:
[   88.732833] CPU    0: hi:  186, btch:  31 usd:   0
[   88.732836] CPU    1: hi:  186, btch:  31 usd:  30
[   88.732840] CPU    2: hi:  186, btch:  31 usd:   0
[   88.732844] CPU    3: hi:  186, btch:  31 usd:   0
[   88.732847] Node 0 Normal per-cpu:
[   88.732851] CPU    0: hi:  186, btch:  31 usd:   0
[   88.732855] CPU    1: hi:  186, btch:  31 usd:   0
[   88.732859] CPU    2: hi:  186, btch:  31 usd:   0
[   88.732863] CPU    3: hi:  186, btch:  31 usd:   0
[   88.732871] active_anon:8072 inactive_anon:22204 isolated_anon:0

[   88.733042] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[   88.733053] [ 2462]     0  2462     1113       21       7        0             0 sh
[   88.733059] [ 2489]     0  2489     4437      605      14        1             0 initctl
[   88.733064] [ 2589]     0  2589     4936      131      13        0             0 upstart-udev-br
[   88.733072] [ 2593]     0  2593    12449      238      27        0         -1000 systemd-udevd
[   88.733077] [ 3394]     0  3394     5857       67      17        0             0 rpcbind
[   88.733083] [ 3437]     0  3437     7444       62      19        0             0 rpc.idmapd
[   88.733089] [ 3466]   102  3466     9893      176      23        0             0 dbus-daemon
[   88.733094] [ 3560]     0  3560    82589      300      66        1             0 ModemManager
[   88.733100] [ 3570]     0  3570    10864       88      27        0             0 systemd-logind
[   88.733105] [ 3594]     0  3594    89162      405      71        0             0 NetworkManager
[   88.733111] [ 3606]   101  3606    65535      179      30        0             0 rsyslogd
[   88.733116] [ 3616]   117  3616     5388      114      16        0             0 rpc.statd
[   88.733122] [ 3622]     0  3622    73632      196      46        0             0 polkitd
[   88.733127] [ 3664]     0  3664     2560      574      11        0             0 dhclient
[   88.733133] [ 3672]     0  3672     5006       39      13        0             0 getty
[   88.733139] [ 3677]     0  3677     5006       40      13        0             0 getty
[   88.733144] [ 3684]     0  3684     5006       40      13        0             0 getty
[   88.733149] [ 3685]     0  3685     5006       39      13        0             0 getty
[   88.733155] [ 3688]     0  3688     5006       41      13        0             0 getty
[   88.733161] [ 3715]     0  3715    15343      171      34        0         -1000 sshd
[   88.733166] [ 3723]     0  3723     5916       63      17        0             0 cron
[   88.733172] [ 3724]     0  3724     4799       58      13        0             0 irqbalance
[   88.733177] [ 3738]     0  3738     1094       45       7        0             0 acpid
[   88.733183] [ 3742]   106  3742     9288       82      20        0             0 kerneloops
[   88.733188] [ 3744]   109  3744   109308      360      78        0             0 whoopsie
[   88.733194] [ 3837]   111  3837     8090       78      21        0             0 avahi-daemon
[   88.733201] [ 3841]     0  3841     5006       41      13        0             0 getty
[   88.733206] [ 3846]     0  3846    19215      278      41        0             0 cupsd
[   88.733212] [ 3849]   111  3849     8058       63      20        0             0 avahi-daemon
[   88.733217] [ 3891]     0  3891    18840      224      41        0             0 cups-browsed
[   88.733223] [ 3901]     7  3901    15791      127      34        0             0 dbus
[   88.733228] [ 3971] 65534  3971     8808       63      21        0             0 dnsmasq
[   88.733234] [ 4003]     0  4003     3959      201      13        0             0 upstart-file-br
[   88.733239] [ 4022]     0  4022     4058      289      12        0             0 upstart-socket-
[   88.733245] [ 4271]     0  4271    27447      253      57        2             0 sshd
[   88.733251] [ 4346]     0  4346     6814      623      18        0             0 bash
[   88.733256] [ 4360]     0  4360    27483      255      56        0             0 sshd
[   88.733262] [ 4396]     0  4396     6787      594      18        0             0 bash
[   88.733267] [ 4411]     0  4411    22606      122      41        0          1000 gem_close_race
[   88.733272] Out of memory: Kill process 4411 (gem_close_race) score 999 or sacrifice child
[   88.733433] Killed process 4411 (gem_close_race) total-vm:90424kB, anon-rss:484kB, file-rss:4kB
Comment 4 Ding Heng 2015-01-04 07:01:25 UTC
2f82bbdf3d4f1361c3d713c516d8aa390102374d is the first bad commit
commit 2f82bbdf3d4f1361c3d713c516d8aa390102374d
Author: Michel Thierry <michel.thierry@intel.com>
Date:   Mon Dec 15 14:58:00 2014 +0000

    drm/i915: Use true PPGTT in Gen8+ when execlists are enabled

    In Gen8+, full ppgtt needs execlist, otherwise the ctx switch can hang.

    Also remove the current restriction, a user should be able to explicitly set
    ppgtt=2.

    Note, this patch considers that execlist support has been enabled by
    default on Gen8.

    v2: Remove non-default restriction and clarify commit message (Daniel)

    Cc: Daniel Vetter <daniel@ffwll.ch>
    Signed-off-by: Michel Thierry <michel.thierry@intel.com>
    [danvet: s/comment/commit message/ in the commit message since that's
    what Michel meant as per our irc discussion.]
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

:040000 040000 c14ea5c72e697dbb673dfc767f593762b547a31c 8d36deb61be8a1026babd6596a872e43041f18cf M      drivers
Comment 5 Ding Heng 2015-01-05 05:52:02 UTC
This issue could not be reroduced on BSW, I had confirmed with Lu Hua. The bisect result is for both system hang and OOM killer issue on BDW.
Comment 6 Michel Thierry 2015-01-05 10:41:16 UTC
OOM problem won't go away until "deferred allocation / dynamic page allocation" is added; gem_close_race creates (and keeps alive) just too many contexts (and therefore ppgtts).

I can confirm it passes in my local branch with deferred allocation enabled.
Comment 7 Rodrigo Vivi 2015-01-21 23:55:46 UTC
I couldn't reproduce on my BDW either.

Could you please retest on latest -nightly?
Comment 8 Ding Heng 2015-01-22 02:10:38 UTC
This issue still exist on BSW with latest nightly branch.

igt commit:3d65ff780d6d7a1b354bd530942a194a97f73dca
nightly commit:d6bc7a6a0a7573350e8be8ec54002c20d1dbe1e0

(In reply to Rodrigo Vivi from comment #7)
> I couldn't reproduce on my BDW either.

Could you please retest on latest
> -nightly?
Comment 9 Eero Tamminen 2015-02-20 16:17:55 UTC
(In reply to Rodrigo Vivi from comment #7)
> I couldn't reproduce on my BDW either.

We've seen this in our testing, 100% reproducible.  Seeing the OOM requires:
1. Java process mapping GBs of memory in large anonyous mappings (of which only  few tens of MB are actually used) in total
2. Ubuntu 14.04 or 14.10 as those have version of Unity "compiz" compositor, which leaks Windows references on every window close
3. Running SynMark GL context recreation test about dozen times after boot

Without 1), seeing the OOM kill took hundred(s) of repeats of step 3).

slabtop shows vm struct slabs growing slowly, first I thought that to be the kernel issue, but I think compiz leak explains that.  There's something with the graphics driver which dislikes a lot of memory being mapped while there's a small leak for handles.
Comment 10 Michel Thierry 2015-03-23 13:09:59 UTC
*** Bug 89646 has been marked as a duplicate of this bug. ***
Comment 11 Michel Thierry 2015-04-09 16:37:02 UTC
Dynamic page allocation finally landed in nightly:

http://cgit.freedesktop.org/drm-intel/commit/?id=90ae20039e11a91e7144ab4e1800616d03403df5

Test should not cause OOM.
Comment 12 ye.tian 2015-04-10 03:09:01 UTC
Tested on the latest nightly kernel and latest igt, this issue does not exists.
Verified it.

output:
--------------------
root@x-bsw08:/GFX/Test/Intel_gpu_tools/intel-gpu-tools/tests# ./gem_close_race --run-subtest process-exit
IGT-Version: 1.10-g1f6a64e (x86_64) (Linux: 4.0.0-rc7_drm-intel-nightly_044307_20150410+ x86_64)
Subtest process-exit: SUCCESS (9.913s)
Comment 13 Jari Tahvanainen 2017-07-03 14:00:19 UTC
Closing old verified+fixed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.