Created attachment 90818 [details] dmesg System Environment: -------------------------- Platform: Haswell Kernel: (drm-intel-queued) df4547d82589714f1b0cecd93569130a452cbf46 Bug detailed description: ------------------------- It randomly causes OOOM killer, It happens 3 in 5 runs. It happens on haswell with -queued, -fixes and -nightly kernel. output: IGT-Version: 1.5-g62e1cbc (x86_64) (Linux: 3.13.0-rc3_drm-intel-next-queued_df4547_20131215+ x86_64) Killed Call Trace: [ 42.384756] [<ffffffff81711921>] ? dump_stack+0x41/0x51 [ 42.384783] [<ffffffff8170df47>] ? dump_header.isra.8+0x69/0x191 [ 42.384814] [<ffffffff8106ded1>] ? ktime_get_ts+0x49/0xab [ 42.384846] [<ffffffff812cf3be>] ? ___ratelimit+0xae/0xc8 [ 42.384878] [<ffffffff810a3208>] ? oom_kill_process+0x76/0x2f8 [ 42.384913] [<ffffffff810a399e>] ? out_of_memory+0x3b2/0x3e5 [ 42.384948] [<ffffffff810a70e8>] ? __alloc_pages_nodemask+0x664/0x771 [ 42.384988] [<ffffffff810d0607>] ? alloc_pages_current+0xbf/0xdc [ 42.385024] [<ffffffff810a21bc>] ? filemap_fault+0x25c/0x381 [ 42.385069] [<ffffffffa0075aeb>] ? i915_gem_fault+0x1b2/0x1c3 [i915] [ 42.385102] [<ffffffff810b7f82>] ? __do_fault+0xac/0x3bf [ 42.385134] [<ffffffff810bb6de>] ? handle_mm_fault+0x1e7/0x7e2 [ 42.385170] [<ffffffff81719c6c>] ? __do_page_fault+0x41c/0x469 [ 42.385205] [<ffffffff810b29ce>] ? vm_mmap_pgoff+0x82/0xab [ 42.385238] [<ffffffff810e90cc>] ? do_vfs_ioctl+0x3f1/0x43a [ 42.385272] [<ffffffff81717232>] ? page_fault+0x22/0x30 [ 42.406280] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 42.407130] [ 2572] 0 2572 77062 90 154 0 0 systemd-journal [ 42.408027] [ 2904] 0 2904 10305 194 23 0 -1000 systemd-udevd [ 42.408888] [ 3374] 0 3374 24466 45 22 0 0 lvmetad [ 42.409748] [ 3400] 0 3400 12231 92 24 0 -1000 auditd [ 42.410638] [ 3406] 0 3406 20053 33 9 0 0 audispd [ 42.411494] [ 3410] 0 3410 5993 41 27 0 0 sedispatch [ 42.412351] [ 3427] 0 3427 4778 53 22 0 0 irqbalance [ 42.413290] [ 3428] 0 3428 6069 151 16 0 0 smartd [ 42.414196] [ 3430] 0 3430 35147 67 39 0 0 abrtd [ 42.415073] [ 3432] 0 3432 34618 63 38 0 0 abrt-watch-log [ 42.415961] [ 3435] 0 3435 34618 62 38 0 0 abrt-watch-log [ 42.416831] [ 3438] 0 3438 1075 20 9 0 0 rngd [ 42.417700] [ 3442] 0 3442 8249 76 19 0 0 systemd-logind [ 42.418607] [ 3444] 0 3444 86500 284 56 0 0 NetworkManager [ 42.419488] [ 3446] 0 3446 65772 101 29 0 0 rsyslogd [ 42.420400] [ 3452] 70 3452 6985 54 30 0 0 avahi-daemon [ 42.421285] [ 3456] 81 3456 6118 101 16 0 -900 dbus-daemon [ 42.422172] [ 3461] 0 3461 1749 30 10 0 0 mcelog [ 42.423119] [ 3472] 993 3472 5647 57 14 0 0 chronyd [ 42.424019] [ 3482] 70 3482 6985 50 22 0 0 avahi-daemon [ 42.424925] [ 3495] 999 3495 127896 800 46 0 0 polkitd [ 42.425873] [ 3503] 0 3503 40408 186 67 0 -900 modem-manager [ 42.426791] [ 3532] 0 3532 25512 3113 51 0 0 dhclient [ 42.427718] [ 3539] 0 3539 132685 1177 142 0 0 libvirtd [ 42.428680] [ 3555] 32 3555 9422 94 21 0 0 rpcbind [ 42.429609] [ 3564] 0 3564 20104 201 41 0 -1000 sshd [ 42.430538] [ 3591] 0 3591 25190 448 47 0 0 sendmail [ 42.431475] [ 3613] 51 3613 21453 375 39 0 0 sendmail [ 42.432453] [ 3632] 0 3632 31020 148 19 0 0 crond [ 42.433399] [ 3633] 0 3633 5930 47 16 0 0 atd [ 42.434349] [ 3641] 0 3641 27498 29 11 0 0 agetty [ 42.435346] [ 3694] 0 3694 32766 286 64 0 0 sshd [ 42.436366] [ 3698] 0 3698 29262 512 19 0 0 bash [ 42.437331] [ 3796] 0 3796 15969 109 48 0 1000 gem_tiled_swapp [ 42.438323] Out of memory: Kill process 3796 (gem_tiled_swapp) score 969 or sacrifice child [ 42.439360] Killed process 3796 (gem_tiled_swapp) total-vm:63876kB, anon-rss:416kB, file-rss:20kB Reproduce steps: ---------------------------- 1. ./gem_tiled_swapping
It also happens on Ironlake.
Created attachment 92039 [details] [review] shrink lock When testing, please make sure full-ppgtt is disabled - that adds too many complications.
Patch failed at @@ -5060,6 +5060,26 @@ static bool mutex_is_locked_by(struct mutex *mutex, struct task_struct *task). Add this patch as following, It still fails with OOM killer. --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -5060,6 +5060,26 @@ static bool mutex_is_locked_by(struct mutex *mutex, struct task_struct *task) #endif } +static bool +i915_gem_shrinker_lock(struct drm_device *dev, bool *unlock) +{ + *unlock = true; + if (mutex_trylock(&dev->struct_mutex)) + return true; + + if (mutex_is_locked_by(&dev->struct_mutex, current)) { + if (to_i915(dev)->mm.shrinker_no_lock_stealing) + return false; + + *unlock = false; + } else { + if (i915_mutex_lock_interruptible(dev)) + return false; + } + + return true; +} + static int num_vma_bound(struct drm_i915_gem_object *obj) { struct i915_vma *vma;
Created attachment 92133 [details] [review] 1: include vma in scan
Created attachment 92134 [details] [review] 2: better shrink lock Using i915_mutex_interruptible() may be overkill here, but makes for a good story.
Try the above two patches, they should apply cleanly to -nightly.
Test the above two patches, It still causes OOM killer.
Created attachment 94529 [details] [review] 3. Writeback our pages under memory pressure And now for step 3. This can tried independently, but really each step fixes a related issue.
(In reply to comment #8) > Created attachment 94529 [details] [review] [review] > 3. Writeback our pages under memory pressure > > And now for step 3. This can tried independently, but really each step fixes > a related issue. Test patch step 3, It still fails with OOM killer.
Can you please attach the latest dmesg after applying the patches so far?
And another thing you can test is whether the oom still occur with my complete tree at http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742
Created attachment 94705 [details] dmesg (In reply to comment #11) > And another thing you can test is whether the oom still occur with my > complete tree at http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742 Test this patch, The OOM killer still occurs. It happens 2 in 3 runs. output: IGT-Version: 1.5-g8ebc02a (x86_64) (Linux: 3.13.0_prts_6e5e25_20140225 x86_64) Killed
Created attachment 94717 [details] [review] Include bound and active pages in the shrinker count This should be the right fix...
Created attachment 94718 [details] [review] Refactor common lock handling Mostly cleanup, but also a minor fix.
Created attachment 94719 [details] [review] Object writeback As far as I can tell, this should be redundant, but it maybe a small improvement anyway.
Test above 3 patches, It still occurs.
How are you running gem_tiled_swapping? Your system is completely out of swap which is unexpected.
When testing, please update i-g-t to include commit ea332b64b6e9f6935da4b43f05fefcdcea32cc64 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Feb 26 11:56:16 2014 +0000 lib: Test against available swap Even if we ignore the double-accounting bug in Linux, we need to be sure that the remaining swapspace is adequate for running our test as the system may be under load before we even start. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742 has been updated with more ideas. Please test and attach the dmesg.
(In reply to comment #18) > When testing, please update i-g-t to include > > commit ea332b64b6e9f6935da4b43f05fefcdcea32cc64 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Wed Feb 26 11:56:16 2014 +0000 > > lib: Test against available swap > > Even if we ignore the double-accounting bug in Linux, we need to be sure > that the remaining swapspace is adequate for running our test as the > system may be under load before we even start. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Test on latest igt and -nightly kernel, It still occurs. output: IGT-Version: 1.5-g072d358 (x86_64) (Linux: 3.13.0-rc8_drm-intel-fixes_e20ec5_20140303+ x86_64) Killed (In reply to comment #19) > http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742 has been > updated with more ideas. Please test and attach the dmesg. Patch fails. patching file include/uapi/drm/i915_drm.h Hunk #1 FAILED at 495. Hunk #2 FAILED at 543. 2 out of 2 hunks FAILED -- saving rejects to file include/uapi/drm/i915_drm.h.rej
Created attachment 95000 [details] dmesg(072d358)
It's not a patch, but a replacement branch to run.
(In reply to comment #22) > It's not a patch, but a replacement branch to run. Sorry,I will give a try.
(In reply to comment #19) > http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742 has been > updated with more ideas. Please test and attach the dmesg. Run this branch(bug72742), build error. # make Setup is 17020 bytes (padded to 17408 bytes). System is 6095 kB CRC b97a29d9 Kernel: arch/x86/boot/bzImage is ready (#3) Building modules, stage 2. MODPOST 1225 modules ERROR: "__hrtimer_start_range_ns" [drivers/gpu/drm/i915/i915.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 commit:commit 4bb7ef89c0f6188f9cc6ea696a24dfe2ca5ed3c2 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Feb 28 18:04:35 2014 +0000 create2-2
Sorry, that branch only compiles with i915.ko builtin at the moment. Is that possible for you to try or do I need to to fix compilation as a module?
It still builds fail. error: make[4]: *** [drivers/gpu/drm/i915/i915_drv.o] Error 1 make[3]: *** [drivers/gpu/drm/i915] Error 2 make[2]: *** [drivers/gpu/drm] Error 2 make[1]: *** [drivers/gpu] Error 2 make: *** [drivers] Error 2
Created attachment 95580 [details] kernel config
That .config has been mangled. So only you know what the build error was.
Updated http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742 please try again.
It still builds fail #make cc1: some warnings being treated as errors make[4]: *** [drivers/gpu/drm/i915/i915_gem.o] Error 1 make[3]: *** [drivers/gpu/drm/i915] Error 2 make[2]: *** [drivers/gpu/drm] Error 2 make[1]: *** [drivers/gpu] Error 2 make: *** [drivers] Error 2
And you still haven't told me the error!
The branch has been updated.
Created attachment 96818 [details] make log
Grrr I pushed the branch before fixing up the rebase fallout, apparently. Now that branch should compile!
It also happens on branch bug72742 (commit: d09f11fc7f7f6784). output: IGT-Version: 1.6-gb8afe98 (x86_64) (Linux: 3.14.0_prts_d09f11_20140403 x86_64) Killed
Created attachment 96825 [details] dmesg
Ok, this seems to be a genuine failure of having too much gunk running on your test systems. Please update i-g-t to commit e8869c4bc439de941be399d156323620a2d6ecda Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Apr 3 09:43:58 2014 +0100 gem_tiled_swapping: Limit to available memory If there is not enough free RAM+swap for us to execute our test, we will hit OOM, so check first. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> and report any changes in output and behaviour.
It still exists. output: IGT-Version: 1.6-gd6362ce (x86_64) (Linux: 3.14.0_prts_d09f11_20140403 x86_64) Using 8169 1MiB objects (available RAM: 7535/7670, swap: 1999) Killed
Created attachment 96883 [details] dmesg
8G mem vs. 2G of swap. I guess we should start to memlock everything but e.g. 1G for those tiled tests ...
Yeah, the oddity here (and why it probably wasn't the right blocker to use for all the oom bugs) is that swap is full. The indication is that this is a true system OOM - though you may well ask why isn't causing a SIGBUS instead of oomkiller? However, there should be plenty of space to execute our test, supposedly we are only using 8200MB of 9500MB. We surely can't have over a GiB of metadata in the test handler? Could we...
Fwiw, I've tweaked the test slightly - I don't expect it to miraculously fix things, but I would like to double check the output now. My current thinking is that we have a backlog of writeback pages occupying both swap and RAM preventing further allocations.
Chris, can QA do any work due to the NEEDINFO status?
I want to see if the new gem_tiled_swapping test behaves any differently.
Test on latest igt and -nightly kernel. output: IGT-Version: 1.6-g9eec5b0 (x86_64) (Linux: 3.14.0_drm-intel-nightly_7cd8b8_20140408+ x86_64) Using 8169 1MiB objects (available RAM: 7531/7670, swap: 1999) Killed
Created attachment 97060 [details] dmesg
Something to try then: diff --git a/mm/vmscan.c b/mm/vmscan.c index a9c74b409681..3702d3aa6898 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -900,8 +900,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto keep_locked; /* Case 2 above */ - } else if (global_reclaim(sc) || - !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { + } else if ((global_reclaim(sc) && !PageReclaim(page)) || + !(sc->gfp_mask & __GFP_IO)) { /* * This is slightly racy - end_page_writeback() * might have just cleared PageReclaim, then
Alternate idea, diff --git a/mm/vmscan.c b/mm/vmscan.c index a9c74b409681..8c2cb1150d17 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -135,6 +135,10 @@ unsigned long vm_total_pages; /* The total number of pages which the VM controls static LIST_HEAD(shrinker_list); static DECLARE_RWSEM(shrinker_rwsem); +static bool throttle_direct_reclaim(gfp_t gfp_mask, + struct zonelist *zonelist, + nodemask_t *nodemask); + #ifdef CONFIG_MEMCG static bool global_reclaim(struct scan_control *sc) { @@ -1521,7 +1525,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, * of pages under pages flagged for immediate reclaim and stall if any * are encountered in the nr_immediate check below. */ - if (nr_writeback && nr_writeback == nr_taken) + if (nr_writeback > nr_taken / 2) zone_set_flag(zone, ZONE_WRITEBACK); /* @@ -2465,6 +2469,12 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, WB_REASON_TRY_TO_FREE_PAGES); sc->may_writepage = 1; } + + if (global_reclaim(sc) && + throttle_direct_reclaim(sc->gfp_mask, + zonelist, + sc->nodemask)) + aborted_reclaim = true; } while (--sc->priority >= 0 && !aborted_reclaim); out:
Created attachment 97161 [details] dmesg(patch) Apply above 2 patches on -nightly branch.It still exists. output: IGT-Version: 1.6-g9eec5b0 (x86_64) (Linux: 3.14.0_prts_de579f_20140410 x86_64) Using 4196 1MiB objects (available RAM: 3611/3698, swap: 1995) Killed
Actually those patches had a huge impact: writeback is now not hogging memory. However, we are still completely out of memory.
I even rebuilt a kernel with your config, ickle@crystalwell:/usr/src/intel-gpu-tools/tests$ sudo ./gem_tiled_swapping IGT-Version: 1.6-gf168b37 (x86_64) (Linux: 3.14.0+ x86_64) Using 4223 1MiB objects (available RAM: 3636/3712, swap: 2047) Subtest threaded: SUCCESS :|
Can you please run the test manually, with vmstat -1 in the background, and attach the output of vmstat?
Have you tried disabling kmemleak and friends?
Please also 'cat /proc/sys/vm/laptop_mode'
'while :; do cat /proc/meminfo ; sleep 1; done' would also be interesting.
(In reply to comment #51) > I even rebuilt a kernel with your config, > > ickle@crystalwell:/usr/src/intel-gpu-tools/tests$ sudo ./gem_tiled_swapping > IGT-Version: 1.6-gf168b37 (x86_64) (Linux: 3.14.0+ x86_64) > Using 4223 1MiB objects (available RAM: 3636/3712, swap: 2047) > Subtest threaded: SUCCESS > > :| It fails on the 3rd cycle. # cat /proc/sys/vm/laptop_mode 0 vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 3555160 30120 107052 0 0 442 53 86 161 0 1 99 0 0 0 0 3554904 30132 107372 0 0 0 56 126 54 0 0 100 0 0 0 0 3554844 30132 107384 0 0 0 0 95 30 0 0 100 0 0 0 0 3554844 30132 107384 0 0 0 0 158 67 0 0 100 0 0 0 0 3553840 30668 107324 0 0 652 0 304 168 0 0 100 0 0 0 0 3553572 30672 107496 0 0 4 0 204 116 0 0 100 0 0 0 0 3553480 30676 107500 0 0 4 36 221 113 0 0 100 0 0 0 0 3552944 31092 107452 0 0 408 20 241 146 0 0 100 0 0 0 0 3552944 31092 107452 0 0 0 0 122 53 0 0 100 0 0 0 0 3552852 31092 107500 0 0 0 0 126 50 0 0 100 0 0 0 0 3552884 31092 107500 0 0 0 0 97 43 0 0 100 0 1 0 0 3328652 348 362840 0 0 5144 12 1000 316 2 17 81 0 1 0 0 2853348 348 833048 0 0 0 140 1092 29 3 22 75 0 1 0 0 2377484 348 1303000 0 0 0 0 1063 36 3 22 75 0 1 0 0 1900880 348 1773816 0 0 0 0 1076 33 3 22 75 0 1 0 0 1423856 348 2245300 0 0 0 0 1059 35 3 22 75 0 1 0 0 946552 356 2717108 0 0 0 32 1073 42 3 22 75 0 1 0 0 469768 356 3188040 0 0 0 0 1063 37 3 22 75 0 2 0 19472 33952 296 3610324 0 19472 0 19472 1474 151 3 27 71 0 4 0 484964 66468 296 3175864 0 465492 0 465516 2597 1395 0 32 66 2 0 2 751940 23692 980 3022884 0 266976 688 266976 3681 636 0 27 66 6 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 1 1244580 295336 988 2606984 0 492896 0 492908 1237 311 1 9 50 41 0 2 2047996 28760 988 2037144 0 803160 0 803160 1221 1291 1 15 69 14 0 2 2047996 39628 988 2037144 0 0 0 0 530 1813 0 0 74 26 0 2 2047996 50604 988 2037144 0 0 0 0 557 1835 0 0 74 26 0 2 2047996 29024 184 2072884 0 0 1548 4 1202 915 0 8 63 28 1 2 2047996 25188 180 2110588 0 0 7552 16 4935 3009 0 12 49 39 1 0 916 3684576 628 10568 0 0 7916 0 1038 464 0 5 94 1 0 0 916 3684720 628 10548 0 0 0 0 88 35 0 0 100 0 0 0 916 3684624 628 10548 0 0 0 0 89 16 0 0 100 0 0 0 916 3684624 636 10548 0 0 0 36 96 46 0 0 100 0 0 1 916 3684720 636 10548 0 0 0 888 102 20 0 0 100 0 0 0 916 3684856 636 10548 0 0 0 100 279 74 0 0 100 0 0 0 916 3684732 636 10592 0 0 44 0 75 21 0 0 100 0 0 0 916 3684552 636 10772 0 0 180 0 130 53 0 0 100 0 0 0 916 3684552 636 10772 0 0 0 0 91 16 0 0 100 0 0 0 916 3684396 644 10768 0 0 0 32 102 45 0 0 100 0
Have you tried increasing swap and seeing just how much we need?
crw + attached kernel config + 4GiB ram, 2GiB swap: running for over 5 hours in a loop with no sign of distress. In particular, I do not see the symptom of inactive_anon != shmem that appears on your machines.
Can you please collect the output of top -b -d 60 -o RES into a file (it's going to be big so compress with lzma) while the test is running until it dies with OOM.
Created attachment 97332 [details] output(top -b -d 60 -o RES) Test on latest -nightly. top -b -d 60 -o RES
Tree at http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742 updated. Please attach the failing dmesg and gem_tiled_swapping output.
Tip of bug72742 is now: commit 9b8a95053e73d3f1b7d86a22fca797c9e3be84d5 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 8 16:48:50 2014 +0100 mm: Throttle shrinkers harder During testing of i915.ko with working texture sets larger than RAM, we encounter OOM with plenty of memory still trapped within writeback, e.g
It still exists on Haswell. System boots fail on Ironlake. output(HSW): IGT-Version: 1.6-ga595a40 (x86_64) (Linux: 3.15.0-rc2_prts_9b8a95_20140425 x86_64) Using 8169 1MiB objects (available RAM: 7518/7670, swap: 1999) Killed
Created attachment 98103 [details] dmesg(HSW)
Created attachment 98104 [details] boot log(ILK)
A request for more debug info from Dave Hansen: "We have tracepoints for the shrinkers in here (it says slab, but it's all the shrinkers, I checked): /sys/kernel/debug/tracing/events/vmscan/mm_shrink_slab_*/enable and another for OOMs: /sys/kernel/debug/tracing/events/oom/enable Could you collect a trace during one of these OOM events and see what the i915 shrinker is doing? Just enable those two and then collect a copy of: /sys/kernel/debug/tracing/trace That'll give us some insight about how well the shrinker is working. If the VM gave up on calling in to it, it might reveal why we didn't get all the way down in to i915_gem_shrink_all()." I think the output (trace.dat) of $ sudo trace-cmd record -e vmscan:mm_shrink_slab_start -e vmscan:mm_shrink_slab_end -e oom ./gem_tiled_swapping should suffice.
Created attachment 98212 [details] trace output
Test on latest -nightly kernel and attached trace output.
Please repeat that with the tree from http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742
(In reply to comment #69) > Please repeat that with the tree from > http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742 I will give you file location via mail.
(In reply to comment #70) > (In reply to comment #69) > > Please repeat that with the tree from > > http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742 > > I will give you file location via mail. Why? As a general rule, please do not send information by email or use pastebins etc. unless specifically asked to do so. If the information is not on the bug, it will get lost or be accessible to only a select few people. Thanks.
(In reply to comment #67) > Created attachment 98212 [details] > trace output Did an OOM happen during the time when that trace was gathered? I looked through it briefly but didn't see one. If one was triggered, please include the full dmesg.
(In reply to comment #71) > (In reply to comment #70) > > (In reply to comment #69) > > > Please repeat that with the tree from > > > http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742 > > > > I will give you file location via mail. > > Why? As a general rule, please do not send information by email or use > pastebins etc. unless specifically asked to do so. If the information is not > on the bug, it will get lost or be accessible to only a select few people. > Thanks. The file is too large.
Created attachment 98398 [details] trace output(branch72742) Retest and get new output(trace.date).
(In reply to comment #74) > Created attachment 98398 [details] > trace output(branch72742) > > Retest and get new output(trace.date). This still does not contain the dmesg. I need that in order to figure out where in the trace that the OOM is occurring. Chris, it's possible that your trace-cmd suggestion is actually causing a problem here because trace-cmd itself is getting OOM'd. I'd suggest using the trace buffers directly. I'd also really like to see how many pages the i915 code believes it has pinned each time the shrinker is called during all this.
(In reply to comment #75) > (In reply to comment #74) > > Created attachment 98398 [details] > > trace output(branch72742) > > > > Retest and get new output(trace.date). > > This still does not contain the dmesg. I need that in order to figure out > where in the trace that the OOM is occurring. Chris, it's possible that > your trace-cmd suggestion is actually causing a problem here because > trace-cmd itself is getting OOM'd. I'd suggest using the trace buffers > directly. > > I'd also really like to see how many pages the i915 code believes it has > pinned each time the shrinker is called during all this. Dmesg was attached in comment 64.
(In reply to comment #76) > Dmesg was attached in comment 64. I need both the trace output and the dmesg which were collected at the same time. I need to see what the shrinker was doing *during* the OOM and I need to correlate them with the timestamps. You can collect in the trace with the 'printk:console' event if you like, but I do need new copies of both the trace and dmesg, not an old dmesg.
Created attachment 98670 [details] trace.dat Retest it and got dmesg and trace.dat at the same time.
Created attachment 98671 [details] dmesg
But that's the wrong kernel! Anyway it has a lead, disable kmemleak and kmemcheck (compile it out) and see if the test uses the expected amount of memory.
$ grep -i 'free swap' attachment-98671 [ 101.803201] Free swap = 686680kB [ 102.026876] Free swap = 559320kB [ 102.131871] Free swap = 507464kB [ 102.452230] Free swap = 305256kB [ 102.558154] Free swap = 238828kB [ 102.686385] Free swap = 168488kB [ 102.890507] Free swap = 44592kB [ 102.997281] Free swap = 0kB [ 103.128730] Free swap = 0kB Note that we're still exhausting swap space.
(In reply to comment #80) > But that's the wrong kernel! Anyway it has a lead, disable kmemleak and > kmemcheck (compile it out) and see if the test uses the expected amount of > memory. Poke. And NEEDINFO, otherwise you'll get ignored by our QA.
commit ceabbba524fb43989875f66a6c06d7ce0410fe5c Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Mar 25 13:23:04 2014 +0000 drm/i915: Include bound and active pages in the count of shrinkable objects When the machine is under a lot of memory pressure and being stressed by multiple GPU threads, we quite often report fewer than shrinker->batch (i.e. SHRINK_BATCH) pages to be freed. This causes the shrink_control to skip calling into i915.ko to release pages, despite the GPU holding onto most of the physical pages in its active lists. References: https://bugs.freedesktop.org/show_bug.cgi?id=72742 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Robert Beckett <robert.beckett@intel.com> Reviewed-by: Rafael Barbalho <rafael.barbalho@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Created attachment 99462 [details] dmesg(ceabbb) (In reply to comment #83) > commit ceabbba524fb43989875f66a6c06d7ce0410fe5c > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue Mar 25 13:23:04 2014 +0000 > > drm/i915: Include bound and active pages in the count of shrinkable > objects > > When the machine is under a lot of memory pressure and being stressed by > multiple GPU threads, we quite often report fewer than shrinker->batch > (i.e. SHRINK_BATCH) pages to be freed. This causes the shrink_control to > skip calling into i915.ko to release pages, despite the GPU holding onto > most of the physical pages in its active lists. > > References: https://bugs.freedesktop.org/show_bug.cgi?id=72742 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Reviewed-by: Robert Beckett <robert.beckett@intel.com> > Reviewed-by: Rafael Barbalho <rafael.barbalho@intel.com> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Test this commit, It still fails with OOM killer. output: IGT-Version: 1.6-g737d248 (x86_64) (Linux: 3.14.0_kcloud_ceabbb_20140521+ x86_64) Using 8169 1MiB objects (available RAM: 7510/7670, swap: 1999) Killed
It wasn't the right kernel anyway, but here I want to know if it is the kmemleak that is causing the overallocation.
(In reply to comment #85) > It wasn't the right kernel anyway, but here I want to know if it is the > kmemleak that is causing the overallocation. Do you mean incorrect config or source code? Test on latest -nightly kernel, this issue still exists.
For kmemleak checking, config. For the lack of the vital tell in the dmesg, source.
Created attachment 99567 [details] kernel config (In reply to comment #87) > For kmemleak checking, config. > > For the lack of the vital tell in the dmesg, source. Use this config build latest drm-intel-nightly commit.
CONFIG_HAVE_DEBUG_KMEMLEAK=y is still set so I doubt this helps answer the question what happens when kmemleak is turned off.
Created attachment 99919 [details] dmesg(disable CONFIG_HAVE_DEBUG_KMEMLEAK) (In reply to comment #89) > CONFIG_HAVE_DEBUG_KMEMLEAK=y > > is still set so I doubt this helps answer the question what happens when > kmemleak is turned off. Disable CONFIG_HAVE_DEBUG_KMEMLEAK=y, output: IGT-Version: 1.6-gff3c122 (x86_64) (Linux: 3.14.0_kcloud_3dabfd_20140527+ x86_64) Using 8169 1MiB objects (available RAM: 7530/7670, swap: 1999) Killed
Are you sure that is the right kernel with the right config? It is not based on -nightly.
Created attachment 100010 [details] dmesg(9f53d4f) Retest on -nightly kernel output: IGT-Version: 1.6-gff3c122 (x86_64) (Linux: 3.15.0-rc7_nightly_20140528+ x86_64) Using 8168 1MiB objects (available RAM: 7502/7669, swap: 1999) Killed kernel: (drm-intle-nightly) commit 9f53d4f6f55aa0c037f299dbe2986eec9151be9b Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed May 28 00:01:05 2014 +0200 drm-intel-nightly: 2014y-05m-28d-00h-00m-47s integration manifest
Created attachment 100011 [details] kernel config
We died with over 1.3GiB of pending writeback which exceeds the amount of memory we left spare during the test.
(In reply to comment #94) > We died with over 1.3GiB of pending writeback which exceeds the amount of > memory we left spare during the test. Do we need to resurrect the writeback harder hacks?
Whilst there is such low hanging fruit in the core vm, I am not inclined to look closer to home...
(In reply to comment #95) > (In reply to comment #94) > > We died with over 1.3GiB of pending writeback which exceeds the amount of > > memory we left spare during the test. > > Do we need to resurrect the writeback harder hacks? The problem isn't the number of writeback pages, the problem is that the pages are not on a path to being freed. Even with a large amount of writeback, the VM will notice that _some_ of it is getting written out, and will resist OOMing until progress stops being made. The reason for the OOM here is that lack of progress. Waiting on writeback will, of course, reduce the number of pages under writeback, but it just delays the inevitable when overall reclaim progress isn't being made.
What is the state on latest -nightly?
It still happens on latest -nightly kernel. Test on haswell: output: IGT-Version: 1.8-g4b81e9c (x86_64) (Linux: 3.17.0-rc6_drm-intel-nightly_0f7cc1_20140925+ x86_64) Using 8168 1MiB objects (available RAM: 7502/7669, swap: 1999) Killed
Created attachment 106829 [details] dmesg(HSW)
Since this test is about swapping and tiling/de-swizzling, maybe it makes sense to exclude it from the OOM killer or disable the OOM killer entirely while it's being run? Not sure if it would still work well with vm.overcommit_memory=2. For a stress test that wouldn't make sense, but I don't think that's what we're trying to verify here with this artificial test. Alternately, we could add a debugfs hook to force swapping and readback testing rather than trying to push the kernel into it from userspace.
Does this still happen with current igt? Thomas has pushed a bunch of fixes...
This is a kernel bug! /me grumbles If we no longer hit the bug due to "igt fixes" to limit the amount of swapping we do, we need to revert those.
Test 10 rounds on the latest drm-intel-nightly kernel.I don't see the oom killer issue. It's pass on HSW and skip on ILK.
It skips everywhere because I've screwed up the L-shaped memory test. Please retest with commit 1765838e34d96c7eb2288cf899ab19f819fa5cb0 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Mon Mar 23 11:00:20 2015 +0100 tests/gem_tiled_swapping: Fix up L-shaped testing Also why was there no regressino report for gem_tiled_swapping suddenly skipping?
(In reply to Daniel Vetter from comment #105) > It skips everywhere because I've screwed up the L-shaped memory test. Please > retest with > > commit 1765838e34d96c7eb2288cf899ab19f819fa5cb0 > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Mon Mar 23 11:00:20 2015 +0100 > > tests/gem_tiled_swapping: Fix up L-shaped testing > > Also why was there no regressino report for gem_tiled_swapping suddenly > skipping? Test on the latest -nightly kernel and igt, it still happens on ILK. output: IGT-Version: 1.10-g392e8ee (x86_64) (Linux: 4.0.0-rc5_drm-intel-nightly_877605_20150325+ x86_64) Using 640 1MiB objects (available RAM: 334/7911, swap: 1999) Killed
Created attachment 114604 [details] dmesg(ILK0325)
(In reply to Daniel Vetter from comment #105) > Also why was there no regressino report for gem_tiled_swapping suddenly > skipping? File bug 89752
gem_userptr_blits@forked-unsync-swapping-multifd-interruptible is hanging out with the following configuration: Kernel 4.3.0-rc8-drm-intel-testing-2015-08-28 Mesa: mesa-10.6.7 from http://cgit.freedesktop.org/mesa/mesa/ Xf86_video_intel: 2.99.917 from http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/ Libdrm: libdrm-2.4.64 from http://cgit.freedesktop.org/mesa/drm/ Cairo: 1.14.2 from http://cgit.freedesktop.org/cairo libva: libva-1.6.0 from http://cgit.freedesktop.org/libva/ intel-driver: 1.6.1. from http://cgit.freedesktop.org/vaapi/intel-driver xorg: 1.17.99 installed with script git_xorg.sh Xserver: xorg-server-1.17.2 from http://cgit.freedesktop.org/xorg/xserver Intel-gpu-tools: 1.12 from http://cgit.freedesktop.org/xorg/app/intel-gpu
Let's close this old bug and track: bug 97130
Closing almost year old resolved+moved.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.