Bug 69247

Summary:

[GM45/ILK/SNB/BYT/BDW]igt/gem_evict_everything/forked-swapping-multifd-mempressure-normal causes OOM killer

Product:

DRI

Reporter:

lu hua <huax.lu>

Component:

DRM/Intel

Assignee:

Daniel Vetter <daniel>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

major

Priority:

high

CC:

intel-gfx-bugs, wendy.wang, xunx.fang

Version:

unspecified

Hardware:

All

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Bug Depends on:

72742

Bug Blocks:

Attachments:

Description	Flags
dmesg	none
patch 1	none
patch 2	none
dmesg with patch1 patch2	none
patch 1, fixed	none
shrinker tuning	none
Updated patch for 3.12-rc2 kernels	none
ILK dmesg	none
Include active objects in the shrinker count	none
Tune the shrinker	none
dmesg(BYT)	none
dmesg(byt)	none
dmesg(BDW)	none
kernel config	none

Description lu hua 2013-09-12 03:13:27 UTC

Created attachment 85686 [details]
dmesg

System Environment:
--------------------------
Platform: Ironlake/Sandybridge
kernel   (drm-intel-fixes)3cea210f2c7c50e67287207a6548314491f49f31

Bug detailed description:
-----------------------------
It casues OOM killer on Ironlake/Sandybridge with -fixes, -nightly, -queued kernel. It's a new case.
Following cases also have this issue.
igt/gem_evict_everything/forked-swapping-mempressure-interruptible
igt/gem_evict_everything/forked-swapping-mempressure-normal
igt/gem_evict_everything/forked-swapping-multifd-mempressure-interruptible

 Call Trace:
[  107.576559]  [<c0870f9d>] ? dump_stack+0x3e/0x4e
[  107.577836]  [<c086e300>] ? dump_header.isra.9+0x53/0x15e
[  107.579103]  [<c028ced4>] ? oom_kill_process+0x6b/0x2a3
[  107.580367]  [<c028f64a>] ? get_page_from_freelist+0x382/0x3b6
[  107.581633]  [<c02960a0>] ? try_to_free_pages+0x20b/0x25b
[  107.582972]  [<c028cd00>] ? find_lock_task_mm+0x12/0x40
[  107.584171]  [<c028d40d>] ? out_of_memory+0x1c3/0x1f0
[  107.585335]  [<c028fb67>] ? __alloc_pages_nodemask+0x4e9/0x5e9
[  107.586392]  [<c028c712>] ? filemap_fault+0x23f/0x336
[  107.587525]  [<c029dea6>] ? __do_fault+0x89/0x33e
[  107.588668]  [<c02ba9ae>] ? pipe_read+0x323/0x331
[  107.589791]  [<c02a0386>] ? handle_pte_fault+0x274/0x5e3
[  107.590983]  [<c02a07ac>] ? handle_mm_fault+0xb7/0xd5
[  107.592346]  [<c0878161>] ? __do_page_fault+0x400/0x43b
[  107.593833]  [<c02c18d4>] ? poll_select_set_timeout+0x44/0x64
[  107.595065]  [<c02c2614>] ? SyS_poll+0x3d/0x85
[  107.596278]  [<c087819c>] ? __do_page_fault+0x43b/0x43b
[  107.597508]  [<c0875e1e>] ? error_code+0x5a/0x60
[  107.599158]  [<c087819c>] ? __do_page_fault+0x43b/0x43b


[  107.719514] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  107.722936] [ 2489]     0  2489     1142      279       8        0             0 systemd-journal
[  107.725752] [ 2494]     0  2494      902      348       4        0         -1000 udevd
[  107.728702] [ 3445]     0  3445      901      258       4        0         -1000 udevd
[  107.731462] [ 3509]     0  3509      560      558       5        0         -1000 watchdog
[  107.735411] [ 3512]     0  3512     1016      261       4        0             0 smartd
[  107.738669] [ 3514]     0  3514      606       65       6        0             0 gpm
[  107.741852] [ 3515]     0  3515     2749      176       3        0         -1000 auditd
[  107.745762] [ 3517]     0  3517     7914      356      12        0             0 NetworkManager
[  107.748613] [ 3523]     0  3523     1454      201       7        0             0 abrtd
[  107.752326] [ 3525]     0  3525     2879      171       4        0             0 audispd
[  107.755672] [ 3526]     0  3526      961      104       6        0             0 irqbalance
[  107.758716] [ 3527]     0  3527      746      150       6        0             0 sedispatch
[  107.761549] [ 3531]     0  3531     1439      162       7        0             0 abrt-watch-log
[  107.764693] [ 3536]     0  3536      550      124       5        0             0 acpid
[  107.768113] [ 3541]     0  3541      854      214       8        0             0 systemd-logind
[  107.770899] [ 3542]    70  3542      829      191       9        0             0 avahi-daemon
[  107.775436] [ 3547]    70  3547      829       41       9        0             0 avahi-daemon
[  107.779583] [ 3549]     0  3549      643      176       5        0             0 mcelog
[  107.783916] [ 3554]     0  3554     1389      301       6        0             0 crond
[  107.788580] [ 3558]    81  3558      826      261       7        0          -900 dbus-daemon
[  107.793588] [ 3562]     0  3562     7805      261       7        0             0 rsyslogd
[  107.798416] [ 3567]     0  3567     1333      112       7        0             0 ksmtuned
[  107.802911] [ 3588]     0  3588     6188      241      12        0          -900 polkitd
[  107.807768] [ 3592]     0  3592     1392      287       7        0          -900 modem-manager
[  107.811827] [ 3621]     0  3621     3716      883       6        0             0 dhclient
[  107.816322] [ 3639]     0  3639     2465      314       5        0         -1000 sshd
[  107.820565] [ 3646]     0  3646      683      169       3        0             0 rpcbind
[  107.823938] [ 3661]    29  3661      751      260       3        0             0 rpc.statd
[  107.826943] [ 3676]     0  3676     3314      364       6        0             0 sendmail
[  107.829945] [ 3699]    51  3699     3184      304       6        0             0 sendmail
[  107.832821] [ 3731]     0  3731      681      152       3        0             0 atd
[  107.835430] [ 3732]     0  3732     1073      166       5        0             0 agetty
[  107.837801] [ 3799]     0  3799     1057      100       5        0             0 sleep
[  107.840774] [ 3803]     0  3803     3348      424       6        0             0 sshd
[  107.843105] [ 3811]     0  3811      901      215       4        0         -1000 udevd
[  107.845449] [ 3816]     0  3816     1630      544       6        0             0 bash
[  107.847652] [ 3980]     0  3980     2177      185      13        0             0 gem_evict_every
[  107.849804] [ 3981]     0  3981     2177       86      13        0             0 gem_evict_every
[  107.851971] [ 3982]     0  3982     2177       86      13        0             0 gem_evict_every
[  107.853962] [ 3983]     0  3983     2177      142      13        0             0 gem_evict_every
[  107.856008] [ 3984]     0  3984     2177       88      13        0             0 gem_evict_every
[  107.857975] [ 3985]     0  3985     2177       86      13        0             0 gem_evict_every
[  107.859952] [ 3986]     0  3986     2177      144      13        0             0 gem_evict_every
[  107.861901] [ 3987]     0  3987     2177       86      13        0             0 gem_evict_every
[  107.863679] [ 3988]     0  3988     2177       86      13        0             0 gem_evict_every
[  107.865651] [ 3989]     0  3989     2177       86      13        0             0 gem_evict_every
[  107.867548] [ 3990]     0  3990     2177       86      13        0             0 gem_evict_every
[  107.869382] [ 3991]     0  3991     2177      144      13        0             0 gem_evict_every
[  107.871157] [ 3992]     0  3992     2177       86      13        0             0 gem_evict_every
[  107.872898] [ 3993]     0  3993     2177       88      13        0             0 gem_evict_every
[  107.874425] [ 3994]     0  3994     2177       86      13        0             0 gem_evict_every
[  107.876093] [ 3995]     0  3995     2177       86      13        0             0 gem_evict_every
[  107.877817] [ 3996]     0  3996     2177       86      13        0             0 gem_evict_every
[  107.879956] [ 4009]     0  4009      560       78       5        1         -1000 watchdog
[  107.881805] Out of memory: Kill process 3699 (sendmail) score 0 or sacrifice child
[  107.883481] Killed process 3699 (sendmail) total-vm:12736kB, anon-rss:1024kB, file-rss:192kB
[  109.808431] gem_evict_every invoked oom-killer: gfp_mask=0xa00d2, order=0, oom_score_adj=0
[  109.809912] gem_evict_every cpuset=/ mems_allowed=0

Reproduce steps:
----------------------------
1. ./gem_evict_everything --run-subtest forked-swapping-multifd-mempressure-normal

Comment 1 Daniel Vetter 2013-09-12 08:40:10 UTC

Roughly 750MB of free swap ;-)

In other words our shrinker failed to clear out sufficient number of objects. Known issue, and will take a long time to fix.

Comment 2 Chris Wilson 2013-09-12 08:55:56 UTC

To be fair, our shrinker probably did exactly what it was asked to do...

Comment 3 Daniel Vetter 2013-09-12 09:18:12 UTC

Hm, I have to admit I don't really have much clue how the shrinker interacts with the page swapout code. But it looks like all_unreclaimable might have misfired a bit ...

Comment 4 Daniel Vetter 2013-09-12 11:23:18 UTC

Created attachment 85711 [details] [review]
patch 1

Please test whether this patch helps to avoid the OOM.

Comment 5 Daniel Vetter 2013-09-12 11:24:05 UTC

Created attachment 85712 [details] [review]
patch 2

If OOMs still happen please also apply this patch on top of patch 1 (so both patches) for testing.

Comment 6 lu hua 2013-09-13 06:37:46 UTC

Run patch 1 and patch 2, It will timeout.

Comment 7 lu hua 2013-09-13 06:39:03 UTC

Created attachment 85750 [details]
dmesg with patch1 patch2

Comment 8 Daniel Vetter 2013-09-13 09:31:35 UTC

(In reply to comment #6)
> Run patch 1 and patch 2, It will timeout.

Hm, just tried to run the test and it says SUCCESS after a bit of time, but then seems to get stuck before exit. Do you see the same?

Also please test with just patch 1 to see whether that also prevents the OOM.

Comment 9 Daniel Vetter 2013-09-13 16:08:11 UTC

Ok, there was a small bug in igt which resulted in testcases getting stuck. Fixed with

commit a031a1bf93b828585e7147f06145fc5030814547
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Fri Sep 13 16:43:22 2013 +0200

    lib/drmtest: ducttape over fork race

The gem_evict_everything subtest here now completes for me in roughly 45 s on my snb, so hopefully you're now unblocked to test my patches.

Please test first patch 1 alone (repeat the test a few times to make sure there's really no OOM any more). Then test patch 1+2 (again please repeat).

Also: Does the OOM always happen on your testbox when running this test or only occasionally?

Comment 10 lu hua 2013-09-16 08:52:39 UTC

(In reply to comment #9)
> Ok, there was a small bug in igt which resulted in testcases getting stuck.
> Fixed with
> 
> commit a031a1bf93b828585e7147f06145fc5030814547
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Fri Sep 13 16:43:22 2013 +0200
> 
>     lib/drmtest: ducttape over fork race
> 
> The gem_evict_everything subtest here now completes for me in roughly 45 s
> on my snb, so hopefully you're now unblocked to test my patches.
> 
> Please test first patch 1 alone (repeat the test a few times to make sure
> there's really no OOM any more). Then test patch 1+2 (again please repeat).
> 
> Also: Does the OOM always happen on your testbox when running this test or
> only occasionally?

Test 3 cycles on latest igt and latest -nightly kernel. it works well.
Test the patch 1 alone or test patch 1+2, OOM still happens.

Comment 11 Daniel Vetter 2013-09-16 08:53:29 UTC

Oops, patch 1 alone was actually broken. I'll attach a new one.

Comment 12 Daniel Vetter 2013-09-16 08:54:37 UTC

Created attachment 85896 [details] [review]
patch 1, fixed

Please retest with just this patch applied, thanks.

Comment 13 Eero Tamminen 2013-09-17 08:06:19 UTC

If one needs to analyze what actually happens with system resources, this is pretty good tool for collecting the information, post-processing and visualizing it:
  https://maemo.gitorious.org/maemo-tools/sp-endurance

It's used by taking snapshots at suitable intervals, and processing those snapshots offline.

[1] snapshot contains huge amount of data from the system so getting one takes quite a lot of time.  The interval between snapshots should be at least tens of secs, preferably minutes, otherwise collecting the data can affect the test too much.

Comment 14 lu hua 2013-09-17 08:51:26 UTC

(In reply to comment #12)
> Created attachment 85896 [details] [review] [review]
> patch 1, fixed
> 
> Please retest with just this patch applied, thanks.

Test this patch, It happens 1 in 3 runs.

Comment 15 Daniel Vetter 2013-09-17 10:56:23 UTC

Created attachment 85966 [details] [review]
shrinker tuning

Ok, slight variation of the previous patches, please test again a few times to see how this version fares.

Comment 16 Chris Wilson 2013-09-17 12:33:40 UTC

Heh, I actually had a patch with similar intent to push the batch_size loop into the shrinkers.

Comment 17 Daniel Vetter 2013-09-17 12:43:48 UTC

(In reply to comment #16)
> Heh, I actually had a patch with similar intent to push the batch_size loop
> into the shrinkers.

I think the logic makes more sense outside of the actual shrinker so that we can take memory pressure (i.e. how many loops through the entire reclaim dance we've done so far) into account. We probably shouldn't use the minimal reclaim size if memory is still easy to get, but ramp it up aggressively if memory is getting really tight.

As soon as we have testing results on this patch (I've asked QA to also test this a bit on their OOM-prone byt platform) I'll send and rfc to mm.

btw for testing the patch: Please test on all affected platforms so that we really know it's robust.

Comment 18 lu hua 2013-09-18 08:29:44 UTC

(In reply to comment #15)
> Created attachment 85966 [details] [review] [review]
> shrinker tuning
> 
> Ok, slight variation of the previous patches, please test again a few times
> to see how this version fares.

Test 3 cycles with this patch, It works well.

Comment 19 lu hua 2013-09-25 07:16:06 UTC

It also happens on Baytrail.

Comment 20 Daniel Vetter 2013-09-25 10:25:35 UTC

Ok, I've rebased my kernel trees to be based on 3.12-rc2. There have been some shrinker changes in upstream, so we need to retest everything.

I'll work on an updated patch, meanwhile can you please check on all the affected platforms (please list them) that the bug is still there or whether anything chagned?

Comment 21 Daniel Vetter 2013-09-25 12:21:58 UTC

Created attachment 86556 [details] [review]
Updated patch for 3.12-rc2 kernels

Under the assumption that the bug is still there please test all affected platforms with this updated patch.

Comment 22 lu hua 2013-09-27 05:56:28 UTC

(In reply to comment #21)
> Created attachment 86556 [details] [review] [review]
> Updated patch for 3.12-rc2 kernels
> 
> Under the assumption that the bug is still there please test all affected
> platforms with this updated patch.

Run 5 cycles with this patch, it works well.

Comment 23 Daniel Vetter 2013-09-27 07:28:59 UTC

(In reply to comment #20)
> Ok, I've rebased my kernel trees to be based on 3.12-rc2. There have been
> some shrinker changes in upstream, so we need to retest everything.
> 
> I'll work on an updated patch, meanwhile can you please check on all the
> affected platforms (please list them) that the bug is still there or whether
> anything chagned?

And what's with plain -nightly based on -rc2?

Comment 24 lu hua 2013-09-29 05:31:17 UTC

(In reply to comment #22)
> (In reply to comment #21)
> > Created attachment 86556 [details] [review] [review] [review]
> > Updated patch for 3.12-rc2 kernels
> > 
> > Under the assumption that the bug is still there please test all affected
> > platforms with this updated patch.
> 
> Run 5 cycles with this patch, it works well.

Run on kernel-3.12.0rc2(commit 8153de8b327e89bad0e36f82b098e37a6e9ef5bb) with this patch.

1. Run 5 cycles on sandybridge with this patch, it works well.

2. It still causes OOM killer on ILK.

Comment 25 lu hua 2013-09-29 05:31:48 UTC

Created attachment 86782 [details]
ILK dmesg

Comment 26 Daniel Vetter 2013-09-29 12:44:48 UTC

(In reply to comment #24)
> Run on kernel-3.12.0rc2(commit 8153de8b327e89bad0e36f82b098e37a6e9ef5bb)
> with this patch.
> 
> 1. Run 5 cycles on sandybridge with this patch, it works well.
> 
> 2. It still causes OOM killer on ILK.

Is the OOM killer with the patch?

Also I've asked you to retest _without_ the patch applied, on a -rc2 based -nightly. This is to check whether the patch is still effective or whether something else changed. There have been many core mm changes which are relevant.

Comment 27 lu hua 2013-09-30 07:13:17 UTC

Test on ILK with the patch, It still happens.
Test on ILK with the latest -nightly kernel(commit a411305bdabef2) 3.12.0-rc2 without any patch, It also happens.

Comment 28 Daniel Vetter 2013-09-30 07:35:05 UTC

(In reply to comment #27)
> Test on ILK with the patch, It still happens.
> Test on ILK with the latest -nightly kernel(commit a411305bdabef2)
> 3.12.0-rc2 without any patch, It also happens.

And what about snb/byt? Again I'm interested in how well it works both with the patch and without.

Comment 29 Gordon Jin 2013-10-11 05:12:16 UTC

Xun, please follow up on behalf of Hua during his vacation.

Comment 30 Guo Jinxian 2013-10-14 02:44:37 UTC

Test with the latest -nightly kernel(commit ae5be842311c9108c6dbbbe0e2abc1c306016f12) 3.12.0-rc4 on both snb and byt, here is the result below:
snb with out patch: works well
snb with patch: works well
byt with out patch: It still causes OOM killer
byt with patch: It still causes OOM killer

Comment 31 Guang Yang 2013-10-17 07:36:05 UTC

Daniel, any updated? do you need more info?

Comment 32 Daniel Vetter 2013-10-17 08:43:34 UTC

(In reply to comment #31)
> Daniel, any updated? do you need more info?

It looks like we're back to square one since the patch doesn't seem to actually work.

Can you please update the summary with the affected platforms? SNB seems to work now according to comment #30

Comment 33 lu hua 2013-10-18 05:33:17 UTC

(In reply to comment #32)
> (In reply to comment #31)
> > Daniel, any updated? do you need more info?
> 
> It looks like we're back to square one since the patch doesn't seem to
> actually work.
> 
> Can you please update the summary with the affected platforms? SNB seems to
> work now according to comment #30

It still happens on SNB randomly. Test on latest -nightly kernel(commit db86e5), It happens 1 in 3 runs.

Comment 34 lu hua 2013-11-28 08:59:45 UTC

It also happens on gm45.

Comment 35 Guang Yang 2013-11-28 11:19:05 UTC

Updated status, it also happened on BDW, change the status higher for this hardly hang blocked for a long time.

Comment 36 Chris Wilson 2013-11-29 22:31:52 UTC

Created attachment 90006 [details] [review]
Include active objects in the shrinker count

Try this with your fingers crossed.

Comment 37 lu hua 2013-12-02 06:01:16 UTC

(In reply to comment #36)
> Created attachment 90006 [details] [review] [review]
> Include active objects in the shrinker count
> 
> Try this with your fingers crossed.

Test this patch, It still exists.

Comment 38 Chris Wilson 2013-12-02 23:04:35 UTC

Created attachment 90124 [details] [review]
Tune the shrinker

Try this in conjunction with the previous patch (https://bugs.freedesktop.org/attachment.cgi?id=90006). (Similar to Daniel's suggestion)

Comment 39 lu hua 2013-12-03 06:37:39 UTC

Many gem_concurrent_blit subcases also cause OOM killer.

Comment 40 lu hua 2013-12-03 07:09:14 UTC

(In reply to comment #38)
> Created attachment 90124 [details] [review] [review]
> Tune the shrinker
> 
> Try this in conjunction with the previous patch
> (https://bugs.freedesktop.org/attachment.cgi?id=90006). (Similar to Daniel's
> suggestion)


Test these 2 patches, It still causes OOM killer.

Comment 41 Daniel Vetter 2013-12-03 15:38:35 UTC

(In reply to comment #39)
> Many gem_concurrent_blit subcases also cause OOM killer.

This is likely a different issue, now tracked in bug #72255

Can you please double-check that Chris' patches don't help with the issue at hand here, namely gem_evict_everything subtests going nuts?

Comment 42 lu hua 2013-12-04 07:44:14 UTC

(In reply to comment #41)
> (In reply to comment #39)
> > Many gem_concurrent_blit subcases also cause OOM killer.
> 
> This is likely a different issue, now tracked in bug #72255
> 
> Can you please double-check that Chris' patches don't help with the issue at
> hand here, namely gem_evict_everything subtests going nuts?


Run ./gem_evict_everything --run-subtest forked-swapping-multifd-mempressure-normal with these 2 patches.
It works well on BDW.
it still causes OOM killer on BYT.

Comment 43 Gordon Jin 2014-01-25 00:04:59 UTC

Can we move on?
Many tests have to be disabled in nightly due to this bug, impacting the execution rate of BYT/BDW.

Comment 44 Chris Wilson 2014-01-27 12:49:22 UTC

(In reply to comment #43)
> Can we move on?
> Many tests have to be disabled in nightly due to this bug, impacting the
> execution rate of BYT/BDW.

Which? Do you mean other than the mempressure and swapping tests? Does failure in these impact upon other tests?

Comment 45 lu hua 2014-01-29 02:50:06 UTC

(In reply to comment #44)
> (In reply to comment #43)
> > Can we move on?
> > Many tests have to be disabled in nightly due to this bug, impacting the
> > execution rate of BYT/BDW.
> 
> Which? Do you mean other than the mempressure and swapping tests? Does
> failure in these impact upon other tests?

gem_evict_everything fails with OOM killer, system will be no response.We disable gem_evict_everything subcases. So it impacts execution rate.

Comment 46 Jani Nikula 2014-02-25 07:52:53 UTC

Please be sure to test the patches posted in the related bug 72742 for this one too. Thanks.

Comment 47 lu hua 2014-02-26 08:34:32 UTC

(In reply to comment #46)
> Please be sure to test the patches posted in the related bug 72742 for this
> one too. Thanks.

Test this patch, It still occurs.

Comment 48 Chris Wilson 2014-02-26 09:01:43 UTC

Please do update the dmesg after testing the patches.

Comment 49 lu hua 2014-03-04 06:57:20 UTC

Retest the patches on BYT with latest igt,run subtest forked-swapping-multifd-interruptible more than 30 minutes, it doesn't exit testing. The OOM killer doesn't happen. 
output:
# ./gem_evict_everything
IGT-Version: 1.5-g072d358 (x86_64) (Linux: 3.14.0-rc4_prts_aa1fe3_20140304 x86_6
Subtest forked-normal: SUCCESS
Subtest forked-interruptible: SUCCESS
Subtest forked-swapping-normal: SUCCESS
Subtest forked-swapping-interruptible: SUCCESS
Subtest forked-multifd-normal: SUCCESS
Subtest forked-multifd-interruptible: SUCCESS
Subtest forked-swapping-multifd-normal: SUCCESS
Subtest forked-swapping-multifd-interruptible: SUCCESS
Subtest forked-mempressure-normal: SUCCESS
Subtest forked-mempressure-interruptible: SUCCESS
Subtest forked-swapping-mempressure-normal: SUCCESS

Test on Ironlake with latest -nightly kernel and -igt, The OOM killer doesn't happen.
output:
IGT-Version: 1.5-g072d358 (x86_64) (Linux: 3.14.0-rc5_drm-intel-nightly_2bbdb4_20140304+ x86_64)
Subtest forked-normal: SUCCESS
Subtest forked-interruptible: SUCCESS
Subtest forked-swapping-normal: SUCCESS
Subtest forked-swapping-interruptible: SUCCESS
Subtest forked-multifd-normal: SUCCESS
Subtest forked-multifd-interruptible: SUCCESS
Subtest forked-swapping-multifd-normal: SUCCESS
Subtest forked-swapping-multifd-interruptible: SUCCESS
Subtest forked-mempressure-normal: SUCCESS
Subtest forked-mempressure-interruptible: SUCCESS
Subtest forked-swapping-mempressure-normal: SUCCESS
Subtest forked-swapping-mempressure-interruptible: SUCCESS
Subtest forked-multifd-mempressure-normal: SUCCESS
Subtest forked-multifd-mempressure-interruptible: SUCCESS
Subtest forked-swapping-multifd-mempressure-normal: SUCCESS
Subtest forked-swapping-multifd-mempressure-interruptible: SUCCESS
Subtest swapping-normal: SUCCESS
Subtest minor-normal: SUCCESS
Test requirement not met in function major_evictions, file eviction_common.c:109:
Last errno: 28, No space left on device
Test requirement: (!((uint64_t)nr_surfaces * surface_size / (1024 * 1024) < intel_get_total_ram_mb() * 9 / 10))
Subtest major-normal: SKIP
Subtest swapping-interruptible: SUCCESS
Subtest minor-interruptible: SUCCESS
Test requirement not met in function major_evictions, file eviction_common.c:109:
Last errno: 28, No space left on device
Test requirement: (!((uint64_t)nr_surfaces * surface_size / (1024 * 1024) < intel_get_total_ram_mb() * 9 / 10))
Subtest major-interruptible: SKIP

Comment 50 lu hua 2014-03-04 06:57:47 UTC

Created attachment 95070 [details]
dmesg(BYT)

Comment 51 Daniel Vetter 2014-03-04 18:43:34 UTC

Can you pls do an overnight run on the byt to see whether the oom killer is really gone now on restricted memory platforms like it?

It would be good to log the output of vmstat 10 or something to make sure the kernel keeps on thrashing the swap. If the columns si and so under the --swap-- heading are zero for a long time the test is stuck.

I know that we don't really care about tests which take positively forever, but it sounds like we're finally getting somewhere with Chris' patches ...

Comment 52 lu hua 2014-03-05 06:14:48 UTC

(In reply to comment #51)
> Can you pls do an overnight run on the byt to see whether the oom killer is
> really gone now on restricted memory platforms like it?
> 
> It would be good to log the output of vmstat 10 or something to make sure
> the kernel keeps on thrashing the swap. If the columns si and so under the
> --swap-- heading are zero for a long time the test is stuck.
> 
> I know that we don't really care about tests which take positively forever,
> but it sounds like we're finally getting somewhere with Chris' patches ...

Run on Baytrail with latest -nightly kernel, some subcases still fail with OOM killer.
I will try Chris' patches.

Comment 53 lu hua 2014-03-07 07:48:16 UTC

Created attachment 95294 [details]
dmesg(byt)

Test the patches on BYT 5 cycles. the OOM killer doesn't occur.
output:
IGT-Version: 1.5-g072d358 (x86_64) (Linux: 3.14.0-rc4_prts_aa1fe3_20140304 x86_64)
Subtest forked-normal: SUCCESS
Subtest forked-interruptible: SUCCESS
Subtest forked-swapping-normal: SUCCESS
Subtest forked-swapping-interruptible: SUCCESS
Subtest forked-multifd-normal: SUCCESS
Subtest forked-multifd-interruptible: SUCCESS
Subtest forked-swapping-multifd-normal: SUCCESS
Subtest forked-swapping-multifd-interruptible: SUCCESS
Subtest forked-mempressure-normal: SUCCESS
Subtest forked-mempressure-interruptible: SUCCESS
Subtest forked-swapping-mempressure-normal: SUCCESS
Subtest forked-swapping-mempressure-interruptible: SUCCESS
Subtest forked-multifd-mempressure-normal: SUCCESS
Subtest forked-multifd-mempressure-interruptible: SUCCESS
Subtest forked-swapping-multifd-mempressure-normal: SUCCESS
Subtest forked-swapping-multifd-mempressure-interruptible: SUCCESS
Subtest swapping-normal: SUCCESS
Test assertion failure function copy, file gem_evict_everything.c:124:
Last errno: 2, No such file or directory
Failed assertion: ret == error
Subtest minor-normal: FAIL
Test requirement not met in function major_evictions, file eviction_common.c:109:
Last errno: 2, No such file or directory
Test requirement: (!((uint64_t)nr_surfaces * surface_size / (1024 * 1024) < intel_get_total_ram_mb() * 9 / 10))
Subtest major-normal: SKIP
Subtest swapping-interruptible: SUCCESS
Test assertion failure function copy, file gem_evict_everything.c:124:
Last errno: 2, No such file or directory
Failed assertion: ret == error
Subtest minor-interruptible: FAIL
Test requirement not met in function major_evictions, file eviction_common.c:109:
Last errno: 2, No such file or directory
Test requirement: (!((uint64_t)nr_surfaces * surface_size / (1024 * 1024) < intel_get_total_ram_mb() * 9 / 10))
Subtest major-interruptible: SKIP

Comment 54 Daniel Vetter 2014-03-07 09:02:37 UTC

Hm, that smells more like a bug in the testcase where we supply an invalid bo reference. At least we're making good progress on the OOM issue!

Comment 55 Chris Wilson 2014-03-07 14:16:02 UTC

The current patch series under considerations is:

http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug72742&id=294c593fd65b6de37006da9eceb6860f3b9d6f26

http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug72742&id=224f66e5cce9575fb5433dd7ec287e3b84d2ecbd

http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug72742&id=8d7e8626fb7132ac029b8174a6962e4aa82a950f

http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug72742&id=ab3095d304159ec5312bf79e68ea662ff4f1767e

Comment 56 lu hua 2014-03-10 06:11:35 UTC

(In reply to comment #55)
> The current patch series under considerations is:

> http://cgit.freedesktop.org/~ickle/linux-2.6/commit/
> ?h=bug72742&id=8d7e8626fb7132ac029b8174a6962e4aa82a950f
> 


This patch fail.

Comment 57 Daniel Vetter 2014-03-10 09:23:04 UTC

On Mon, Mar 10, 2014 at 7:11 AM,  <bugzilla-daemon@freedesktop.org> wrote:
> This patch fail.


Please clarify: Does the testcase fail, or do you see the OOM killer in action?

This bug here is _only_ about the OOM killer firing when it shouldn't,
we need to track the testcase failure itself in a new bug once the oom
issue is resolved.

Comment 58 lu hua 2014-03-11 05:46:37 UTC

patching file drivers/gpu/drm/i915/i915_gem.c
Hunk #1 FAILED at 4920.
Hunk #2 succeeded at 4911 with fuzz 1 (offset -28 lines).
Hunk #3 succeeded at 5007 with fuzz 1 (offset -24 lines).
1 out of 3 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gem.                                                                                                 c.rej

drivers/gpu/drm/i915/i915_gem.c:
static bool mutex_is_locked_by(struct mutex *mutex, struct task_struct *task)
{
        if (!mutex_is_locked(mutex))
                return false;

#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_MUTEXES)
        return mutex->owner == task;
#else
        /* Since UP may be pre-empted, we cannot assume that we own the lock */
        return false;
#endif
}

static unsigned long
i915_gem_inactive_count(struct shrinker *shrinker, struct shrink_control *sc)
{


patch: 
@@ -4920,6 +4920,22 @@ static bool mutex_is_locked_by(struct mutex *mutex, struct task_struct *task)
 #endif
 }
 
+static bool i915_gem_shrinker_lock(struct drm_device *dev, bool *unlock)
+{
+	if (!mutex_trylock(&dev->struct_mutex)) {
+		if (!mutex_is_locked_by(&dev->struct_mutex, current))
+			return false;
+
+		if (to_i915(dev)->mm.shrinker_no_lock_stealing)
+			return false;
+
+		*unlock = false;
+	} else
+		*unlock = true;
+
+	return true;
+}
+
 static int num_vma_bound(struct drm_i915_gem_object *obj)
 {
 	struct i915_vma *vma;

Comment 59 Gordon Jin 2014-04-01 00:17:44 UTC

Chris, could you help Hua to resolve patching issue? I really want this bug moving on.

Comment 60 Chris Wilson 2014-04-01 06:57:49 UTC

The branch is at http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742

Comment 61 lu hua 2014-04-11 08:20:42 UTC

(In reply to comment #60)
> The branch is at http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug72742

Run 3 cycles on this patch, OOM Killer goes away.

Comment 62 Gordon Jin 2014-04-22 00:43:33 UTC

Chris, would you upstream the fix?

Comment 63 Guang Yang 2014-05-17 01:20:16 UTC

ping again, Chirs&Daniel, when the fixed patch will land upstream?

Comment 64 Chris Wilson 2014-05-19 06:47:31 UTC

The patches had r-b tags, just waiting upon Daniel.

Comment 65 Daniel Vetter 2014-05-19 08:40:17 UTC

I wanted a 2nd review but apparently that one's slow to come about. Poked relevant people+managers ...

Comment 66 Daniel Vetter 2014-05-19 09:06:07 UTC

Sounds like a dupe of the filp leak ... Please retest and reopen if it still
happens.

Comment 67 Daniel Vetter 2014-05-19 09:08:11 UTC

Nevermind, got lost.

Comment 68 Chris Wilson 2014-05-20 08:57:58 UTC

commit ceabbba524fb43989875f66a6c06d7ce0410fe5c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 25 13:23:04 2014 +0000

    drm/i915: Include bound and active pages in the count of shrinkable objects
    
    When the machine is under a lot of memory pressure and being stressed by
    multiple GPU threads, we quite often report fewer than shrinker->batch
    (i.e. SHRINK_BATCH) pages to be freed. This causes the shrink_control to
    skip calling into i915.ko to release pages, despite the GPU holding onto
    most of the physical pages in its active lists.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=72742
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Robert Beckett <robert.beckett@intel.com>
    Reviewed-by: Rafael Barbalho <rafael.barbalho@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Comment 69 lu hua 2014-05-21 03:35:46 UTC

Created attachment 99464 [details]
dmesg(BDW)

Test on commit ceabbb, It still fails with OOM killer.
output:
IGT-Version: 1.6-gd71add5 (x86_64) (Linux: 3.14.0_kcloud_ceabbb_20140521+ x86_64)

Comment 70 Chris Wilson 2014-05-21 05:43:08 UTC

Try again with the right kernel.

Comment 71 lu hua 2014-05-23 01:31:45 UTC

Created attachment 99601 [details]
kernel config

Use this config build latest drm-intel-nightly commit.

Comment 72 lu hua 2014-05-23 01:40:26 UTC

(In reply to comment #70)
> Try again with the right kernel.

Attached kernel config, Is it incorrect?

Comment 73 Chris Wilson 2014-05-23 06:07:06 UTC

It was that your dmesg did not have the warning that was added to -nightly in relation to oom. Run and reattach the dmesg.

Comment 74 lu hua 2014-05-26 08:54:42 UTC

Run it on latest -nightly 5 cycles, It works well. I will double check, if it fixed, I will close it.

Comment 75 lu hua 2014-05-30 02:12:34 UTC

Fixed on latest -nightly kernel.

Comment 76 lu hua 2014-05-30 02:12:46 UTC

Verified.Fixed.

Comment 77 Jari Tahvanainen 2017-07-03 13:55:04 UTC

closing old verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.