Bug 104327

Summary: [IGT] gem_shrink subtest mmap-gtt oom
Product: DRI Reporter: Hector Velazquez <hector.franciscox.velazquez.suriano>
Component: DRM/IntelAssignee: Francesco Balestrieri <francesco.balestrieri>
Status: CLOSED INVALID QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: intel-gfx-bugs
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard: ReadyForDev
i915 platform: CFL i915 features: GEM/Other
Bug Depends on: 101857    
Bug Blocks:    
Attachments:
Description Flags
otuput
none
dmesg -w
none
dmesg_shrink none

Description Hector Velazquez 2017-12-18 20:37:21 UTC
Created attachment 136258 [details]
otuput
Comment 1 Hector Velazquez 2017-12-18 20:37:28 UTC
This tests was failing on CFL QA

igt@gem_shrink@mmap-gtt

====================================================
output 
====================================================
IGT-Version: 1.20-gc0be331 (x86_64) (Linux: 4.15.0-rc4-drm-intel-qa-ww51-commit-bf5cdf9+ x86_64)
(gem_shrink:2116) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_shrink:2116) intel-chipset-DEBUG: Test requirement passed: pci_dev
Using 125 processes and 128MiB per process
(gem_shrink:2116) intel-os-DEBUG: Checking 125 surfaces of size 134217728 bytes (total 16777281536) against RAM + swap
(gem_shrink:2116) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_shrink:2116) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_shrink:2116) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_shrink:2116) intel-os-DEBUG: Test requirement passed: __intel_check_memory(count, size, mode, &required, &total)
(gem_shrink:2116) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_shrink:2116) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_shrink:2116) drmtest-DEBUG: Test requirement passed: is_i915_device(fd) && has_known_intel_chipset(fd)
(gem_shrink:2116) ioctl-wrappers-DEBUG: Test requirement passed: err == 0
(gem_shrink:2116) DEBUG: Test requirement passed: nengine
(gem_shrink:2116) igt-core-DEBUG: Starting subtest: mmap-gtt
Subtest mmap-gtt failed.
No log.
child 70 died with signal 9, Killed
Subtest mmap-gtt: FAIL (142.110s)
(gem_shrink:2116) igt-core-DEBUG: Exiting with status code 137
(gem_shrink:2116) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'

This is my configuration:
======================================
	Graphic stack
======================================
Component: drm
    tag: libdrm-2.4.88-42-g831036a
    commit: 831036a6f62005da9fb4a75fe043bd96ce672d27

Component: cairo
    tag: 1.15.8-73-g903b0de
    commit: 903b0de539844c144c63ea57c30e84a23360c290

Component: intel-gpu-tools
    tag: intel-gpu-tools-1.20-232-gc0be331
    commit: c0be3310715e2f744b892c51f09e62273bcc8e57

Component: piglit
    tag: piglit-v1
    commit: 64775cc0f59820c4d733e480a66f8c31f5b78d1b

======================================
	     Software
======================================
kernel version              : 4.15.0-rc4-drm-intel-qa-ww51-commit-bf5cdf9+
hostname                    : CFL-1
architecture                : x86_64
os version                  : Ubuntu 16.10
os codename                 : yakkety
kernel driver               : i915
bios revision               : 104.3
bios release date           : 09/14/2017
ksc                         : 1.5
hardware acceleration       : disabled
swap partition              : enabled on (/dev/nvme0n1p3)

======================================
	Graphic drivers
======================================
libdrm                      : 2.4.89
cairo                       : 1.15.11
intel-gpu-tools (tag)       : intel-gpu-tools-1.20-232-gc0be331
intel-gpu-tools (commit)    : c0be331

======================================
	     Hardware
======================================
motherboard model          : CoffeeLakeClientPlatform
motherboard id             : CoffeeLakeSUDIMMRVP
form factor                : Desktop
manufacturer               : IntelCorporation
cpu family                 : Other
cpu family id              : 6
cpu information            : Genuine Intel(R) CPU 0000 @ 3.60GHz
gpu card                   : Intel Corporation Device 3e92 (prog-if 00 [VGA controller])
memory ram                 : 15.58 GB
max memory ram             : 32 GB
cpu thread                 : 12
cpu core                   : 6
cpu model                  : 158
cpu stepping               : 10
socket                     : Other
current cd clock frequency : 337500 kHz
maximum cd clock frequency : 675000 kHz
displays connected         : eDP-1 DP-1

======================================
	     Firmware
======================================
dmc fw loaded             : yes
dmc version               : 1.4
guc fw loaded             : fetch SUCCESS, load SUCCESS
guc version wanted        : wanted 9.39, found 9.39
guc version found         : wanted 9.39, found 9.39

======================================
	     kernel parameters
======================================
quiet drm.debug=0x1e i915.enable_guc=-1 i915.alpha_support=1 auto panic=1 nmi_watchdog=panic intel_iommu=igfx_off resume=/dev/nvme0n1p3 fastboot
Comment 2 Hector Velazquez 2017-12-18 20:37:58 UTC
Created attachment 136259 [details]
dmesg -w
Comment 3 Chris Wilson 2017-12-18 20:42:34 UTC
[ 3565.670199] [IGT] gem_shrink: starting subtest mmap-gtt
[ 3654.707839] Purging GPU memory, 0 pages freed, 3770498 pages still pinned.
[ 3654.707841] 31 and 0 pages still available in the bound and unbound GPU page lists.
[ 3
Comment 4 Chris Wilson 2017-12-18 21:49:41 UTC
Quite patently there was a log, and the SIGKILL is due to the oom. Please keep the summary a summary of the bug and not nonsense.
Comment 5 Elizabeth 2017-12-19 16:25:39 UTC
My bad. Got it.
Comment 6 Chris Wilson 2017-12-19 21:36:07 UTC
The problem here is that we end up with all 128 threads hitting the reclaim logic; each threading pinning the object it has faulted. Only one thread can make any progress through the oom-logic, but it can't make progress unless it is the one holding struct_mutex. Ergo it reports failure and the oom-killer proceeds without mercy.

There is certainly no quick fix for this.
Comment 7 Hector Velazquez 2018-02-27 16:16:54 UTC
This tests continue failing on GLK QA 

Tests List:

igt@gem_shrink@mmap-gtt

IGT-Version: 1.21-ga2664f8 (x86_64) (Linux: 4.16.0-rc2-drm-tip-ww9-commit-3a86cab+ x86_64)
Comment 8 Jani Saarinen 2018-03-29 07:11:58 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 9 Elizabeth 2018-04-16 14:27:59 UTC
Created attachment 138866 [details]
dmesg_shrink

(In reply to Jani Saarinen from comment #8)
> ...
> still valid or not...

(In reply to Chris Wilson from comment #6)
> ...
> There is certainly no quick fix for this.

This test still takes an eternity to stop and the "killed process" keep happening:

[  +0.000000] Out of memory: Kill process 1288 (gem_shrink) score 1000 or sacrifice child
[  +0.000007] Killed process 1288 (gem_shrink) total-vm:185620kB, anon-rss:396kB, file-rss:4kB, shmem-rss:0kB
[ +18.996521] systemd-journald[319]: /dev/kmsg buffer overrun, some messages lost.
[Apr13 18:18] Purging GPU memory, 0 pages freed, 1114112 pages still pinned.
Comment 10 Lakshmi 2018-09-13 14:46:54 UTC
This test is not valid anymore. Closing this bug as INVALID.
Closing now. Feel free to reopen if you still have the issue.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.