Bug 102936 - [IGT] gem_ctx_thrash single test assertion failure on function gem_set_domain
Summary: [IGT] gem_ctx_thrash single test assertion failure on function gem_set_domain
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 103805 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-09-21 21:45 UTC by Luis Botello
Modified: 2018-02-27 22:41 UTC (History)
2 users (show)

See Also:
i915 platform: GLK
i915 features: GEM/Other


Attachments
dmesg (4.83 KB, text/plain)
2017-09-21 21:45 UTC, Luis Botello
no flags Details
IGT_output (2.92 KB, text/plain)
2017-09-21 21:46 UTC, Luis Botello
no flags Details
kernl_log_gem_ctx_thrash-single (338.21 KB, text/plain)
2017-11-07 20:45 UTC, Elizabeth
no flags Details
output_gem_ctx_thash-family (20.72 KB, text/plain)
2017-11-08 18:15 UTC, Elizabeth
no flags Details
dmesg_gem_ctx_trash-family (144.72 KB, text/plain)
2017-11-08 18:17 UTC, Elizabeth
no flags Details
dmesg_gem_ctx_trash-single_5hrs (231.97 KB, text/plain)
2017-11-30 16:24 UTC, Elizabeth
no flags Details

Description Luis Botello 2017-09-21 21:45:46 UTC
Created attachment 134419 [details]
dmesg

Configuration:
--------------
Component: drm
    tag: libdrm-2.4.81-55-g76418c2
    commit: 76418c244d4c52a8dd20809e3e8b4e70501fc76f

Component: cairo
    tag: 1.15.6-38-g1220e3c
    commit: 1220e3c6b8f94a00ac7afee15f21e6782655d97c

Component: intel-gpu-tools
    tag: intel-gpu-tools-1.19-312-gda197b5
    commit: da197b5f3cb516aaaea72d0d60b0f5c1c81081dd

Component: piglit
    tag: piglit-v1
    commit: 2753955998d7deb90f681cf4cb1253c4519dfd1d

commit 2afdfe9be8345f9499f3d00ba13c05f1f23344d1
Author:     Jani Nikula <jani.nikula@intel.com>
AuthorDate: Tue Sep 19 18:41:27 2017 +0300
Commit:     Jani Nikula <jani.nikula@intel.com>
CommitDate: Tue Sep 19 18:41:27 2017 +0300

    drm-tip: 2017y-09m-19d-15h-40m-56s UTC integration manifest


Steps:
------
1. Execute IGT tests:
# ./gem_ctx_thrash --r single

Actual results:
---------------
Test assertion failure in function gem_set_domain,, Cannot allocate memory

Hardware configuration
----------------------
platform                   : Geminilake
motherboard model          : Geminilake
cpu information            : Genuine Intel(R) CPU @ 1.10GHz
gpu card                   : Intel Corporation Device 3185 (rev 03) (prog-if 00 [VGA controller])
memory ram                 : 7.64 GB
Comment 1 Luis Botello 2017-09-21 21:46:09 UTC
Created attachment 134420 [details]
IGT_output
Comment 2 Elizabeth 2017-11-07 20:45:13 UTC
Created attachment 135290 [details]
kernl_log_gem_ctx_thrash-single

On GLK now this test is being killed instead:

$ : sudo -E ./gem_ctx_thrash --r single
IGT-Version: 1.20-g9fe5a9a (x86_64) (Linux: 4.14.0-rc8-drm-intel-qa-ww45-commit-b911f67+ x86_64)
Creating 60228 contexts (assuming of size 106496 with execlists)
Killed

[   63.764376] Out of memory: Kill process 1397 (gem_ctx_thrash) score 1000 or sacrifice child
[   63.764383] Killed process 1397 (gem_ctx_thrash) total-vm:63656kB, anon-rss:92kB, file-rss:0kB, shmem-rss:0kB
[   63.780214] oom_reaper: reaped process 1397 (gem_ctx_thrash), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Comment 3 Chris Wilson 2017-11-08 16:02:12 UTC
commit 1ab22356b37ab08a391d6f007fda4c822bef9fb5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 7 22:06:56 2017 +0000

    drm/i915: Prune the reservation shared fence array
    
    The shared fence array is not autopruning and may continue to grow as an
    object is shared between new timelines. Take the opportunity when we
    think the object is idle (we have to confirm that any external fence is
    also signaled) to decouple all the fences.
    
    We apply a similar trick after waiting on an object, see commit
    e54ca9774777 ("drm/i915: Remove completed fences after a wait")
    
    v2: No longer need to handle the batch pool as a special case.
    v3: Need to trylock from within i915_vma_retire as this may be called
    form the shrinker - and we may later try to allocate underneath the
    reservation lock, so a deadlock is possible.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=102936
    Fixes: d07f0e59b2c7 ("drm/i915: Move GEM activity tracking into a common struct reservation_object")
    Fixes: 80b204bce8f2 ("drm/i915: Enable multiple timelines")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171107220656.5020-1-chris@chris-wilson.co.uk
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

commit 2f6a3783833dde63f1c08982943a8b2229b97afb
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Nov 8 09:44:00 2017 +0000

    drm/i915: Idle the GPU before shinking everything
    
    The handling of contexts are peculiar. Instead of tieing their vma to
    activity, we pin the context. This means that we cannot simply unbind
    the context object itself at will (which would normally cause us to wait
    for the vma to be idle), but must manually idle the GPU and retire
    requests first.
    
    A consequence of this peculiarity is when doing a last desperate attempt
    to recover memory. If the memory is tied up inside active context
    objects, we will fail to recover any memory simply by trying to unbind
    the objects without first doing a wait-for-idle.
    
    A side-effect of removing the call to shrinker_lock_uninterruptible()
    from i915_gem_shrinker_oom() was that we removed an unlocked
    wait-for-idle, and so lost the "natural" shrinkage of context objects.
    By replacing that with a locked wait from inside i915_gem_shrink(), we
    not only replace it with the ability to recover all context objects, but
    do so for all i915_gem_shrink_all() callers.
    
    v2: Switching requires request allocation, which is not permitted from
    inside the shrinker as it only uses ordinary allocations.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=102936
    Fixes: f2123818ffad ("drm/i915: Move dev_priv->mm.[un]bound_list to its own lock")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171108094400.1386-1-chris@chris-wilson.co.uk
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

should help a lot. Still expect that this test (but not this subtest) will fail if you let it run long enough.
Comment 4 Elizabeth 2017-11-08 18:15:25 UTC
Created attachment 135315 [details]
output_gem_ctx_thash-family

Tests switched between killed and failed assertion. Only subtest processes pass as success. Commit information:

commit 087c404bd6d56a52e0656ac7c79faa376c25b796
Author:     Chris Wilson <chris@chris-wilson.co.uk>
AuthorDate: Wed Nov 8 15:44:46 2017 +0000
Commit:     Chris Wilson <chris@chris-wilson.co.uk>
CommitDate: Wed Nov 8 15:44:46 2017 +0000

    drm-tip: 2017y-11m-08d-15h-44m-06s UTC integration manifest
Comment 5 Elizabeth 2017-11-08 18:17:17 UTC
Created attachment 135316 [details]
dmesg_gem_ctx_trash-family
Comment 6 Chris Wilson 2017-11-08 18:30:42 UTC
Can you confirm you applied the patches? And please turn off guc. You still have very large reservation_objects...
Comment 7 Chris Wilson 2017-11-14 17:11:42 UTC
Another contributing factor should be improved by:

commit ca25fe5efe4ab43cc5b4f3117a205c281805a5ca
Author: Christian König <ckoenig.leichtzumerken@gmail.com>
Date:   Tue Nov 14 15:24:36 2017 +0100

    dma-buf: try to replace a signaled fence in reservation_object_add_shared_inplace
    
    The amdgpu issue to also need signaled fences in the reservation objects should
    be fixed by now.
    
    Optimize the handling by replacing a signaled fence when adding a new
    shared one.
    
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171114142436.1360-2-christian.koenig@amd.com

commit 4d9c62e8ce69d0b0a834282a34bff5ce8eeacb1d
Author: Christian König <ckoenig.leichtzumerken@gmail.com>
Date:   Tue Nov 14 15:24:35 2017 +0100

    dma-buf: keep only not signaled fence in reservation_object_add_shared_replace v3
    
    The amdgpu issue to also need signaled fences in the reservation objects
    should be fixed by now.
    
    Optimize the list by keeping only the not signaled yet fences around.
    
    v2: temporary put the signaled fences at the end of the new container
    v3: put the old fence at the end of the new container as well.
    
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171114142436.1360-1-christian.koenig@amd.com
Comment 8 Elizabeth 2017-11-17 22:52:30 UTC
(In reply to Chris Wilson from comment #6)
> Can you confirm you applied the patches? And please turn off guc. You still
> have very large reservation_objects...
Hello, sorry for the delay, tested again and the test seems to never advance after various minutes. 

IGT-Version: 1.20-g936b971 (x86_64) (Linux: 4.14.0-drm-intel-qa-ww46-commit-ed17259+ x86_64) 
(gem_ctx_thrash:1555) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_ctx_thrash:1555) igt-core-DEBUG: Starting subtest: single
(gem_ctx_thrash:1555) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_ctx_thrash:1555) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_ctx_thrash:1555) drmtest-DEBUG: Test requirement passed: drmSetMaster(fd) == 0
(gem_ctx_thrash:1555) drmtest-DEBUG: Test requirement passed: is_i915_device(fd) && has_known_intel_chipset(fd)
(gem_ctx_thrash:1555) ioctl-wrappers-DEBUG: Test requirement passed: err == 0
(gem_ctx_thrash:1555) DEBUG: Test requirement passed: gem_can_store_dword(fd, 0)
Creating 60228 contexts (assuming of size 106496 with execlists)
(gem_ctx_thrash:1555) intel-os-DEBUG: Checking 60,228 surfaces of size 106,496 bytes (total 6,444,879,872) against RAM + swap
(gem_ctx_thrash:1555) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_ctx_thrash:1555) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_ctx_thrash:1555) intel-os-DEBUG: Test requirement passed: __intel_check_memory(count, size, mode, &required, &total)
(gem_ctx_thrash:1555) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_ctx_thrash:1555) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
^C(gem_ctx_thrash:1555) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' 

dmesg:
2333 [  100.172674] [IGT] gem_ctx_thrash: executing
2334 [  100.174066] [IGT] gem_ctx_thrash: starting subtest single
2335 [  100.228316] gem_ctx_thrash (1555): drop_caches: 4 

Verified that patches from comment 3 were applied, but the ones from comment 7 don't seem to be included in commit ed17259.
Comment 9 Chris Wilson 2017-11-17 22:55:01 UTC
(In reply to Elizabeth from comment #8)
> (In reply to Chris Wilson from comment #6)
> > Can you confirm you applied the patches? And please turn off guc. You still
> > have very large reservation_objects...
> Hello, sorry for the delay, tested again and the test seems to never advance
> after various minutes. 

The test takes several hours to run.
Comment 10 Chris Wilson 2017-11-17 22:56:43 UTC
(In reply to Elizabeth from comment #8)
> Verified that patches from comment 3 were applied, but the ones from comment
> 7 don't seem to be included in commit ed17259.

Check again or else you are not testing the same upstream as drm-tip.
Comment 11 Chris Wilson 2017-11-17 22:59:49 UTC
*** Bug 103805 has been marked as a duplicate of this bug. ***
Comment 12 Elizabeth 2017-11-30 16:24:28 UTC
Created attachment 135831 [details]
dmesg_gem_ctx_trash-single_5hrs

I ran igt@gem_ctx_thrash@single for more than 5 hours and the test didn't finish. I got a lot of this in the dmesg:

[ 6405.088379] INFO: task kswapd0:42 blocked for more than 120 seconds.
[ 6405.088398]       Tainted: G     U           4.15.0-rc1-drm-intel-qa-ww48-commit-0645c6d+ #1
[ 6405.088406] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6405.088415] kswapd0         D    0    42      2 0x80000000
[ 6405.088422] Call Trace:
[ 6405.088443]  ? __schedule+0x3c1/0x890
[ 6405.088449]  schedule+0x32/0x80
[ 6405.088456]  io_schedule+0x12/0x40
[ 6405.088463]  __lock_page+0x102/0x140
[ 6405.088468]  ? page_cache_tree_insert+0xd0/0xd0
[ 6405.088475]  deferred_split_scan+0x252/0x2b0
[ 6405.088482]  shrink_slab.part.50+0x1f5/0x410
[ 6405.088489]  shrink_node+0x314/0x320
[ 6405.088495]  kswapd+0x32a/0x730
[ 6405.088505]  kthread+0xf5/0x130
[ 6405.088510]  ? mem_cgroup_shrink_node+0x180/0x180
[ 6405.088515]  ? kthread_associate_blkcg+0x90/0x90
[ 6405.088520]  ? kthread_associate_blkcg+0x90/0x90
[ 6405.088525]  ret_from_fork+0x1f/0x30

Hope this is the information that was being looked after.
Comment 13 Hector Velazquez 2018-02-27 22:41:41 UTC
This test has pass successfully after ~55 minutes on GLK QA

Test List

igt@gem_ctx_thrash@single

IGT-Version: 1.21-ga2664f8 (x86_64) (Linux: 4.16.0-rc2-drm-tip-ww9-commit-3a86cab+ x86_64)
======================================
        output
======================================
(gem_ctx_thrash:639) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_ctx_thrash:639) igt-core-DEBUG: Starting subtest: single
(gem_ctx_thrash:639) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_ctx_thrash:639) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_ctx_thrash:639) drmtest-DEBUG: Test requirement passed: is_i915_device(fd) && has_known_intel_chipset(fd)
(gem_ctx_thrash:639) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_ctx_thrash:639) ioctl-wrappers-DEBUG: Test requirement passed: dir >= 0
(gem_ctx_thrash:639) ioctl-wrappers-DEBUG: Test requirement passed: err == 0
(gem_ctx_thrash:639) DEBUG: Test requirement passed: gem_can_store_dword(fd, 0)
(gem_ctx_thrash:639) i915/gem-context-DEBUG: Test requirement passed: gem_has_contexts(fd)
Creating 60493 contexts (assuming of size 106496 with execlists)
(gem_ctx_thrash:639) intel-os-DEBUG: Checking 60493 surfaces of size 106496 bytes (total 6473236480) against RAM + swap
(gem_ctx_thrash:639) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_ctx_thrash:639) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_ctx_thrash:639) intel-os-DEBUG: Test requirement passed: __intel_check_memory(count, size, mode, &required, &total)
(gem_ctx_thrash:639) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_ctx_thrash:639) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0
Subtest single: SUCCESS (3308.101s)
(gem_ctx_thrash:639) igt-core-DEBUG: Exiting with status code 0
(gem_ctx_thrash:639) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'

(closing this case..., verified as fixed/not a bug on GLK)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.