Summary: | [IGT] gem_ctx_thrash single test assertion failure on function gem_set_domain | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Luis Botello <luis.botello.ortega> | ||||||||||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||
Severity: | normal | ||||||||||||||||
Priority: | medium | CC: | hector.franciscox.velazquez.suriano, intel-gfx-bugs | ||||||||||||||
Version: | DRI git | ||||||||||||||||
Hardware: | Other | ||||||||||||||||
OS: | All | ||||||||||||||||
Whiteboard: | ReadyForDev | ||||||||||||||||
i915 platform: | GLK | i915 features: | GEM/Other | ||||||||||||||
Attachments: |
|
Description
Luis Botello
2017-09-21 21:45:46 UTC
Created attachment 134420 [details]
IGT_output
Created attachment 135290 [details]
kernl_log_gem_ctx_thrash-single
On GLK now this test is being killed instead:
$ : sudo -E ./gem_ctx_thrash --r single
IGT-Version: 1.20-g9fe5a9a (x86_64) (Linux: 4.14.0-rc8-drm-intel-qa-ww45-commit-b911f67+ x86_64)
Creating 60228 contexts (assuming of size 106496 with execlists)
Killed
[ 63.764376] Out of memory: Kill process 1397 (gem_ctx_thrash) score 1000 or sacrifice child
[ 63.764383] Killed process 1397 (gem_ctx_thrash) total-vm:63656kB, anon-rss:92kB, file-rss:0kB, shmem-rss:0kB
[ 63.780214] oom_reaper: reaped process 1397 (gem_ctx_thrash), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
commit 1ab22356b37ab08a391d6f007fda4c822bef9fb5 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Nov 7 22:06:56 2017 +0000 drm/i915: Prune the reservation shared fence array The shared fence array is not autopruning and may continue to grow as an object is shared between new timelines. Take the opportunity when we think the object is idle (we have to confirm that any external fence is also signaled) to decouple all the fences. We apply a similar trick after waiting on an object, see commit e54ca9774777 ("drm/i915: Remove completed fences after a wait") v2: No longer need to handle the batch pool as a special case. v3: Need to trylock from within i915_vma_retire as this may be called form the shrinker - and we may later try to allocate underneath the reservation lock, so a deadlock is possible. References: https://bugs.freedesktop.org/show_bug.cgi?id=102936 Fixes: d07f0e59b2c7 ("drm/i915: Move GEM activity tracking into a common struct reservation_object") Fixes: 80b204bce8f2 ("drm/i915: Enable multiple timelines") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171107220656.5020-1-chris@chris-wilson.co.uk Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> commit 2f6a3783833dde63f1c08982943a8b2229b97afb Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Nov 8 09:44:00 2017 +0000 drm/i915: Idle the GPU before shinking everything The handling of contexts are peculiar. Instead of tieing their vma to activity, we pin the context. This means that we cannot simply unbind the context object itself at will (which would normally cause us to wait for the vma to be idle), but must manually idle the GPU and retire requests first. A consequence of this peculiarity is when doing a last desperate attempt to recover memory. If the memory is tied up inside active context objects, we will fail to recover any memory simply by trying to unbind the objects without first doing a wait-for-idle. A side-effect of removing the call to shrinker_lock_uninterruptible() from i915_gem_shrinker_oom() was that we removed an unlocked wait-for-idle, and so lost the "natural" shrinkage of context objects. By replacing that with a locked wait from inside i915_gem_shrink(), we not only replace it with the ability to recover all context objects, but do so for all i915_gem_shrink_all() callers. v2: Switching requires request allocation, which is not permitted from inside the shrinker as it only uses ordinary allocations. References: https://bugs.freedesktop.org/show_bug.cgi?id=102936 Fixes: f2123818ffad ("drm/i915: Move dev_priv->mm.[un]bound_list to its own lock") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171108094400.1386-1-chris@chris-wilson.co.uk Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> should help a lot. Still expect that this test (but not this subtest) will fail if you let it run long enough. Created attachment 135315 [details] output_gem_ctx_thash-family Tests switched between killed and failed assertion. Only subtest processes pass as success. Commit information: commit 087c404bd6d56a52e0656ac7c79faa376c25b796 Author: Chris Wilson <chris@chris-wilson.co.uk> AuthorDate: Wed Nov 8 15:44:46 2017 +0000 Commit: Chris Wilson <chris@chris-wilson.co.uk> CommitDate: Wed Nov 8 15:44:46 2017 +0000 drm-tip: 2017y-11m-08d-15h-44m-06s UTC integration manifest Created attachment 135316 [details]
dmesg_gem_ctx_trash-family
Can you confirm you applied the patches? And please turn off guc. You still have very large reservation_objects... Another contributing factor should be improved by: commit ca25fe5efe4ab43cc5b4f3117a205c281805a5ca Author: Christian König <ckoenig.leichtzumerken@gmail.com> Date: Tue Nov 14 15:24:36 2017 +0100 dma-buf: try to replace a signaled fence in reservation_object_add_shared_inplace The amdgpu issue to also need signaled fences in the reservation objects should be fixed by now. Optimize the handling by replacing a signaled fence when adding a new shared one. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171114142436.1360-2-christian.koenig@amd.com commit 4d9c62e8ce69d0b0a834282a34bff5ce8eeacb1d Author: Christian König <ckoenig.leichtzumerken@gmail.com> Date: Tue Nov 14 15:24:35 2017 +0100 dma-buf: keep only not signaled fence in reservation_object_add_shared_replace v3 The amdgpu issue to also need signaled fences in the reservation objects should be fixed by now. Optimize the list by keeping only the not signaled yet fences around. v2: temporary put the signaled fences at the end of the new container v3: put the old fence at the end of the new container as well. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171114142436.1360-1-christian.koenig@amd.com (In reply to Chris Wilson from comment #6) > Can you confirm you applied the patches? And please turn off guc. You still > have very large reservation_objects... Hello, sorry for the delay, tested again and the test seems to never advance after various minutes. IGT-Version: 1.20-g936b971 (x86_64) (Linux: 4.14.0-drm-intel-qa-ww46-commit-ed17259+ x86_64) (gem_ctx_thrash:1555) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation() (gem_ctx_thrash:1555) igt-core-DEBUG: Starting subtest: single (gem_ctx_thrash:1555) drmtest-DEBUG: Test requirement passed: !(fd<0) (gem_ctx_thrash:1555) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_ctx_thrash:1555) drmtest-DEBUG: Test requirement passed: drmSetMaster(fd) == 0 (gem_ctx_thrash:1555) drmtest-DEBUG: Test requirement passed: is_i915_device(fd) && has_known_intel_chipset(fd) (gem_ctx_thrash:1555) ioctl-wrappers-DEBUG: Test requirement passed: err == 0 (gem_ctx_thrash:1555) DEBUG: Test requirement passed: gem_can_store_dword(fd, 0) Creating 60228 contexts (assuming of size 106496 with execlists) (gem_ctx_thrash:1555) intel-os-DEBUG: Checking 60,228 surfaces of size 106,496 bytes (total 6,444,879,872) against RAM + swap (gem_ctx_thrash:1555) drmtest-DEBUG: Test requirement passed: !(fd<0) (gem_ctx_thrash:1555) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_ctx_thrash:1555) intel-os-DEBUG: Test requirement passed: __intel_check_memory(count, size, mode, &required, &total) (gem_ctx_thrash:1555) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation() (gem_ctx_thrash:1555) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0 ^C(gem_ctx_thrash:1555) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' dmesg: 2333 [ 100.172674] [IGT] gem_ctx_thrash: executing 2334 [ 100.174066] [IGT] gem_ctx_thrash: starting subtest single 2335 [ 100.228316] gem_ctx_thrash (1555): drop_caches: 4 Verified that patches from comment 3 were applied, but the ones from comment 7 don't seem to be included in commit ed17259. (In reply to Elizabeth from comment #8) > (In reply to Chris Wilson from comment #6) > > Can you confirm you applied the patches? And please turn off guc. You still > > have very large reservation_objects... > Hello, sorry for the delay, tested again and the test seems to never advance > after various minutes. The test takes several hours to run. (In reply to Elizabeth from comment #8) > Verified that patches from comment 3 were applied, but the ones from comment > 7 don't seem to be included in commit ed17259. Check again or else you are not testing the same upstream as drm-tip. *** Bug 103805 has been marked as a duplicate of this bug. *** Created attachment 135831 [details]
dmesg_gem_ctx_trash-single_5hrs
I ran igt@gem_ctx_thrash@single for more than 5 hours and the test didn't finish. I got a lot of this in the dmesg:
[ 6405.088379] INFO: task kswapd0:42 blocked for more than 120 seconds.
[ 6405.088398] Tainted: G U 4.15.0-rc1-drm-intel-qa-ww48-commit-0645c6d+ #1
[ 6405.088406] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6405.088415] kswapd0 D 0 42 2 0x80000000
[ 6405.088422] Call Trace:
[ 6405.088443] ? __schedule+0x3c1/0x890
[ 6405.088449] schedule+0x32/0x80
[ 6405.088456] io_schedule+0x12/0x40
[ 6405.088463] __lock_page+0x102/0x140
[ 6405.088468] ? page_cache_tree_insert+0xd0/0xd0
[ 6405.088475] deferred_split_scan+0x252/0x2b0
[ 6405.088482] shrink_slab.part.50+0x1f5/0x410
[ 6405.088489] shrink_node+0x314/0x320
[ 6405.088495] kswapd+0x32a/0x730
[ 6405.088505] kthread+0xf5/0x130
[ 6405.088510] ? mem_cgroup_shrink_node+0x180/0x180
[ 6405.088515] ? kthread_associate_blkcg+0x90/0x90
[ 6405.088520] ? kthread_associate_blkcg+0x90/0x90
[ 6405.088525] ret_from_fork+0x1f/0x30
Hope this is the information that was being looked after.
This test has pass successfully after ~55 minutes on GLK QA Test List igt@gem_ctx_thrash@single IGT-Version: 1.21-ga2664f8 (x86_64) (Linux: 4.16.0-rc2-drm-tip-ww9-commit-3a86cab+ x86_64) ====================================== output ====================================== (gem_ctx_thrash:639) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation() (gem_ctx_thrash:639) igt-core-DEBUG: Starting subtest: single (gem_ctx_thrash:639) drmtest-DEBUG: Test requirement passed: !(fd<0) (gem_ctx_thrash:639) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_ctx_thrash:639) drmtest-DEBUG: Test requirement passed: is_i915_device(fd) && has_known_intel_chipset(fd) (gem_ctx_thrash:639) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_ctx_thrash:639) ioctl-wrappers-DEBUG: Test requirement passed: dir >= 0 (gem_ctx_thrash:639) ioctl-wrappers-DEBUG: Test requirement passed: err == 0 (gem_ctx_thrash:639) DEBUG: Test requirement passed: gem_can_store_dword(fd, 0) (gem_ctx_thrash:639) i915/gem-context-DEBUG: Test requirement passed: gem_has_contexts(fd) Creating 60493 contexts (assuming of size 106496 with execlists) (gem_ctx_thrash:639) intel-os-DEBUG: Checking 60493 surfaces of size 106496 bytes (total 6473236480) against RAM + swap (gem_ctx_thrash:639) drmtest-DEBUG: Test requirement passed: !(fd<0) (gem_ctx_thrash:639) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (gem_ctx_thrash:639) intel-os-DEBUG: Test requirement passed: __intel_check_memory(count, size, mode, &required, &total) (gem_ctx_thrash:639) igt-core-DEBUG: Test requirement passed: !igt_run_in_simulation() (gem_ctx_thrash:639) ioctl-wrappers-DEBUG: Test requirement passed: __gem_set_caching(fd, handle, caching) == 0 Subtest single: SUCCESS (3308.101s) (gem_ctx_thrash:639) igt-core-DEBUG: Exiting with status code 0 (gem_ctx_thrash:639) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0' (closing this case..., verified as fixed/not a bug on GLK) |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.