Bug 103718 - [CI] igt@drv_selftest@live_gtt - incomplete
Summary: [CI] igt@drv_selftest@live_gtt - incomplete
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-13 12:11 UTC by Marta Löfstedt
Modified: 2017-12-05 08:44 UTC (History)
1 user (show)

See Also:
i915 platform: BXT, GLK
i915 features:


Attachments

Description Marta Löfstedt 2017-11-13 12:11:05 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3332/shard-apl6/igt@drv_selftest@live_gtt.html

oom killer starting:

<7>[ 1607.338719] [drm:drm_setup_crtcs] desired mode 1024x768 set on crtc 39 (0,0)
<7>[ 1607.339273] [drm:intelfb_create [i915]] no BIOS fb, allocating a new one
<6>[ 1636.138752] Purging GPU memory, 0 pages freed, 845 pages still pinned.
<3>[ 1636.138774] 1 and 0 pages still available in the bound and unbound GPU page lists.
<4>[ 1636.138886] drv_selftest invoked oom-killer: gfp_mask=0x16042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK), nodemask=(null),  order=2, oom_score_adj=1000
...
<3>[ 1636.928657] 1 and 0 pages still available in the bound and unbound GPU page lists.
<6>[ 1636.974782] oom_reaper: reaped process 6772 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
<5>[ 1637.006382] owatch: /dev/watchdog0 closed

so, it looks like it is possible to oom kill the python process and then python and hence the whole test execution.

From run.log this just looks like you typical system hang:
Completed CI_IGT_test CI_DRM_3332@shard-apl6 : FAILURE
CI_IGT_test runtime 160 seconds
Comment 1 Chris Wilson 2017-11-13 12:34:29 UTC
Only seen once so far (I think at least), it looks to be a kernel leak. At the moment, the obvious thing to do is a run with kmemleak, but my initial guess is that it's a result of early fail not cleaning up properly. The modules allocations (such as drm_mm, kmem_cache etc) are checked upon module unload (and kselftest) but no warning seen, hence the search for something a little more unusual.
Comment 2 Marta Löfstedt 2017-11-30 09:17:43 UTC
(In reply to Chris Wilson from comment #1)
> Only seen once so far (I think at least), it looks to be a kernel leak. At
> the moment, the obvious thing to do is a run with kmemleak, but my initial
> guess is that it's a result of early fail not cleaning up properly. The
> modules allocations (such as drm_mm, kmem_cache etc) are checked upon module
> unload (and kselftest) but no warning seen, hence the search for something a
> little more unusual.

This incomplete is pretty frequent on APL, but due to ftrace messing up pstore and the recent 4.15.0-rc1 fire, we'll have to wait and see if we can get any reasonable data on this.
Comment 3 Marta Löfstedt 2017-11-30 11:33:39 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3415/shard-apl6/igt@drv_selftest@live_gtt.html

doesn't have the any oom stuff, so I change the title and file all igt@drv_selftest@live_gtt on this bug.

run.log doesn't hint at timeout or softdog so system hang is assumed.

this is last dmesg:
<7>[ 2890.553261] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 01
<5>[ 2891.380887] __shrink_hole timed out at ofset 1ffffff000 [0 - 1000000000000]
<5>[ 2892.632006] lowlevel_hole timed out before 192296/260705
<5>[ 2893.636015] drunk_hole timed out after 114947/521410
<5>[ 2894.637006] walk_hole timed out at 1c93a000
<5>[ 2895.784092] pot_hole timed out after 16/31
<5>[ 2896.837487] fill_hole timed out (npages=279841, prime=23)
<6>[ 2896.842475] Console: switching to colour dummy device 80x25
Comment 4 Marta Löfstedt 2017-11-30 11:37:04 UTC
(In reply to Marta Löfstedt from comment #3)
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3415/shard-apl6/
> igt@drv_selftest@live_gtt.html
> 
> doesn't have the any oom stuff, so I change the title and file all
> igt@drv_selftest@live_gtt on this bug.
> 
> run.log doesn't hint at timeout or softdog so system hang is assumed.
> 
> this is last dmesg:
> <7>[ 2890.553261] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to
> 01
> <5>[ 2891.380887] __shrink_hole timed out at ofset 1ffffff000 [0 -
> 1000000000000]
> <5>[ 2892.632006] lowlevel_hole timed out before 192296/260705
> <5>[ 2893.636015] drunk_hole timed out after 114947/521410
> <5>[ 2894.637006] walk_hole timed out at 1c93a000
> <5>[ 2895.784092] pot_hole timed out after 16/31
> <5>[ 2896.837487] fill_hole timed out (npages=279841, prime=23)
> <6>[ 2896.842475] Console: switching to colour dummy device 80x25

The dmesg snippet is wrong, it is from this GLK-shards run:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3415/shard-glkb4/igt@drv_selftest@live_gtt.html

The last APL dmesgs are:
<7>[ 1653.449383] [drm:intel_fb_initial_config [i915]] Not using firmware configuration
<7>[ 1653.449404] [drm:drm_setup_crtcs] looking for cmdline mode on connector 72
<7>[ 1653.449427] [drm:drm_setup_crtcs] looking for preferred mode on connector 72 0
<7>[ 1653.449434] [drm:drm_setup_crtcs] found mode 1024x768
<7>[ 1653.449439] [drm:drm_setup_crtcs] picking CRTCs for 8192x8192 config
<7>[ 1653.449467] [drm:drm_setup_crtcs] desired mode 1024x768 set on crtc 40 (0,0)
<7>[ 1653.449577] [drm:intelfb_create [i915]] no BIOS fb, allocating a new one
<7>[ 1653.483768] [drm:asle_work [i915]] bclp = 0x800000ff
<7>[ 1653.483842] [drm:asle_work [i915]] updating opregion backlight 255/255
<6>[ 1664.028427] perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Comment 5 Marta Löfstedt 2017-11-30 11:38:25 UTC
to clarify my previous mess:
Here are 2 new occurrences of this issue, both looks like system hangs.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3415/shard-apl6/igt@drv_selftest@live_gtt.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3415/shard-glkb4/igt@drv_selftest@live_gtt.html
Comment 6 Chris Wilson 2017-12-04 14:37:00 UTC
I think this explains this failure, and it should also prevent the sanitycheck incompletes.

commit c325dd948b4e4e9fe0cc7d612f2101fb3804de5c (HEAD, upstream/master)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Nov 30 21:41:10 2017 +0000

    igt/drv_selftests: Disable initialising the display
    
    Many of the selftests try to completely fill global resources; resources
    that are presumed available for bringing up the display. Avoid the
    contention by simply not bringing up the display!
    
    This does limit the effectiveness of selftesting to GEM for the
    time being. To exercise KMS from selftests we would essentially have to
    always mock the displays.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=103718
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Comment 7 Marta Löfstedt 2017-12-05 08:44:00 UTC
Fix included in CI_DRM_3449 I will close


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.