Bug 105985

Summary: [CI][CNL only] incomplete - system hang due to temperature
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: CNL i915 features:

Description Marta Löfstedt 2018-04-11 07:56:17 UTC
On the fi-cnl-y3 we are seeing a big amount of incompletes with below in dmesg. The theory is that the machines throttles down to short time/bad cooling design and then spikes through the roof and the machine need to be halted.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4041/fi-cnl-y3/igt@gem_exec_flush@basic-uc-pro-default.html

<2>[  162.179870] CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
<2>[  162.179872] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1)
<2>[  162.179988] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
<2>[  162.179989] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
<2>[  162.179990] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
<2>[  162.179992] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
<6>[  162.180866] CPU1: Core temperature/speed normal
<6>[  162.180867] CPU3: Core temperature/speed normal
<6>[  162.180870] CPU2: Package temperature/speed normal
<6>[  162.180871] CPU0: Package temperature/speed normal
<6>[  162.180871] CPU3: Package temperature/speed normal
<6>[  162.180873] CPU1: Package temperature/speed normal
Comment 1 Marta Löfstedt 2018-04-11 08:14:25 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@syncobj_wait@wait-all-for-submit-snapshot.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@gem_workarounds@suspend-resume.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@gem_exec_await@wide-all.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@sw_sync@sync_busy_fork.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@gem_linear_blits@normal.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@kms_cursor_legacy@pipe-a-single-bo.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@gem_mocs_settings@mocs-settings-ctx-render.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@kms_vblank@pipe-b-wait-busy-hang.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@kms_chv_cursor_fail@pipe-b-128x128-top-edge.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@kms_frontbuffer_tracking@psr-1p-primscrn-cur-indfb-draw-mmap-wc.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@kms_cursor_legacy@cursor-vs-flip-atomic-transitions-varying-size.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@prime_busy@wait-after-render.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@gem_ctx_isolation@bcs0-dirty-switch.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-pri-shrfb-draw-mmap-wc.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@kms_vblank@pipe-a-query-forked-hang.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_14/fi-cnl-y3/igt@gem_exec_suspend@basic-s3.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_14/fi-cnl-y3/igt@kms_vblank@pipe-c-query-forked-hang.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_14/fi-cnl-y3/igt@kms_addfb_basic@unused-handle.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_14/fi-cnl-y3/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-indfb-msflip-blt.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_14/fi-cnl-y3/igt@perf_pmu@rc6-runtime-pm-long.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@kms_chv_cursor_fail@pipe-b-256x256-right-edge.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@kms_universal_plane@universal-plane-gen9-features-pipe-c.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@kms_psr_sink_crc@primary_mmap_gtt.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@kms_chv_cursor_fail@pipe-c-64x64-left-edge.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@tools_test@tools_test.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@pm_rpm@sysfs-read.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@pm_lpsp@edp-native.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@kms_ccs@pipe-b-bad-pixel-format.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4034/fi-cnl-y3/igt@gem_ctx_param@basic-default.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_12/fi-cnl-y3/igt@pm_rpm@drm-resources-equal.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_12/fi-cnl-y3/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-cur-indfb-draw-mmap-wc.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_12/fi-cnl-y3/igt@pm_rpm@debugfs-read.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_12/fi-cnl-y3/igt@kms_frontbuffer_tracking@drrs-1p-offscren-pri-shrfb-draw-mmap-wc.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_11/fi-cnl-y3/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-cur-indfb-draw-mmap-cpu.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_3/fi-cnl-y3/igt@kms_atomic_transition@plane-all-modeset-transition-fencing.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_2/fi-cnl-y3/igt@pm_rpm@gem-execbuf-stress-extra-wait.html
Comment 2 Marta Löfstedt 2018-04-11 09:14:50 UTC
Maybe it could help to decrease the throttle temperature in BIOS if that is possible.
Comment 3 Marta Löfstedt 2018-04-18 06:34:21 UTC
in drmtip runs from last week I only saw 2 new incompletes with temperature related logs in dmesg close in time to incompletes. There is still a suspicion that there could be a lab environment issue affecting thing machine.

I lower priority to high, since no one at the moment is looking into all incompletes on this machine.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.