Bug 105985 - [CI][CNL only] incomplete - system hang due to temperature
Summary: [CI][CNL only] incomplete - system hang due to temperature
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-11 07:56 UTC by Marta Löfstedt
Modified: 2018-04-26 08:09 UTC (History)
1 user (show)

See Also:
i915 platform: CNL
i915 features:


Attachments

Description Marta Löfstedt 2018-04-11 07:56:17 UTC
On the fi-cnl-y3 we are seeing a big amount of incompletes with below in dmesg. The theory is that the machines throttles down to short time/bad cooling design and then spikes through the roof and the machine need to be halted.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4041/fi-cnl-y3/igt@gem_exec_flush@basic-uc-pro-default.html

<2>[  162.179870] CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
<2>[  162.179872] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1)
<2>[  162.179988] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
<2>[  162.179989] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
<2>[  162.179990] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
<2>[  162.179992] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
<6>[  162.180866] CPU1: Core temperature/speed normal
<6>[  162.180867] CPU3: Core temperature/speed normal
<6>[  162.180870] CPU2: Package temperature/speed normal
<6>[  162.180871] CPU0: Package temperature/speed normal
<6>[  162.180871] CPU3: Package temperature/speed normal
<6>[  162.180873] CPU1: Package temperature/speed normal
Comment 1 Marta Löfstedt 2018-04-11 08:14:25 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@syncobj_wait@wait-all-for-submit-snapshot.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@gem_workarounds@suspend-resume.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@gem_exec_await@wide-all.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@sw_sync@sync_busy_fork.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@gem_linear_blits@normal.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@kms_cursor_legacy@pipe-a-single-bo.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@gem_mocs_settings@mocs-settings-ctx-render.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@kms_vblank@pipe-b-wait-busy-hang.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_16/fi-cnl-y3/igt@kms_chv_cursor_fail@pipe-b-128x128-top-edge.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@kms_frontbuffer_tracking@psr-1p-primscrn-cur-indfb-draw-mmap-wc.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@kms_cursor_legacy@cursor-vs-flip-atomic-transitions-varying-size.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@prime_busy@wait-after-render.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@gem_ctx_isolation@bcs0-dirty-switch.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-pri-shrfb-draw-mmap-wc.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_15/fi-cnl-y3/igt@kms_vblank@pipe-a-query-forked-hang.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_14/fi-cnl-y3/igt@gem_exec_suspend@basic-s3.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_14/fi-cnl-y3/igt@kms_vblank@pipe-c-query-forked-hang.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_14/fi-cnl-y3/igt@kms_addfb_basic@unused-handle.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_14/fi-cnl-y3/igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-indfb-msflip-blt.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_14/fi-cnl-y3/igt@perf_pmu@rc6-runtime-pm-long.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@kms_chv_cursor_fail@pipe-b-256x256-right-edge.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@kms_universal_plane@universal-plane-gen9-features-pipe-c.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@kms_psr_sink_crc@primary_mmap_gtt.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@kms_chv_cursor_fail@pipe-c-64x64-left-edge.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@tools_test@tools_test.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@pm_rpm@sysfs-read.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@pm_lpsp@edp-native.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_13/fi-cnl-y3/igt@kms_ccs@pipe-b-bad-pixel-format.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4034/fi-cnl-y3/igt@gem_ctx_param@basic-default.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_12/fi-cnl-y3/igt@pm_rpm@drm-resources-equal.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_12/fi-cnl-y3/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-cur-indfb-draw-mmap-wc.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_12/fi-cnl-y3/igt@pm_rpm@debugfs-read.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_12/fi-cnl-y3/igt@kms_frontbuffer_tracking@drrs-1p-offscren-pri-shrfb-draw-mmap-wc.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_11/fi-cnl-y3/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-cur-indfb-draw-mmap-cpu.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_3/fi-cnl-y3/igt@kms_atomic_transition@plane-all-modeset-transition-fencing.html
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_2/fi-cnl-y3/igt@pm_rpm@gem-execbuf-stress-extra-wait.html
Comment 2 Marta Löfstedt 2018-04-11 09:14:50 UTC
Maybe it could help to decrease the throttle temperature in BIOS if that is possible.
Comment 3 Marta Löfstedt 2018-04-18 06:34:21 UTC
in drmtip runs from last week I only saw 2 new incompletes with temperature related logs in dmesg close in time to incompletes. There is still a suspicion that there could be a lab environment issue affecting thing machine.

I lower priority to high, since no one at the moment is looking into all incompletes on this machine.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.