Bug 101403 - [IGT][SKL][XEN Environment] gpu couldn't recovery after running drv_hangman in dom0
Summary: [IGT][SKL][XEN Environment] gpu couldn't recovery after running drv_hangman i...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-13 06:10 UTC by XiongZhang
Modified: 2017-07-21 16:47 UTC (History)
1 user (show)

See Also:
i915 platform: SKL
i915 features:


Attachments
Dmesg while running drv_hangman (46.90 KB, text/plain)
2017-06-13 06:14 UTC, XiongZhang
no flags Details
dmesg while send reboot command through ssh (4.52 KB, text/plain)
2017-06-13 06:17 UTC, XiongZhang
no flags Details
the output and drv_hangman (2.04 KB, text/plain)
2017-06-13 06:22 UTC, XiongZhang
no flags Details
dmesg from bootup and runing drv_hangman (103.10 KB, text/plain)
2017-06-14 05:02 UTC, XiongZhang
no flags Details

Description XiongZhang 2017-06-13 06:10:26 UTC
Environment:
Xen: upstream 4.9
kernel: drm-intel-nightly
4f89fbd drm-tip: 2017y-06m-12d-19h-18m-52s UTC integration manifest
intel-gpu-tools: master branch
d1ea0c0 gem_wsim: More interesting workloads

Issues:
After running igt@drv_hangman, gpu couldn't recovery on dom0, then send reboot command through SSH, dom0 couldn't reboot as Xorg couldn't be terminated and blocked there,  I have to press power button to shutdown machine.

How to reproduce:
1) compile upstream xen 4.9, compile drm-intel-nightly kernel
2) drm-intel-nightly as dom0's kernel, use xen 4.9 to boot it
3) After dom0 boot up, run igt@tests@drv_hangman through ssh
4) After running drv_hangman, gpu couldn't recovery and dom0's desktop is frozen
5) send reboot command to dom0 through ssh, but dom0 couldn't reboot as xorg blocked there. And I have to press power button to shutdown dom0.

Experiments:
1) this only happens on xen environment, native environment doesn't have such issue.
2) this is an kernel regression. 4.10 kernel doesn't have such issue, 4.11 kernel has this issue and drm-intel-night has this too.
git bisect tell me the first bad commit is: 
commit 4c9655436522eaf4ba35572851150ccb71f3866e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 17 17:59:01 2017 +0200

    drm/i915: Move engine reset preparation to i915_gem_reset_prepare()

    Now that we have prepare/finish routines for the GEM reset, move the
    disabling of the engine->irq_tasklet into them to reduce repetition. The
    device irq enable/disable is split out to ensure it is run first and
    last always (even if the GPU reset fails).

    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/1484668747-9120-1-git-send-email-mika.kuoppala@intel.com
Comment 1 XiongZhang 2017-06-13 06:14:14 UTC
Created attachment 131914 [details]
Dmesg while running drv_hangman
Comment 2 XiongZhang 2017-06-13 06:17:26 UTC
Created attachment 131915 [details]
dmesg while send reboot command through ssh
Comment 3 XiongZhang 2017-06-13 06:22:52 UTC
Created attachment 131916 [details]
the output and drv_hangman
Comment 4 Chris Wilson 2017-06-13 10:52:59 UTC
What did you bisect? The failure of drv_hangman is quite normal for !full-ppgtt, which I presume is the result of the vgpu emulation. The hang of X is another issue unconnected to reset. You should not be running igt while anything else is accessing the gpu as most of the tests assume exclusive ownership of the device.
Comment 5 Elizabeth 2017-06-13 17:19:26 UTC
Good day, could you specify the platform you used? It works well with the following configuration:

Attached Displays: DP and HDMI

Driver Graphics Specifications ============================================================== 
Component: drm
    tag: libdrm-2.4.80-20-g48aac8c
    commit: 48aac8c6ef301be5ed4cf824779baa3c98981a90

Component: cairo
    tag: 1.15.4-22-g0fd0fd0
    commit: 0fd0fd0ae9ad8cfb177bb844091de98c0235917e

Component: intel-gpu-tools
    tag: intel-gpu-tools-1.18-214-ga0433ca
    commit: a0433ca1dddb83968a0f91753509526bb0240b5a

Component: piglit
    tag: piglit-v1
    commit: 943b4f9dff77874c1998ca68f78f16db1d175fdf

Hardware Graphics Specifications ============================================================== 
Processor Graphics 			Intel® Iris™ Graphics 540
Graphics Base Frequency			300.00 MHz
Graphics Max Dynamic Frequency		950.00 MHz
Graphics Video Max Memory		32 GB
eDRAM					64 MB
Graphics Output				eDP/DP/HDMI/DVI
4K Support				Yes, at 60Hz
Max Resolution (Intel® WiDi)		1080p
Max Resolution (HDMI 1.4)		4096x2304@24Hz
Max Resolution (DP)			4096x2304@60Hz
Max Resolution (eDP)			4096x2304@60Hz
Max Resolution (VGA)			N/A
DirectX* Support			12
OpenGL* Support				4.4
Intel® Quick Sync Video 		Yes
Intel® InTru™ 3D Technology		Yes
Intel® Clear Video HD Technology	Yes
Intel® Clear Video Technology		Yes
Intel® Wireless Display 		Yes
# of Displays Supported 		3
Device ID				0x1926

Could you take a look on that and share the info? Thanks.
Comment 6 XiongZhang 2017-06-14 02:37:31 UTC
I just install an upstream xen 4.9 and upstream 4.11 kernel, then use xen to boot 4.11 kernel. Both of them don't contain any xen-gvt related code and I don't boot a guest, so there is no vgpu emulation. In this case, i915 driver access hw directly and doesn't trap into vgpu. I bisect the upstream 4.11 kernel through checking gpu recovery and found the above commit.

Then I tried drm-intel-nightly and found it has the same issue. Just now, I boot drm-intel-nightly kernel to text mode using xen, then run drv_hangman, but GPU still couldn't recovery. 
And through the dmesg, it is in full-ppgtt mode as dmesg output "[i915_driver_load[i915]] ppgtt mode: 3"


(In reply to Chris Wilson from comment #4)
> What did you bisect? The failure of drv_hangman is quite normal for
> !full-ppgtt, which I presume is the result of the vgpu emulation. The hang
> of X is another issue unconnected to reset. You should not be running igt
> while anything else is accessing the gpu as most of the tests assume
> exclusive ownership of the device.
Comment 7 XiongZhang 2017-06-14 02:49:39 UTC
My machine is SkyLake.
This issue doesn't happen in native machine. It only happens in xen environment.
If you are using ubuntu, you could "apt-get install xen-hypervisor-4.(6/7/8)-amd64". Then you could see an item "Advanced options for Ubuntu GNU/Linux (with Xen hypervisor)" in grub, enter this item and select "your target kernel" to boot.
(In reply to elizabethx.de.la.torre.mena from comment #5)
> Good day, could you specify the platform you used? It works well with the
> following configuration:
> 
> Attached Displays: DP and HDMI
> 
> Driver Graphics Specifications
> ============================================================== 
> Component: drm
>     tag: libdrm-2.4.80-20-g48aac8c
>     commit: 48aac8c6ef301be5ed4cf824779baa3c98981a90
> 
> Component: cairo
>     tag: 1.15.4-22-g0fd0fd0
>     commit: 0fd0fd0ae9ad8cfb177bb844091de98c0235917e
> 
> Component: intel-gpu-tools
>     tag: intel-gpu-tools-1.18-214-ga0433ca
>     commit: a0433ca1dddb83968a0f91753509526bb0240b5a
> 
> Component: piglit
>     tag: piglit-v1
>     commit: 943b4f9dff77874c1998ca68f78f16db1d175fdf
> 
> Hardware Graphics Specifications
> ============================================================== 
> Processor Graphics 			Intel® Iris™ Graphics 540
> Graphics Base Frequency			300.00 MHz
> Graphics Max Dynamic Frequency		950.00 MHz
> Graphics Video Max Memory		32 GB
> eDRAM					64 MB
> Graphics Output				eDP/DP/HDMI/DVI
> 4K Support				Yes, at 60Hz
> Max Resolution (Intel® WiDi)		1080p
> Max Resolution (HDMI 1.4)		4096x2304@24Hz
> Max Resolution (DP)			4096x2304@60Hz
> Max Resolution (eDP)			4096x2304@60Hz
> Max Resolution (VGA)			N/A
> DirectX* Support			12
> OpenGL* Support				4.4
> Intel® Quick Sync Video 		Yes
> Intel® InTru™ 3D Technology		Yes
> Intel® Clear Video HD Technology	Yes
> Intel® Clear Video Technology		Yes
> Intel® Wireless Display 		Yes
> # of Displays Supported 		3
> Device ID				0x1926
> 
> Could you take a look on that and share the info? Thanks.
Comment 8 XiongZhang 2017-06-14 04:58:11 UTC
drv_hangman couldn't be finished in text mode and gpu couldn't recovery:
tests# ./drv_hangman
IGT-Version: 1.18-gd1ea0c0 (x86_64) (Linux: 4.12.0-rc4nightly+ x86_64)
Subtest error-state-sysfs-entry: SUCCESS (0.000s)
Subtest error-state-basic: SUCCESS (0.013s)
Subtest error-state-capture-render: SUCCESS (9.891s)
Subtest error-state-capture-bsd: SUCCESS (6.016s)
Test requirement not met in function test_error_state_capture, file drv_hangman.c:186:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-bsd1: SKIP (0.000s)
Test requirement not met in function test_error_state_capture, file drv_hangman.c:186:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-bsd2: SKIP (0.000s)
Subtest error-state-capture-blt: SUCCESS (5.983s)
Subtest error-state-capture-vebox: SUCCESS (6.016s)
Subtest hangcheck-unterminated: SUCCESS (15.999s)

cursor flicker forever and couldn't exit to normal console where I could input command.
Comment 9 XiongZhang 2017-06-14 05:02:17 UTC
Created attachment 131944 [details]
dmesg from bootup and runing drv_hangman
Comment 10 XiongZhang 2017-06-14 05:04:52 UTC
This dmesg  is in text mode.
(In reply to XiongZhang from comment #9)
> Created attachment 131944 [details]
> dmesg from bootup and runing drv_hangman
Comment 11 XiongZhang 2017-07-19 09:25:42 UTC
Latest drm-intel-nightly branch couldn't reproduce this issue.

And it is fixed by Per-engine reset feature from:
https://lists.freedesktop.org/archives/intel-gfx/2017-June/130921.html
Comment 12 Ricardo 2017-07-21 16:47:22 UTC
closing


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.