Bug 100858 - [SNB][IGT] igt@drv_module_reload@basic-reload-final fails and hard hangs the system
Summary: [SNB][IGT] igt@drv_module_reload@basic-reload-final fails and hard hangs the ...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: high critical
Assignee: krisman
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-27 18:54 UTC by krisman
Modified: 2017-10-12 14:11 UTC (History)
1 user (show)

See Also:
i915 platform: SNB
i915 features: GEM/Other


Attachments

Description krisman 2017-04-27 18:54:26 UTC
When running igt@drv_module_reload@basic-reload-final, the first iteration hits an assertion failure and any further invocation will hard hang the system.  Network connection is lost, no console output, etc.  I'm trying to collect a crash dump next.

This might be two separate issues, but until confirmed, I think they should be considered related and handled together.  The output of the 2 subsequent invocations:

root@collab-x220:~/work/igt-gpu-tools# tests/drv_module_reload --run-subtest basic-reload-final --debug
IGT-Version: 1.18-g8039c0ef6e51 (x86_64) (Linux: 4.11.0-rc8.intel-boxes+ x86_64)
(drv_module_reload:1181) igt-core-DEBUG: Starting subtest: basic-reload-final
(drv_module_reload:1181) igt-kmod-DEBUG: Could not remove module drm_kms_helper (No such file or directory)
(drv_module_reload:1181) igt-kmod-DEBUG: Could not remove module drm (No such file or directory)
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) CRITICAL: Test assertion failure function store_dword, file drv_module_reload.c:111:
(drv_module_reload:1181) CRITICAL: Failed assertion: *batch == 0xc0ffee
(drv_module_reload:1181) CRITICAL: error: 0 != 12648430
Stack trace:
  #0 [__igt_fail_assert+0x16e]
  #1 [store_dword+0x35d]
  #2 [gem_exec_store+0x48]
  #3 [__real_main308+0x1e0]
  #4 [main+0x49]
  #5 [__libc_start_main+0xf1]
  #6 [_start+0x2a]
  #7 [<unknown>+0x2a]
Subtest basic-reload-final failed.
**** DEBUG ****
(drv_module_reload:1181) igt-kmod-DEBUG: Could not remove module drm_kms_helper (No such file or directory)
(drv_module_reload:1181) igt-kmod-DEBUG: Could not remove module drm (No such file or directory)
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) igt-debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(drv_module_reload:1181) CRITICAL: Test assertion failure function store_dword, file drv_module_reload.c:111:
(drv_module_reload:1181) CRITICAL: Failed assertion: *batch == 0xc0ffee
(drv_module_reload:1181) CRITICAL: error: 0 != 12648430
****  END  ****
Subtest basic-reload-final: FAIL (1.584s)
(drv_module_reload:1181) igt-core-DEBUG: Exiting with status code 99
root@collab-x220:~/work/igt-gpu-tools# tests/drv_module_reload --run-subtest basic-reload-final --debug
IGT-Version: 1.18-g8039c0ef6e51 (x86_64) (Linux: 4.11.0-rc8.intel-boxes+ x86_64)
(drv_module_reload:1200) igt-core-DEBUG: Starting subtest: basic-reload-final

^C
Comment 1 Chris Wilson 2017-04-27 19:52:12 UTC
The answer lies in if the GPU didn't write the dword to where we wanted, where did it write it? And the same worrying question will be for lots of different ops. The HWS is intact otherwise it would have complained about a gpu hang... Actually no, that test isn't checking so that's a more reasonable explanation that the HWS write also went astray, i.e. it is likely that nothing is right in the way the GPU is addressing memory.

In load/unload we do a GPU reset, so everything should be sane...

Remove the store dword check and see it it still hard hangs. If so, we can start removing chunks from the load/unload sequence and see at what point it works.
Comment 2 krisman 2017-05-02 16:39:22 UTC
(In reply to Chris Wilson from comment #1)
> The answer lies in if the GPU didn't write the dword to where we wanted,
> where did it write it? And the same worrying question will be for lots of
> different ops. The HWS is intact otherwise it would have complained about a
> gpu hang... Actually no, that test isn't checking so that's a more
> reasonable explanation that the HWS write also went astray, i.e. it is
> likely that nothing is right in the way the GPU is addressing memory.
> 
> In load/unload we do a GPU reset, so everything should be sane...
> 
> Remove the store dword check and see it it still hard hangs. If so, we can
> start removing chunks from the load/unload sequence and see at what point it
> works.

It doesn't hard hang without the store_dword sequence.  The hang happens on gem_write when executing the blt engine.  One interesting aspect is that the issue is only reproducible with intel_iommu=on.  If that parameter is disabled, the test succeeds and the system never hangs.
Comment 3 Ricardo 2017-05-03 15:39:23 UTC
Krisman can you add information regarding the platform
Comment 4 Ricardo 2017-05-03 15:52:52 UTC
Ignore last comment
Comment 5 Elizabeth 2017-06-21 15:54:50 UTC
Good afternoon,
Has the status of this bug changed recently? Is there any new information? Thanks.
Comment 6 Elizabeth 2017-07-20 21:05:09 UTC
Changing priority since is IGT basic Failure. Thanks.
Comment 7 Elizabeth 2017-10-12 14:11:42 UTC
Closing since:

    commit 2fea8d26e589a9d256eca9f3d561750ecb3fb681
    Author: Marius Vlad <marius.c.vlad@intel.com>
    Date:   Thu Dec 1 14:23:57 2016 +0200

        tests/drv_module_reload: Convert sh script to C version.

    Cc: tomi.p.sarvela@intel.com
    Tested-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
    Reviewed-by: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

    We always need to make sure there's a working driver, hence need to
    move the -final test into the igt_fixture.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.