Bug 65495

Summary: [GM45] bsd ring reset fails
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: xunx.fang, yangweix.shui
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Ignore EIO during set-to-domain
none
i915_error_state
none
fix media reset on gm45
none
run full gem hw init after gpu resets none

Description lu hua 2013-06-07 07:43:14 UTC
System Environment:
--------------------------
Arch:           x86_64
Platform:       GM45
Kernel: drm-intel-next-queued cb8b2a30b32cde5ac9053d399d084c487598976a

Bug detailed description:
-------------------------
It happens on GM45 with drm-intel-next-queued kernel, It works well on drm-intel-fixes kernel. Many igt cases will fail after run ZZ_hangman. It caused by igt commit.
Bisect shows: 1cb4f90946289457c3b92773f2ce96b0b03e4a22 is the first bad commit
commit 1cb4f90946289457c3b92773f2ce96b0b03e4a22
Author:     Imre Deak <imre.deak@intel.com>
AuthorDate: Tue May 28 17:35:32 2013 +0300
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Tue May 28 18:32:32 2013 +0200

    tests/lib: make sure the GPU is idle at test start and exit

    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=64270

    v2:
    - Make sure also that the GPU is idle at start and error exit of any
      test using drm_open_any(). (Daniel)
    v3:
    - actually call gem_quiescent_gpu() at exit

    Signed-off-by: Imre Deak <imre.deak@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

output:
rings stopped
gem_set_domain:467 failed, ret=-1, errno=5
./ZZ_hangman: line 30:  4247 Aborted                 (core dumped) $SOURCE_DIR/gem_exec_big
gpu hang correctly dectected

dmesg:
[  120.374100] [drm:i915_ring_stop_set], Stopping rings 0x0000000f
[  120.376368] [drm:i915_driver_open],
[  120.376383] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  120.376389] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  120.376392] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  120.376394] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  120.376400] [drm:i915_driver_open],
[  126.708148] [drm:i915_hangcheck_elapsed] *ERROR* render ring: stuck on addr 0xbac8
[  126.708224] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[  126.711675] [drm:i915_error_work_func], resetting chip
[  126.711720] [drm] Simulated gpu hang, resetting stop_rings
[  126.711765] [drm:i915_gem_context_init], Disabling HW Contexts; old hardware
[  126.711768] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[  126.711825] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[  132.704157] [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x28
[  132.704310] [drm:i915_error_work_func], resetting chip
[  132.704370] [drm:i915_gem_context_init], Disabling HW Contexts; old hardware
[  132.704373] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[  132.704417] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[  133.198449] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  133.198455] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  133.198458] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  133.198460] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  133.208290] [drm:i915_driver_open],
[  133.208298] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  133.208302] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  133.208304] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  133.208306] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  133.208311] [drm:i915_driver_open],
[  135.704156] [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x28
[  135.704837] [drm:i915_error_work_func], resetting chip
[  135.704878] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[  135.704921] [drm:i915_reset] *ERROR* Failed to reset chip.
[  135.704958] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[  145.704169] [drm:i915_gem_wait_for_error] *ERROR* Timed out waiting for the gpu reset to complete
[  146.090217] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  146.090226] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  146.090229] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  146.090231] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  146.486347] [drm:i915_error_state_write], Resetting error state

Reproduce steps:
----------------
1../ZZ_hangman
Comment 1 Chris Wilson 2013-06-07 07:47:27 UTC
The fix is in the drm-intel-fixes queue:

commit 7abb690a0e095717420ba78dcab4309abbbec78a
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Fri May 24 21:29:32 2013 +0200

    drm/i915: Fix spurious -EIO/SIGBUS on wedged gpus
Comment 2 Daniel Vetter 2013-06-07 07:49:24 UTC
We also need

commit 2e7c8ee7a6bf3440478120f14cbf597d416f88b2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue May 28 10:38:44 2013 +0100

    drm/i915: Avoid promoting a simulated hang to 'wedged'


from dinq for this case here.
Comment 3 lu hua 2013-06-09 07:12:49 UTC
It still happens on latest drm-intel-next-queued kernel(commit:22e407d749a418b4bb4cc93ef76e0429a9f83c82).
Comment 4 Daniel Vetter 2013-06-09 08:05:09 UTC
Can you please attach a new dmesg from latest -nightly?
Comment 5 lu hua 2013-06-09 08:19:35 UTC
Test latest -nightly branch(commit 4f9e7cfb09aa3e2fc3b3bba635c6d0c558ce1b70
Merge: 284e9e5 91f8f10)

Run the 1st cycle:
output:
rings stopped
gpu hang correctly dectected

Run  the 2nd cycle:
output:
rings stopped
gem_quiescent_gpu:146 failed, ret=-1, errno=5
./ZZ_hangman: line 30:  4491 Aborted                 (core dumped) $SOURCE_DIR/gem_exec_big
gpu hang not dectected


dmesg:
[   51.656092] [drm:i915_ring_stop_set], Stopping rings 0x0000000f
[   51.682678] [drm:i915_driver_open],
[   51.682695] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   51.682702] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   51.682705] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   51.682707] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   51.682713] [drm:i915_driver_open],
[   57.708078] [drm:i915_hangcheck_elapsed] *ERROR* render ring: stuck on addr 0x0
[   57.708153] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[   57.711553] [drm:i915_error_work_func], resetting chip
[   57.711602] [drm] Simulated gpu hang, resetting stop_rings
[   57.711640] [drm:i915_gem_context_init], Disabling HW Contexts; old hardware
[   57.711644] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[   57.711710] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[   63.704115] [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x28
[   63.704288] [drm:i915_error_work_func], resetting chip
[   63.704677] [drm:i915_gem_context_init], Disabling HW Contexts; old hardware
[   63.704680] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[   63.704728] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[   64.235617] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   64.235625] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   64.235627] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   64.235629] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   64.245361] [drm:i915_driver_open],
[   64.245369] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   64.245372] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   64.245375] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   64.245377] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   64.245382] [drm:i915_driver_open],
[   66.712078] [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x28
[   66.712231] [drm:i915_error_work_func], resetting chip
[   66.712299] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[   66.712343] [drm:i915_reset] *ERROR* Failed to reset chip.
[   66.712377] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[   66.712839] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   66.712844] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   66.712847] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   66.712849] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   67.088554] [drm:i915_error_state_write], Resetting error state
Comment 6 Chris Wilson 2013-06-09 09:02:57 UTC
So we do need the other half of my patch then. :-p
Comment 7 Chris Wilson 2013-06-12 10:04:39 UTC
Created attachment 80719 [details] [review]
Ignore EIO during set-to-domain
Comment 8 Chris Wilson 2013-06-12 10:05:00 UTC
Note that this bug should now be impossible to reproduce on dinq.
Comment 9 lu hua 2013-06-13 06:33:13 UTC
Created attachment 80754 [details]
i915_error_state

ZZ_hangman works well on latest -driq kernel. Run ZZ_hangman then run following cases, they will cause GPU hang:
igt/debugfs_emon_crash
igt/drm_vma_limiter
igt/gem_cpu_concurrent_blit/overwrite-source
igt/gem_gtt_concurrent_blit/early-read
igt/gem_mmap

dmesg:
[   60.419186] [drm:i915_driver_open],
[   60.419202] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   60.419209] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   60.419212] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   60.419214] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   60.419220] [drm:i915_driver_open],
[   60.419260] [drm:i915_getparam], Unknown parameter 22
[   60.419288] [drm:i915_getparam], Unknown parameter 22
[   61.969754] [drm:i915_driver_open],
[   61.969766] [drm:i915_driver_open],
[   61.969792] [drm:i915_getparam], Unknown parameter 22
[   62.225992] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   62.226021] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   62.226025] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   62.226027] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   83.838268] [drm:i915_driver_open],
[   83.838283] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   83.838288] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   83.838291] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   83.838293] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   83.838299] [drm:i915_driver_open],
[   83.838336] [drm:i915_getparam], Unknown parameter 22
[   83.838362] [drm:i915_getparam], Unknown parameter 22
[   85.384475] [drm:i915_driver_open],
[   85.384488] [drm:i915_driver_open],
[   85.384514] [drm:i915_getparam], Unknown parameter 22
[   85.638461] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   85.638472] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   85.638475] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   85.638477] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   89.911226] [drm:i915_ring_stop_set], Stopping rings 0x0000000f
[   89.914325] [drm:i915_driver_open],
[   89.914341] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   89.914347] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   89.914350] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   89.914352] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   89.914358] [drm:i915_driver_open],
[   97.708011] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[   97.708058] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[   97.711483] [drm:i915_error_work_func], resetting chip
[   97.711532] [drm] Simulated gpu hang, resetting stop_rings
[   97.711574] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[   97.711637] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[   97.712079] [drm:i915_getparam], Unknown parameter 22
[  105.704130] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  105.704373] [drm:i915_error_work_func], resetting chip
[  105.704440] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[  105.704514] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[  106.231569] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  106.231576] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  106.231579] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  106.231581] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  106.241424] [drm:i915_driver_open],
[  106.241432] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  106.241436] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  106.241438] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  106.241440] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  106.241446] [drm:i915_driver_open],
[  106.241471] [drm:i915_getparam], Unknown parameter 22
[  114.704133] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  114.704391] [drm:i915_error_work_func], resetting chip
[  114.704467] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[  114.704549] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[  114.704878] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  114.704882] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  114.704885] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  114.704887] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  114.752250] [drm:i915_error_state_write], Resetting error state
[  128.187576] [drm:i915_driver_open],
[  128.187591] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  128.187597] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  128.187600] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  128.187602] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  128.187608] [drm:i915_driver_open],
[  128.187651] [drm:i915_getparam], Unknown parameter 22
[  135.704042] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  135.704090] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[  135.705435] [drm:i915_error_work_func], resetting chip
[  135.705519] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[  135.705604] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[  135.705940] [drm:i915_getparam], Unknown parameter 22
[  137.282160] [drm:i915_driver_open],
[  137.282174] [drm:i915_driver_open],
[  137.282203] [drm:i915_getparam], Unknown parameter 22
[  139.712142] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  139.712354] [drm:i915_error_work_func], resetting chip
[  139.717439] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[  139.717484] [drm:i915_reset] *ERROR* Failed to reset chip.
[  139.717518] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[  139.971331] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  139.971343] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  139.971346] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  139.971348] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
Comment 10 Chris Wilson 2013-06-13 07:48:41 UTC
The bsd ring fails after being reset; the breadcrumb write fails to materialise.
Comment 11 Daniel Vetter 2013-06-30 12:21:28 UTC
This is wreaking havoc with running igt on my gm45 ... I guess I should take a look at fixing gpu reset on it.
Comment 12 Daniel Vetter 2013-07-01 21:17:13 UTC
Created attachment 81830 [details] [review]
fix media reset on gm45

Can you please test the attached patch? Seems to work better on some light testing here at least ...
Comment 13 lu hua 2013-07-02 03:00:42 UTC
(In reply to comment #12)
> Created attachment 81830 [details] [review] [review]
> fix media reset on gm45
> 
> Can you please test the attached patch? Seems to work better on some light
> testing here at least ...

Test with this patch.
output:
gem_quiescent_gpu:155 failed, ret=-1, errno=5
./ZZ_hangman: line 30:  3291 Aborted                 (core dumped) $SOURCE_DIR/gem_exec_big
gpu hang correctly dectected

dmesg:
[   79.660881] [drm:i915_ring_stop_set], Stopping rings 0x0000000f
[   79.688312] [drm:i915_driver_open],
[   79.688328] [drm:intel_crtc_cursor_set], cursor off
[   79.688331] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   79.688337] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   79.688340] [drm:intel_crtc_cursor_set], cursor off
[   79.688342] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   79.688344] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   79.688350] [drm:i915_driver_open],
[   83.703210] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[   83.703276] [drm] capturing error event; look for more information in /sys/class/drm/card0/error
[   83.704560] [drm:i915_error_work_func], resetting chip
[   83.704603] [drm] Simulated gpu hang, resetting stop_rings
[   83.704640] [drm:i915_reset] *ERROR* Failed to reset chip.
[   83.704680] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[   83.777870] [drm:intel_crtc_cursor_set], cursor off
[   83.777875] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[   83.777882] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   83.777885] [drm:intel_crtc_cursor_set], cursor off
[   83.777886] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[   83.777888] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[   84.143386] [drm:i915_error_state_write], Resetting error state
Comment 14 Daniel Vetter 2013-07-02 07:17:33 UTC
Yeah, that patch is broken, I've failed to properly test it. Back to the drawing board.
Comment 15 Daniel Vetter 2013-07-02 09:40:33 UTC
Created attachment 81860 [details] [review]
run full gem hw init after gpu resets

Hopefully I haven't botched the testing on my side again, but this seems to actually work. Please test, thanks.
Comment 16 lu hua 2013-07-03 02:35:21 UTC
(In reply to comment #15)
> Created attachment 81860 [details] [review] [review]
> run full gem hw init after gpu resets
> 
> Hopefully I haven't botched the testing on my side again, but this seems to
> actually work. Please test, thanks.

Fixed by this patch.
output:
rings stopped
gpu hang correctly dectected

dmesg:
[  195.724135] [drm:i915_ring_stop_set], Stopping rings 0x0000000f
[  195.726451] [drm:i915_driver_open],
[  195.726467] [drm:intel_crtc_cursor_set], cursor off
[  195.726470] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  195.726476] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  195.726479] [drm:intel_crtc_cursor_set], cursor off
[  195.726480] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  195.726483] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  195.726489] [drm:i915_driver_open],
[  199.707175] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[  199.707240] [drm] capturing error event; look for more information in /sys/class/drm/card0/error
[  199.710689] [drm:i915_error_work_func], resetting chip
[  199.710735] [drm] Simulated gpu hang, resetting stop_rings
[  199.710780] [drm:init_status_page], render ring hws offset: 0x00477000
[  199.710960] [drm:init_status_page], bsd ring hws offset: 0x0049a000
[  199.711131] [drm:i915_gem_context_init], Disabling HW Contexts; old hardware
[  199.711135] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[  199.711195] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[  200.290435] [drm:intel_crtc_cursor_set], cursor off
[  200.290440] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  200.290447] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  200.290450] [drm:intel_crtc_cursor_set], cursor off
[  200.290451] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  200.290454] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  200.300637] [drm:i915_driver_open],
[  200.300646] [drm:intel_crtc_cursor_set], cursor off
[  200.300647] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  200.300651] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  200.300653] [drm:intel_crtc_cursor_set], cursor off
[  200.300655] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  200.300657] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  200.300663] [drm:i915_driver_open],
[  200.300700] [drm:intel_crtc_cursor_set], cursor off
[  200.300702] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  200.300705] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  200.300708] [drm:intel_crtc_cursor_set], cursor off
[  200.300709] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  200.300711] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  200.695181] [drm:i915_error_state_write], Resetting error state
Comment 17 Daniel Vetter 2013-07-03 11:00:34 UTC
Ok, the previous patch had some pretty massive issues, so new patch to test:

https://patchwork.kernel.org/patch/2816111/

It seems to work here, but please confirm that this one is still good.
Comment 18 lu hua 2013-07-04 05:26:29 UTC
(In reply to comment #17)
> Ok, the previous patch had some pretty massive issues, so new patch to test:
> 
> https://patchwork.kernel.org/patch/2816111/
> 
> It seems to work here, but please confirm that this one is still good.

Works well with this patch.
output:
rings stopped
gpu hang correctly dectected

dmesg:
[  199.855498] [drm:i915_ring_stop_set], Stopping rings 0x0000000f
[  199.857770] [drm:i915_driver_open],
[  199.857786] [drm:intel_crtc_cursor_set], cursor off
[  199.857789] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  199.857794] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  199.857797] [drm:intel_crtc_cursor_set], cursor off
[  199.857799] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  199.857801] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  199.857807] [drm:i915_driver_open],
[  203.707172] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[  203.707240] [drm] capturing error event; look for more information in /sys/class/drm/card0/error
[  203.710729] [drm:i915_error_work_func], resetting chip
[  203.710774] [drm] Simulated gpu hang, resetting stop_rings
[  203.710819] [drm:i915_gem_context_init], Disabling HW Contexts; old hardware
[  203.710822] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[  203.710879] [drm:i9xx_update_plane], Writing base 00046000 00000000 0 0 5120
[  204.245423] [drm:intel_crtc_cursor_set], cursor off
[  204.245428] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  204.245434] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  204.245437] [drm:intel_crtc_cursor_set], cursor off
[  204.245438] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  204.245441] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  204.255272] [drm:i915_driver_open],
[  204.255280] [drm:intel_crtc_cursor_set], cursor off
[  204.255282] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  204.255285] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  204.255287] [drm:intel_crtc_cursor_set], cursor off
[  204.255289] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  204.255291] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  204.255297] [drm:i915_driver_open],
[  204.255332] [drm:intel_crtc_cursor_set], cursor off
[  204.255334] [drm:intel_crtc_set_config], [CRTC:3] [FB:37] #connectors=1 (x y) (0 0)
[  204.255337] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  204.255339] [drm:intel_crtc_cursor_set], cursor off
[  204.255341] [drm:intel_crtc_set_config], [CRTC:4] [NOFB]
[  204.255343] [drm:intel_modeset_stage_output_state], [CONNECTOR:5:LVDS-1] to [CRTC:3]
[  204.646094] [drm:i915_error_state_write], Resetting error state
Comment 19 Daniel Vetter 2013-07-04 09:37:15 UTC
Patch merged to -fixes:

commit 035dc1e0f9008b48630e02bf0eaa7cc547416d1d
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jul 3 12:56:54 2013 +0200

    drm/i915: reinit status page registers after gpu reset
Comment 20 lu hua 2013-07-05 03:54:38 UTC
Test on latest -nightly branch.
output:
 ./ZZ_hangman
checking /sys/kernel/debug/dri/0/i915_error_state
rings stopped
gpu hang correctly detected
checking /sys/class/drm/card0/error
rings stopped
gpu hang correctly detected

dmesg has <6>[ 1438.703273] [drm] capturing error event; look for more information in /sys/class/drm/card0/error

# cat /sys/class/drm/card0/error
no error state collected

Is it expected?
Comment 21 Daniel Vetter 2013-07-05 06:41:33 UTC
(In reply to comment #20)
> Test on latest -nightly branch.
> output:
>  ./ZZ_hangman
> checking /sys/kernel/debug/dri/0/i915_error_state
> rings stopped
> gpu hang correctly detected
> checking /sys/class/drm/card0/error
> rings stopped
> gpu hang correctly detected
> 
> dmesg has <6>[ 1438.703273] [drm] capturing error event; look for more
> information in /sys/class/drm/card0/error
> 
> # cat /sys/class/drm/card0/error
> no error state collected
> 
> Is it expected?

Yes, ZZ_hangman automatically clears the error_state at the end of the test so that if any other test causes a real gpu hang it gets captured.
Comment 22 lu hua 2013-07-05 08:08:21 UTC
Verified Fixed.
Comment 23 Elizabeth 2017-10-06 14:46:02 UTC
Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.