Bug 78685

Summary: [ILK]igt/gem_reset_stats/ban-render causes GPU HANG: ecode 0:0x169955aa and *ERROR* render ring :timed out trying to stop ring
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED WONTFIX QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: christophe.prigent, intel-gfx-bugs, jinxianx.guo, yi.sun
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: ILK i915 features: GPU hang
Attachments:
Description Flags
dmesg
none
/sys/class/drm/card0/error none

Description lu hua 2014-05-14 07:40:53 UTC
Created attachment 99014 [details]
dmesg

System Environment:
--------------------------
Platform:         Ironlake
Kernel:           (drm-intel-nightly)2be456541ea41728002ccca2de5235f48d14326e

Bug detailed description:
-------------------------
It causes GPU hang on Ironlake with -queued, -fixes and -nightly kernel.
Run on earlier kernel, It also has this issue.
output:
IGT-Version: 1.6-g351e7d3 (x86_64) (Linux: 3.15.0-rc3_drm-intel-nightly_2be456_2                                                                                                 0140514+ x86_64)
Subtest ban-render: SUCCESS
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:813:
Last errno: 0, Success
Test requirement: (!((((intel_get_drm_devid(fd)) == 0x0102 || (intel_get_drm_devid(fd)) == 0x0112 || (intel_get_drm_devid(fd)) == 0x0122 || (intel_get_drm_devid(fd)) == 0x0106 || (intel_get_drm_devid(fd)) == 0x0116 || (intel_get_drm_devid(fd)) == 0x0126 || (intel_get_drm_devid(fd)) == 0x010A) || (((intel_get_drm_devid(fd)) == 0x0152 || (intel_get_drm_devid(fd)) == 0x0162 || (intel_get_drm_devid(fd)) == 0x0156 || (intel_get_drm_devid(fd)) == 0x0166 || (intel_get_drm_devid(fd)) == 0x015a || (intel_get_drm_devid(fd)) == 0x016a) || (((intel_get_drm_devid(fd)) == 0x0402 || (intel_get_drm_devid(fd)) == 0x0406 || (intel_get_drm_devid(fd)) == 0x040A || (intel_get_drm_devid(fd)) == 0x040B || (intel_get_drm_devid(fd)) == 0x040E || (intel_get_drm_devid(fd)) == 0x0C02 || (intel_get_drm_devid(fd)) == 0x0C06 || (intel_get_drm_devid(fd)) == 0x0C0A || (intel_get_drm_devid(fd)) == 0x0C0B || (intel_get_drm_devid(fd)) == 0x0C0E || (intel_get_drm_devid(fd)) == 0x0A02 || (intel_get_drm_devid(fd)) == 0x0A06 || (intel_get_drm_devid(fd)) == 0x0A0A || (intel_get_drm_devid(fd)) == 0x0A0B || (intel_get_drm_devid(fd)) == 0x0A0E || (intel_get_drm_devid(fd)) == 0x0D02 || (intel_get_drm_devid(fd)) == 0x0D06 || (intel_get_drm_devid(fd)) == 0x0D0A || (intel_get_drm_devid(fd)) == 0x0D0B || (intel_get_drm_devid(fd)) == 0x0D0E) || ((intel_get_drm_devid(fd)) == 0x0412 || (intel_get_drm_devid(fd)) == 0x0416 || (intel_get_drm_devid(fd)) == 0x041A || (intel_get_drm_devid(fd)) == 0x041B || (intel_get_drm_devid(fd)) == 0x041E || (intel_get_drm_devid(fd)) == 0x0C12 || (intel_get_drm_devid(fd)) == 0x0C16 || (intel_get_drm_devid(fd)) == 0x0C1A || (intel_get_drm_devid(fd)) == 0x0C1B || (intel_get_drm_devid(fd)) == 0x0C1E || (intel_get_drm_devid(fd)) == 0x0A12 || (intel_get_drm_devid(fd)) == 0x0A16 || (intel_get_drm_devid(fd)) == 0x0A1A || (intel_get_drm_devid(fd)) == 0x0A1B || (intel_get_drm_devid(fd)) == 0x0A1E || (intel_get_drm_devid(fd)) == 0x0D12 || (intel_get_drm_devid(fd)) == 0x0D16 || (intel_get_drm_devid(fd)) == 0x0D1A || (intel_get_drm_devid(fd)) == 0x0D1B || (intel_get_drm_devid(fd)) == 0x0D1E) || ((intel_get_drm_devid(fd)) == 0x0422 || (intel_get_drm_devid(fd)) == 0x0426 || (intel_get_drm_devid(fd)) == 0x042A || (intel_get_drm_devid(fd)) == 0x042B || (intel_get_drm_devid(fd)) == 0x042E || (intel_get_drm_devid(fd)) == 0x0C22 || (intel_get_drm_devid(fd)) == 0x0C26 || (intel_get_drm_devid(fd)) == 0x0C2A || (intel_get_drm_devid(fd)) == 0x0C2B || (intel_get_drm_devid(fd)) == 0x0C2E || (intel_get_drm_devid(fd)) == 0x0A22 || (intel_get_drm_devid(fd)) == 0x0A26 || (intel_get_drm_devid(fd)) == 0x0A2A || (intel_get_drm_devid(fd)) == 0x0A2B || (intel_get_drm_devid(fd)) == 0x0A2E || (intel_get_drm_devid(fd)) == 0x0D22 || (intel_get_drm_devid(fd)) == 0x0D26 || (intel_get_drm_devid(fd)) == 0x0D2A || (intel_get_drm_devid(fd)) == 0x0D2B || (intel_get_drm_devid(fd)) == 0x0D2E)) || ((intel_get_drm_devid(fd)) == 0x0f30 || (intel_get_drm_devid(fd)) == 0x0f31 || (intel_get_drm_devid(fd)) == 0x0f32 || (intel_get_drm_devid(fd)) == 0x0f33)) || (((((intel_get_drm_devid(fd)) & 0xff00) != 0x1600) ? 0 : ((((intel_get_drm_devid(fd)) & 0x00f0) >> 4) > 3) ? 0 : (((intel_get_drm_devid(fd)) & 0x000f) == 0x2) ? 1 : (((intel_get_drm_devid(fd)) & 0x000f) == 0x6) ? 1 : (((intel_get_drm_devid(fd)) & 0x000f) == 0xb) ? 1 : (((intel_get_drm_devid(fd)) & 0x000f) == 0xa) ? 1 : (((intel_get_drm_devid(fd)) & 0x000f) == 0xd) ? 1 : (((intel_get_drm_devid(fd)) & 0x000f) == 0xe) ? 1 : 0) || ((intel_get_drm_devid(fd)) == 0x22b0 || (intel_get_drm_devid(fd)) == 0x22b1 || (intel_get_drm_devid(fd)) == 0x22b2 || (intel_get_drm_devid(fd)) == 0x22b3)))))

# echo $?
0

# dmesg -r | egrep "<[1-6]>" |grep drm
<5>[    0.000000] Linux version 3.15.0-rc3_drm-intel-nightly_2be456_20140514+ (buildtopcommit@x-kernel) (gcc version 4.7.0 20120507 (Red Hat 4.7.0-5) (GCC) ) #2606 SMP Wed May 14 11:26:03 CST 2014
<6>[    0.000000] Command line: BOOT_IMAGE=kernels//nightly_parents/2014_05_14/drm-intel-nightly/2be456541ea41728002ccca2de5235f48d14326e/bzImage_x86_64 root=/dev/sda2 drm.debug=0xe modules_path=kernels//nightly_parents/2014_05_14/drm-intel-nightly/2be456541ea41728002ccca2de5235f48d14326e/modules_x86_64/lib/modules/3.15.0-rc3_drm-intel-nightly_2be456_20140514+
<5>[    0.000000] Kernel command line: BOOT_IMAGE=kernels//nightly_parents/2014_05_14/drm-intel-nightly/2be456541ea41728002ccca2de5235f48d14326e/bzImage_x86_64 root=/dev/sda2 drm.debug=0xe modules_path=kernels//nightly_parents/2014_05_14/drm-intel-nightly/2be456541ea41728002ccca2de5235f48d14326e/modules_x86_64/lib/modules/3.15.0-rc3_drm-intel-nightly_2be456_20140514+
<6>[    0.668163] usb usb1: Manufacturer: Linux 3.15.0-rc3_drm-intel-nightly_2be456_20140514+ ehci_hcd
<6>[    0.680113] usb usb2: Manufacturer: Linux 3.15.0-rc3_drm-intel-nightly_2be456_20140514+ ehci_hcd
<6>[    0.682131] usb usb3: Manufacturer: Linux 3.15.0-rc3_drm-intel-nightly_2be456_20140514+ uhci_hcd
<6>[    0.683681] usb usb4: Manufacturer: Linux 3.15.0-rc3_drm-intel-nightly_2be456_20140514+ uhci_hcd
<6>[    0.685389] usb usb5: Manufacturer: Linux 3.15.0-rc3_drm-intel-nightly_2be456_20140514+ uhci_hcd
<6>[    0.686879] usb usb6: Manufacturer: Linux 3.15.0-rc3_drm-intel-nightly_2be456_20140514+ uhci_hcd
<6>[    0.688329] usb usb7: Manufacturer: Linux 3.15.0-rc3_drm-intel-nightly_2be456_20140514+ uhci_hcd
<6>[    0.689740] usb usb8: Manufacturer: Linux 3.15.0-rc3_drm-intel-nightly_2be456_20140514+ uhci_hcd
<6>[    1.496414] [drm] Initialized drm 1.1.0 20060810
<6>[    1.500656] [drm] Memory usable by graphics device = 2048M
<6>[    1.513734] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
<6>[    1.513834] [drm] Driver supports precise vblank timestamp query.
<6>[    1.541715] fbcon: inteldrmfb (fb0) is primary device
<6>[    1.617461] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
<6>[    1.617557] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
<6>[   60.815868] [drm] stuck on render ring
<6>[   60.818433] [drm] GPU HANG: ecode 0:0x169955aa, in gem_reset_stats [3692], reason: Ring hung, action: reset
<6>[   60.818550] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
<6>[   60.818610] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
<6>[   60.818675] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
<6>[   60.818729] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
<6>[   60.818778] [drm] GPU crash dump saved to /sys/class/drm/card0/error
<6>[   60.818985] [drm] Simulated gpu hang, resetting stop_rings
<3>[   61.820788] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
<3>[   62.822775] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
<3>[   62.822813] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl 0001f401 head 00000308 tail 00000390 start 00003000
<6>[   64.815826] [drm] stuck on render ring
<6>[   64.818306] [drm] GPU HANG: ecode 0:0x169955ab, in gem_reset_stats [3692], reason: Ring hung, action: reset
<6>[   64.818490] [drm] Simulated gpu hang, resetting stop_rings
<3>[   65.819737] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
<3>[   66.821769] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
<3>[   66.821820] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl 0001f401 head 00000308 tail 000005f0 start 00003000
<6>[   68.815769] [drm] stuck on render ring
<6>[   68.818296] [drm] GPU HANG: ecode 0:0x169955ab, reason: Ring hung, action: reset
<3>[   69.819686] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
<3>[   70.821673] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
<3>[   70.821710] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl 0001f401 head 00000308 tail 00000720 start 00003000
<6>[   72.807713] [drm] stuck on render ring
<6>[   72.810300] [drm] GPU HANG: ecode 0:0x169955ab, in gem_reset_stats [3692], reason: Ring hung, action: reset
<3>[   73.811638] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
<3>[   74.813624] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
<3>[   74.813662] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl 0001f401 head 00000308 tail 000007b8 start 00003000
<6>[   76.819664] [drm] no progress on render ring
<6>[   76.822147] [drm] GPU HANG: ecode -1:0x00000000, reason: Ring hung, action: reset
<3>[   77.823586] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
<3>[   78.825573] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
<3>[   78.825613] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl 0001f401 head 00000308 tail 000007b8 start 0000300

Reproduce steps:
---------------------------- 
1. ./gem_reset_stats --run-subtest ban-render
Comment 1 lu hua 2014-05-14 07:41:29 UTC
Created attachment 99015 [details]
/sys/class/drm/card0/error
Comment 2 lu hua 2014-05-14 08:11:12 UTC
Following cases also have this issue:
gem_reset_stats_close-pending-fork-render
gem_reset_stats_close-pending-fork-reverse-render
gem_reset_stats_close-pending-render
gem_reset_stats_reset-count-render
gem_reset_stats_reset-stats-render
Comment 3 Daniel Vetter 2014-05-15 15:00:46 UTC
Is this a regression? The testcase itself is a few months old, and at least on my testing here it seemed to have worked recently ...
Comment 4 lu hua 2014-05-16 07:26:01 UTC
(In reply to comment #3)
> Is this a regression? The testcase itself is a few months old, and at least
> on my testing here it seemed to have worked recently ...

Test on commit 16b23af8d4f95c09d2bb650e85ecf8ed9e7c18d0, it works well.
Comment 5 Daniel Vetter 2014-05-16 08:25:41 UTC
Ok, let's shrug this off as a fluke then. Please reopen if it shows up again. For verification please run the test a few times in a loop to make sure.
Comment 6 lu hua 2014-05-21 07:56:44 UTC
Bisect shows: e9fea5747d2b3dbff47a8790c1cc4d7af80051d6 is the first bad commit
commit e9fea5747d2b3dbff47a8790c1cc4d7af80051d6
Author: Naresh Kumar Kachhi <naresh.kumar.kachhi@intel.com>
Date:   Wed Mar 12 16:39:41 2014 +0530

    drm/i915: wait for rings to become idle once disabled

    make sure we wait for rings to become idle once they are
    disabled. In case of timeout print an error message

    Signed-off-by: Naresh Kumar Kachhi <naresh.kumar.kachhi@intel.com>
    [danvet: Frob patch as suggested by Chris.]
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Comment 7 Chris Wilson 2014-05-21 08:20:42 UTC
First bad commit for what? The error message?
Comment 8 lu hua 2014-06-03 08:27:38 UTC
(In reply to comment #7)
> First bad commit for what? The error message?

It's for printing error message, the error existed earlier.
Comment 9 lu hua 2014-06-06 08:27:37 UTC
It looks like this error is earlier than e9fea5747d2b3dbff47a8790c1cc4d7af80051d6.
I try to apply this patch on earlier commit and try to find any good commit, but patch fails.
patching file drivers/gpu/drm/i915/i915_reg.h
Hunk #1 FAILED at 748.
Hunk #2 FAILED at 824.
2 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_reg.h.rej
patching file drivers/gpu/drm/i915/intel_ringbuffer.c
Hunk #1 FAILED at 444.
1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/i915/intel_ringbuffer.c.rej
patching file drivers/gpu/drm/i915/intel_ringbuffer.h
Hunk #1 succeeded at 35 with fuzz 2 (offset 2 lines).

So I am not sure it is regression.
Comment 10 Humberto Israel Perez Rodriguez 2015-08-11 15:39:36 UTC
Closed after more than one year of inactivity. Feel free to reopen if needed. Thanks

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.