Bug 97779

Summary: [regression, bisected][BDW, GPU hang] stuck on render ring, always reproducible
Product: Mesa Reporter: regwz <regwz>
Component: Drivers/DRI/i965Assignee: Jason Ekstrand <jason>
Status: RESOLVED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: diego.viola, intel-gfx-bugs, zhouwei400
Version: 12.0Keywords: bisected, regression
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: BDW i915 features: GPU hang
Bug Depends on:    
Bug Blocks: 98335    
Attachments: gpu crash dump
dmesg output (drm.debug=0x1e)
glxinfo
lspci -vvvnn
Chromium stderr output
git bisect log
git-revert of the offending commit (mesa 12.0.3)
aub trace of crash: sklgt2
git bisect log

Description regwz 2016-09-12 15:37:06 UTC
Created attachment 126466 [details]
gpu crash dump

System environment:
-- system architecture: 64-bit
-- xf86-video-intel: 1:2.99.917+703+g15c5ff1-1
-- xserver: 1.18.4
-- mesa: 12.1.0-devel (git-2da15a3)
-- llvm: 4.0.0svn_r281203
-- libdrm: 2.4.70
-- kernel: drm-intel-nightly 4.8.0 commit 3015dc173b34795d5bcf8fed4d7ce4709914a88f
-- chromium: 53.0.2785.101
-- Linux distribution: Arch Linux
-- Machine or mobo model: Lenovo Thinkpad Edge E550 20DF004PMC
-- Processor: Intel i5-5200U
-- GPU: Intel(R) HD Graphics 5500 (Broadwell GT2)
-- Display connector: eDP

Reproducing steps:
	1. open Chromium and navigate to the following link: https://www.artstation.com/embed/3346607 (source: https://playrust.com/devblog-125/)
	2. click the play button in the middle of the screen
	3. GPU hangs before the object is displayed


Additional info:
	- tested after a cold boot
	- hang occurs every time (after a cold boot and after every GPU reset due to a hang)
	- /proc/sys/kernel/tainted = 0 (no non-GPL or out-of-tree modules were loaded)
	- also works under Firefox (both chromium and firefox have WebGL acceleration enabled)
	- the bug is also present on kernel 4.7.3 and mesa 12.0.2
	- if intel_iommu is on, the PC may freeze completely
	- DRI3 and SNA is enabled, but the hang also occurs under DRI2 / UXA and the modesetting driver
Comment 1 regwz 2016-09-12 15:39:51 UTC
Created attachment 126467 [details]
dmesg output (drm.debug=0x1e)
Comment 2 regwz 2016-09-12 15:40:35 UTC
Created attachment 126468 [details]
glxinfo
Comment 3 regwz 2016-09-12 15:44:31 UTC
Created attachment 126469 [details]
lspci -vvvnn
Comment 4 regwz 2016-09-12 15:46:58 UTC
Created attachment 126470 [details]
Chromium stderr output
Comment 5 yann 2016-09-12 16:00:52 UTC
Assigning to Mesa product.

From this error dump, hung is happening in render ring batch with active head at 0xf7299f04, with 0x7b000005 (3DPRIMITIVE) as IPEHR.

Batch extract (around 0xf7299f04):

0xf7299ed4:      0x78490001: 3D UNKNOWN: 3d_965 opcode = 0x7849
0xf7299ed8:      0x00000004: MI_NOOP
0xf7299edc:      0x00000000: MI_NOOP
0xf7299ee0:      0x780c0000: 3D UNKNOWN: 3d_965 opcode = 0x780c
0xf7299ee4:      0x00000000: MI_NOOP
Bad length 7 in (null), expected 6-6
0xf7299ee8:      0x7b000005: 3DPRIMITIVE: fail sequential
0xf7299eec:      0x00000104:    vertex count
0xf7299ef0:      0x00019470:    start vertex
0xf7299ef4:      0x00000000:    instance count
0xf7299ef8:      0x00000001:    start instance
0xf7299efc:      0x00000000:    index bias
0xf7299f00:      0x00000000: MI_NOOP
0xf7299f04:      0x78150009: 3D UNKNOWN: 3d_965 opcode = 0x7815
0xf7299f08:      0x00000004: MI_NOOP
0xf7299f0c:      0x00000000: MI_NOOP
Comment 6 Diego Viola 2016-10-14 22:27:05 UTC
I can reproduce this as well.

Arch Linux (x86-64)
Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz
Comment 7 Ilia Mirkin 2016-10-14 22:44:48 UTC
Confirmed on mesa 12.0.3 on SKL GT2 (8086:1912) with kernel v4.7. Error state available on request (ping me on IRC, don't cc me on this bug).
Comment 8 Diego Viola 2016-10-14 22:45:37 UTC
Linux myhost 4.7.6-1-ARCH #1 SMP PREEMPT Fri Sep 30 19:28:42 CEST 2016 x86_64 GNU/Linux
mesa 12.0.3-3
Comment 9 Diego Viola 2016-10-16 00:19:44 UTC
Has this always been broken? Have you tried bisecting?
Comment 10 regwz 2016-10-16 06:48:25 UTC
I haven't checked that, and it turns out I should have.
It can't be reproduced with mesa-11.2.2-1 and it's broken again in mesa-12.0.0-1.
Comment 11 Diego Viola 2016-10-16 17:10:32 UTC
(In reply to regwz from comment #10)
> I haven't checked that, and it turns out I should have.
> It can't be reproduced with mesa-11.2.2-1 and it's broken again in
> mesa-12.0.0-1.

This looks more like a kernel issue than Mesa issue.
Comment 12 regwz 2016-10-17 08:32:24 UTC
I ran a bisect with the following result:

091b6156dd8553979336c15acdaf140e5419c483 is the first bad commit
commit 091b6156dd8553979336c15acdaf140e5419c483
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Tue Dec 8 17:34:38 2015 -0800

    i965/fs: Push small uniform arrays
    
    Unfortunately, this also means that we need to use a slightly different
    algorithm for assign_constant_locations.  The old algorithm worked based on
    the assumption that each read of a uniform value read exactly one float.
    If it encountered a MOV_INDIRECT, it would immediately bail and push the
    whole thing.  Since we can now read ranges using MOV_INDIRECT, we need to
    be able to push a series of floats without breaking them up.  To do this,
    we use an algorithm similar to the on in split_virtual_grfs.


I also verified that the bug can no longer be reproduced after reverting the commit from mesa 12.0.3.
Comment 13 regwz 2016-10-17 08:33:29 UTC
Created attachment 127343 [details]
git bisect log
Comment 14 regwz 2016-10-17 08:35:10 UTC
Created attachment 127344 [details] [review]
git-revert of the offending commit (mesa 12.0.3)
Comment 15 Mark Janes 2016-10-17 22:52:10 UTC
Jason, the aub dump is hosted internally at:

http://otc-mesa-ci.jf.intel.com/userContent/Bug_97779.aub
Comment 16 Mark Janes 2016-10-17 22:57:11 UTC
Created attachment 127369 [details]
aub trace of crash: sklgt2
Comment 17 Jason Ekstrand 2016-10-20 19:06:09 UTC
(In reply to regwz from comment #12)
> I ran a bisect with the following result:
> 
> 091b6156dd8553979336c15acdaf140e5419c483 is the first bad commit
> commit 091b6156dd8553979336c15acdaf140e5419c483
> Author: Jason Ekstrand <jason.ekstrand@intel.com>
> Date:   Tue Dec 8 17:34:38 2015 -0800
> 
>     i965/fs: Push small uniform arrays
>     
>     Unfortunately, this also means that we need to use a slightly different
>     algorithm for assign_constant_locations.  The old algorithm worked based
> on
>     the assumption that each read of a uniform value read exactly one float.
>     If it encountered a MOV_INDIRECT, it would immediately bail and push the
>     whole thing.  Since we can now read ranges using MOV_INDIRECT, we need to
>     be able to push a series of floats without breaking them up.  To do this,
>     we use an algorithm similar to the on in split_virtual_grfs.

This bisect is bad.  You were bisecting through the Vulkan merge.  Back when the Vulkan driver was still in development the i965 driver in the vulkan branch was very unstable.  In order to get a proper bisect, you need to do so while ignoring the vulkan branch.

The easiest way to do this is probably to test right before the vulkan branch merged and right after.  The vulkan branch merging shouldn't have caused any problems.  If those tests are good, bisect between the merge and 12.0.  If they're bad, bisect between some older known-good commit and the vulkan merge.

> I also verified that the bug can no longer be reproduced after reverting the
> commit from mesa 12.0.3.

I doubt that given that I can't get that commit to revert cleanly.  A similar commit does exist in the main tree and happened shortly prior to merging the vulkan branch.  The commit you point to got lost in the merge.  In any case, please re-bisect.
Comment 18 regwz 2016-10-21 06:51:05 UTC
(In reply to Jason Ekstrand from comment #17)
> This bisect is bad.  You were bisecting through the Vulkan merge.  Back when
> the Vulkan driver was still in development the i965 driver in the vulkan
> branch was very unstable.  In order to get a proper bisect, you need to do
> so while ignoring the vulkan branch.

Sorry about that and thank you for the explanation. I redid the bisect, here are the results:

963513bb24bdd542f1af3733fab53ad450d3221b is the first bad commit
commit 963513bb24bdd542f1af3733fab53ad450d3221b
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Tue Dec 8 17:34:38 2015 -0800

    i965/fs: Push small uniform arrays
    
    Unfortunately, this also means that we need to use a slightly different
    algorithm for assign_constant_locations.  The old algorithm worked based on
    the assumption that each read of a uniform value read exactly one float.
    If it encountered a MOV_INDIRECT, it would immediately bail and push the
    whole thing.  Since we can now read ranges using MOV_INDIRECT, we need to
    be able to push a series of floats without breaking them up.  To do this,
    we use an algorithm similar to the on in split_virtual_grfs.
    
    Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
    Acked-by: Kenneth Graunke <kenneth@whitecape.org>


> I doubt that given that I can't get that commit to revert cleanly.  A
> similar commit does exist in the main tree and happened shortly prior to
> merging the vulkan branch.  The commit you point to got lost in the merge. 
> In any case, please re-bisect.

Yes, there were merge conflicts, but I resolved them manually (see attachment https://bugs.freedesktop.org/attachment.cgi?id=127344).
Comment 19 regwz 2016-10-21 06:52:50 UTC
Created attachment 127447 [details]
git bisect log
Comment 20 Jason Ekstrand 2016-10-29 04:39:46 UTC
This bug should be fixed by the following commit:

commit 2a4a86862c949055c71637429f6d5f2e725d07d8
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Fri Oct 28 14:48:53 2016 -0700

    i965/fs/generator: Don't use the address immediate for MOV_INDIRECT
    
    The address immediate field is only 9 bits and, since the value is in
    bytes, the highest GRF we can point to with it is g15.  This makes it
    pretty close to useless for MOV_INDIRECT.  There were already piles of
    restrictions preventing us from using it prior to Broadwell, so let's get
    rid of the gen8+ code path entirely.
    
    Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=97779
    Cc: "12.0 13.0" <mesa-stable@lists.freedesktop.org>
    Reviewed-by: Matt Turner <mattst88@gmail.com>

I've tagged it for stable so it should be in 13.0 and it may even get into a 12.0 stable release.
Comment 21 Diego Viola 2016-11-02 01:38:15 UTC
I can confirm this is fixed, thanks.

Arch Linux (x86-64)
mesa 13.0.0-1
Comment 22 regwz 2016-11-02 14:04:09 UTC
*** Bug 98412 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.