97779 – [regression, bisected][BDW, GPU hang] stuck on render ring, always reproducible

Bug 97779 - [regression, bisected][BDW, GPU hang] stuck on render ring, always reproducible

Summary: [regression, bisected][BDW, GPU hang] stuck on render ring, always reproducible

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	12.0
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Jason Ekstrand
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:	bisected, regression

Duplicates (1):	98412 (view as bug list)
Depends on:
Blocks:	98335
	Show dependency tree / graph

Reported:	2016-09-12 15:37 UTC by regwz
Modified:	2016-11-02 14:04 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	BDW
i915 features:	GPU hang

Attachments
gpu crash dump (715.90 KB, text/plain) 2016-09-12 15:37 UTC, regwz	Details
dmesg output (drm.debug=0x1e) (1.51 MB, text/plain) 2016-09-12 15:39 UTC, regwz	Details
glxinfo (25.92 KB, text/plain) 2016-09-12 15:40 UTC, regwz	Details
lspci -vvvnn (24.23 KB, text/plain) 2016-09-12 15:44 UTC, regwz	Details
Chromium stderr output (44.98 KB, text/plain) 2016-09-12 15:46 UTC, regwz	Details
git bisect log (7.41 KB, text/plain) 2016-10-17 08:33 UTC, regwz	Details
git-revert of the offending commit (mesa 12.0.3) (5.67 KB, patch) 2016-10-17 08:35 UTC, regwz	Details \| Splinter Review
aub trace of crash: sklgt2 (1.54 MB, application/octet-stream) 2016-10-17 22:57 UTC, Mark Janes	Details
git bisect log (1.92 KB, text/plain) 2016-10-21 06:52 UTC, regwz	Details
Show Obsolete (1) View All

Description regwz 2016-09-12 15:37:06 UTC

Created attachment 126466 [details]
gpu crash dump

System environment:
-- system architecture: 64-bit
-- xf86-video-intel: 1:2.99.917+703+g15c5ff1-1
-- xserver: 1.18.4
-- mesa: 12.1.0-devel (git-2da15a3)
-- llvm: 4.0.0svn_r281203
-- libdrm: 2.4.70
-- kernel: drm-intel-nightly 4.8.0 commit 3015dc173b34795d5bcf8fed4d7ce4709914a88f
-- chromium: 53.0.2785.101
-- Linux distribution: Arch Linux
-- Machine or mobo model: Lenovo Thinkpad Edge E550 20DF004PMC
-- Processor: Intel i5-5200U
-- GPU: Intel(R) HD Graphics 5500 (Broadwell GT2)
-- Display connector: eDP

Reproducing steps:
	1. open Chromium and navigate to the following link: https://www.artstation.com/embed/3346607 (source: https://playrust.com/devblog-125/)
	2. click the play button in the middle of the screen
	3. GPU hangs before the object is displayed


Additional info:
	- tested after a cold boot
	- hang occurs every time (after a cold boot and after every GPU reset due to a hang)
	- /proc/sys/kernel/tainted = 0 (no non-GPL or out-of-tree modules were loaded)
	- also works under Firefox (both chromium and firefox have WebGL acceleration enabled)
	- the bug is also present on kernel 4.7.3 and mesa 12.0.2
	- if intel_iommu is on, the PC may freeze completely
	- DRI3 and SNA is enabled, but the hang also occurs under DRI2 / UXA and the modesetting driver

Comment 1 regwz 2016-09-12 15:39:51 UTC

Created attachment 126467 [details]
dmesg output (drm.debug=0x1e)

Comment 2 regwz 2016-09-12 15:40:35 UTC

Created attachment 126468 [details]
glxinfo

Comment 3 regwz 2016-09-12 15:44:31 UTC

Created attachment 126469 [details]
lspci -vvvnn

Comment 4 regwz 2016-09-12 15:46:58 UTC

Created attachment 126470 [details]
Chromium stderr output

Comment 5 yann 2016-09-12 16:00:52 UTC

Assigning to Mesa product.

From this error dump, hung is happening in render ring batch with active head at 0xf7299f04, with 0x7b000005 (3DPRIMITIVE) as IPEHR.

Batch extract (around 0xf7299f04):

0xf7299ed4:      0x78490001: 3D UNKNOWN: 3d_965 opcode = 0x7849
0xf7299ed8:      0x00000004: MI_NOOP
0xf7299edc:      0x00000000: MI_NOOP
0xf7299ee0:      0x780c0000: 3D UNKNOWN: 3d_965 opcode = 0x780c
0xf7299ee4:      0x00000000: MI_NOOP
Bad length 7 in (null), expected 6-6
0xf7299ee8:      0x7b000005: 3DPRIMITIVE: fail sequential
0xf7299eec:      0x00000104:    vertex count
0xf7299ef0:      0x00019470:    start vertex
0xf7299ef4:      0x00000000:    instance count
0xf7299ef8:      0x00000001:    start instance
0xf7299efc:      0x00000000:    index bias
0xf7299f00:      0x00000000: MI_NOOP
0xf7299f04:      0x78150009: 3D UNKNOWN: 3d_965 opcode = 0x7815
0xf7299f08:      0x00000004: MI_NOOP
0xf7299f0c:      0x00000000: MI_NOOP

Comment 6 Diego Viola 2016-10-14 22:27:05 UTC

I can reproduce this as well.

Arch Linux (x86-64)
Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz

Comment 7 Ilia Mirkin 2016-10-14 22:44:48 UTC

Confirmed on mesa 12.0.3 on SKL GT2 (8086:1912) with kernel v4.7. Error state available on request (ping me on IRC, don't cc me on this bug).

Comment 8 Diego Viola 2016-10-14 22:45:37 UTC

Linux myhost 4.7.6-1-ARCH #1 SMP PREEMPT Fri Sep 30 19:28:42 CEST 2016 x86_64 GNU/Linux
mesa 12.0.3-3

Comment 9 Diego Viola 2016-10-16 00:19:44 UTC

Has this always been broken? Have you tried bisecting?

Comment 10 regwz 2016-10-16 06:48:25 UTC

I haven't checked that, and it turns out I should have.
It can't be reproduced with mesa-11.2.2-1 and it's broken again in mesa-12.0.0-1.

Comment 11 Diego Viola 2016-10-16 17:10:32 UTC

(In reply to regwz from comment #10)
> I haven't checked that, and it turns out I should have.
> It can't be reproduced with mesa-11.2.2-1 and it's broken again in
> mesa-12.0.0-1.

This looks more like a kernel issue than Mesa issue.

Comment 12 regwz 2016-10-17 08:32:24 UTC

I ran a bisect with the following result:

091b6156dd8553979336c15acdaf140e5419c483 is the first bad commit
commit 091b6156dd8553979336c15acdaf140e5419c483
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Tue Dec 8 17:34:38 2015 -0800

    i965/fs: Push small uniform arrays
    
    Unfortunately, this also means that we need to use a slightly different
    algorithm for assign_constant_locations.  The old algorithm worked based on
    the assumption that each read of a uniform value read exactly one float.
    If it encountered a MOV_INDIRECT, it would immediately bail and push the
    whole thing.  Since we can now read ranges using MOV_INDIRECT, we need to
    be able to push a series of floats without breaking them up.  To do this,
    we use an algorithm similar to the on in split_virtual_grfs.


I also verified that the bug can no longer be reproduced after reverting the commit from mesa 12.0.3.

Comment 13 regwz 2016-10-17 08:33:29 UTC

Created attachment 127343 [details]
git bisect log

Comment 14 regwz 2016-10-17 08:35:10 UTC

Created attachment 127344 [details] [review]
git-revert of the offending commit (mesa 12.0.3)

Comment 15 Mark Janes 2016-10-17 22:52:10 UTC

Jason, the aub dump is hosted internally at:

http://otc-mesa-ci.jf.intel.com/userContent/Bug_97779.aub

Comment 16 Mark Janes 2016-10-17 22:57:11 UTC

Created attachment 127369 [details]
aub trace of crash: sklgt2

Comment 17 Jason Ekstrand 2016-10-20 19:06:09 UTC

(In reply to regwz from comment #12)
> I ran a bisect with the following result:
> 
> 091b6156dd8553979336c15acdaf140e5419c483 is the first bad commit
> commit 091b6156dd8553979336c15acdaf140e5419c483
> Author: Jason Ekstrand <jason.ekstrand@intel.com>
> Date:   Tue Dec 8 17:34:38 2015 -0800
> 
>     i965/fs: Push small uniform arrays
>     
>     Unfortunately, this also means that we need to use a slightly different
>     algorithm for assign_constant_locations.  The old algorithm worked based
> on
>     the assumption that each read of a uniform value read exactly one float.
>     If it encountered a MOV_INDIRECT, it would immediately bail and push the
>     whole thing.  Since we can now read ranges using MOV_INDIRECT, we need to
>     be able to push a series of floats without breaking them up.  To do this,
>     we use an algorithm similar to the on in split_virtual_grfs.

This bisect is bad.  You were bisecting through the Vulkan merge.  Back when the Vulkan driver was still in development the i965 driver in the vulkan branch was very unstable.  In order to get a proper bisect, you need to do so while ignoring the vulkan branch.

The easiest way to do this is probably to test right before the vulkan branch merged and right after.  The vulkan branch merging shouldn't have caused any problems.  If those tests are good, bisect between the merge and 12.0.  If they're bad, bisect between some older known-good commit and the vulkan merge.

> I also verified that the bug can no longer be reproduced after reverting the
> commit from mesa 12.0.3.

I doubt that given that I can't get that commit to revert cleanly.  A similar commit does exist in the main tree and happened shortly prior to merging the vulkan branch.  The commit you point to got lost in the merge.  In any case, please re-bisect.

Comment 18 regwz 2016-10-21 06:51:05 UTC

(In reply to Jason Ekstrand from comment #17)
> This bisect is bad.  You were bisecting through the Vulkan merge.  Back when
> the Vulkan driver was still in development the i965 driver in the vulkan
> branch was very unstable.  In order to get a proper bisect, you need to do
> so while ignoring the vulkan branch.

Sorry about that and thank you for the explanation. I redid the bisect, here are the results:

963513bb24bdd542f1af3733fab53ad450d3221b is the first bad commit
commit 963513bb24bdd542f1af3733fab53ad450d3221b
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Tue Dec 8 17:34:38 2015 -0800

    i965/fs: Push small uniform arrays
    
    Unfortunately, this also means that we need to use a slightly different
    algorithm for assign_constant_locations.  The old algorithm worked based on
    the assumption that each read of a uniform value read exactly one float.
    If it encountered a MOV_INDIRECT, it would immediately bail and push the
    whole thing.  Since we can now read ranges using MOV_INDIRECT, we need to
    be able to push a series of floats without breaking them up.  To do this,
    we use an algorithm similar to the on in split_virtual_grfs.
    
    Reviewed-by: Kristian Høgsberg <krh@bitplanet.net>
    Acked-by: Kenneth Graunke <kenneth@whitecape.org>


> I doubt that given that I can't get that commit to revert cleanly.  A
> similar commit does exist in the main tree and happened shortly prior to
> merging the vulkan branch.  The commit you point to got lost in the merge. 
> In any case, please re-bisect.

Yes, there were merge conflicts, but I resolved them manually (see attachment https://bugs.freedesktop.org/attachment.cgi?id=127344).

Comment 19 regwz 2016-10-21 06:52:50 UTC

Created attachment 127447 [details]
git bisect log

Comment 20 Jason Ekstrand 2016-10-29 04:39:46 UTC

This bug should be fixed by the following commit:

commit 2a4a86862c949055c71637429f6d5f2e725d07d8
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Fri Oct 28 14:48:53 2016 -0700

    i965/fs/generator: Don't use the address immediate for MOV_INDIRECT
    
    The address immediate field is only 9 bits and, since the value is in
    bytes, the highest GRF we can point to with it is g15.  This makes it
    pretty close to useless for MOV_INDIRECT.  There were already piles of
    restrictions preventing us from using it prior to Broadwell, so let's get
    rid of the gen8+ code path entirely.
    
    Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=97779
    Cc: "12.0 13.0" <mesa-stable@lists.freedesktop.org>
    Reviewed-by: Matt Turner <mattst88@gmail.com>

I've tagged it for stable so it should be in 13.0 and it may even get into a 12.0 stable release.

Comment 21 Diego Viola 2016-11-02 01:38:15 UTC

I can confirm this is fixed, thanks.

Arch Linux (x86-64)
mesa 13.0.0-1

Comment 22 regwz 2016-11-02 14:04:09 UTC

*** Bug 98412 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.