Bug 94924 - [GEN8] Ungine Valley fails to run due to "intel_do_flush_locked failed: Input/output error"
Summary: [GEN8] Ungine Valley fails to run due to "intel_do_flush_locked failed: Input...
Status: VERIFIED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Matt Turner
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-13 16:33 UTC by Eero Tamminen
Modified: 2017-08-18 08:13 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Kernel crash dump (from Ubuntu 4.4 kernel) (557.83 KB, text/plain)
2016-04-13 16:37 UTC, Eero Tamminen
Details
Kernel dmesg (from Ubuntu 4.4 kernel) (60.66 KB, text/plain)
2016-04-13 16:43 UTC, Eero Tamminen
Details
debugging patch (2.27 KB, patch)
2016-04-18 22:28 UTC, Matt Turner
Details | Splinter Review

Description Eero Tamminen 2016-04-13 16:33:38 UTC
Ungine Valley fails to run due to "intel_do_flush_locked failed: Input/output error" with latest Mesa.

This started to happen on BDW/BSW/SKL ~31st of March, but it doesn't happen on IVB/HSW/BYT,

Automated bisecting on BDW & SKL gave the same commit as trigger to this:
---------------------------------------------------------
commit b4e223cfbf4d46e2ca4c7313f4ebd52798d21551
Author: Matt Turner <mattst88@gmail.com>
Date:   Mon Feb 15 10:43:39 2016 -0800

    i965: Remove NOP insertion kludge in scheduler.
    
    Instead of removing every instruction in add_insts_from_block(), just
    move the instruction to its scheduled location. This is a step towards
    doing both bottom-up and top-down scheduling without conflicts.
    
    Note that this patch changes cycle counts for programs because it begins
    including control flow instructions in the estimates.
    
    Reviewed-by: Francisco Jerez <currojerez@riseup.net>
---------------------------------------------------------

This happens both on:
- Ubuntu 15.10 with DRI3 + latest 4.6-rc kernel & latest X, and
- Ubuntu 16.04 with DRI2 + Ubuntu kernel (i.e. 4.4 + 4.6 i915 forklift) & X

(There have been similar issues with SynMark Multithread test, but those aren't 100% reproducible like this.)
Comment 1 Eero Tamminen 2016-04-13 16:37:14 UTC
Created attachment 122900 [details]
Kernel crash dump (from Ubuntu 4.4 kernel)

There seems to have been several (recovered) GPU hangs until the program exits.  

Issue has happened both when running Valley as fullscreen and as windowed.
Comment 2 Eero Tamminen 2016-04-13 16:43:01 UTC
Created attachment 122901 [details]
Kernel dmesg (from Ubuntu 4.4 kernel)
Comment 3 Eero Tamminen 2016-04-13 16:46:11 UTC
> Ubuntu 16.04 with DRI2 + Ubuntu kernel (i.e. 4.4 + 4.6 i915 forklift) & X

Sorry, this test and attached files were with the older 4.4 Ubuntu kernel which doesn't yet have 4.6 i915 forklift.  If needed, I can later provide also same data from latest upstream 4.6-rc.
Comment 4 Mark Janes 2016-04-14 23:11:02 UTC
I've reproduced this on SKL, linux 4.5

Reverting the bisected commit does not fix the gpu hang.
Comment 5 Mark Janes 2016-04-15 01:08:56 UTC
The proper bisection is:

commit 7b208a731277b4b99b86af3df98c1219099036d7
Author: Matt Turner <mattst88@gmail.com>
Date:   Mon Feb 15 10:05:33 2016 -0800
    i965: Relax restriction on scheduling last instruction.
    
    I think when this code was written, basic blocks were always ended by a
    control flow instruction or an end-of-thread message. That's no longer
    the case, and removing this restriction actually helps things:
    
       instructions in affected programs: 7267 -> 7244 (-0.32%)
       helped: 4
    
       total cycles in shared programs: 66559580 -> 66431900 (-0.19%)
       cycles in affected programs: 28310152 -> 28182472 (-0.45%)
       helped: 9577
       HURT: 879
    
       GAINED: 2
    
    The addition of the is_control_flow() checks is not a functional change,
    since the add_insts_from_block() does not put them in the list of
    instructions to schedule. I plan to change this in a later patch.
    
    Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Comment 6 Matt Turner 2016-04-18 22:28:20 UTC
Created attachment 123035 [details] [review]
debugging patch

I can't reproduce on Haswell, so my hypothesis is that it's a problem in the SIMD8 vertex shader (which is BDW+).

I've diff'd the BDW vertex shaders before and after the commit Mark bisected it to, but I don't see anything that looks wrong.

Attached is a debugging patch I used to verify some invariants (programs end with EOT, blocks that start/end with control flow continue to do so after scheduling). It didn't demonstrate any problems with the Unigine Valley shaders in shader-db.
Comment 7 Kenneth Graunke 2016-04-21 02:15:17 UTC
Is this Valley 1.0 or 1.1rc1?
Comment 8 Mark Janes 2016-04-21 02:34:48 UTC
I reproduced with v1.0.
Comment 9 Eero Tamminen 2016-05-02 11:04:30 UTC
I also have only results for Valley v1.0.

Since ~28th of April, Valley has again started to work on SKL (at least GT2, don't have data on others), but it still doesn't work on BSW nor BDW (GT2 & GT3).

All this time, Valley has worked fine on SNB/IVB/BYT/HSW.
Comment 10 Matt Turner 2016-05-03 06:33:06 UTC
Patch sent: "i965/fs: Follow pow(16) instructions with a NOP."
Comment 11 Matt Turner 2016-05-05 18:39:15 UTC
Fixed by

commit f01d92f4734a7ca62926dceda1d004c0cb10548c
Author: Matt Turner <mattst88@gmail.com>
Date:   Mon May 2 23:32:13 2016 -0700

    i965/fs: Don't follow pow with an instruction with two dest regs.
Comment 12 Eero Tamminen 2016-05-09 12:54:55 UTC
* Valley started working on SKL already earlier, but I guess it's due to unrelated instruction scheduling / timing changes
* Valley works now on BDW, BSW & KBL, where it didn't work before

-> Verified

(Only concern is that Valley failed still once on BDW i7 after this fix.  I'm going to ignore that for now.)


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.