Bug 99977

Summary: [SNB iommu] GPU HANG: TLB page VTD translation generated an error
Product: DRI Reporter: Jan Nordholz <jckn>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard: ReadyForDev
i915 platform: SNB i915 features: GPU hang
Attachments:
Description Flags
GPU error dump
none
kernel log
none
GPU error dump (drm-tip)
none
kernel lockdep error, drm-tip, unpatched
none
GPU error dump (v3.19)
none
dmesg (v3.19)
none
Force 64k alignment
none
dmesg (v4.11-rc2), hard freeze
none
GPU error dump (v4.11-rc2), hard freeze none

Description Jan Nordholz 2017-02-27 00:51:40 UTC
Created attachment 129939 [details]
GPU error dump

Architecture: amd64
Machine: Thinkpad T520
Linux: 4.10.0-rc8git+ (vanilla master)
Distro: Debian Unstable
Mesa: 13.0.5-1 (distro)
libdrm: 2.4.74-1 (distro)
X intel driver: 2:2.99.917+git20161206-1 (distro)

Bug manifests reliably around 30sec after starting up the game and then every few minutes thereafter, until the machine finally locks up hard (as in "not-even-sysrq"-hard). GPU dump is attached, dmesg follows.
Comment 1 Jan Nordholz 2017-02-27 00:52:31 UTC
Created attachment 129940 [details]
kernel log
Comment 2 Jan Nordholz 2017-02-27 01:29:05 UTC
PS: Problem persists with drm-tip Linux kernel. The first occurrence of the bug directly caused a hard lockup, so no debug output though. Can reproduce until I have my hands on a drm.debug=0x1e log if desired.
Comment 3 Jan Nordholz 2017-02-27 01:41:54 UTC
Also persists with Mesa 17.
Comment 4 Chris Wilson 2017-02-27 09:20:23 UTC
Hmm, could you also attach the error state from drm-tip? There's an interesting error in there that I wonder if it is consistent.
Comment 5 Jan Nordholz 2017-02-27 17:27:11 UTC
Desperately trying to, but with drm-tip, the soft lock appears to be gone. Every encounter of the bug now (five out of five and counting) directly hard-freezes the machine. SysRq doesn't respond, but the kexec crashkernel isn't being activated either, so I'm currently at a loss how to retrieve the crash data...

Is there anything else I could do to approach the problem, or do you have any suggestions to capture my crash state?
Comment 6 Chris Wilson 2017-02-27 20:48:09 UTC
Hmm, missed that it immediately hard locked. A bit of a nuisance, but could you bisect the hard lockup?

Assuming it's not rogue hardware, hard lockups are usually broken locking (e.g. recursive irq-off spinlocks). Trying a build with lockdep enabled may help - at least to get a warning out.
Comment 7 Jan Nordholz 2017-02-27 21:09:54 UTC
Created attachment 129966 [details]
GPU error dump (drm-tip)

Have finally been able to survive one.
Comment 8 Jan Nordholz 2017-02-27 21:14:13 UTC
On a side note, in one of the freezes I encountered on the way, the laptop became completely unresponsive (as usual), but kept hammering bogus small packets out over eth0, flooding the network. Raised my suspicion that I'm experiencing memory corruption rather than just getting stuck in a lock...
Comment 9 Chris Wilson 2017-02-27 21:38:47 UTC
Ok, same error pops out:

ERROR: 0x00000012
    Context page GTT translation generated a fault (GTT entry not valid)
    TLB page VTD translation generated an error


Try:
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 6827a85408bf..0910f59dfddf 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -105,6 +105,7 @@ static int get_context_size(struct drm_i915_private *dev_priv)
        case 6:
                reg = I915_READ(CXT_SIZE);
                ret = GEN6_CXT_TOTAL_SIZE(reg) * 64;
+               ret = 18 * PAGE_SIZE;
                break;
        case 7:
                reg = I915_READ(GEN7_CXT_SIZE);

Something else to try is intel_iommu=igfx-off
Comment 10 Jan Nordholz 2017-02-27 21:51:52 UTC
Created attachment 129968 [details]
kernel lockdep error, drm-tip, unpatched

Rebuilt the drm-tip kernel with lockdep on - got this.
Comment 11 Jan Nordholz 2017-02-27 22:17:56 UTC
Freeze still occurs even with the suggested patch applied to drm-tip. (Hard lockup, will post a crash dump once I have one.) Setting intel_iommu=igfx_off on the other hand made the problem disappear.
Comment 12 Chris Wilson 2017-02-27 22:32:57 UTC
(In reply to Jan Nordholz from comment #10)
> Created attachment 129968 [details]
> kernel lockdep error, drm-tip, unpatched
> 
> Rebuilt the drm-tip kernel with lockdep on - got this.

Oh. I actually fixed that earlier and then wrote another patch with exactly the same problem.

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 4c645f8ab05d..561deab3aff6 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -466,10 +466,11 @@ i915_gem_object_wait_reservation(struct reservation_object *resv,
        dma_fence_put(excl);
 
        if (prune_fences && !__read_seqcount_retry(&resv->seq, seq)) {
-               reservation_object_lock(resv, NULL);
-               if (!__read_seqcount_retry(&resv->seq, seq))
-                       reservation_object_add_excl_fence(resv, NULL);
-               reservation_object_unlock(resv);
+               if (reservation_object_trylock(resv, NULL)) {
+                       if (!__read_seqcount_retry(&resv->seq, seq))
+                               reservation_object_add_excl_fence(resv, NULL);
+                       reservation_object_unlock(resv);
+               }
        }
 
        return timeout;
Comment 13 Chris Wilson 2017-02-27 22:38:12 UTC
s/reservation_object_trylock(obj->resv, NULL)/reservation_object_trylock(obj->resv)/
Comment 14 Chris Wilson 2017-02-27 22:46:40 UTC
So assuming it is memcorruption with the GPU context state + iommu, do we have a good idea when it started? Assuming it is memcorruption, the hard hang will then just be an accident on clobbering something important, bisecting its introduction is likely a wild goose chase. But if iommu used to work with this game and now doesn't, that might be interesting. (Assuming it doesn't bisect to some major feature enabling like turning on HW contexts without which mesa simply won't work!)
Comment 15 Jan Nordholz 2017-02-27 23:02:23 UTC
Re #12:
Cannot find reservation_object_trylock() in drm-tip or in Linus' master. Extrapolated from the declarations of <reservation.h> and <ww_mutex.h> that the intention was (please correct if I'm wrong):

if (reservation_object_lock(resv, NULL) == 0) { /* lock successfully acquired */
  ...
}

Tested that, hard freeze persists.

Re #14:
Not sure, just pulled it out of the closet again. Played it years ago, but the kernels I ran back then didn't have IOMMU compiled in, if I remember correctly. I'll happily bisect this - what would be a good guess for how far to go back?
Comment 16 Chris Wilson 2017-02-27 23:20:09 UTC
reservation_object_trylock() hit drm-tip [via drm-misc] this morning. No, it won't stop the hang or hard lockup, since the patch it fixes is very recent.

If you have the opportunity, try something like kernel v3.19. v4.0 is the start of all the major changes (atomic modesetting, request handling) and so we are most likely to have broken something since.
Comment 17 Jan Nordholz 2017-02-28 03:26:28 UTC
Ah, ok. Pulled again and applied - will report if the lockdep problem disappears with that.

Having trouble booting vanilla v3.19, probably some hasty mistake rewinding my .config - will report back with the result of the bisection, but might take a little time.

Thank you for all your quick responses!
Comment 18 yann 2017-03-08 16:03:25 UTC
Reference to Chris' patchset: https://patchwork.freedesktop.org/series/20903/
Comment 19 Jan Nordholz 2017-03-09 14:09:53 UTC
The bug is also there in v3.19. Cannot check by myself whether it's the same bug, so I'm attaching dmesg and error dump again.
Comment 20 Jan Nordholz 2017-03-09 14:11:03 UTC
Created attachment 130144 [details]
GPU error dump (v3.19)
Comment 21 Jan Nordholz 2017-03-09 14:12:15 UTC
Created attachment 130145 [details]
dmesg (v3.19)
Comment 22 Chris Wilson 2017-03-09 14:24:04 UTC
Yup, same

ERROR: 0x00000012
    Context page GTT translation generated a fault (GTT entry not valid)
    TLB page VTD translation generated an error

We tried overallocating, let's also try aligning.
Comment 23 Chris Wilson 2017-03-09 14:26:25 UTC
Created attachment 130146 [details] [review]
Force 64k alignment
Comment 24 Jan Nordholz 2017-03-10 05:09:35 UTC
Still there, three hard freezes in a row. Strange, I didn't experience a single one when I was rewinding back to 3.19 (even tried a few versions along the way, 4.1-ish), always had the GPU error, but each time recoverable.

Now that I've found out that I had to downgrade my binutils to build pre-v4.1 kernels, I'll try to find out whether there's a specific point when the hard freezes started appearing...

I'll also happily try out all patches you're coming up with.
Comment 25 Jan Nordholz 2017-03-16 00:33:55 UTC
Different approach: with some small modifications to push the i915 error state into memory in human-readable form and then out of the machine using firewire DMA, I was able to get my hands on a GPU error that seconds later led to a hard freeze. Thus we can at least check whether the two forms of GPU hang I'm experiencing (recoverable and non-recoverable) are actually the same error.

This one is now from a vanilla 4.11-rc2 without your patches, but I can repeat again on a recent (and patched) drm-tip if that's helpful.
Comment 26 Jan Nordholz 2017-03-16 00:34:57 UTC
Created attachment 130247 [details]
dmesg (v4.11-rc2), hard freeze
Comment 27 Jan Nordholz 2017-03-16 00:35:27 UTC
Created attachment 130248 [details]
GPU error dump (v4.11-rc2), hard freeze
Comment 28 Elizabeth 2017-06-26 21:33:49 UTC
Good afternoon,
Is this bug still valid? Is the problem still present on last kernel release? Thank you.
Comment 29 Elizabeth 2017-08-24 19:59:31 UTC
Hello Jan, 
Could you please confirm if this is still reproducible on last commit https://cgit.freedesktop.org/drm-tip, 4.13 and up?
Comment 30 Jan Nordholz 2017-08-28 15:21:09 UTC
Hi Elizabeth,

thanks for the ping - I didn't investigate this problem for quite a while. I can confirm that with current drm-tip (c52f53226) the issue appears to be gone.
Comment 31 Elizabeth 2017-08-28 17:36:05 UTC
(In reply to Jan Nordholz from comment #30)
> Hi Elizabeth,
> 
> thanks for the ping - I didn't investigate this problem for quite a while. I
> can confirm that with current drm-tip (c52f53226) the issue appears to be
> gone.
Thanks for the information Jan. Closing the bug then.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.