Summary: | [SNB iommu] GPU HANG: TLB page VTD translation generated an error | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Jan Nordholz <jckn> | ||||||||||||||||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||||||
Severity: | normal | ||||||||||||||||||||||
Priority: | medium | CC: | intel-gfx-bugs | ||||||||||||||||||||
Version: | unspecified | ||||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||
Whiteboard: | ReadyForDev | ||||||||||||||||||||||
i915 platform: | SNB | i915 features: | GPU hang | ||||||||||||||||||||
Attachments: |
|
Created attachment 129940 [details]
kernel log
PS: Problem persists with drm-tip Linux kernel. The first occurrence of the bug directly caused a hard lockup, so no debug output though. Can reproduce until I have my hands on a drm.debug=0x1e log if desired. Also persists with Mesa 17. Hmm, could you also attach the error state from drm-tip? There's an interesting error in there that I wonder if it is consistent. Desperately trying to, but with drm-tip, the soft lock appears to be gone. Every encounter of the bug now (five out of five and counting) directly hard-freezes the machine. SysRq doesn't respond, but the kexec crashkernel isn't being activated either, so I'm currently at a loss how to retrieve the crash data... Is there anything else I could do to approach the problem, or do you have any suggestions to capture my crash state? Hmm, missed that it immediately hard locked. A bit of a nuisance, but could you bisect the hard lockup? Assuming it's not rogue hardware, hard lockups are usually broken locking (e.g. recursive irq-off spinlocks). Trying a build with lockdep enabled may help - at least to get a warning out. Created attachment 129966 [details]
GPU error dump (drm-tip)
Have finally been able to survive one.
On a side note, in one of the freezes I encountered on the way, the laptop became completely unresponsive (as usual), but kept hammering bogus small packets out over eth0, flooding the network. Raised my suspicion that I'm experiencing memory corruption rather than just getting stuck in a lock... Ok, same error pops out: ERROR: 0x00000012 Context page GTT translation generated a fault (GTT entry not valid) TLB page VTD translation generated an error Try: diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c index 6827a85408bf..0910f59dfddf 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.c +++ b/drivers/gpu/drm/i915/i915_gem_context.c @@ -105,6 +105,7 @@ static int get_context_size(struct drm_i915_private *dev_priv) case 6: reg = I915_READ(CXT_SIZE); ret = GEN6_CXT_TOTAL_SIZE(reg) * 64; + ret = 18 * PAGE_SIZE; break; case 7: reg = I915_READ(GEN7_CXT_SIZE); Something else to try is intel_iommu=igfx-off Created attachment 129968 [details]
kernel lockdep error, drm-tip, unpatched
Rebuilt the drm-tip kernel with lockdep on - got this.
Freeze still occurs even with the suggested patch applied to drm-tip. (Hard lockup, will post a crash dump once I have one.) Setting intel_iommu=igfx_off on the other hand made the problem disappear. (In reply to Jan Nordholz from comment #10) > Created attachment 129968 [details] > kernel lockdep error, drm-tip, unpatched > > Rebuilt the drm-tip kernel with lockdep on - got this. Oh. I actually fixed that earlier and then wrote another patch with exactly the same problem. diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 4c645f8ab05d..561deab3aff6 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -466,10 +466,11 @@ i915_gem_object_wait_reservation(struct reservation_object *resv, dma_fence_put(excl); if (prune_fences && !__read_seqcount_retry(&resv->seq, seq)) { - reservation_object_lock(resv, NULL); - if (!__read_seqcount_retry(&resv->seq, seq)) - reservation_object_add_excl_fence(resv, NULL); - reservation_object_unlock(resv); + if (reservation_object_trylock(resv, NULL)) { + if (!__read_seqcount_retry(&resv->seq, seq)) + reservation_object_add_excl_fence(resv, NULL); + reservation_object_unlock(resv); + } } return timeout; s/reservation_object_trylock(obj->resv, NULL)/reservation_object_trylock(obj->resv)/ So assuming it is memcorruption with the GPU context state + iommu, do we have a good idea when it started? Assuming it is memcorruption, the hard hang will then just be an accident on clobbering something important, bisecting its introduction is likely a wild goose chase. But if iommu used to work with this game and now doesn't, that might be interesting. (Assuming it doesn't bisect to some major feature enabling like turning on HW contexts without which mesa simply won't work!) Re #12: Cannot find reservation_object_trylock() in drm-tip or in Linus' master. Extrapolated from the declarations of <reservation.h> and <ww_mutex.h> that the intention was (please correct if I'm wrong): if (reservation_object_lock(resv, NULL) == 0) { /* lock successfully acquired */ ... } Tested that, hard freeze persists. Re #14: Not sure, just pulled it out of the closet again. Played it years ago, but the kernels I ran back then didn't have IOMMU compiled in, if I remember correctly. I'll happily bisect this - what would be a good guess for how far to go back? reservation_object_trylock() hit drm-tip [via drm-misc] this morning. No, it won't stop the hang or hard lockup, since the patch it fixes is very recent. If you have the opportunity, try something like kernel v3.19. v4.0 is the start of all the major changes (atomic modesetting, request handling) and so we are most likely to have broken something since. Ah, ok. Pulled again and applied - will report if the lockdep problem disappears with that. Having trouble booting vanilla v3.19, probably some hasty mistake rewinding my .config - will report back with the result of the bisection, but might take a little time. Thank you for all your quick responses! Reference to Chris' patchset: https://patchwork.freedesktop.org/series/20903/ The bug is also there in v3.19. Cannot check by myself whether it's the same bug, so I'm attaching dmesg and error dump again. Created attachment 130144 [details]
GPU error dump (v3.19)
Created attachment 130145 [details]
dmesg (v3.19)
Yup, same ERROR: 0x00000012 Context page GTT translation generated a fault (GTT entry not valid) TLB page VTD translation generated an error We tried overallocating, let's also try aligning. Created attachment 130146 [details] [review] Force 64k alignment Still there, three hard freezes in a row. Strange, I didn't experience a single one when I was rewinding back to 3.19 (even tried a few versions along the way, 4.1-ish), always had the GPU error, but each time recoverable. Now that I've found out that I had to downgrade my binutils to build pre-v4.1 kernels, I'll try to find out whether there's a specific point when the hard freezes started appearing... I'll also happily try out all patches you're coming up with. Different approach: with some small modifications to push the i915 error state into memory in human-readable form and then out of the machine using firewire DMA, I was able to get my hands on a GPU error that seconds later led to a hard freeze. Thus we can at least check whether the two forms of GPU hang I'm experiencing (recoverable and non-recoverable) are actually the same error. This one is now from a vanilla 4.11-rc2 without your patches, but I can repeat again on a recent (and patched) drm-tip if that's helpful. Created attachment 130247 [details]
dmesg (v4.11-rc2), hard freeze
Created attachment 130248 [details]
GPU error dump (v4.11-rc2), hard freeze
Good afternoon, Is this bug still valid? Is the problem still present on last kernel release? Thank you. Hello Jan, Could you please confirm if this is still reproducible on last commit https://cgit.freedesktop.org/drm-tip, 4.13 and up? Hi Elizabeth, thanks for the ping - I didn't investigate this problem for quite a while. I can confirm that with current drm-tip (c52f53226) the issue appears to be gone. (In reply to Jan Nordholz from comment #30) > Hi Elizabeth, > > thanks for the ping - I didn't investigate this problem for quite a while. I > can confirm that with current drm-tip (c52f53226) the issue appears to be > gone. Thanks for the information Jan. Closing the bug then. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 129939 [details] GPU error dump Architecture: amd64 Machine: Thinkpad T520 Linux: 4.10.0-rc8git+ (vanilla master) Distro: Debian Unstable Mesa: 13.0.5-1 (distro) libdrm: 2.4.74-1 (distro) X intel driver: 2:2.99.917+git20161206-1 (distro) Bug manifests reliably around 30sec after starting up the game and then every few minutes thereafter, until the machine finally locks up hard (as in "not-even-sysrq"-hard). GPU dump is attached, dmesg follows.