Bug 60391 - [ilk regression] 3.7.x corrupt console image, hard hang starting X
Summary: [ilk regression] 3.7.x corrupt console image, hard hang starting X
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium blocker
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-02-07 02:49 UTC by Nathan Myers
Modified: 2017-07-24 22:58 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
screen image, blurred but maybe better than nothing (868.71 KB, image/jpeg)
2013-02-12 04:52 UTC, Nathan Myers
no flags Details
git bisect log (3.93 KB, text/plain)
2013-02-12 13:40 UTC, Nathan Myers
no flags Details
dmesg output (42.04 KB, text/plain)
2013-02-12 22:01 UTC, Nathan Myers
no flags Details
3.7.7 .config (95.22 KB, text/plain)
2013-02-12 22:02 UTC, Nathan Myers
no flags Details
Disable WC PTE updates for ILK VTd (3.14 KB, patch)
2013-02-13 09:28 UTC, Chris Wilson
no flags Details | Splinter Review

Description Nathan Myers 2013-02-07 02:49:47 UTC
On booting a 3.7.x kernel (tried 3.7.[45]), when the display switches from VGA to HD DRI mode, small fragments of the upper left part of the image are scattered and duplicated over the full screen.  The fragments are about 1/2 of a text line high, and maybe 1/10 screen wide.  After boot, I can see just enough to recognize a prompt, and blind login/halt succeeds.  All the image fragments update in sync.

On starting X, locks up solid, with staggered color stripes maybe 1/10 screen high, no response to any input/ping.  No errors are preserved in Xorg.log on reboot.

Kernels 3.[23456].x have been fine, incl. current 3.6.10.

Dell Latitude E6510 1920x1080, intel M520 2.4GHz, i915 Arrandale.
Modules i915,drm,drm_kms_helper,intel_gtt,intel_agp,agpgart,i2c_algo_bit
No boot options are set
Comment 1 Chris Wilson 2013-02-07 09:17:14 UTC
Can you please try bisecting between 3.6.10 and 3.7.4? That would most likely be the quickest method to isolate the cause.
Comment 2 Daniel Vetter 2013-02-07 09:27:36 UTC
In addition to the bisect, a screenshot (with a camera) would be interesting.
Comment 3 Nathan Myers 2013-02-08 10:30:46 UTC
Working...
Comment 4 Nathan Myers 2013-02-12 04:52:48 UTC
Created attachment 74664 [details]
screen image, blurred but maybe better than nothing

That screen image,, attached...

Looking again, it seems more as if the display driver and the renderer disagree on the scan-line stride, but in a way that text lines often get several aligned scan lines for some distance.

I bisected (first time!) a half-dozen cycles between 3.6.10 and 3.7.4, but it was straying off into 3.6.10-preX.  Probably my last "bad" assertion was some other bug.  Picking up with a shortened bisect log and a new path...
Comment 5 Nathan Myers 2013-02-12 04:55:33 UTC
Sorry, that was "3.6.0-preX".
Comment 6 Daniel Vetter 2013-02-12 12:22:45 UTC
Can you try another screenshot, sharper? To figure out what's broken I need to do pixel-counting of how exactly things moved around, which isn't possible with yours. Maybe put the camera on a stand to avoid shaking, and if the pixels still aren't sharp enough, maybe also do a shot of the top-left corner only.
Comment 7 Nathan Myers 2013-02-12 13:40:38 UTC
Created attachment 74678 [details]
git bisect log

I will get another pic.  In the meantime, a bisect log.  (How could it take so many more than log2(n) steps?!)  Tested by booting each build with init=/bin/bash, then "modprobe i915".  Verified X starts OK on the last version before the failure.
Comment 8 Chris Wilson 2013-02-12 13:45:53 UTC
Can you revert that patch by applying

diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
index 207e5c3..0a1d654 100644
--- a/drivers/char/agp/intel-gtt.c
+++ b/drivers/char/agp/intel-gtt.c
@@ -601,7 +601,7 @@ static int intel_gtt_init(void)
        gtt_map_size = intel_private.gtt_total_entries * 4;
 
        intel_private.gtt = NULL;
-       if (INTEL_GTT_GEN < 6 && INTEL_GTT_GEN > 2)
+       if (INTEL_GTT_GEN < 6 && INTEL_GTT_GEN > 2 && 0)
                intel_private.gtt = ioremap_wc(intel_private.gtt_bus_addr,
                                               gtt_map_size);
        if (intel_private.gtt == NULL)

to your most recent release (or drm-intel-next/-nightly)?
Comment 9 Chris Wilson 2013-02-12 13:46:39 UTC
Anything unusual about your Ironlake, e.g. VT-d enabled?
Comment 10 Nathan Myers 2013-02-12 14:45:24 UTC
Yes, that fixes it.

I don't know of anything unusual about this hardware, except I think the 1920x1080 LCD is not very common on this model.  

In dmesg I see a line "PCI-DMA: Intel(R) Virtualization Technology for Directed I/O".  In .config, I seem to have "CONFIG_VIRT_TO_BUS" and "CONFIG_HAVE_KVM" turned on, but CONFIG_VIRTUALIZATION off.
Comment 11 Chris Wilson 2013-02-12 14:49:44 UTC
Hmm, definitely entering into Ironlake errata territory.

How about "grep IOMMU /boot/config-`uname -r`" ?
Comment 12 Daniel Vetter 2013-02-12 14:58:20 UTC
Complete boot dmesg would be useful in general I think.
Comment 13 Nathan Myers 2013-02-12 22:01:25 UTC
Created attachment 74713 [details]
dmesg output

No IOMMU in uname.

# CONFIG_CALGARY_IOMMU is not set
CONFIG_IOMMU_HELPER=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
# CONFIG_AMD_IOMMU is not set
CONFIG_INTEL_IOMMU=y
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
# CONFIG_IOMMU_STRESS is not set

dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c9008020e30272 ecap 1000
dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c0000020230272 ecap 1000
dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap c9008020630272 ecap 1000
IOMMU 1 0xfed91000: using Register based invalidation
IOMMU 0 0xfed90000: using Register based invalidation
IOMMU 2 0xfed93000: using Register based invalidation
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device 0000:00:02.0 [0xbdc00000 - 0xbfffffff]
IOMMU: Setting identity map for device 0000:00:1d.0 [0xbb7d7000 - 0xbb7e6fff]
IOMMU: Setting identity map for device 0000:00:1a.0 [0xbb7d7000 - 0xbb7e6fff]
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
Comment 14 Nathan Myers 2013-02-12 22:02:16 UTC
Created attachment 74714 [details]
3.7.7 .config

Should I try building with any different config settings?
Comment 15 Chris Wilson 2013-02-12 22:05:48 UTC
(In reply to comment #14)
> Created attachment 74714 [details]
> 3.7.7 .config
> 
> Should I try building with any different config settings?

Yes. Please try disabling CONFIG_IOMMU_SUPPORT. Look for "IOMMU Hardware Support" in config (under Device Drivers in menuconfig).
Comment 16 Nathan Myers 2013-02-12 22:46:07 UTC
Is turning off IOMMU just a diagnostic exercise, or is it
usually a better choice?
Comment 17 Nathan Myers 2013-02-12 23:11:20 UTC
A build of stock 3.7.7 (i.e. w/o the patch from #8) with
IOMMU disabled boots and runs X normally.
Comment 18 Chris Wilson 2013-02-13 09:13:08 UTC
(In reply to comment #16)
> Is turning off IOMMU just a diagnostic exercise, or is it
> usually a better choice?

Better choice. At least if you are using Intel graphics since there were a few errata in the silicon that prevent it from functioning properly (and the workarounds we have are to stall the GPU every time we update its page tables), and I think we've found another one.
Comment 19 Chris Wilson 2013-02-13 09:28:13 UTC
Created attachment 74737 [details] [review]
Disable WC PTE updates for ILK VTd

Can you please test this patch and report yay-or-nay on the mailing list? (I've cc'ed you on that patch)
Comment 20 Daniel Vetter 2013-02-13 09:55:37 UTC
We need to check one more thing: Please test a IOMMU kernel both with and without the patch with intel_iommu=igfx_off added on the kernel cmdline.
Comment 21 Nathan Myers 2013-02-16 09:46:56 UTC
I built stock 3.7.7 with IOMMU turned on, and patched, and 
booted with and without the suggested option.  It booted to
full X both times.  Without the patch, and built with IOMMU on,
it fails reliably.  Without the patch, and without IOMMU, it
boots and runs X with no apparent problems.

IOMMU turned on means:

# CONFIG_CALGARY_IOMMU is not set
CONFIG_IOMMU_HELPER=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
# CONFIG_AMD_IOMMU is not set
CONFIG_INTEL_IOMMU=y
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
# CONFIG_IOMMU_STRESS is not set

This is not to say there are no problems.  The 3.6 and 3.7
kernels are prone to freezing, with no mouse pointer motion,
no response to keyboard input, and no response to ping, about
once a week on this machine, but I don't know how to get any
diagnostics out when it happens.
Comment 22 Daniel Vetter 2013-02-17 12:04:11 UTC
(In reply to comment #21)
> I built stock 3.7.7 with IOMMU turned on, and patched, and 
> booted with and without the suggested option.  It booted to
> full X both times.  Without the patch, and built with IOMMU on,
> it fails reliably.  Without the patch, and without IOMMU, it
> boots and runs X with no apparent problems.
> 
> IOMMU turned on means:
> 
> # CONFIG_CALGARY_IOMMU is not set
> CONFIG_IOMMU_HELPER=y
> CONFIG_IOMMU_API=y
> CONFIG_IOMMU_SUPPORT=y
> # CONFIG_AMD_IOMMU is not set
> CONFIG_INTEL_IOMMU=y
> CONFIG_INTEL_IOMMU_DEFAULT_ON=y
> CONFIG_INTEL_IOMMU_FLOPPY_WA=y
> # CONFIG_IOMMU_STRESS is not set

Have you also tested what happens with an IOMMU-enable kernel, but adding intel_iommu=igfx_off on the kernel cmdline? That is a slightly different mode of "IOMMU disabled" which we need to test separately.

> This is not to say there are no problems.  The 3.6 and 3.7
> kernels are prone to freezing, with no mouse pointer motion,
> no response to keyboard input, and no response to ping, about
> once a week on this machine, but I don't know how to get any
> diagnostics out when it happens.

Should be fixed in latest stable updates, see bug #55984
Comment 23 Nathan Myers 2013-02-17 19:55:16 UTC
Yes, to be precise, I tested 

1. a STOCK 3.7.7 kernel with IOMMU configured OFF and booted with NO option 
intel_iommu=igfx_off (success)

2. a STOCK 3.7.7 kernel with IOMMU configured ON and booted with NO option 
intel_iommu=igfx_off (FAIL)

3. a PATCHed 3.7.7 kernel with IOMMU configured OFF and booted with NO option 
intel_iommu=igfx_off (success)

4. a PATCHed 3.7.7 kernel with IOMMU configured ON and booted with NO option 
intel_iommu=igfx_off (success)

5. a PATCHed 3.7.7 kernel with IOMMU configured ON and booted WITH option 
intel_iommu=igfx_off (success)

where PATCH refers to attachment 74737 [details] [review], "Disable WC PTE updates for ILK VTd",
and IOMMU ON as defined in #21.

btw: It seems as if I cannot turn off CONFIG_IOMMU_HELPER or 
CONFIG_CONFIG_SWIOTLB using the regular configuration tools.
Comment 24 Nathan Myers 2013-02-17 20:15:01 UTC
(off-topic) 5afeb70e606bdce5a76de, referred to in #55984 as mentioned in #22, appears to be ancestral to 3.7.7.  If so, it does not fix the occasional hang mentioned at the end of #21, which also occurred when I was running 3.7.7 (with IOMMU configured OFF).  But there's no evidence to suggest that the hang has anything to do with [drm], beyond that it usually (but not always) happens when I'm sitting at the machine.
Comment 25 Daniel Vetter 2013-02-17 20:32:36 UTC
Fix merged to drm-intel-next as

commit b4950816cb3b1e10d8d0db3cd112e432b6c244cf
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 13 09:31:53 2013 +0000

    drm/i915: Disable WC PTE updates to w/a buggy IOMMU on ILK

For your ilk woes, the real fix is:

https://bugs.freedesktop.org/attachment.cgi?id=73105

Dunno whether you've meant that since I couldn't find any patch on comment #22 on that bug. If you still have hangs, please file a new bug report and attache the i915_error_state.
Comment 26 Nathan Myers 2013-02-17 22:31:00 UTC
"git blame" indicates attachment #73105 [details] [review] is the commit 5afeb70e I mentioned.
Comment 27 Daniel Vetter 2013-02-17 23:30:01 UTC
(In reply to comment #26)
> "git blame" indicates attachment #73105 [details] [review] [review] is the commit
> 5afeb70e I mentioned.

Ah the sha1 you cite is on stable, not the upstream commit - hence I couldn't find it.
Comment 28 Nathan Myers 2013-02-18 18:30:13 UTC
Thank you to all.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.