Bug 96473 - i915.ko corrupt text lines at top, [drm] GPU HANG: ecode 2:0:0x037fffc1, reason: Ring hung
Summary: i915.ko corrupt text lines at top, [drm] GPU HANG: ecode 2:0:0x037fffc1, reas...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-10 07:46 UTC by Taketo Kabe
Modified: 2016-08-26 14:01 UTC (History)
1 user (show)

See Also:
i915 platform: I865G
i915 features: GEM/Other


Attachments
dmesg (32.17 KB, text/plain)
2016-06-10 07:46 UTC, Taketo Kabe
no flags Details
[drm] GPU crash dump saved to /sys/class/drm/card0/error (673.61 KB, text/plain)
2016-06-10 07:48 UTC, Taketo Kabe
no flags Details
i915.ko of kernel 4.6.2 /sys/class/drm/card0/error (673.65 KB, text/plain)
2016-06-10 09:26 UTC, Taketo Kabe
no flags Details
4.6.2 /sys/class/drm/card0/error after revert (673.65 KB, text/plain)
2016-06-11 01:18 UTC, Taketo Kabe
no flags Details
dmesg, drm.debug=6 (697 bytes, text/plain)
2016-06-15 06:33 UTC, Taketo Kabe
no flags Details
[PATCH] drm/i915: Account for TSEG size when determining 865G stolen base (3.82 KB, patch)
2016-08-04 10:59 UTC, Ville Syrjala
no flags Details | Splinter Review

Description Taketo Kabe 2016-06-10 07:46:09 UTC
Created attachment 124438 [details]
dmesg

Using vanilla Linux 4.4.13, hardware Intel 82865G

On boot, when i915.ko was activated, screen switches to native KMS mode,
but several lines at top is blank, followed by 2 scanlines of random pixels.
Other parts of text looks and scrolls fine, but the blank/random part stays blank.

dmesg says:

[drm] stuck on render ring
[drm] GPU HANG: ecode 2:0:0x037fffc1, reason: Ring hung, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
i915: render error detected, EIR: 0x00000010
[drm:i915_report_and_clear_eir [i915]] *ERROR* EIR stuck: 0x00000010, masking
drm/i915: Resetting chip after gpu hang
[drm:i915_reset [i915]] *ERROR* Failed to reset chip: -19
Comment 1 Taketo Kabe 2016-06-10 07:48:24 UTC
Created attachment 124439 [details]
[drm] GPU crash dump saved to /sys/class/drm/card0/error

[drm] GPU crash dump saved to /sys/class/drm/card0/error
Comment 2 Taketo Kabe 2016-06-10 07:51:57 UTC
lspci:

00:00.0 Host bridge: Intel Corporation 82865G/PE/P DRAM Controller/Host-Hub Interface (rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82865G Integrated Graphics Controller (rev 02)
00:03.0 PCI bridge: Intel Corporation 82865G/PE/P PCI to CSA Bridge (rev 02)
00:06.0 System peripheral: Intel Corporation 82865G/PE/P Processor to I/O Memory Interface (rev 02)
00:1d.0 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02)
00:1d.1 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02)
00:1d.2 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02)
00:1d.7 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2)
00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02)
02:01.0 Ethernet controller: Intel Corporation 82547EI Gigabit Ethernet Controller
Comment 3 Taketo Kabe 2016-06-10 07:52:39 UTC
uname:
Linux capricorn.five.ten 4.4.13 #1 SMP Fri Jun 10 15:06:33 JST 2016 i686 i686 i386 GNU/Linux
Comment 4 Taketo Kabe 2016-06-10 07:59:38 UTC
Even if the text console is corrupt, Xserver works perfectly.
Top lines are still corrupted on Ctrl-Alt-F1 text console.
This is not an Xorg related issue, since it happens with vanilla kernel only.
Comment 5 Taketo Kabe 2016-06-10 09:26:32 UTC
Created attachment 124442 [details]
i915.ko of kernel 4.6.2 /sys/class/drm/card0/error

Confirmed that problem still exists in stable kernel 4.6.2 with exactly same error message and symptom.
Comment 6 Chris Wilson 2016-06-10 16:20:51 UTC
commit 44c5905e8e977b1dd9bb99bcd5686464fa0aa247 [v4.3] 
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Thu Jun 11 16:31:16 2015 +0300

    drm/i915: Drop the 64k linear scanout alignment on gen2/3

is a possibility. Can you please test the revert?
Comment 7 Taketo Kabe 2016-06-11 01:18:37 UTC
Created attachment 124470 [details]
4.6.2 /sys/class/drm/card0/error after revert

>>   drm/i915: Drop the 64k linear scanout alignment on gen2/3
>>is a possibility. Can you please test the revert?

Reverted the patch https://patchwork.kernel.org/patch/6588621/,
but error messages symptoms are exactly same.
(GPU HANG: ecode 2:0:0x037fffc1, upper several text lines blank)

Attached card0/error just in case something obscure changed.
Comment 8 Taketo Kabe 2016-06-11 01:28:48 UTC
By writing directly on /dev/fb0,
  dd if=/dev/urandom bs=1024 of=/dev/fb0 count=384
doesn't have any screen change, but
  dd if=/dev/urandom bs=1024 of=/dev/fb0 count=385
starts to render uramdom pixels following the freezed,garbled pixels.
Does this ring any bell?
Comment 9 Chris Wilson 2016-06-11 08:14:19 UTC
384k + offset(fb0 in GTT) ~= 576k, the size of the VGA text console. My guess is that the VGA planes are not being disabled and through them the BIOS is obliterating our setup. One quick test would be

diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c b/drivers/gpu/drm/i915/i915_gem_stolen.c
index e9cd82290408..fe8cfb027199 100644
--- a/drivers/gpu/drm/i915/i915_gem_stolen.c
+++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
@@ -411,6 +411,8 @@ int i915_gem_init_stolen(struct drm_device *dev)
        }
 #endif
 
+       return 0;
+
        if (ggtt->stolen_size == 0)
                return 0;
Comment 10 Taketo Kabe 2016-06-12 06:42:44 UTC
temp patch of Comment 9 worked! Clean native resolution text console.

Now how should we implement this properly?
82865G is Gen 2;
The switch-case in i915_gem_init_stolen() for 82865G goes to

	switch (INTEL_INFO(dev_priv)->gen) {
	case 2:
	case 3:
		break;

which leaves reserved_base variable uninitialized (=0),
and screwing up the rest.

Also at the end of i915_gem_init_stolen() says
	/*
	 * Basic memrange allocator for stolen space.
	 *
	 * TODO: Notice that some platforms require us to not use the first page
	 * of the stolen memory but their BIOSes may still put the framebuffer
	 * on the first page. So we don't reserve this page for now because of
	 * that. Our current solution is to just prevent new nodes from being
	 * inserted on the first page - see the check we have at
	 * i915_gem_stolen_insert_node_in_range(). We may want to fix the fbcon
	 * problem later.
	 */
	drm_mm_init(&dev_priv->mm.stolen, 0, dev_priv->gtt.stolen_usable_size);

so I may have stumped apon a hot spot.
Comment 11 Chris Wilson 2016-06-12 08:11:20 UTC
Something to try just to check that the VGA disable is taking effect as soon as we expect:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 801e4c17dd8d..2b9410ad91e4 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -15425,6 +15425,9 @@ static void i915_disable_vga(struct drm_device *dev)
 
        I915_WRITE(vga_reg, VGA_DISP_DISABLE);
        POSTING_READ(vga_reg);
+
+       I915_WRITE(vga_reg, VGA_DISP_DISABLE);
+       POSTING_READ(vga_reg);
 }
 
 void intel_modeset_init_hw(struct drm_device *dev)
@@ -16204,6 +16207,8 @@ void intel_modeset_gem_init(struct drm_device *dev)
        }
 
        intel_backlight_register(dev);
+
+       i915_redisable_vga(dev);
 }
 
 void intel_connector_unregister(struct intel_connector *intel_connector)
Comment 12 Taketo Kabe 2016-06-12 08:41:06 UTC
Are patches cumultive or exclusive?
I assume exclusive, and only applied Comment 11 patch;
same symptom. (blank lines at top of text console)
Comment 13 Taketo Kabe 2016-06-13 06:03:34 UTC
Temporal solution: do what kernel 3.18.x does:
don't try to determine stolen area and return zero on 865G.
This works (no text corruption, no GPU HANG), but
essentially this is same as Comment 9 test, and 
obviously isn't going in right direction.
(is reserved memory usable for /dev/fb0 in 82865G?)

diff -U 6 -p ./drivers/gpu/drm/i915/i915_gem_stolen.c.ville ./drivers/gpu/drm/i915/i915_gem_stolen.c
--- ./drivers/gpu/drm/i915/i915_gem_stolen.c.ville	2016-06-08 10:23:53.000000000 +0900
+++ ./drivers/gpu/drm/i915/i915_gem_stolen.c	2016-06-13 14:10:55.000000000 +0900
@@ -105,23 +105,26 @@ static unsigned long i915_stolen_to_phys
 	base = 0;
 	if (INTEL_INFO(dev)->gen >= 3) {
 		/* Read Graphics Base of Stolen Memory directly */
 		pci_read_config_dword(dev->pdev, 0x5c, &base);
 		base &= ~((1<<20) - 1);
 	} else if (IS_I865G(dev)) {
+#if 0 /* as kernel 3.14 */
 		u16 toud = 0;
 
 		/*
 		 * FIXME is the graphics stolen memory region
 		 * always at TOUD? Ie. is it always the last
 		 * one to be allocated by the BIOS?
 		 */
 		pci_bus_read_config_word(dev->pdev->bus, PCI_DEVFN(0, 0),
 					 I865_TOUD, &toud);
 
 		base = toud << 16;
+#endif
+	/*DDD*/DRM_INFO("i915_stolen_to_physical: i865 base = 0x%x\n", base);
 	} else if (IS_I85X(dev)) {
 		u32 tseg_size = 0;
 		u32 tom;
 		u8 tmp;
 
 		pci_bus_read_config_byte(dev->pdev->bus, PCI_DEVFN(0, 0),
Comment 14 Taketo Kabe 2016-06-15 06:33:08 UTC
Created attachment 124540 [details]
dmesg, drm.debug=6

Better (or worse) solution:
Hardcode that first 1MB of stolen memory is really reserved in 865G.

diff -p -U6 ./drivers/gpu/drm/i915/i915_gem_stolen.c.ville ./drivers/gpu/drm/i915/i915_gem_stolen.c
--- ./drivers/gpu/drm/i915/i915_gem_stolen.c.ville	2016-06-08 10:23:53.000000000 +0900
+++ ./drivers/gpu/drm/i915/i915_gem_stolen.c	2016-06-15 14:59:55.000000000 +0900
@@ -116,12 +116,13 @@ static unsigned long i915_stolen_to_phys
 		 * one to be allocated by the BIOS?
 		 */
 		pci_bus_read_config_word(dev->pdev->bus, PCI_DEVFN(0, 0),
 					 I865_TOUD, &toud);
 
 		base = toud << 16;
+		base += 1024 * 1024; /* FIXME assume first 1MB is really reserved */
 	} else if (IS_I85X(dev)) {
 		u32 tseg_size = 0;
 		u32 tom;
 		u8 tmp;
 
 		pci_bus_read_config_byte(dev->pdev->bus, PCI_DEVFN(0, 0),


This also works on real machine. No text corruption, no GPU HANG.
 
 
After these,
I'm beginning to think that Comment 13 fix is nontheless right, because
- I865_TOUD is in "Reserved" region of 865G's PCI config registers.
  Depending on it is wrong, even if it holds sane value in practice.
- It looks like reserved memory is really reserved and not reusable for /dev/fb0 .
  VGA BIOS et al is sitting there.

Ville Syrjala in https://patchwork.kernel.org/patch/3448921/
claimed that 865G is bit different and needs verify on a real system.
Maybe noone had tested on real system since.

865G is a decade-old chipset; for enterprise level, so maybe not worth effort 
to claim panic-proof /dev/fb0 region.
For others, "it works" is important and Comment 13 fix is enough.
Any thoughts?

The reason I'm sticking to this problem is that I use CentOS 6
on 865G machine, and upstream RHEL6.8 started retrofitting
kernel 4.4 drivers/gpu/drm/ code onto kernel 2.6.32 (amazing!).
Comment 15 Taketo Kabe 2016-06-19 05:44:34 UTC
As expected, the problem still exists in kernel 4.3-rc3.
(on 82865G integrated video, KMS text console has few invisible lines and random pixels at top of screen)

Applying Comment 14 and/or Comment 13 fixes the problem.

Should I reopen the case in kernel.org bugzilla since this is
not freedesktop.org client program issue (but a DRI/DRM-Intel issue)?
Comment 16 Taketo Kabe 2016-06-19 05:52:13 UTC
< As expected, the problem still exists in kernel 4.3-rc3.
---
> As expected, the problem still exists in kernel 4.7-rc3.

It isn't a showstopper since 865G can only do non-mainstream 32bit, but 
I believe the are numerous machines still in service using RHEL6.CentOS6
are stumbling on this bug.
Comment 17 Jani Nikula 2016-06-20 09:27:40 UTC
(In reply to Taketo Kabe from comment #15)
> Should I reopen the case in kernel.org bugzilla since this is
> not freedesktop.org client program issue (but a DRI/DRM-Intel issue)?

No. We track all drm/i915 issues specifically at bugs.freedesktop.org, *not* bugzilla.kernel.org.
Comment 18 yann 2016-08-04 10:06:01 UTC
Patch to disable use of stolen memory https://patchwork.freedesktop.org/series/10651/
Comment 19 Ville Syrjala 2016-08-04 10:59:58 UTC
Created attachment 125519 [details] [review]
[PATCH] drm/i915: Account for TSEG size when determining 865G stolen  base

Please test this patch, it should bump the stolen base upwards to account for the TSEG.
Comment 20 Taketo Kabe 2016-08-05 10:58:45 UTC
Tested the patch
https://bugs.freedesktop.org/attachment.cgi?id=125519
on actual machine (Fujitsu C610):

Fixes the issue on both kernel 4.4.16 and kernel 4.7.0. Congrats.
Seems like my machine has 512KiB of TSEG area.
Comment 21 Ville Syrjala 2016-08-11 16:39:35 UTC
Fixed by

commit d721b02fd00bf133580f431b82ef37f3b746dfb2
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Mon Aug 8 13:58:39 2016 +0300

    drm/i915: Account for TSEG size when determining 865G stolen base


Thanks for the bug report and testing.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.