Bug 54575

Summary:

[GM45] boot-up broken in kernel 3.4.10 - screen stays blank (bad patch identified)

Product:

DRI

Reporter:

andreas.sturmlechner

Component:

DRM/Intel

Assignee:

Daniel Vetter <daniel>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

normal

Priority:

medium

CC:

ben, chris, daniel, jbarnes

Version:

XOrg git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
Paranoia patch, #1	none
Paranoia patch, #2	none
dmesg.log	none
dmesg_3.4.10_lvds-drmdbg.log	none
intel-reg-dump_3.4.10_lvds-drmdbg.log	none
system messages	none
my applied patch	none

Description andreas.sturmlechner 2012-09-05 20:21:13 UTC

Kernel 3.4.10 contains a patch for intel_ringbuffer.c that breaks my fairly common Intel Mobile 4 Series hardware (ThinkPad X200s). On boot-up, LVDS as well as DP screens stay blank right before init. I guess that's what greg k-h meant in his 3.4.10 announcement: http://lwn.net/Articles/513635/

There is the culprit: https://lkml.org/lkml/2012/8/20/19

What I tried:
3.4.10 vanilla - failed
3.4.10 + follow-up patches in Herton's reply - failed
3.4.10 with reverted commit 0d8957c8a90bbb5d34fab9a304459448a5131e06 - success

What works: <= 3.4.9 and >= 3.5.0, included up until 3.6_rc4-drm

Comment 1 Daniel Vetter 2012-09-06 08:22:45 UTC

This is funny, since 3.5 has that patch, but works. And I really don't see anything else missing in there.

Can you please send a note to greg kh asking to revert the offending patch on 3.4?

Comment 2 andreas.sturmlechner 2012-09-07 00:03:37 UTC

Yes it's strange - to be absolutely sure I tried again with a new kernel image, same result. >=3.5 is fine except for the trouble shown in bug 53385.

Comment 3 andreas.sturmlechner 2012-09-17 11:50:07 UTC

Yesterday I thought I was going crazy when I couldn't reproduce the issue with 3.4.10 for a couple reboots - but then today it happened again (coincidentally, when sitting in the train, just when I first noticed that problem). While I _never_ got that behaviour outside of 3.4.10 so far, the kernel image with the reverted patch needs some more testing to be sure that patch is the actual culprit.

Comment 4 Daniel Vetter 2012-09-17 15:48:02 UTC

The offending patch has (or at least should have) been reverted from 3.4. And things seem to work nicely on 3.5 afaik. Hence closing this as fixed, thanks for reporting.

Comment 5 andreas.sturmlechner 2012-10-15 21:39:13 UTC

To reopen/keep this bug up to date, the blank screen issue happened now twice in 3.6, last time just now in 3.6.2. Most of the time it works though, contrary to 3.0 and 3.4 stable releases with the patch included.

Comment 6 Chris Wilson 2012-12-05 13:10:09 UTC

Created attachment 71029 [details] [review]
Paranoia patch, #1

Comment 7 andreas.sturmlechner 2012-12-29 12:00:08 UTC

Finally had some spare time again and applied the patch against 3.4.10 - after a few reboots I was greeted with the same black screen as with vanilla-3.4.10.

Meanwhile, many weeks of flawlessly booting 3.6.x kernels have gone by.

Comment 8 Chris Wilson 2012-12-29 12:06:11 UTC

Were any of the ring init warnings to be found in dmesg?

Comment 9 Chris Wilson 2012-12-29 12:29:29 UTC

Created attachment 72249 [details] [review]
Paranoia patch, #2

Comment 10 andreas.sturmlechner 2012-12-29 13:34:33 UTC

Created attachment 72250 [details]
dmesg.log

Indeed, this can be found in dmesg (full log attached):


[    0.372850] [drm] Initialized drm 1.1.0 20060810
[    0.373062] i915 0000:00:02.0: power state changed by ACPI to D0
[    0.373162] i915 0000:00:02.0: power state changed by ACPI to D0
[    0.374157] i915 0000:00:02.0: setting latency timer to 64
[    0.381492] ACPI: Battery Slot [BAT0] (battery present)
[    0.642451] i915 0000:00:02.0: irq 40 for MSI/MSI-X
[    0.642473] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[    0.642531] [drm] Driver supports precise vblank timestamp query.
[    0.642656] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    1.043225] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 000010e8 tail 00000000 start 00001000
[    1.043620] [drm:i915_driver_load] *ERROR* failed to init modeset
[    1.176373] i915: probe of 0000:00:02.0 failed with error -5


(output taken from kernel image with paranoia patch #1 applied)

Comment 11 andreas.sturmlechner 2012-12-30 14:46:42 UTC

Trying to reproduce this is probably the most frustrating experience ever. After building a new image with bigger log size and drm.debug=6 standard param it decided to work for n reboots until I gave up for today. It seems that actually travelling by train is the key to provoke that specific bug.

Chris, should I apply patch #2 in addition and also over 3.4.10? It didn't work at first try.

Comment 12 Chris Wilson 2012-12-30 14:54:34 UTC

#2 should apply by itself, the conflict should be minor though, the only vital bit is really the "+ (void)I915_READ_CTL(ring);"

Comment 13 andreas.sturmlechner 2012-12-30 17:17:29 UTC

OK, two context lines that are not present in 3.4.10 had spoilt the fun with #2.

Anyway, with the change to drm.debug=6 I can't seem to be able to reproduce the blank screen - is it possible that additional latencies through that option are influencing ring init?

Comment 14 Chris Wilson 2012-12-30 17:27:57 UTC

(In reply to comment #13)
> Anyway, with the change to drm.debug=6 I can't seem to be able to reproduce
> the blank screen - is it possible that additional latencies through that
> option are influencing ring init?

Yes. No idea why, the posting reads seemed to be the most sensible attempt to introduce some delays along with the write barriers. What you can try is to just put the entire init in a loop and repeat say 10 times until it works.

Comment 15 andreas.sturmlechner 2012-12-30 17:38:58 UTC

Created attachment 72308 [details]
dmesg_3.4.10_lvds-drmdbg.log

instant success with patch #2 and drm.debug=6

Comment 16 andreas.sturmlechner 2012-12-30 17:41:19 UTC

Created attachment 72309 [details]
intel-reg-dump_3.4.10_lvds-drmdbg.log

maybe a reg-dump is also of use. ...with special thx to android and connectbot

Comment 17 Jun Hu 2013-01-18 05:57:59 UTC

Hi , I met this issue too:

https://bugzilla.novell.com/show_bug.cgi?id=797407

Comment 18 Jun Hu 2013-01-18 23:21:16 UTC

Created attachment 73267 [details]
system messages

Comment 19 Jun Hu 2013-01-18 23:22:38 UTC

junwork:/etc/sysconfig/network # uname -a
Linux junwork 3.4.11-2.16-xen #1 SMP Wed Sep 26 17:05:00 UTC 2012 (259fc87) x86_64 x86_64 x86_64 GNU/Linux

I applied above two patches, and trid 120 times in init_ring_common function, add kernel parameter "drm.debug=6", but still encoutered this issue 2-3 times per day ; 

messages info ,please see above file.
patch file , please see below .

Comment 20 Jun Hu 2013-01-18 23:25:33 UTC

Created attachment 73268 [details] [review]
my applied  patch

Comment 21 Daniel Vetter 2013-01-19 17:04:18 UTC

Hm, something's seriously amiss here, and I have no idea what exactly. Just to check: Is it still the case that newer/older kernels work as expected and without any of these tricks? I fear we'll just have to give up on 3.4 :(

Comment 22 andreas.sturmlechner 2013-01-19 17:32:27 UTC

I haven't tried yet applying your paranoia patches over a newer kernel. With vanilla kernels, I've never encountered that issue before 3.4.10 or in 3.5, 3.6 and 3.7.

Comment 23 Jun Hu 2013-01-21 21:37:34 UTC

yes, I tested 3.7.3 kernel for three days, hasn't happened this issue.

the kernel is opensuse 12.3 stable kernel.

so I believe new version kernel has been solved.  

but for a developer, I think he will find out where modification caused this situation.

Comment 24 Daniel Vetter 2013-01-21 22:58:01 UTC

Ok, I think we'll just close this one here as working on recent platforms, no idea what's broken on 3.4. Thanks everyone for reporting this and digging into possible solutions.

Comment 25 andreas.sturmlechner 2013-01-22 06:49:04 UTC

I also don't care about 3.4 on that system - unless the bug is just hidden and ready to reappear in the future.

Comment 26 Egbert Eich 2013-06-13 10:59:20 UTC

This issue is fixed in the current upstream kernel, however it is definitely an issue in the 3.0.x longterm kernel:
On some Q43/Q45 chipsets (device ID 0x2e12) this issue doesn't only happen occasionally but permanently.
It turns out that the two upstream commits:

commit f01db988ef6f6c70a6cc36ee71e4a98a68901229
Author: Sean Paul <seanpaul@chromium.org>
Date: Fri Mar 16 12:43:22 2012 -0400

drm/i915: Add wait_for in init_ring_common

I have seen a number of "blt ring initialization failed" messages
where the ctl or start registers are not the correct value. Upon further
inspection, if the code just waited a little bit, it would read the
correct value. Adding the wait_for to these reads should eliminate the
issue.

and

commit 3eef8918ff440837f6af791942d8dd07e1a268ee
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Mon Jun 4 17:05:40 2012 +0100

drm/i915: Mark the ringbuffers as being in the GTT domain

By correctly describing the rinbuffers as being in the GTT domain, it
appears that we are more careful with the management of the CPU cache
upon resume and so prevent some coherency issue when submitting commands
to the GPU later. A secondary effect is that the debug logs are then
consistent with the actual usage (i.e. they no longer describe the
ringbuffers as being in the CPU write domain when we are accessing them
through an wc iomapping.)

are needed to fix this issue.

Both commits are part of 3.2.x and 3.4.x stable however not of 3.0.x.

Commit b7884eb45ec98c0d34c7f49005ae9d4b4b4e38f6 may also be useful to have, however in the context of the hardware that I have tested it is not required.

Will notify stable@

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.