Bug 54575

Summary: [GM45] boot-up broken in kernel 3.4.10 - screen stays blank (bad patch identified)
Product: DRI Reporter: andreas.sturmlechner
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: ben, chris, daniel, jbarnes
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Paranoia patch, #1
none
Paranoia patch, #2
none
dmesg.log
none
dmesg_3.4.10_lvds-drmdbg.log
none
intel-reg-dump_3.4.10_lvds-drmdbg.log
none
system messages
none
my applied patch none

Description andreas.sturmlechner 2012-09-05 20:21:13 UTC
Kernel 3.4.10 contains a patch for intel_ringbuffer.c that breaks my fairly common Intel Mobile 4 Series hardware (ThinkPad X200s). On boot-up, LVDS as well as DP screens stay blank right before init. I guess that's what greg k-h meant in his 3.4.10 announcement: http://lwn.net/Articles/513635/

There is the culprit: https://lkml.org/lkml/2012/8/20/19

What I tried:
3.4.10 vanilla - failed
3.4.10 + follow-up patches in Herton's reply - failed
3.4.10 with reverted commit 0d8957c8a90bbb5d34fab9a304459448a5131e06 - success

What works: <= 3.4.9 and >= 3.5.0, included up until 3.6_rc4-drm
Comment 1 Daniel Vetter 2012-09-06 08:22:45 UTC
This is funny, since 3.5 has that patch, but works. And I really don't see anything else missing in there.

Can you please send a note to greg kh asking to revert the offending patch on 3.4?
Comment 2 andreas.sturmlechner 2012-09-07 00:03:37 UTC
Yes it's strange - to be absolutely sure I tried again with a new kernel image, same result. >=3.5 is fine except for the trouble shown in bug 53385.
Comment 3 andreas.sturmlechner 2012-09-17 11:50:07 UTC
Yesterday I thought I was going crazy when I couldn't reproduce the issue with 3.4.10 for a couple reboots - but then today it happened again (coincidentally, when sitting in the train, just when I first noticed that problem). While I _never_ got that behaviour outside of 3.4.10 so far, the kernel image with the reverted patch needs some more testing to be sure that patch is the actual culprit.
Comment 4 Daniel Vetter 2012-09-17 15:48:02 UTC
The offending patch has (or at least should have) been reverted from 3.4. And things seem to work nicely on 3.5 afaik. Hence closing this as fixed, thanks for reporting.
Comment 5 andreas.sturmlechner 2012-10-15 21:39:13 UTC
To reopen/keep this bug up to date, the blank screen issue happened now twice in 3.6, last time just now in 3.6.2. Most of the time it works though, contrary to 3.0 and 3.4 stable releases with the patch included.
Comment 6 Chris Wilson 2012-12-05 13:10:09 UTC
Created attachment 71029 [details] [review]
Paranoia patch, #1
Comment 7 andreas.sturmlechner 2012-12-29 12:00:08 UTC
Finally had some spare time again and applied the patch against 3.4.10 - after a few reboots I was greeted with the same black screen as with vanilla-3.4.10.

Meanwhile, many weeks of flawlessly booting 3.6.x kernels have gone by.
Comment 8 Chris Wilson 2012-12-29 12:06:11 UTC
Were any of the ring init warnings to be found in dmesg?
Comment 9 Chris Wilson 2012-12-29 12:29:29 UTC
Created attachment 72249 [details] [review]
Paranoia patch, #2
Comment 10 andreas.sturmlechner 2012-12-29 13:34:33 UTC
Created attachment 72250 [details]
dmesg.log

Indeed, this can be found in dmesg (full log attached):


[    0.372850] [drm] Initialized drm 1.1.0 20060810
[    0.373062] i915 0000:00:02.0: power state changed by ACPI to D0
[    0.373162] i915 0000:00:02.0: power state changed by ACPI to D0
[    0.374157] i915 0000:00:02.0: setting latency timer to 64
[    0.381492] ACPI: Battery Slot [BAT0] (battery present)
[    0.642451] i915 0000:00:02.0: irq 40 for MSI/MSI-X
[    0.642473] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[    0.642531] [drm] Driver supports precise vblank timestamp query.
[    0.642656] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    1.043225] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 000010e8 tail 00000000 start 00001000
[    1.043620] [drm:i915_driver_load] *ERROR* failed to init modeset
[    1.176373] i915: probe of 0000:00:02.0 failed with error -5


(output taken from kernel image with paranoia patch #1 applied)
Comment 11 andreas.sturmlechner 2012-12-30 14:46:42 UTC
Trying to reproduce this is probably the most frustrating experience ever. After building a new image with bigger log size and drm.debug=6 standard param it decided to work for n reboots until I gave up for today. It seems that actually travelling by train is the key to provoke that specific bug.

Chris, should I apply patch #2 in addition and also over 3.4.10? It didn't work at first try.
Comment 12 Chris Wilson 2012-12-30 14:54:34 UTC
#2 should apply by itself, the conflict should be minor though, the only vital bit is really the "+ (void)I915_READ_CTL(ring);"
Comment 13 andreas.sturmlechner 2012-12-30 17:17:29 UTC
OK, two context lines that are not present in 3.4.10 had spoilt the fun with #2.

Anyway, with the change to drm.debug=6 I can't seem to be able to reproduce the blank screen - is it possible that additional latencies through that option are influencing ring init?
Comment 14 Chris Wilson 2012-12-30 17:27:57 UTC
(In reply to comment #13)
> Anyway, with the change to drm.debug=6 I can't seem to be able to reproduce
> the blank screen - is it possible that additional latencies through that
> option are influencing ring init?

Yes. No idea why, the posting reads seemed to be the most sensible attempt to introduce some delays along with the write barriers. What you can try is to just put the entire init in a loop and repeat say 10 times until it works.
Comment 15 andreas.sturmlechner 2012-12-30 17:38:58 UTC
Created attachment 72308 [details]
dmesg_3.4.10_lvds-drmdbg.log

instant success with patch #2 and drm.debug=6
Comment 16 andreas.sturmlechner 2012-12-30 17:41:19 UTC
Created attachment 72309 [details]
intel-reg-dump_3.4.10_lvds-drmdbg.log

maybe a reg-dump is also of use. ...with special thx to android and connectbot
Comment 17 Jun Hu 2013-01-18 05:57:59 UTC
Hi , I met this issue too:

https://bugzilla.novell.com/show_bug.cgi?id=797407
Comment 18 Jun Hu 2013-01-18 23:21:16 UTC
Created attachment 73267 [details]
system messages
Comment 19 Jun Hu 2013-01-18 23:22:38 UTC
junwork:/etc/sysconfig/network # uname -a
Linux junwork 3.4.11-2.16-xen #1 SMP Wed Sep 26 17:05:00 UTC 2012 (259fc87) x86_64 x86_64 x86_64 GNU/Linux

I applied above two patches, and trid 120 times in init_ring_common function, add kernel parameter "drm.debug=6", but still encoutered this issue 2-3 times per day ; 

messages info ,please see above file.
patch file , please see below .
Comment 20 Jun Hu 2013-01-18 23:25:33 UTC
Created attachment 73268 [details] [review]
my applied  patch
Comment 21 Daniel Vetter 2013-01-19 17:04:18 UTC
Hm, something's seriously amiss here, and I have no idea what exactly. Just to check: Is it still the case that newer/older kernels work as expected and without any of these tricks? I fear we'll just have to give up on 3.4 :(
Comment 22 andreas.sturmlechner 2013-01-19 17:32:27 UTC
I haven't tried yet applying your paranoia patches over a newer kernel. With vanilla kernels, I've never encountered that issue before 3.4.10 or in 3.5, 3.6 and 3.7.
Comment 23 Jun Hu 2013-01-21 21:37:34 UTC
yes, I tested 3.7.3 kernel for three days, hasn't happened this issue.

the kernel is opensuse 12.3 stable kernel.

so I believe new version kernel has been solved.  

but for a developer, I think he will find out where modification caused this situation.
Comment 24 Daniel Vetter 2013-01-21 22:58:01 UTC
Ok, I think we'll just close this one here as working on recent platforms, no idea what's broken on 3.4. Thanks everyone for reporting this and digging into possible solutions.
Comment 25 andreas.sturmlechner 2013-01-22 06:49:04 UTC
I also don't care about 3.4 on that system - unless the bug is just hidden and ready to reappear in the future.
Comment 26 Egbert Eich 2013-06-13 10:59:20 UTC
This issue is fixed in the current upstream kernel, however it is definitely an issue in the 3.0.x longterm kernel:
On some Q43/Q45 chipsets (device ID 0x2e12) this issue doesn't only happen occasionally but permanently.
It turns out that the two upstream commits:

   commit f01db988ef6f6c70a6cc36ee71e4a98a68901229
   Author: Sean Paul <seanpaul@chromium.org>
   Date:   Fri Mar 16 12:43:22 2012 -0400

       drm/i915: Add wait_for in init_ring_common
    
    I have seen a number of "blt ring initialization failed" messages
    where the ctl or start registers are not the correct value. Upon further
    inspection, if the code just waited a little bit, it would read the
    correct value. Adding the wait_for to these reads should eliminate the
    issue.
    
and

   commit 3eef8918ff440837f6af791942d8dd07e1a268ee
   Author: Chris Wilson <chris@chris-wilson.co.uk>
   Date:   Mon Jun 4 17:05:40 2012 +0100

       drm/i915: Mark the ringbuffers as being in the GTT domain
    
    By correctly describing the rinbuffers as being in the GTT domain, it
    appears that we are more careful with the management of the CPU cache
    upon resume and so prevent some coherency issue when submitting commands
    to the GPU later. A secondary effect is that the debug logs are then
    consistent with the actual usage (i.e. they no longer describe the
    ringbuffers as being in the CPU write domain when we are accessing them
    through an wc iomapping.)
    

are needed to fix this issue.

Both commits are part of 3.2.x and 3.4.x stable however not of 3.0.x.

Commit b7884eb45ec98c0d34c7f49005ae9d4b4b4e38f6 may also be useful to have, however in the context of the hardware that I have tested it is not required.

Will notify stable@

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.