Summary: | [GM45] boot-up broken in kernel 3.4.10 - screen stays blank (bad patch identified) | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | andreas.sturmlechner | ||||||||||||||||
Component: | DRM/Intel | Assignee: | Daniel Vetter <daniel> | ||||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||
Severity: | normal | ||||||||||||||||||
Priority: | medium | CC: | ben, chris, daniel, jbarnes | ||||||||||||||||
Version: | XOrg git | ||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||
Attachments: |
|
Description
andreas.sturmlechner
2012-09-05 20:21:13 UTC
This is funny, since 3.5 has that patch, but works. And I really don't see anything else missing in there. Can you please send a note to greg kh asking to revert the offending patch on 3.4? Yes it's strange - to be absolutely sure I tried again with a new kernel image, same result. >=3.5 is fine except for the trouble shown in bug 53385. Yesterday I thought I was going crazy when I couldn't reproduce the issue with 3.4.10 for a couple reboots - but then today it happened again (coincidentally, when sitting in the train, just when I first noticed that problem). While I _never_ got that behaviour outside of 3.4.10 so far, the kernel image with the reverted patch needs some more testing to be sure that patch is the actual culprit. The offending patch has (or at least should have) been reverted from 3.4. And things seem to work nicely on 3.5 afaik. Hence closing this as fixed, thanks for reporting. To reopen/keep this bug up to date, the blank screen issue happened now twice in 3.6, last time just now in 3.6.2. Most of the time it works though, contrary to 3.0 and 3.4 stable releases with the patch included. Created attachment 71029 [details] [review] Paranoia patch, #1 Finally had some spare time again and applied the patch against 3.4.10 - after a few reboots I was greeted with the same black screen as with vanilla-3.4.10. Meanwhile, many weeks of flawlessly booting 3.6.x kernels have gone by. Were any of the ring init warnings to be found in dmesg? Created attachment 72249 [details] [review] Paranoia patch, #2 Created attachment 72250 [details]
dmesg.log
Indeed, this can be found in dmesg (full log attached):
[ 0.372850] [drm] Initialized drm 1.1.0 20060810
[ 0.373062] i915 0000:00:02.0: power state changed by ACPI to D0
[ 0.373162] i915 0000:00:02.0: power state changed by ACPI to D0
[ 0.374157] i915 0000:00:02.0: setting latency timer to 64
[ 0.381492] ACPI: Battery Slot [BAT0] (battery present)
[ 0.642451] i915 0000:00:02.0: irq 40 for MSI/MSI-X
[ 0.642473] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[ 0.642531] [drm] Driver supports precise vblank timestamp query.
[ 0.642656] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[ 1.043225] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 000010e8 tail 00000000 start 00001000
[ 1.043620] [drm:i915_driver_load] *ERROR* failed to init modeset
[ 1.176373] i915: probe of 0000:00:02.0 failed with error -5
(output taken from kernel image with paranoia patch #1 applied)
Trying to reproduce this is probably the most frustrating experience ever. After building a new image with bigger log size and drm.debug=6 standard param it decided to work for n reboots until I gave up for today. It seems that actually travelling by train is the key to provoke that specific bug. Chris, should I apply patch #2 in addition and also over 3.4.10? It didn't work at first try. #2 should apply by itself, the conflict should be minor though, the only vital bit is really the "+ (void)I915_READ_CTL(ring);" OK, two context lines that are not present in 3.4.10 had spoilt the fun with #2. Anyway, with the change to drm.debug=6 I can't seem to be able to reproduce the blank screen - is it possible that additional latencies through that option are influencing ring init? (In reply to comment #13) > Anyway, with the change to drm.debug=6 I can't seem to be able to reproduce > the blank screen - is it possible that additional latencies through that > option are influencing ring init? Yes. No idea why, the posting reads seemed to be the most sensible attempt to introduce some delays along with the write barriers. What you can try is to just put the entire init in a loop and repeat say 10 times until it works. Created attachment 72308 [details]
dmesg_3.4.10_lvds-drmdbg.log
instant success with patch #2 and drm.debug=6
Created attachment 72309 [details]
intel-reg-dump_3.4.10_lvds-drmdbg.log
maybe a reg-dump is also of use. ...with special thx to android and connectbot
Hi , I met this issue too: https://bugzilla.novell.com/show_bug.cgi?id=797407 Created attachment 73267 [details]
system messages
junwork:/etc/sysconfig/network # uname -a Linux junwork 3.4.11-2.16-xen #1 SMP Wed Sep 26 17:05:00 UTC 2012 (259fc87) x86_64 x86_64 x86_64 GNU/Linux I applied above two patches, and trid 120 times in init_ring_common function, add kernel parameter "drm.debug=6", but still encoutered this issue 2-3 times per day ; messages info ,please see above file. patch file , please see below . Created attachment 73268 [details] [review] my applied patch Hm, something's seriously amiss here, and I have no idea what exactly. Just to check: Is it still the case that newer/older kernels work as expected and without any of these tricks? I fear we'll just have to give up on 3.4 :( I haven't tried yet applying your paranoia patches over a newer kernel. With vanilla kernels, I've never encountered that issue before 3.4.10 or in 3.5, 3.6 and 3.7. yes, I tested 3.7.3 kernel for three days, hasn't happened this issue. the kernel is opensuse 12.3 stable kernel. so I believe new version kernel has been solved. but for a developer, I think he will find out where modification caused this situation. Ok, I think we'll just close this one here as working on recent platforms, no idea what's broken on 3.4. Thanks everyone for reporting this and digging into possible solutions. I also don't care about 3.4 on that system - unless the bug is just hidden and ready to reappear in the future. This issue is fixed in the current upstream kernel, however it is definitely an issue in the 3.0.x longterm kernel: On some Q43/Q45 chipsets (device ID 0x2e12) this issue doesn't only happen occasionally but permanently. It turns out that the two upstream commits: commit f01db988ef6f6c70a6cc36ee71e4a98a68901229 Author: Sean Paul <seanpaul@chromium.org> Date: Fri Mar 16 12:43:22 2012 -0400 drm/i915: Add wait_for in init_ring_common I have seen a number of "blt ring initialization failed" messages where the ctl or start registers are not the correct value. Upon further inspection, if the code just waited a little bit, it would read the correct value. Adding the wait_for to these reads should eliminate the issue. and commit 3eef8918ff440837f6af791942d8dd07e1a268ee Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jun 4 17:05:40 2012 +0100 drm/i915: Mark the ringbuffers as being in the GTT domain By correctly describing the rinbuffers as being in the GTT domain, it appears that we are more careful with the management of the CPU cache upon resume and so prevent some coherency issue when submitting commands to the GPU later. A secondary effect is that the debug logs are then consistent with the actual usage (i.e. they no longer describe the ringbuffers as being in the CPU write domain when we are accessing them through an wc iomapping.) are needed to fix this issue. Both commits are part of 3.2.x and 3.4.x stable however not of 3.0.x. Commit b7884eb45ec98c0d34c7f49005ae9d4b4b4e38f6 may also be useful to have, however in the context of the hardware that I have tested it is not required. Will notify stable@ |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.