53926 – [945gm regression] LVDS boots to blank screen

Bug 53926 - [945gm regression] LVDS boots to blank screen

Summary: [945gm regression] LVDS boots to blank screen

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	highest blocker
Assignee:	Daniel Vetter
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-08-22 12:16 UTC by Alex
Modified:	2016-10-07 05:38 UTC (History)
CC List:	8 users (show)

See Also:	57365
i915 platform:
i915 features:

Attachments
dmesg on failure (50.25 KB, text/plain) 2012-08-22 12:16 UTC, Alex	no flags	Details
dmesg on failure with drm.debug=0xe (62.12 KB, text/plain) 2012-08-23 06:00 UTC, Alex	no flags	Details
intel_reg_dumper output with kms (10.63 KB, text/plain) 2012-08-23 06:01 UTC, Alex	no flags	Details
intel_reg_dumper output without kms (10.62 KB, text/plain) 2012-08-23 06:24 UTC, Alex	no flags	Details
dmesg on failure, kernel-3.6.0-rc3 drm.debug=0xe (66.85 KB, text/plain) 2012-08-24 08:48 UTC, Alex	no flags	Details
dmesg of good boot, kernel-3.6.0-rc3 drm.debug=0xe (66.23 KB, text/plain) 2012-08-24 09:18 UTC, Alex	no flags	Details
intel_reg_dumper output with kms good screen (10.63 KB, text/plain) 2012-08-26 12:49 UTC, Alex	no flags	Details
lshw -c display output (954 bytes, text/plain) 2012-12-12 09:57 UTC, Adam	no flags	Details
git bisect log (2.32 KB, text/plain) 2012-12-15 10:07 UTC, Adam	no flags	Details
reversed git bisect log (2.87 KB, text/plain) 2012-12-21 09:41 UTC, Adam	no flags	Details
Probe for the EDID using GPIO as a fallback (1.26 KB, patch) 2012-12-21 10:13 UTC, Chris Wilson	no flags	Details \| Splinter Review
*dmesg and intel_reg_ logs for various boots** (88.78 KB, application/x-gzip) 2013-01-09 07:18 UTC, Nathan Schulte	no flags	Details
*dmesg and intel_reg_ logs for hibernate boots** (75.00 KB, application/x-gzip) 2013-01-09 07:54 UTC, Nathan Schulte	no flags	Details
*dmesg and intel_reg_ logs for various boots** (89.59 KB, text/plain) 2013-01-09 09:31 UTC, Nathan Schulte	no flags	Details
*dmesg and intel_reg_ logs for various boots** (89.59 KB, application/x-gzip) 2013-01-09 09:32 UTC, Nathan Schulte	no flags	Details
bisect (5.22 KB, text/plain) 2013-01-13 02:43 UTC, Nathan Schulte	no flags	Details
Show Obsolete (2) View All

Description Alex 2012-08-22 12:16:59 UTC

Created attachment 65947 [details]
dmesg on failure

On fresh install Debian Wheezy (Linux version 3.2.0-3-686-pae), no XOrg just base system, most of the time I get blank screen during boot. But the system works fine as I can log in blindly and save dmesg on Windows partition. In some rare unpredictable cases I can log in normal screen so it seems like some race condition.
This is apparently not the bug #38718 as my backlight works fine (I can control it by Fn-F8/F9 keys while seeing the blank screen).

P.S. I'm very disappointed not to be able to even boot Linux in console without problems at such quite common hardware.

Comment 1 Daniel Vetter 2012-08-22 16:43:02 UTC

Please boot with drm.debug=0xe added to your kernel cmdline and attach the full dmesg. Also please grab the latest intel-gpu-tools (the 1.2 version in debian is new enough for your hw) and grab the output of intel_reg_dumper both when kms has loaded, and without kms (i915.die=1 on the kernel cmdline prevents i915.ko from loading).

Comment 2 Alex 2012-08-23 06:00:14 UTC

Created attachment 65993 [details]
dmesg on failure with drm.debug=0xe

Comment 3 Alex 2012-08-23 06:01:54 UTC

Created attachment 65994 [details]
intel_reg_dumper output with kms

Comment 4 Alex 2012-08-23 06:24:55 UTC

Created attachment 65995 [details]
intel_reg_dumper output without kms

I've uploaded information you asked.
Also I have one more note: Yesterday I set UTC=no in my /etc/default/rcS file in my attempt (of no avail) of telling Linux that my hw-clock is set to local time. Since that I cannot log in normal console (without blank screen) at all (20+ reboots). Before that I was able to boot in successfully from time to time and now I think this was due to I changed something in the system (console font size, grub screen resolution or something else). So there is a weird connection between the bug and different system settings.

Comment 5 Daniel Vetter 2012-08-23 08:28:00 UTC

Nothing obvious popped out of the reg dumps & dmesg ...

2 more things to test:
- Please test with kernel 3.6-rc, that contains a patch that might help here ("drm/i915: Remove too early plane enable on pre-PCH hardware").
- Please boot again with kms disable, but then start X with the vesa driver and select the native resolution. Then please grab the reg_dumper output again. This way we should match the kms configuration more closely, which makes comparing the difffs much easier (just in case I've missed something).

Comment 6 Alex 2012-08-24 04:42:26 UTC

I have installed Xorg (#aptitude install x-window-system) but command 'startx' gives me only some pixel mess on the screen and then small console prompt on black screen in the upper left corner. After this the system hangs: no reaction on Ctrl-Alt-Backspace, Ctrl-Alt-Del, Ctrl-Alt-Fn, and so on. Only power button kicks it to halt.
Then I tried to generate xorg.conf as there was no such file after Xorg installation. Command '#Xorg :1 -configure' ended with errors:
---
[xxx] i915: Unknown parameter 'die'
ERROR: could not insert 'i915': Unknown symbol in module, or unknown parameter (see dmesg)
Number of created screens does not match number of detected devices.
  Configuration failed.
Server terminated with error (2). Closing log file.
---
Please advise how to start the Xorg and how to force the vesa driver. In the meantime I'm compiling new vanilla kernel 3.6-rc3.

Comment 7 Alex 2012-08-24 08:48:07 UTC

Created attachment 66054 [details]
dmesg on failure, kernel-3.6.0-rc3 drm.debug=0xe

Same bug with new 3.6.0-rc3 kernel. Just for clarity here is my sketch of operations:
---
aptitude install kernel-package fakeroot libncurses5-dev
cd /usr/src
wget http://www.kernel.org/pub/linux/kernel/v3.0/testing/linux-3.6-rc3.tar.bz2
tar -xvfj linux-3.6-rc3.tar.bz2
ln -s linux-3.6-rc3 linux
cd linux
cp /boot/config-'uname -r' ./.config
make oldconfig  <--- here I chose default value for every new parameter
export CONCURRENCY_LEVEL=3
fakeroot make-kpkg --initrd --append-to-version=-my1 kernel_image kernel_headers
dpkg -i linux-image-3.6.0-rc3-my1_3.6.0-rc1-my1-10.00.Custom_i386.deb
reboot
---
Also here I see the same problems after 'startx' command.

Comment 8 Alex 2012-08-24 09:18:07 UTC

Created attachment 66055 [details]
dmesg of good boot, kernel-3.6.0-rc3 drm.debug=0xe

After a couple of reboots I accidentally booted to normal console with native resolution. Unfortunately I forgot to take 'intel_reg_dumper' output only dmesg and now after reboot I cannot get it back - only blank screen. Nothing had been changed in the system before and after successive boot.
---
What next?

Comment 9 Alex 2012-08-26 12:49:57 UTC

Created attachment 66134 [details]
intel_reg_dumper output with kms good screen

After several reboots I got one successive boot of my previous kernel (3.2.0) into normal screen console with native resolution. I saved the 'intel_reg_dumper' output and it differs a little from what was with blank screen and kms on. I hope this helps.

Comment 10 Alex 2012-09-03 05:56:14 UTC

Guys, if you really want to fix this bug my pc is ready for experiments. But I'm not going to keep this piece of sh_t for a long time, so hurry up.

P.S. I have WinXP installed on the same PC and it has been working fine, so this bug is not due to some hw failure.

Comment 11 Jesse Barnes 2012-11-14 17:14:27 UTC

Hm still nothing useful in those register dumps.  I don't see anything in the logs either; we seem to do normal probing and mode setting even in the failing case, which makes me think we're doing mode setting in the wrong order, or not waiting for a vblank where we should on gen3...

If you can ssh into the machine when the display has failed to come up, can you check and see if the registers at 0x70000 or 0x71000 are changing?  You can use intel_reg_read 0x70000 or 0x71000 for that.  That should tell us if the pipe is running at least.

If not, something has failed in our plane/pipe setup code.  If the pipe is running however, then something has failed in our port setup code, either bad panel power sequencing, incorrect LVDS port setup, or the registers were protected by the panel write protect lock.

Comment 12 Adam 2012-11-15 15:41:30 UTC

I'm also affectd by this issue
Here's my output:

[echinos@nettop ~]$ sudo intel_reg_read 0x70000
0x70000 : 0x1E7
[echinos@nettop ~]$ sudo intel_reg_read 0x71000
0x71000 : 0x2FF

Comment 13 Chris Wilson 2012-11-15 15:45:53 UTC

(In reply to comment #12)
> I'm also affectd by this issue
> Here's my output:
> 
> [echinos@nettop ~]$ sudo intel_reg_read 0x70000
> 0x70000 : 0x1E7
> [echinos@nettop ~]$ sudo intel_reg_read 0x71000
> 0x71000 : 0x2FF

You need to sample them at least twice and see if the values are changing. Thanks.

Comment 14 Adam 2012-11-15 16:44:59 UTC

Values are not changing. Those commands always print the same output.

Comment 15 Daniel Vetter 2012-11-15 16:49:20 UTC

Can people please retest with a 3.7-rc kernel? The modeset-rework in there seems to fix tons of sporadic lvds failures, specifically we have reports that it fixes LVDS issues on i945gm machines, too.

Comment 16 Adam 2012-11-15 17:21:30 UTC

I have tested kernel 3.7-rc5 but unfortunately with no luck. Bug still exists. I still get blank screen.

Comment 17 Adam 2012-11-18 18:36:42 UTC

In kernel 3.7-rc6 bug still exists.

Comment 18 Adam 2012-11-27 14:31:03 UTC

No progress in kernel 3.7-rc7

Comment 19 Chris Wilson 2012-11-27 17:52:45 UTC

Time to try random quirks:

i915.lvds_use_ssc=1 or 0
i915.lvds_channel_mode=1 or 2
i915.panel_ignore_lid=1

Comment 20 Adam 2012-11-28 10:47:33 UTC

No progress with these parameters. I tried almost every combination but I still get blank screen right after boot (used kernel 3.7-rc7).

Comment 21 Jesse Barnes 2012-12-11 18:36:56 UTC

Not sure what's going on here... what model is this machine?  Is there a BIOS update available?  Sometimes we find we don't program all the same state as the Windows driver, so BIOS versions occasionally matter.

Comment 22 Adam 2012-12-12 09:57:25 UTC

Created attachment 71385 [details]
lshw -c display output

What exactly info do you need? I'm using MSI AP 1900 http://www.msi.com/product/aio/Wind-Top-AP1900.html
I have updated BIOS to the latest version a while ago.
I'm running Windows XP and Arch Linux with LTS kernel (3.0.54) with no problems on this machine.

Comment 23 Jesse Barnes 2012-12-12 18:30:58 UTC

3.0.54 works from Arch, but newer kernels don't?  If so, can you bisect the problem?

Comment 24 Adam 2012-12-12 19:14:57 UTC

Yes, kernel 3.0.54 works from Arch Linux and newer don't. Probably since 3.2, maybe even since 3.1.

Comment 25 Jesse Barnes 2012-12-12 20:06:12 UTC

Can you bisect it then?

Comment 26 Adam 2012-12-13 12:49:08 UTC

I'm sorry but I don't understand what you want me to do.

I have tested every kernel using Arch Rollback Machine and it looks like the last one I can boot with no problems is kernel 3.4.9. The next one, 3.5 boots to blank screen.

Comment 27 Daniel Vetter 2012-12-13 12:51:08 UTC

(In reply to comment #26)
> I'm sorry but I don't understand what you want me to do.
> 
> I have tested every kernel using Arch Rollback Machine and it looks like the
> last one I can boot with no problems is kernel 3.4.9. The next one, 3.5
> boots to blank screen.

Bisect the kernel using the complete git history, i.e. check where exactly between 3.4 and 3.5 things broke. My recommended howto:

http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/

Comment 28 Adam 2012-12-15 10:07:53 UTC

Created attachment 71540 [details]
git bisect log

Bisecting points that "c10e408a00bb74c39f4f9b817f2b948851513377 is the first bad commit".
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=c10e408a00bb74c39f4f9b817f2b948851513377

Comment 29 Daniel Vetter 2012-12-15 11:44:22 UTC

Hm, that's a strange bisect result, since it shouldn't have any effect at all (safe when you're setting module options). Does reverting that patch on top of 3.2 still fix things?

$ git checkout v3.5
$ git revert c10e408a00bb74c39f4f9b817f2b948851513377

Comment 30 Adam 2012-12-15 17:48:45 UTC

I'm not sure if I done that correctly but no, it boot to black screen.

I wonder also why last revision from bisect was from 2012-03-01 when good commit (working) was from 2012-05-20 and bad commit (not working) was from 2012-07-21. Is this normal?

I did additional test.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=shortlog;h=f2fde3a65e88330017b816faf2ef75f141d21375
Revision f2fde3a65e88330017b816faf2ef75f141d21375 boots to black but the previous one, 28f3d717618156c0dcd2f497d791b578a7931d87 boots with no problem (it's working). Is that also normal?

Comment 31 Daniel Vetter 2012-12-15 17:58:08 UTC

Yeah, bisect can jump around quite a bit, since it needs to check all the parallel git trees. You can try testing the parent commit of your bisect again, just to make sure:

$ git checkout c501ae7f332cdaf42e31af30b72b4b66cbbb1604

Comment 32 Adam 2012-12-15 19:21:02 UTC

Revision c501ae7f332cdaf42e31af30b72b4b66cbbb1604 unfortunately also boots to blank screen.

Comment 33 Adam 2012-12-19 19:36:52 UTC

I did this bisect thing again with good rev 28f3d717618156c0dcd2f497d791b578a7931d87 and bad rev f2fde3a65e88330017b816faf2ef75f141d21375 and it also points me that "c10e408a00bb74c39f4f9b817f2b948851513377 is the first bad commit". I'm sure I did everything alright this time.

Comment 34 Daniel Vetter 2012-12-20 09:59:36 UTC

Ok, this bug here does not seem to be bisectable - the parent of c10e408a00bb74c39f4f9b should be good since an earlier bisect point contained it and was good. But actually it's a bad commit, too. Hence the really strange bisect result.

One thing you can try is to restart bisect with bad commit c10e408a00bb74c39f4f and good commit v3.4. git bisect will then first go back in time to test the common merge base, and if that works, too, will restart a bisect from that point. Sorry if this is such a mess, usually it's much simpler to bisect a regression :(

Comment 35 Adam 2012-12-20 12:46:05 UTC

The output was:

echinos@vbox ~/linux-git/linux.git-bisect (git)-[c501ae7...|bisect] % git bisect bad
The merge base c501ae7f332cdaf42e31af30b72b4b66cbbb1604 is bad.
This means the bug has been fixed between c501ae7f332cdaf42e31af30b72b4b66cbbb1604 and [76e10d158efb6d4516018846f60c2ab5501900bc].

Comment 36 Daniel Vetter 2012-12-20 13:14:27 UTC

On Thu, Dec 20, 2012 at 1:46 PM,  <bugzilla-daemon@freedesktop.org> wrote:
> echinos@vbox ~/linux-git/linux.git-bisect (git)-[c501ae7...|bisect] % git
> bisect bad
> The merge base c501ae7f332cdaf42e31af30b72b4b66cbbb1604 is bad.
> This means the bug has been fixed between
> c501ae7f332cdaf42e31af30b72b4b66cbbb1604 and
> [76e10d158efb6d4516018846f60c2ab5501900bc].

Have you restarted the git bisect with

$ git bisect reset

first?

Comment 37 Adam 2012-12-20 13:24:30 UTC

Of course I have.

Comment 38 Daniel Vetter 2012-12-20 13:34:09 UTC

(In reply to comment #37)
> Of course I have.

Oh, I've mixed up commit ids. The parent of the bisect commit is indeed in 3.4, so bisect can't really help us in drilling down further. The next trick is to do the reverse bisect to figure out where between c501ae7f332cdaf42e31af30b72b4b66cbbb1604 and v3.4 things have been fixed, so that we then can figure out why that fix has been lost again (hopefully).

Now git bisect is not symmetric in good/bad, so it refuse to figure out by default where something has been fixed (i.e. the first good commit). Hence you need to mark every commit which is good as bad and every commit which is bad as good. Then the first bad commit should be the one which fixed the bug in 3.4.

Comment 39 Adam 2012-12-21 09:41:22 UTC

Created attachment 71912 [details]
reversed git bisect log

Final output:
"6a562e3daee217ce99fe0e31150acd89a5b22606 is the first bad commit"
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=6a562e3daee217ce99fe0e31150acd89a5b22606

Of course everything is reversed (good commit = not working; bad commit = working).

Comment 40 Chris Wilson 2012-12-21 10:13:34 UTC

Created attachment 71917 [details] [review]
Probe for the EDID using GPIO as a fallback

Comment 41 Adam 2012-12-22 16:07:03 UTC

(In reply to comment #40)
> Created attachment 71917 [details] [review] [review]
> Probe for the EDID using GPIO as a fallback

I can't complile kernel with this patch

Log:
"CC [M]  drivers/gpu/drm/i915/intel_lvds.o
drivers/gpu/drm/i915/intel_lvds.c: In function ‘intel_lvds_init’:
drivers/gpu/drm/i915/intel_lvds.c:1033:42: error: ‘i2c’ undeclared (first use in this function)
drivers/gpu/drm/i915/intel_lvds.c:1033:42: note: each undeclared identifier is reported only once for each function it appears in
make[4]: *** [drivers/gpu/drm/i915/intel_lvds.o] Błąd 1
make[3]: *** [drivers/gpu/drm/i915] Błąd 2
make[2]: *** [drivers/gpu/drm] Błąd 2
make[1]: *** [drivers/gpu] Błąd 2
make: *** [drivers] Error 2"


I have compiled master revision from linux-git (without this patch) and it seems that problem is resolved there. System is booting normally.

Unfortunately, ArchLinux kernel "linux 3.7.1-2" still boots to black. I can't also apply patch there.

Comment 42 Tomas M. 2012-12-23 15:54:25 UTC

hi

ive been struck with this bug too. (945gm hardware too).

this started happening since 3.7-rc1 (see bug report: https://bugs.freedesktop.org/show_bug.cgi?id=57365 )

on this laptop the bug is 99.999% reproducible (it booted correctly only once since 3.7-rc1).

whatever needs testing, please let me know.

Comment 43 Nathan Schulte 2013-01-09 07:18:52 UTC

Created attachment 72707 [details]
dmesg and intel_reg_* logs for various boots

I am experiencing this bug as well, on a fresh install of Debian Sid.  I can control the backlight just fine, and it does indeed appear to be a race condition as there are [rare] times that I boot into a console with KMS without issues.

I am experiencing it with both the kernel in Debian Sid as well as the 3.7 kernel from their experimental repo:

$ uname -a # 3.2, unstable kernel
Linux desmas-l 3.2.0-4-amd64 #1 SMP Debian 3.2.35-2 x86_64 GNU/Linux

$ uname -a # 3.7, experimental kernel
Linux desmas-l 3.7-trunk-amd64 #1 SMP Debian 3.7.1-1~experimental.2 x86_64 GNU/Linux


I have found two workarounds to this:

1) Adding the i915 kernel module to my initramfs.
2) Booting with the kernel parameter video=SVIDEO-1:d


I have attached the output from dmesg for booting with the 3.2 kernel, as well as the 3.7 kernel, all with the kernel parameter drm.debug=0xe, and for both with and without the video=SVIDEO-1:d workaround.  In addition you will find intel_reg_dumper outputs, as well as intel_reg_reads for 0x70000 and 0x71000 for the 3.2 boots; I could not get the tools to work for the 3.7 kernel, failing with the following message:

Couldn't map MMIO region: Resource temporarily unavailable

Lastly, there is a dmesg from a 3.7 boot that worked without a workaround.

Please let me know if there is anything else I can assist with!  I will attempt a git bisect as I am familiar with that process.  I can provide i915.modeset=0 or nomodeset boot dmesg logs as well if needed.

--
Nate

Comment 44 Nathan Schulte 2013-01-09 07:35:26 UTC

Also, I forgot to mention one thing:

If I hibernate (suspend to disk) my machine while experiencing the issue, upon rebooting from the hibernate, I am presented with a console as expected.  I will provide logs for these "resume" boots shortly.

Comment 45 Nathan Schulte 2013-01-09 07:54:03 UTC

Created attachment 72708 [details]
dmesg and intel_reg_* logs for hibernate boots

Here are the associated hibernate booted logs.

*-hibernate : the logs for the initial boot, with a black screen
*-hibernate_resume : the logs for the "resume" boot, which appears to work fine

I've provided both for the 3.2 kernel as well as the 3.7 kernel.  There are dmesg and intel_reg_dumper outputs for all boots (the gpu tools don't appear to work for my 3.7 kernel, so those are excepted).

--
Nate

Comment 46 Nathan Schulte 2013-01-09 09:31:08 UTC

Created attachment 72710 [details]
dmesg and intel_reg_* logs for various boots

I've compiled intel-gpu-tools from git and updated the first set of outputs (log.tar.gz); they are all new from the new build and contain outputs for the 3.7 kernel as well.

* c612481 (HEAD, origin/master, origin/HEAD, master) tests/gem_seqno_wrap: skip if debugfs entry is not there

Comment 47 Nathan Schulte 2013-01-09 09:32:58 UTC

Created attachment 72711 [details]
dmesg and intel_reg_* logs for various boots

Same as previous attachment, fixing content-type in tracker.

Comment 48 Nathan Schulte 2013-01-11 00:38:10 UTC

Some updates:

First, I built the mainline kernel last night (a few commits before 3.8-rc8), and the problem appears to have been fixed.

Next, I built vanilla 3.2 and ensured that the problem still existed; it did.  I then proceeded to slim down my .config to speed up the build, and attempted to bisect the issue, noting that "good" was really bad, and "bad" was actually good.

A few of the chosen bisection points would cause kernel panics from DRM/i915, so I skipped these.  At some point, much before the vanilla that the Debian 3.7 kernel is built upon, the issue appears to have been fixed.  How can that be?  Could there be multiple regressions causing and subsequently fixing this issue throughout the history?

Here is a history of bisection before I gave up, note that "good" is really bad, and "bad" is really good.

COMMIT [COMMIT_DATE] GIT_BISECT_STATUS - OTHER_DETAILS

805a6af [2012-01-04 175544] good - start; issue apparent
c558386 [2012-07-02 042322] good - 1st;   issue apparent
7f60ba3 [2012-10-07 033050] skip - 2nd;   issue not apparent; kernel panic
9ff601a [2012-06-28 090327] good - 3rd;   issue apparent
327967c [2012-09-04 090336] skip - 4th;   issue not apparent
3077494 [2012-09-21 094309] skip - 5th;   issue not apparent; kernel panic
5cbe786 [2012-08-16 081314] bad  - 6th;   issue not apparent
974b335 [2013-01-08 205356] bad  - stop;  issue not apparent

--

If we can safely ignore the kernel panics, the issue was last seen in commit 9ff601a (June 28, 2012) and is first seen fixed in commit 7f60ba3.  I will try bisecting those two commits and see what I find.

The data from the bisect doesn't jive with the kernel from Debian, though.  I don't know what to make of that, honestly.  Any ideas?

Comment 49 Daniel Vetter 2013-01-11 08:30:30 UTC

> --- Comment #48 from Nathan Schulte <nmschulte@gmail.com> ---
> A few of the chosen bisection points would cause kernel panics from DRM/i915,
> so I skipped these.  At some point, much before the vanilla that the Debian 3.7
> kernel is built upon, the issue appears to have been fixed.  How can that be?
> Could there be multiple regressions causing and subsequently fixing this issue
> throughout the history?

Are you sure that the fixed versions are really included in 3.7? Since
the drm-intel-next tree for e.g. 3.8 is usually based on some 3.6-rc
kernel, git bisect can let you test kernel versions based on 3.6-rc
kernels. But the actual set of patches is only included in 3.8, not
3.7.

To check such a case run

$ git tag --contains <commit-id>

it'll list all the version tags which contain the given commit. If 3.7
is not among them, then it's just the branch-y nature of git messing
around with you ;-) Can you please check that and if it's just that,
continue the bisect?

Comment 50 Nathan Schulte 2013-01-11 12:06:38 UTC

> --- Comment #49 from Daniel Vetter <daniel@ffwll.ch> ---
> Are you sure that the fixed versions are really included in 3.7? Since
> the drm-intel-next tree for e.g. 3.8 is usually based on some 3.6-rc
> kernel, git bisect can let you test kernel versions based on 3.6-rc
> kernels. But the actual set of patches is only included in 3.8, not
> 3.7.

Actually, I'm fairly certain it is not, :).  I tried to git bisect the
two commits mentioned, and bisect noted that the former was not a
parent of the latter.

> To check such a case run
> 
> $ git tag --contains <commit-id>
> 
> it'll list all the version tags which contain the given commit. If 3.7
> is not among them, then it's just the branch-y nature of git messing
> around with you ;-) Can you please check that and if it's just that,
> continue the bisect?

Certainly, I will attempt the bisect again soon.  Do you know if there
is a way to bisect with only commits that are children of the branch of
the starting commit?

--
Nate

Comment 51 Nathan Schulte 2013-01-11 12:16:10 UTC

I've found an interesting article that explains that git bisect is
branch aware, meaning that it knows about merge commits and "does the
right thing."

http://stackoverflow.com/questions/3673377/why-isnt-git-bisect-branch-aware

Thinking about the issue, I'm wondering if git's "strange" jumps back
in time for a bisect are actually _because_ of this methodology.
Perhaps I should continue with this bisect log after all?  I have it
saved, so perhaps that's what I'll do.

Daniel, if you could confirm or deny my suspicions above before I go
through with all of that work, I would much appreciate it!

Comment 52 Daniel Vetter 2013-01-11 13:29:08 UTC

Yeah, that's what I've tried to explain (but without the pretty branch graphs). If you're sure that you haven't marked any commits wrongly with good/bad, you can restart git bisect with the capture bisect log, i.e. after starting the bisect ignore the first bisect request and manually mark all the already tested commits with

$ git bisect good|bad|skip sha1

After that's done, git will compute a new optimal bisect point and check it out for testing.

Comment 53 Nathan Schulte 2013-01-11 18:28:50 UTC

I don't think I've marked any commits improperly.  I skipped the ones
that caused kernel panics, though I suspect I could include those data
points and possibly speed up the process.  I'm not sure, so I'll leave
them out.

One question: right now I'm bisecting for a bug fix rather than a bug
regression.  Would it be wiser to instead bisect for a regression from
say 3.0 (or similar, where the bug does not exist, if such a commit
exists) to 3.2?

Another question: git bisect can narrow down commits to those that
affect a certain part of the tree.  Do you think it's safe to narrow
this down to the DRM or even i915 portion?  That might speed up the
process tremendously if so.

Comment 54 Daniel Vetter 2013-01-11 18:38:22 UTC

On Fri, Jan 11, 2013 at 7:28 PM,  <bugzilla-daemon@freedesktop.org> wrote:
> Another question: git bisect can narrow down commits to those that
> affect a certain part of the tree.  Do you think it's safe to narrow
> this down to the DRM or even i915 portion?  That might speed up the
> process tremendously if so.

I tend to not restrict bisect - most often they're in drm/i915, but
sometimes not ...

Comment 55 Nathan Schulte 2013-01-13 02:43:39 UTC

Created attachment 72938 [details]
bisect

I've completed the bisection, and git points to this commit as being the
one that resolves the issue:

commit 0b9f43a0ee7e89013a3d913ce556715fd8acb674
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Jun 5 10:07:11 2012 +0200

    drm/i915: allow pipe A for lvds on gen4
    
    Given the havoc the missing backlight pipe select code might have
    caused, let's try to re-enabled pipe A support for lvds on gen4 hw.
    Just to see what all blows up ...
    
    Note though that
    
    commit 4add75c43f39573edc884d46b7c2b7414f01171a
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Sat Dec 4 17:49:46 2010 +0000
    
        drm/i915: Allow LVDS to be on pipe A for Ironlake+
    
    claims that this caused tons of spurious wakeups somehow.
    
    More details can be found in the old revert:
    
    commit 12e8ba25ef52f19e7a42e61aecb3c1fef83b2a82
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Tue Sep 7 23:39:28 2010 +0100
    
        Revert "drm/i915: Allow LVDS on pipe A on gen4+"
    
        Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=16307
    
    Reviewed-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

I've attached the bisect log, including marking the final commit (should
you wish to replay the log, you will probably need to remove this
entry).

Given the nature of the bug, appearing to be a race condition and not
always easily reproducible, I booted each kernel (I built it nearly 40
times in all) until I either experience the black screen, or about 5-10
times until I was satisfied it was not occurring, with the latter builds
receiving more scrutiny.  I have booted this "fix" commit many, many
times and have yet to experience the issue, where as the commit before
had the issue after the second boot.

It should be noted that this commit immediately follows some that appear
to modify backlight operation for gen4.  It should also be noted that my
particular controller, on an Dell Vostro 1510, is as follows:

00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile
GM965/GL960 Integrated Graphics Controller (primary) [8086:2a02] (rev
0c) (prof-if 00 [VGA controller])
	Subsystem: Dell Device [1028:0273]
	Flags: bus master, fast devsel, latency 0, IRQ 47
	Memory at f8000000 (64-bit, non-prefetchable) [size=1M]
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	I/O ports at 1800 [size=8]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
	Capabilities: [d0] Power Management version 3
	Kernel driver in use: i915

Lastly, these commits appear to be merged into mainline at the v3.6-rc1
release, and came after the v3.4 release.  I am experiencing the blank
screen issue with a Debian 3.2 kernel, and a Debian 3.7 kernel.  Looking
at the source for the Debian 3.2 kernel, I do not see this change; there
is no check for GEN4 and setting the pipe A mask (the lsb; (1 << 0)),
and the catchall else only sets pipe B (bit 1; (1 << 1)).  Looking at
the source for the Debian 3.7 kernel, the change is visible.

So, we've got conflicting data... Perhaps there is some issue with
modesetting and checking the crtc_mask properly?  I could look at that
code if needed.  It's also possible I have no idea what I'm talking
about.

Finally, I am in the process of frobbing the Debian 3.7 source to see
what happens.  I will share those results when I have them.  Again,
perhaps finding the actual regression (assuming the bug hasn't been
hidden all along with no actual regression) will shed some light.
Further, I'm thinking about looking into SVIDEO interactions with KMS,
as that appears to solve the issue.  I may also try downgrading my BIOS
and see what that effects, if possible.  Lots of places to go from here.

--
Nate

Comment 56 Nathan Schulte 2013-01-13 12:19:42 UTC

So here's the deal:

The bug (due to a regression or otherwise) was indeed fixed in the
commit identified by the bisect.  However, since that commit and the
Linux 3.7.1 release, there has been a regression.  However, since that
regression, the bug appears to have been fixed (in 3.8-rc3).

I am bisecting at the moment to find the cause of the regression, as it
doesn't appear obvious.  I assume it's got something to do with the
recent KMS changes to i915.

--
Nate

Comment 57 Nathan Schulte 2013-01-15 18:00:43 UTC

Nathan Schulte wrote:
> I am bisecting at the moment to find the cause of the regression, as it
> doesn't appear obvious.  I assume it's got something to do with the
> recent KMS changes to i915.

fa55583797d12b10928a1813f3dcf066637caf5e is the first bad commit
commit fa55583797d12b10928a1813f3dcf066637caf5e
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Oct 10 23:14:00 2012 +0200

    drm/i915: fixup the plane->pipe fixup code
    
    We need to check whether the _other plane is on our pipe, not
whether
    our plane is on the other pipe. Otherwise if not both pipes/planes
are
    active, we won't properly clean up the mess and set up our desired
    plane->pipe mapping.
    
    v2: Fixup the logic, I've totally fumbled it. Noticed by Chris
Wilson.
    
    v3: I've checked Bspec, and the flexible plane->pipe mapping is a
    gen2/3 feature, so test for that instead of PCH_SPLIT
    
    v4: Check whether we indeed have 2 pipes before checking the other
    pipe, to avoid upsetting i845g/i865g. Noticed by Chris Wilson.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=51265
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=49838
    Tested-by: Dave Airlie <airlied@gmail.com>
    Tested-by: Chris Wilson <chris@chris-wilson.co.uk> #855gm
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

This is the commit that introduced the bug into Debian's 3.7.1 kernel.
I believe this is also the cause of
https://bugs.freedesktop.org/show_bug.cgi?id=57365

Mainline 3.8-rc3 seems to be working fine.

Now what?

--
Nate

Comment 58 Daniel Vetter 2013-01-15 18:10:42 UTC

Can you please test whether

commit b0a2658acb5bf9ca86b4aab011b7106de3af0add
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Dec 18 09:37:54 2012 +0100

    drm/i915: don't disable disconnected outputs

fixing things in 3.8? See bug #58396. Patch is already on it's way to stable kernels ...

Comment 59 Nathan Schulte 2013-01-15 21:14:23 UTC

I have tested that commit as it exists in the kernel repo.  Building
that commit seems to fix the issue; I was unable to reproduce it.

However, I tested the commit right before it as well; I was also unable
to reproduce it.

I have cherry-picked the commit in question on to the commit I
identified as causing the issue (the second time, not the first one
noted above), and it does not fix the issue; I am able to reproduce it.

I am currently bisecting between the two to see where the fix is.

--
Nate

Comment 60 Nathan Schulte 2013-01-16 22:43:58 UTC

On Tue, 2013-01-15 at 15:14 -0600, Nathan Schulte wrote:
> I am currently bisecting between the two to see where the fix is.

f20e0b08b8b2a8432e6abf3683960099f0ab2958 is the first bad commit
commit f20e0b08b8b2a8432e6abf3683960099f0ab2958
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Dec 7 10:43:25 2012 +0000

    drm/i915: Prefer CRTC 'active' rather than 'enabled' during WM
computations
    
    Only the intel_crtc->active is accurate at the point where we wish
to
    perform WM computations, so use it instead of crtc->enabled.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

This commit fixes the issue for me.  I can try cherry-picking if you
wish.

--
Nate

Comment 61 Nathan Schulte 2013-01-17 15:32:20 UTC

Cherry-picking Chris Wilson's changes onto the broken commit (the 2nd time) fixes the issue.

Comment 62 Tomas M. 2013-01-17 23:21:49 UTC

hmmm. tried 3.8-rc3 and it did NOT fix the issue for me.

maybe my bug is different.

Comment 63 Nathan Schulte 2013-01-18 15:44:42 UTC

> Comment # 62 on bug 53926 from Tomas M. 
> hmmm. tried 3.8-rc3 and it did NOT fix the issue for me.
> 
> maybe my bug is different.

Perhaps so.  I suppose mine could be different from the OP as well.  In
fact, it looks like I was encountering two different bugs with the same
effect (both race conditions, apparently).

Did you try the other commits I listed?  Both the ones with the fix, and
the commits immediately prior?  That might help narrow down your
particular problem.

--
Nate

Comment 64 Tomas M. 2013-01-19 16:42:49 UTC

(In reply to comment #63)
> > Comment # 62 on bug 53926 from Tomas M. 
> > hmmm. tried 3.8-rc3 and it did NOT fix the issue for me.
> > 
> > maybe my bug is different.
> 
> Perhaps so.  I suppose mine could be different from the OP as well.  In
> fact, it looks like I was encountering two different bugs with the same
> effect (both race conditions, apparently).
> 
> Did you try the other commits I listed?  Both the ones with the fix, and
> the commits immediately prior?  That might help narrow down your
> particular problem.
> 
> --
> Nate

Hi,

this is my result:
  1 works 0b9f43a0ee7e89013a3d913ce556715fd8acb674 | should work
  2 worked once fa55583797d12b10928a1813f3dcf066637caf5e | should not work
  3 does not work fa55583797d12b10928a1813f3dcf066637caf5e^ | should work
  4 does not work b0a2658acb5bf9ca86b4aab011b7106de3af0add | should fix the problem
  5 does not work f20e0b08b8b2a8432e6abf3683960099f0ab2958 | should not work
  6 worked once f20e0b08b8b2a8432e6abf3683960099f0ab2958^ | should work


0b9f43 was only tested twice. the others tested till failed (only once or twice).

i will retest 0b9f43 and report if it fails

Comment 65 Nathan Schulte 2013-01-19 22:26:54 UTC

Hmm.  I have no idea, I'm not that familiar with the code base.

I would recommend you try performing the bisection as I did, and note the commits that break/fix your issue in your bug report?  In hindsight, I probably should have made a new bug report for my specific hardware.

One thing I want to note:  While I was testing each kernel, I noticed a sort of pattern.

1) Boot 3.8-rc3 to build next bisected kernel.
2) Reboot into new kernel; would boot fine, no issue.
3) Reboot into new kernel again (via SysRq+REISUB); would boot fine, no issue.
3) Reboot into new kernel again (via SysRq+REISUB); would not boot fine, issue present.
5) Boot 3.8-rc3 to build next bisected kernel.
6) Reboot into new kernel; would boot fine, no issue.
6) Reboot into new kernel again (via SysRq+REISUB); would boot fine, no issue.
7) Reboot into new kernel again (via SysRq+REISUB); would not boot fine, issue present.

For every kernel that had the problem, it seems that every fourth boot (every third of the kernel in question) would be the one that would cause the issue.  I got into such a rhythm of building kernels (I did it over 100 times...), I would be building a new kernel every fourth warm-boot.  I wonder if this pattern is helpful at all to diagnose the issue.

Anyway, my point for Tomas: it seems as though if you get into a pattern like that, you shouldn't have to test each kernel more than three times.  While I was testing, for the kernels that did not have an issue on the third boot, I would reboot them 10 times or more just to make sure.

Comment 66 Tomas M. 2013-01-19 22:44:14 UTC

Nathan,

see my bug report: https://bugs.freedesktop.org/show_bug.cgi?id=57365

maybe you can attempt a bisect there and see if you reach to my own bad commit.

im just here cause someone suggested i might have this very same bug, which im not sure. 

my problem appeared at 3.7-rc1 but since its racy (sometimes i can boot correctly, sometimes i cant) it could be an older regression/bug.

of course to mark a good commit i booted around 10 times each good bisect point.

right now on 3.8-rc4 i have a 50% chance of booting correctly. additionally, if i suspend/resume a broken boot. it fixes itself.

Comment 67 Daniel Vetter 2013-01-22 09:25:01 UTC

Ok, we need to clear up this bug here, since too many other reporters (with different issues) chipped int.

Thomas, I think it's best to track your bug in #57365 

Nathan, it sounds like your bug is fixed. I've also checked stable kernels and the change seems to have been inadvertently been backported with another manual backport. At least the wm functions are now using ->active and no longer ->enabled.

Chris, can you quickly fix up the compile fail in Adam's patch.

To everyone else who might add his issue to this bug report: This is Adam's report, if you don't have the exact same hw/sw combo, chances are really high that you have a different bug (especially if you bisect to a different commit). Please file your own bug report, thanks.

Comment 68 Daniel Vetter 2013-02-15 10:30:19 UTC

Can you please retest with latest drm-intel-nightly from

http://cgit.freedesktop.org/~danvet/drm-intel

I've just merged two patches to adjust the pll limits for sdvo/lvds on gen3/4.

Comment 69 Imre Deak 2013-05-23 15:41:24 UTC

(In reply to comment #67)
> [...]
> Chris, can you quickly fix up the compile fail in Adam's patch.
> 
> To everyone else who might add his issue to this bug report: This is Adam's
> report, if you don't have the exact same hw/sw combo, chances are really
> high that you have a different bug (especially if you bisect to a different
> commit). Please file your own bug report, thanks.
>
(In reply to comment #68)
> Can you please retest with latest drm-intel-nightly from
> http://cgit.freedesktop.org/~danvet/drm-intel
> I've just merged two patches to adjust the pll limits for sdvo/lvds on gen3/4.

Based on the above we are now only missing Adam's feedback on this. Adam could you test the above branch?

Comment 70 Adam 2013-05-24 17:14:12 UTC

Issue is resolved for me since kernel 3.8.

Comment 71 Daniel Vetter 2013-05-24 19:36:32 UTC

Thanks for reporting back, closing as fixed.

Comment 72 Jari Tahvanainen 2016-10-07 05:38:02 UTC

Closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.