Bug 15602 - Screen Blanking with kernel >= 2.6.25 and Intel Driver
Screen Blanking with kernel >= 2.6.25 and Intel Driver
Status: RESOLVED NOTOURBUG
Product: xorg
Classification: Unclassified
Component: Driver/intel
7.3 (2007.09)
x86 (IA32) Linux (All)
: high major
Assigned To: Wang Zhenyu
Xorg Project Team
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-04-18 20:09 UTC by Justin Madru
Modified: 2008-08-15 10:53 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Intel regdumps and corresponding X.log files (43.71 KB, application/x-bzip)
2008-04-18 20:09 UTC, Justin Madru
no flags Details
2.6.25 Kernel config (37.33 KB, text/plain)
2008-04-18 20:11 UTC, Justin Madru
no flags Details
Hardware/Software Environment Info (24.34 KB, text/plain)
2008-04-18 20:13 UTC, Justin Madru
no flags Details
Git Bisect Log - to within 42 commits (1.57 KB, text/plain)
2008-05-12 10:58 UTC, Justin Madru
no flags Details
Full Git Bisect Log (2.40 KB, text/plain)
2008-05-15 11:10 UTC, Justin Madru
no flags Details
Full Git Bisect Log (2.40 KB, text/plain)
2008-05-15 11:11 UTC, Justin Madru
no flags Details
Git Bisect Config (35.53 KB, text/plain)
2008-05-15 11:16 UTC, Justin Madru
no flags Details
X.log Using Jesse's Patch (20.80 KB, text/plain)
2008-05-15 12:43 UTC, Justin Madru
no flags Details
Blank screen Xorg.log with ModeDebug (82.56 KB, text/plain)
2008-05-22 11:32 UTC, Justin Madru
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Madru 2008-04-18 20:09:51 UTC
Created attachment 16031 [details]
Intel regdumps and corresponding X.log files 

After the boot splash screen, the screen goes through mode changes. The screen goes blank and "flickers", seemingly checking for the right display settings. What should happen next is a display of a cursor on a colored background until GDM is fully loaded. But, when using a kernel in the range 2.6.25-rc3...2.6.25, once the screen goes blank it never turns back on. Everything else is normal, I hear the sound gdm makes when it's fully booted, and I can even login - except the screen is still blank (I have to do everything blind). Switching to a terminal (ctrl+alt+f#) doesn't help; the screen goes through another mode change, but still to a blank screen. Upon shutting down, the usplash shutdown screen displays (meaning somehow the screen is re-enabled!) Also whenever booting with the 2.6.25 kernel the splash screen gets garbled.

This blank screen problem doesn't happen all the time, but seemingly randomly (but often enough that I sometimes have to repeatedly reboot). I can also trigger the blank screen if I switch terminals (ctrl+alt+f#) at certain points during the usplash boot splash. I've noticed that if I disable the boot splash screen, it's more likely (almost completely, but not always) to fix the problem. If I use the vesa X.org driver I haven't been able to trigger the blank screen; and the i915 kernel module doesn't load.

I had the problem with Ubuntu 7.10, and even after upgrading to 8.4.
The computer is a Dell Inspiron E1505 laptop with an Intel 945 graphic card.
I've tested with kernel versions from 2.6.22 to 2.6.24 and never had the problem.

Because the problem only happens on a 2.6.25 kernel, I suspect the bug is related to a change in the kernel that is increasing the chance of hitting a bug in the X.org Intel driver.

I've posted this problem to lkml (http://lkml.org/lkml/2008/3/12/290),
and there's an open bug report on bugzilla.kernel.org (http://bugzilla.kernel.org/show_bug.cgi?id=10235)
Comment 1 Justin Madru 2008-04-18 20:11:27 UTC
Created attachment 16032 [details]
2.6.25 Kernel config
Comment 2 Justin Madru 2008-04-18 20:13:00 UTC
Created attachment 16033 [details]
Hardware/Software Environment Info
Comment 3 Wang Zhenyu 2008-04-21 20:24:01 UTC
Have intel_fb loaded by ubuntu startup? 

Does this only happen with gdm or other gui session manage stuff? Could you disable gdm and test start X from console in runlevel 2 or 3?

Comment 4 Justin Madru 2008-04-23 09:07:39 UTC
I get the blank screen with and without the intel_fb. I currently have it compiled out of my kernel because Jesse Barnes said it's only useful for a boot logo or boot splash (but ubuntu's boot splash still works).

This is what I have enabled in my config:
CONFIG_AGP=y
CONFIG_AGP_INTEL=m
CONFIG_DRM=y
CONFIG_DRM_I915=m
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=m
CONFIG_BACKLIGHT_CLASS_DEVICE=m
CONFIG_DISPLAY_SUPPORT=m
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=128
CONFIG_VIDEO_SELECT=y
CONFIG_DUMMY_CONSOLE=y

It seems to ~only~ happen with gdm (haven't tested xdm/kdm/etc). If I disable gdm and start X manually it ~works~ (~20 reboots, only <2 blank screens). Disabling the boot splash also works fairly well at fixing the problem.

Jesse thought it was either:
1) VT switching
2) Mode programming
3) Pipe programming timing
4) Timing change in 2.6.25
5) Memory layout change in 2.6.25

(see http://lkml.org/lkml/2008/3/12/290 lkml thread for what I've tested so far.)
Comment 5 Justin Madru 2008-04-24 23:08:56 UTC
I tested a Sony Viao with an intel 965 and the 2.6.25 kernel. Got the same blank screen on the mode change from splash screen to X/gdm. The computer's running Ubuntu 7.10. Once I get the time I'm going to try a Fedora 9 live CD because it's using a 2.6.25 kernel. We'll see how that goes.

Comment 6 Justin Madru 2008-05-07 11:49:50 UTC
I tested kdm and kernel 2.6.26-rc1 and the blank screen still happens.
Comment 7 Justin Madru 2008-05-12 10:58:56 UTC
Created attachment 16486 [details]
Git Bisect Log - to within 42 commits
Comment 8 Jesse Barnes 2008-05-12 11:02:55 UTC
Justin, can you try applying this patch to your X driver?

https://bugs.freedesktop.org/attachment.cgi?id=16394

It should give us some debug output even if it doesn't fix the problem (see the parent bug for more hints in case it doesn't work).
Comment 9 Jesse Barnes 2008-05-12 11:03:29 UTC
I mean see https://bugs.freedesktop.org/show_bug.cgi?id=13326, basically you can try setting DSPARB to 48 if the patch doesn't work.
Comment 10 Justin Madru 2008-05-12 11:07:54 UTC
I've started to git-bisect the problem (aka. it seems to be at least a kernel regression, maybe in combination with an X bug).

Unfortunately, I've hit a snag that has slowed me down.
during the boot splash I get the following error message:

select() to /dev/rtc to wait for clock tick timed out
Unable to set system clock

The problem is it drops to console mode and never goes back to the slash mode. Henceforth the blank screen problem never happens! Furthermore, after a few reboots later grub stops counting down - it stalls at 2. I have to press esc and select the kernel to boot. Then, after a few more reboots I get the BIOS error message:

time-of-day clock stopped

The computer can't boot and the only solution that I've found is removing the bios battery for about 1hr, then rebooting. but it's only temporary because today the select()... and grub issues are back. Henceforth, I can't resume the git bisect until I remove the bios battery again!

I think the blank screen issue is wearing down the internal real time clock or something related to the bios! This is _Not_Good_ ! 

I've narrowed down the regression to about one of the first 42 commits on the following page:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=shortlog;h=6478d8800b75253b2a934ddcb734e13ade023ad0

The commits on that page show that the regression is either a bug in:
1) sched
2) hrtimer
3) Preempt-RCU

I think its #2 or #3.

Hopefully I fix my Bios error so that I can continue the git bisect. Until then I hope this helps.
Comment 11 Justin Madru 2008-05-12 11:13:11 UTC
(In reply to comment #9)
> I mean see https://bugs.freedesktop.org/show_bug.cgi?id=13326, basically you
> can try setting DSPARB to 48 if the patch doesn't work.
> 

Is this a patch for the kernel/modules or for X/drivers?
How do I apply the patch? (I'm still new to all this.)
Comment 12 Jesse Barnes 2008-05-12 11:30:54 UTC
The patch is for the Intel X driver.  Are you building the X driver from git?  If so, just cd into the xf86-video-intel git tree and do 'patch -p1 < patch_filename'.

Are you on IRC?  You could hop onto Freenode and enter the #intel-gfx channel if you want to talk in real time.
Comment 13 Justin Madru 2008-05-15 11:09:40 UTC
I finished git bisecting the kernel. When testing, if it booted correctly I rebooted 3 times to make sure. If it booted blank then I know that that rev had the problem.

Also when I got to within the last ~80 commits the kernel started to corrupt my real time clock! Similar to https://launchpad.net/ubuntu/+source/linux-source-2.6.15/+bug/43745

8f4d37ec073c17e2d4aa8851df5837d798606d6f is first bad commit
commit 8f4d37ec073c17e2d4aa8851df5837d798606d6f
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date:   Fri Jan 25 21:08:29 2008 +0100

sched: high-res preemption tick

Use HR-timers (when available) to deliver an accurate preemption tick.

The regular scheduler tick that runs at 1/HZ can be too coarse when nice level are used. The fairness system will still keep the cpu utilisation 'fair' by then delaying the task that got an excessive amount of CPU time but try to minimize this by delivering preemption points spot-on.

The average frequency of this extra interrupt is sched_latency / nr_latency. Which need not be higher than 1/HZ, its just that the distribution within the sched_latency period is important.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

:040000 040000 ab225228500f7a19d5ad20ca12ca3fc8ff5f5ad1 f1742e1d225a72aecea9d6961ed989b5943d31d8 M      arch
:040000 040000 25d85e4ef7a71b0cc76801a2526ebeb4dce180fe ae61510186b4fad708ef0211ac169decba16d4e5 M      include
:040000 040000 9247cec7dd506c648ac027c17e5a07145aa41b26 950832cc1dc4d30923f593ecec883a06b45d62e9 M      kernel
Comment 14 Justin Madru 2008-05-15 11:10:29 UTC
Created attachment 16553 [details]
Full Git Bisect Log
Comment 15 Justin Madru 2008-05-15 11:11:35 UTC
Created attachment 16554 [details]
Full Git Bisect Log
Comment 16 Justin Madru 2008-05-15 11:16:10 UTC
Created attachment 16555 [details]
Git Bisect Config
Comment 17 Justin Madru 2008-05-15 12:26:28 UTC
(In reply to comment #8)
> Justin, can you try applying this patch to your X driver?
> 
> https://bugs.freedesktop.org/attachment.cgi?id=16394
> 
> It should give us some debug output even if it doesn't fix the problem (see the
> parent bug for more hints in case it doesn't work).
> 

I applied the patch and it didn't fix the problem.
I'm still seeing the blank screen with 2.6.26-rc1.
Comment 18 Justin Madru 2008-05-15 12:43:43 UTC
Created attachment 16556 [details]
X.log Using Jesse's Patch
Comment 19 Justin Madru 2008-05-21 13:48:09 UTC
As of 2.6.26-rc3 (or _at_least_ this is the first time I noticed it), when the screen blanks out for the screen saver/power saving, it never comes back on. This is just like what happens when it boots, and it doesn't happen all the time.

The computer seems like it would still be responsive. It's just that the X server is in some kind of "hung" state, not responding to ctrl+alt+backspace  or switching to a terminal (cltrl+alt+f2). Just a constant blank screen.
Sometimes alt+sysrq+k works, sometimes only alt+sysrq+b or a hard reset work.

Although, I'm sure at least some programs were interrupted, because one time, I was doing a large network copy of about 6 files totaling 4.6GB and it only transfered the first file when I noticed the screen wouldn't come back on. And other times I've noticed the hard drive was still active. I haven't tried pinging it when it blanks out, but I'm sure it would respond.

This has already happened 3 times. This is what I get in my syslog

trying to get vblank count for disabled pipe 1
last message repeated 1515 times
last message repeated 3050 times
last message repeated 3050 times
last message repeated 3050 times
last message repeated 3050 times
last message repeated 3031 times
trying to get vblank count for disabled pipe 1
last message repeated 1518 times
last message repeated 3443 times
last message repeated 2385 times
trying to get vblank count for disabled pipe 1
trying to get vblank count for disabled pipe 1
trying to get vblank count for disabled pipe 1
last message repeated 41 times
trying to get vblank count for disabled pipe 1
trying to get vblank count for disabled pipe 1
last message repeated 629 times
Comment 20 Wang Zhenyu 2008-05-21 23:58:39 UTC
Could you enable ModeDebug option and send your failure X log?
Comment 21 Justin Madru 2008-05-22 11:32:34 UTC
Created attachment 16695 [details]
Blank screen Xorg.log with ModeDebug
Comment 22 Wang Zhenyu 2008-05-22 18:49:09 UTC
How about specify the modeline in your xorg.conf Monitor section?

Modeline "1280x800"   71.11  1280 1328 1360 1440  800 802 808 823
Comment 23 Justin Madru 2008-05-25 20:41:55 UTC
(In reply to comment #22)
> How about specify the modeline in your xorg.conf Monitor section?
> 
> Modeline "1280x800"   71.11  1280 1328 1360 1440  800 802 808 823
> 

Still get the blank screen. I can post the Xorg.log if you want, but it's just about the same.
Comment 24 Wang Zhenyu 2008-06-18 01:25:59 UTC
How about remove other graphics setting in kernel .config and only include agp/drm? Not sure if other video config might be relate to this one.
Comment 25 Wang Zhenyu 2008-06-25 18:07:08 UTC
lower priority
Comment 26 Justin Madru 2008-06-25 20:13:28 UTC
(In reply to comment #24)
> How about remove other graphics setting in kernel .config and only include
> agp/drm? Not sure if other video config might be relate to this one.
> 

Ok, tried that and still get the blank screen.

But, like usual I don't get it when I disable the splash screen. And still, the blanking doesn't happen every boot, but I'd say about 20-30% of the time

Below is the diff of the config file where I removed the other graphics settings. I removed everything but agp/drm, but if I made a mistake the diff will show exactly what changed.

--- /boot/config-2.6.26-rc6-git	2008-06-16 10:08:54.000000000 -0700
+++ .config	2008-06-18 12:12:34.000000000 -0700
@@ -1,7 +1,7 @@
 #
 # Automatically generated make config: don't edit
-# Linux kernel version: 2.6.26-rc6-git
-# Mon Jun 16 10:08:54 2008
+# Linux kernel version: 2.6.26-rc6
+# Wed Jun 18 12:12:34 2008
 #
 # CONFIG_64BIT is not set
 CONFIG_X86_32=y
@@ -312,7 +312,6 @@
 CONFIG_ACPI_AC=m
 CONFIG_ACPI_BATTERY=m
 CONFIG_ACPI_BUTTON=m
-CONFIG_ACPI_VIDEO=m
 CONFIG_ACPI_FAN=m
 CONFIG_ACPI_DOCK=y
 # CONFIG_ACPI_BAY is not set
@@ -964,22 +963,14 @@
 # CONFIG_DRM_VIA is not set
 # CONFIG_DRM_SAVAGE is not set
 # CONFIG_VGASTATE is not set
-CONFIG_VIDEO_OUTPUT_CONTROL=m
+# CONFIG_VIDEO_OUTPUT_CONTROL is not set
 # CONFIG_FB is not set
-CONFIG_BACKLIGHT_LCD_SUPPORT=y
-CONFIG_LCD_CLASS_DEVICE=m
-CONFIG_BACKLIGHT_CLASS_DEVICE=m
-# CONFIG_BACKLIGHT_CORGI is not set
-# CONFIG_BACKLIGHT_PROGEAR is not set
+# CONFIG_BACKLIGHT_LCD_SUPPORT is not set
 
 #
 # Display device support
 #
-CONFIG_DISPLAY_SUPPORT=m
-
-#
-# Display hardware drivers
-#
+# CONFIG_DISPLAY_SUPPORT is not set
 
 #
 # Console display driver support
Comment 27 Justin Madru 2008-07-29 22:34:16 UTC
I recently compiled the newest 2.6.27-rc1 kernel and I haven't had the blank screen problem so far (although only 1st day of testing). I'll do a few more days of testing, but I think v27 fixes the problem. If this is true then the problem wasn't with the x server.
Comment 28 Justin Madru 2008-08-06 19:39:02 UTC
I've done over a week of testing kernel 2.6.27-rc{1,2} not once did the screen blank out, and that week of testing represents over 20 reboots.

Below is a summary and reference for others that run into this issue.

Problem:
    During the boot splash the scrolling text gets corrupt after this one of three things will happen:
    1) Normal boot - you've bet the odds! ;)
    2) About 75% through the graphical boot splash, the screen freezes and fades out to either white or completely black (no backlight). At this point the computer is completely locked up; a hard reset is necessary.
    3) After the boot splash the screen does the usual mode change which blanks the screen and refreshes it several times. After this a normal boot would then display your display manager (gdm, kdm), but with this bug the screen stays blank. The backlight is still on and the computer is perfectly normal, except for the lack of a screen, as it's black.

Testing:
    I've testing with Ubuntu 7.10 and 8.4 on two different computers. The graphic cards tested were Intel GMA 950 and X3100. The x server version tested were 7.2/7.3. I've tested every -rc kernel from 2.6.23 through 2.6.27-rc2. Kernels before 2.6.25 and after 2.6.26 do not have the bug (ie. 2.6.24, 2.6.27). Building the intel x drivers from the git tree has also been tested with no success.

Solutions:
    1) Use a kernel less than 2.6.25 or greater than 2.6.26 (ie. 2.6.24 or 2.6.27-rc1).
    2) Boot with the boot splash disabled
    3) Manually start X
    4) Revert the kernel commit that was introduced after 2.6.24 but before 2.6.25-rc1 that caused the problem (seems to be a timing issue).

Solution #2 and #3 only fixes the problem with about >95% reliability, there's a slight chance the issue could still arise; as the bug is not in the x server.
Comment 29 Justin Madru 2008-08-15 10:53:30 UTC
I've done a reverse git bisect to find the commit that fixed the bug. The commit fixes a bug in hrtick that was introduced by the same exact commit I found that introduced the bug.

commit 31656519e132f6612584815f128c83976a9aaaef
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jul 18 18:01:23 2008 +0200

    sched, x86: clean up hrtick implementation

    random uvesafb failures were reported against Gentoo:

      http://bugs.gentoo.org/show_bug.cgi?id=222799

    and Mihai Moldovan bisected it back to:

    > 8f4d37ec073c17e2d4aa8851df5837d798606d6f is first bad commit
    > commit 8f4d37ec073c17e2d4aa8851df5837d798606d6f
    > Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
    > Date:   Fri Jan 25 21:08:29 2008 +0100
    >
    >    sched: high-res preemption tick

    Linus suspected it to be hrtick + vm86 interaction and observed:

    > Btw, Peter, Ingo: I think that commit is doing bad things. They aren't
    > _incorrect_ per se, but they are definitely bad.
    >
    > Why?
    >
    > Using random _TIF_WORK_MASK flags is really impolite for doing
    > "scheduling" work. There's a reason that arch/x86/kernel/entry_32.S
    > special-cases the _TIF_NEED_RESCHED flag: we don't want to exit out of
    > vm86 mode unnecessarily.
    >
    > See the "work_notifysig_v86" label, and how it does that
    > "save_v86_state()" thing etc etc.

    Right, I never liked having to fiddle with those TIF flags. Initially I
    needed it because the hrtimer base lock could not nest in the rq lock.
    That however is fixed these days.

    Currently the only reason left to fiddle with the TIF flags is remote
    wakeups. We cannot program a remote cpu's hrtimer. I've been thinking
    about using the new and improved IPI function call stuff to implement
    hrtimer_start_on().

    However that does require that smp_call_function_single(.wait=0) works
    from interrupt context - /me looks at the latest series from Jens - Yes
    that does seem to be supported, good.

    Here's a stab at cleaning this stuff up ...

    Mihai reported test success as well.

    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Tested-by: Mihai Moldovan <ionic@ionic.de>
    Cc: Michal Januszewski <spock@gentoo.org>
    Cc: Antonino Daplas <adaplas@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

:040000 040000 5ae152350652713c58bd1700ba2c776a556b6985 40d22771987dc5814a1e18aa3cee82ae9e4faea5 M      arch
:040000 040000 4dfe3c6abd244d2da57b7801e47f073899124376 3863e3311a21dc049d5ad98f45c272e4a5269a2b M      include
:040000 040000 236b2824be1c7cf3c899a090498e4151543bba31 9f9779c89b781d8fa8950468deee42f419339bd7 M      kernel