Bug 19304

Summary: [845G] FIFO underruns
Product: xorg Reporter: Martin Pitt <martin.pitt>
Component: Driver/intelAssignee: Jesse Barnes <jbarnes>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: major    
Priority: high CC: adam, daniel.kitta, daniel, dave, martink2, n-roeser, pachoramos1, scottandchrystie, ynvich, zack.evans
Version: git   
Hardware: x86 (IA32)   
OS: Linux (All)   
URL: https://bugs.launchpad.net/bugs/311895
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 20276    
Attachments:
Description Flags
intel_reg_dump when working
none
intel_reg_dump after going black
none
Xorg.0.log
none
regs with patch from #18491: right after boot
none
regs with patch from #18491: after X and gdm start
none
regs with patch from #18491: GNOME fully running
none
regs with fixed patch from #18491: right after boot
none
regs with fixed patch from #18491: GNOME fully running
none
patch ported to 2.4.1
none
regs with fixed patch from #18491: after hibernate
none
registers with latest version 5 patch
none
Xorg.log with patch 5 and flicker
none
debug logs for monitor/VT state changes
none
add save/restore of watermark regs across VT switch
none
debug logs for monitor/VT state changes for patch v6
none
Add underrun debugging
none
GM45 (rev7) patched intel-2.6.3 log on Thinkpad T400, showing ESR:0x1
none
debug logs for patch v8
none
Increase latency constant
none
debug logs for patch v9
none
Fix watermark sanity check
none
most recent, KMS version of the patch
none
logs for early/late i915 loading with drm debugging
none
fix up FIFO programming
none
more fixes for FIFO programming none

Description Martin Pitt 2008-12-28 02:41:30 UTC
I am using a Dell Latitude D430 with an Intel GM945.

When I use my external 19" TFT (through DVI, 1280x1024), I occasionally get a black screen. This is not triggered by anything obvious, it just happens spontaneously. It is impossible to recover from this with restarting X, only a reboot cures it.

Further investigation shows that this is caused by a long series of

  (EE) intel(0): underrun on pipe A!

errors (some 10.000 lines in the log). I get short underruns pretty often, which results in the screen flickering for a split second, but when the long series happens, the screen stays black forever.

A comparison of the intel_reg_dump output (-: works, +: black screen) confirms this as well:

-(II): PIPEASTAT: 0x00000203 (status: VSYNC_INT_STATUS VBLANK_INT_STATUS OREG_UPDATE_STATUS)
+(II): PIPEASTAT: 0x80000000 (status: FIFO_UNDERRUN)

I have not observed this behaviour when I use the laptop undocked, with the internal screen (1280x800).

This is current Ubuntu Jaunty with -intel 2.5.1, X.org server 1.5.3, kernel 2.6.28. It also happened with earlier releases.
Comment 1 Martin Pitt 2008-12-28 02:42:08 UTC
Created attachment 21511 [details]
intel_reg_dump when working
Comment 2 Martin Pitt 2008-12-28 02:42:43 UTC
Created attachment 21512 [details]
intel_reg_dump after going black
Comment 3 Martin Pitt 2008-12-28 02:43:37 UTC
Created attachment 21513 [details]
Xorg.0.log

X.org log which shows the plethora of pipe-A underruns.
Comment 4 Martin Pitt 2008-12-28 02:44:11 UTC
When this happens, I get the following kernel messages:

Dec 28 10:25:07 tick kernel: [ 5559.025081] mtrr: no MTRR for d0000000,10000000 found
Dec 28 10:25:08 tick kernel: [ 5560.478087] apm: BIOS not found.
Comment 5 Martin Pitt 2008-12-28 02:45:15 UTC
$ diff -U 0 intel_regs.works.txt intel_regs.black.txt 
--- intel_regs.works.txt	2008-12-28 10:33:22.000000000 +0100
+++ intel_regs.black.txt	2008-12-28 10:24:50.000000000 +0100
@@ -57 +57 @@
-(II):            PIPEASTAT: 0x00000203 (status: VSYNC_INT_STATUS VBLANK_INT_STATUS OREG_UPDATE_STATUS)
+(II):            PIPEASTAT: 0x80000000 (status: FIFO_UNDERRUN)
@@ -132 +132 @@
-(II):          FBC_CONTROL: 0x43e847e2
+(II):          FBC_CONTROL: 0xc3e847e2
@@ -134 +134 @@
-(II):           FBC_STATUS: 0x20000000
+(II):           FBC_STATUS: 0x60000000
@@ -138 +138 @@
-(II):              MI_MODE: 0x00000200
+(II):              MI_MODE: 0x00000000
Comment 6 Jesse Barnes 2009-01-06 13:30:20 UTC
I think this might be a DUP, can you try the patch in 18651?

*** This bug has been marked as a duplicate of bug 18651 ***
Comment 7 Martin Pitt 2009-01-07 00:30:54 UTC
I'm building/installing -intel with this patch applied. I'll report back in a day or two, since the underruns only start to happen after a couple of hours (presumably when I'm doing particular things with my computer, but I'm not able to pinpoint what triggers it).

Thanks!
Comment 8 Martin Pitt 2009-01-07 01:31:13 UTC
I found out that starting kvm and doing some other window juggling triggers the quick underrun (i. e. the flickering, not the total blackout) pretty reliably. 

With the proposed patch applied, I still get underruns, though. I'll let it run for a couple of days to see whether I get any black screen still.
Comment 9 Jesse Barnes 2009-01-28 14:37:15 UTC
Looks separate from 18651 unfortunately.
Comment 10 Martin Pitt 2009-01-29 00:19:57 UTC
I have used the suggestion in https://bugs.launchpad.net/bugs/311895
since yesterday (Option "FramebufferCompression" "off"), and that *seems* to do
the trick. I want to test it a little longer before fully confirming,
especially since the most recent X.org stopped logging the underruns in
Xorg.0.log, and I got too used to the occasional screen flicker, so I might
well have ignored them.

But my screen went black (or brown, or white) irrecoverably after a day or two
without that option. If that doesn't happen any more either, I'll report back here.
Comment 11 Martin Pitt 2009-01-30 00:11:42 UTC
Two days have passed with Option "FramebufferCompression" "off", and I didn't notice a single flickering, nor encounter another black screen. Thus I'm fairly sure that this is at least a very good (if not perfect) workaround for the problem, and might also point to the root cause.

Just reiterating that I never ever observed those problems with the internal LVDS (1280x800), just with the external TFT (1280x1024).

</facts>

<wild and unqualified speculations>
May it be possible that compressing the framebuffer just occasionally takes too long, once it gets bigger than a critical treshold (which lies somewhere in between 1280x800 and 1280x1024 pixels)? Any idea why it would sometimes not recover from this at all any more, perhaps if it takes too long, and it cannot 'catch up' any more?

Thanks!
Comment 12 Dave Miller 2009-01-30 01:24:32 UTC
That mirrors my experience, too.

I'm on a Mac Mini with a GM945 video...  using the TV-out at 1024x768 for several months I never had any issues, and when I changed to using DVI->HDMI output on it at 1280x720, I started getting the solid color screen really frequently.  Disabling the FramebufferCompression about three weeks ago did make the machine usable again.  I've run the thing for 5 or 6 hours per day on a daily basis (I have it hooked up to a TV using MythTV on it), and although I have still gotten that solid color screen since then, it's only happened once in all that time (as opposed to every 5 or 10 minutes before).  I was getting that periodic flicker before, too, and that's infrequent enough that I don't notice it anymore if it's still happening at all.
Comment 13 Jesse Barnes 2009-02-13 15:25:03 UTC
In #18491 there's a patch (https://bugs.freedesktop.org/attachment.cgi?id=22319) to mess with the FIFO watermark values that might help.  But more than that, it includes a patch to dump the FIFO watermark regs to the intel_reg_dumper tool.  Can someone apply it and capture a reg dump both before and after starting X on their machine with the patch applied?

The spontaneous black screen is almost surely caused by a series of pipe underruns.  That generally happens if our memory arbitration settings are off (so a given pipe can't get its pixels due to some other pipe hogging the memory interface) or the FIFO watermark regs being incorrect (we fetch a new chunk of pixels too late and end up missing our window of time to feed them to the pipe).

The framebuffer compression hardware periodically compresses the framebuffer into a private section of memory (the compressed buffer), temporarily increasing memory activity; it could be that we're not accounting for that in the FIFO settings, so the screen goes black after the first compression pass (which is usually after about 15s iirc).
Comment 14 Martin Pitt 2009-02-14 06:58:49 UTC
I enabled FB compression again and applied the patch in bug 18491. It had quite a dramatic regressive effect: the screen now flickers at each hard disk access, mouse movement, or key press, and only stands still if absolutely nothing happens.

I captured the registers right after boot, then after X and gdm started, and finally after GNOME was fully running.
Comment 15 Martin Pitt 2009-02-14 06:59:47 UTC
Created attachment 22944 [details]
regs with patch from #18491: right after boot
Comment 16 Martin Pitt 2009-02-14 07:00:21 UTC
Created attachment 22945 [details]
regs with patch from #18491: after X and gdm start
Comment 17 Martin Pitt 2009-02-14 07:02:15 UTC
Created attachment 22946 [details]
regs with patch from #18491: GNOME fully running

That's the watermark change you asked for:

--- boot-nox.regs	2009-02-14 15:49:55.000000000 +0100
+++ boot-gdm.regs	2009-02-14 15:50:15.000000000 +0100
@@ -31,2 +31,2 @@
-(II):           FWATER_BLC: 0x03060106
-(II):          FWATER_BLC2: 0x00000306
+(II):           FWATER_BLC: 0x033f033f
+(II):          FWATER_BLC2: 0x0000033f

It doesn't change any further after starting GNOME (which does xrandr stuff, etc.) Other registers do change during GNOME startup, though.
Comment 18 Jesse Barnes 2009-02-14 09:51:50 UTC
Heh, I think I had the watermark regs backwards... I'll have to spin a new patch, but you could try changing the watermark value in the patch in the meantime:

watermark = (3 << 8) | 0x3f

should instead be something like

watermark = (3 << 8) | 1

Comment 19 Martin Pitt 2009-02-14 11:10:57 UTC
I did that change, much better. :-) It doesn't flicker so badly any more, and the watermark reg diff is now

$ diff -U 0 boot-nox.regs boot-gnome.regs |grep WATER
-(II):           FWATER_BLC: 0x03060106
-(II):          FWATER_BLC2: 0x00000306
+(II):           FWATER_BLC: 0x03010301
+(II):          FWATER_BLC2: 0x00000301

I have to run now, so I can't do the full test which triggers the original underrun; will report back tomorrow or Monday.

Thank you so far and have a nice weekend!
Comment 20 Martin Pitt 2009-02-14 11:11:39 UTC
Created attachment 22956 [details]
regs with fixed patch from #18491: right after boot
Comment 21 Martin Pitt 2009-02-14 11:11:57 UTC
Created attachment 22957 [details]
regs with fixed patch from #18491: GNOME fully running
Comment 22 Martin Pitt 2009-02-14 11:52:56 UTC
OK, I now threw kvm, glxinfo, and totem at it, all running at the same time, and not a single flicker. No watermark difference in the registers.

Great work, thanks!
Comment 23 Jesse Barnes 2009-02-14 12:06:06 UTC
Great, thanks a lot for testing.  Dave does this change also help your situation?
Comment 24 Raghu 2009-02-14 12:53:23 UTC
Thanks for the new the patch.

Martin,

Do you have a pointer to how to build the new driver with the patch for 8.10 (Interpid)?

Or if someone could post a like the binary driver, that would be great! I can just replace the original intel driver with this.
Comment 25 Martin Pitt 2009-02-15 03:47:48 UTC
Raghu,

I ported the patch to the intrepid version (2.4.1), will attach in a bit.

To make testing easier for everybody, I also uploaded it to my personal package archive, so that you can grab the ready-built .deb from there, or just add the new apt source:

  https://launchpad.net/~pitti/+archive/ppa

Be warned, though, I didn't test it. In the (unlikely) event that it totally screws up your system, please boot with the "text" kernel command line option in grub, log in at Ctrl+Alt+F1, and do

  sudo apt-get install xserver-xorg-video-intel/intrepid-updates
Comment 26 Martin Pitt 2009-02-15 03:49:23 UTC
Created attachment 22959 [details] [review]
patch ported to 2.4.1
Comment 27 Raghu 2009-02-15 21:18:48 UTC
Thanks Martin, for making real easy to install!

I am currently running xserver-xorg-video-intel-2.4.1-1ubuntu10.4~test1 from your repository. so far so good. Using the original xorg.conf that does not have any options set for the device.
Comment 28 Dave Miller 2009-02-15 23:05:38 UTC
the deb will make it easy for me to test, too, thanks!  It'll probably be tomorrow before I can get to it though.
Comment 29 Martin Pitt 2009-02-16 23:44:07 UTC
Created attachment 23007 [details]
regs with fixed patch from #18491: after hibernate

Ugh, after a hibernate/resume cycle the flickering is back. I have never seen it any more when not using hibernate (didn't test suspend, it's currently broken).

The watermark registers did not change, though.
Comment 30 Jesse Barnes 2009-02-17 09:31:52 UTC
Hm, so the regs look ok after resume but you see flickering?  That sounds bad; it means there may be another reg we've got to write to get things working again.
Comment 31 Dave Miller 2009-02-17 13:54:46 UTC
(In reply to comment #25)
> or just add the new apt source:
> 
>   https://launchpad.net/~pitti/+archive/ppa

How do I add this source?  If I to that URL and follow the directions given on that page, I should add this to my sources.list:

deb http://ppa.launchpad.net/pitti/ppa/ubuntu intrepid main

But after doing that, I get a 404 error trying to retrieve Packages.gz

I just downloaded the package by hand for now, will let you know how it goes.
Comment 32 Martin Pitt 2009-02-17 23:53:01 UTC
@Dave: Weird, that should be the correct URL. I just tried it here, and it works.
Comment 33 Raghu 2009-02-18 00:05:46 UTC
Dave, just in case : make sure sure you don't have 's' after 'http'. 

'deb http://ppa.launchpad.net/pitti/ppa/ubuntu intrepid main' worked for me as well. 

The Syaptic Package manager complains about either lack of or mismatch of signatures, but repo works.
Comment 34 Dave Miller 2009-02-18 00:49:23 UTC
Huh.  I tried again just now and it worked.  Maybe I just caught it at a bad time during a repo refresh or something before.

Anyhow, I installed the deb manually yesterday, and I've been running the thing most of the day, with the workaround hacks removed from xorg.conf (so it just has the default "detect everything" settings again).  No screen blankouts yet.  I did just get a flicker, though, just before typing this (I quit out of MythTV so I could run Synaptic and try the repo source add again, the flicker happened right after MythTV quit).  The flicker did have the corresponding:

(EE) intel(0): underrun on pipe A!

in Xorg.0.log

It's the one and only occurrence of that error in the log since Xorg was restarted this morning.  I'm not sure how to check the registers that were mentioned.
Comment 35 Jesse Barnes 2009-02-18 09:05:27 UTC
If it was just one flicker when you exited MythTV that might be normal, if a mode set or pipe on/off sequence occurred.

Anyway sounds like we have at least this part of the problem narrowed down; I'll put together a patch for the 2.7 release.
Comment 36 Dave Miller 2009-02-21 13:04:05 UTC
Been running this for a few days now, and no further issues so far.  Looks like it fixes it for me.
Comment 37 Dave Miller 2009-02-21 13:06:53 UTC
Oh, and I haven't tested Martin's situation from comment 29...  I've never had reason to suspend or hibernate this thing.
Comment 38 Martin Pitt 2009-02-23 02:49:15 UTC
Jesse, do you think that http://bugs.freedesktop.org/attachment.cgi?id=22319 plus the "0x3f -> 1" fix is good for uploading? I'd like to get it some more testing exposure, but I'm not sure whether this was just a test patch and needs to be redone for public consupmtion?

Thank you!
Comment 39 Martin Pitt 2009-02-27 01:14:04 UTC
Hm, for a few days now I get the screen flicker immediately, even after a clean boot and no suspend, etc. Odd, I was running Jaunty with this patch for over a week without a single glitch; apparently something else changed in the system now (newer X.org, kernel updates, etc.)

$ sudo intel_reg_dumper |grep WATER
(II):           FWATER_BLC: 0x03010301
(II):          FWATER_BLC2: 0x00000301
Comment 40 Jesse Barnes 2009-03-30 17:10:24 UTC
There's a patch in 18651 that might also help (they're more proper at least, not like the hack I posted here).  Can you give try?
Comment 41 Martin Pitt 2009-03-31 01:21:56 UTC
I applied the latest patch (http://bugs.freedesktop.org/attachment.cgi?id=24375) to 2.6.3 (what we have in Ubuntu Jaunty).

I threw everything at it which I could find: running glxgears under a load of 6.6 (having an rsync and jigdo in the background), playing a fullscreen video while booting a live system under kvm, suspend/resume, everything works. I haven't seen a single flickering so far. This even fixes the flickering of glxgears when running under EXA (we didn't switch to UXA in Ubuntu yet, since it still causes too many crashes and problems).

I will run with this patch for a while, to see the long-term behaviour. Before, I got the flickering/hang after running for some hours, or some time after suspend (see bug 20520). Perhaps bug 20520 is even just another consequence of this one, although it happened even with FramebufferCompression off.

I'll report back in a couple of days with the long-term results.

Kudos, Jesse! You made my day!
Comment 42 Martin Pitt 2009-03-31 01:49:23 UTC
Just got the hang after suspend again (bug 20520), so that is independent after all.
Comment 43 Jesse Barnes 2009-03-31 08:19:25 UTC
(In reply to comment #42)
> Just got the hang after suspend again (bug 20520), so that is independent after
> all.

Hm that could be one of the other suspend/resume related bugs we have open at the moment.  It could also be due to some missing bits I posted a patch for in 18702.  Care to try that out?
Comment 44 Martin Pitt 2009-03-31 09:05:34 UTC
A first quick shot at trying that patch left me with a ton of rejections (tried to apply against linux 2.6.28.8, with some ubuntu modifications). I'll try again later, but this might take a while.
Comment 45 Martin Pitt 2009-03-31 09:06:19 UTC
As for this bug, I have used this latest patch for several hours now, with no problem whatsoever. Thanks muchly!
Comment 46 Jesse Barnes 2009-03-31 14:39:22 UTC
Great, thanks a lot for testing, Martin.  I'll push as soon as I get some review on intel-gfx.
Comment 47 Jesse Barnes 2009-03-31 14:40:38 UTC
*** Bug 18651 has been marked as a duplicate of this bug. ***
Comment 48 Martin Pitt 2009-04-01 06:58:49 UTC
Created attachment 24431 [details]
registers with latest version 5 patch

Argh, this is haunting me. With the latest patch applied, it was working perfectly yesterday, but just now the flickering is back. No suspend involved.

I attach the current registered, do you see anything wrong there?

Thanks!
Comment 49 Martin Pitt 2009-04-01 09:42:39 UTC
Created attachment 24438 [details]
Xorg.log with patch 5 and flicker

I'm attaching current Xorg log as well, since it has a couple of messages like

  (II) intel(0): Setting FIFO watermarks - A: 1, B: 37, C: 2, SR 127
Comment 50 Martin Pitt 2009-04-02 00:43:25 UTC
Just as an early warning, this patch (same .deb that I am running) completely broke matters for a colleague of mine, also on 945GM/GMS. I asked for registers and Xorg.log, will forward as soon as I get it.
Comment 51 Martin Pitt 2009-04-02 08:41:53 UTC
Created attachment 24465 [details]
debug logs for monitor/VT state changes

Ah, I know what changed. After a clean boot, with the latest (version 5) patch applied, everything works perfectly for me, the trouble starts when I switch off my monitor, and switch it on again (as I usually do during lunch break).

So I looked at dmesg, Xorg log, and registers in three states.

1. After clean boot, and GNOME login. See boot.* files.

2. Switch to VT1 and back. dmesg says

  [drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 1

   Registers are wildly different, see diff -U0 boot.registers.txt vtsiwtch.registers.txt. After waiting for one minute, the registers change further to

-(II):           FBC_STATUS: 0x20000000
+(II):           FBC_STATUS: 0x60000000

  Xorg log gets a some 30 lines of info lines, and some interesting warnings:

$ diff -u boot.Xorg.log vtswitch.Xorg.log | grep -v '(II)'
+(WW) intel(0): ESR is 0x00000001, instruction error
+(WW) intel(0): Existing errors found in hardware state.
+(WW) intel(0): plane B needs more FIFO entries

3. Switch off monitor, and turn it on again. Now I get the occasional flickering.

  dmesg gets some USB disconnect/connect messages (monitor has USB hub with some stuff), nothing X related.

  registers do not change at all.

  No new Xorg log entries.

Now, having typed this, it seems to me that switching off the monitor doesn't change much, and that most likely the VT switch is to blame; I will do another test to affirm that I get flickering after VT switch already (I'll report back if that is not the case). Your patch seems to work by and large, but seems to not take VT switches into account correctly.
Comment 52 Jesse Barnes 2009-04-02 11:00:54 UTC
Ah I can believe that VT switches might cause trouble...  The diff actually doesn't look too interesting though, mainly LVDS is off.  However this part definitely does look weird:
(II) intel(0): FIFO entries - A: 25, B: 0
(II) intel(0): FIFO size - A: 28, B: 59
(WW) intel(0): plane B needs more FIFO entries

That FIFO entries line indicates that pipe B is off. Maybe I don't handle that case correctly...
Comment 53 Jesse Barnes 2009-04-02 12:35:52 UTC
Created attachment 24471 [details] [review]
add save/restore of watermark regs across VT switch

Not restoring these across VT switch might be bad...  This one leaves the programming alone but takes care to save/restore the regs across VT switch.
Comment 54 Martin Pitt 2009-04-03 00:21:09 UTC
Created attachment 24493 [details]
debug logs for monitor/VT state changes for patch v6

similar debug logs for patch v6 (https://bugs.freedesktop.org/attachment.cgi?id=24471).

Unfortunately the flickering still happens. :-(

Yesterday night I conducted another experiment. I just switched the monitor off and on, without any VT switch. After that I already got the flickering. However, there was no change at all in registers, Xorg.log, or dmesg. (This was with the previous patch version 5, though).

Many thanks for your efforts,

Martin

P.S. If ssh to my machine helps you in any way, I can provide that. I'll just be away next week on the LinuxFoundation collaboration summit in San Francisco, and can spend little to no time testing this.
Comment 55 Daniel J Blueman 2009-04-03 07:54:33 UTC
The first time only I installed xserver-xorg-vido-intel 2:2.6.3-0ubuntu4pitti1 (AMD64) from Martin Pitt's PPA and restarted GDM, I was met with a blank screen, though it was clear GDM was waiting for input. Hardware is GM45 rev 7, dual-channel mem with only pipe B connected to internal LVDS to a 1440x900 6-bit TN (aargh) panel.

Perhaps this relates to when the pipe watermarks are reprogrammed and thus data is discarded; we see pipe B's LBLC_EVENT_STATUS flag set and EDID detection was not performed. There were no kernel/syslog messages, and the Xorg.log difference against normal operation is:

$ diff -u /var/log/Xorg.0.log.working /var/log/Xorg.0.log.blank
@@ -197,10 +197,7 @@
 (WW) intel(0): Register 0x61200 (PP_STATUS) changed from 0xc0000008 to 0xd000000a
 (WW) intel(0): PP_STATUS before: on, ready, sequencing idle
 (WW) intel(0): PP_STATUS after: on, ready, sequencing on
-(WW) intel(0): Register 0x71024 (PIPEBSTAT) changed from 0x80000206 to 0x80000246
-(WW) intel(0): PIPEBSTAT before: status: FIFO_UNDERRUN VSYNC_INT_STATUS SVBLANK_INT_STATUS VBLANK_INT_STATUS
-(WW) intel(0): PIPEBSTAT after: status: FIFO_UNDERRUN VSYNC_INT_STATUS LBLC_EVENT_STATUS SVBLANK_INT_STATUS VBLANK_INT_STATUS
-(WW) intel(0): Register 0x321b (FBC_FENCE_OFF) changed from 0x59018500 to 0x2a03a200
+(WW) intel(0): Register 0x321b (FBC_FENCE_OFF) changed from 0x59008500 to 0x2a03a200
 (==) Depth 24 pixmap format is 32 bpp
 (II) do I need RAC?  No, I don't.
 (II) resource ranges after preInit:
@@ -432,93 +429,17 @@
 (II) AT Translated Set 2 keyboard: Device reopened after 1 attempts.
 (II) Video Bus: Device reopened after 1 attempts.
 (II) Macintosh mouse button emulation: Device reopened after 1 attempts.
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
-exaCopyDirty: Pending damage region empty!
-(II) PM Event received: Capability Changed
-I830PMEvent: Capability change
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) PM Event received: Capability Changed
-I830PMEvent: Capability change
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) PM Event received: Capability Changed
-I830PMEvent: Capability change
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) PM Event received: Capability Changed
-I830PMEvent: Capability change
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) PM Event received: Capability Changed
-I830PMEvent: Capability change
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) PM Event received: Capability Changed
-I830PMEvent: Capability change
-(II) intel(0): EDID vendor "LEN", prod id 16435
-(II) intel(0): Using hsync ranges from config file
-(II) intel(0): Using vrefresh ranges from config file
-(II) intel(0): Printing DDC gathered Modelines:
-(II) intel(0): Modeline "1440x900"x0.0  101.60  1440 1488 1520 1792  900 903 909 945 -hsync -vsync (56.7 kHz)
-(II) intel(0): Modeline "1440x900"x0.0   81.49  1440 1488 1520 1760  900 903 909 926 -hsync -vsync (46.3 kHz)
-(II) intel(0): EDID vendor "LEN", prod id 16435
Comment 56 Martin Pitt 2009-04-03 08:15:15 UTC
Daniel,

please note that 4pitti1 has the "v 5" patch. I just uploaded my current test package with the latest "v6" patch (http://bugs.freedesktop.org/attachment.cgi?id=24471) to my PPA, as 4pitti2.
Comment 57 Jesse Barnes 2009-04-07 12:27:22 UTC
Daniel, looks like you hit the LVDS detect bug with the version Martin packaged.

Martin, the fact that you see flickering after just a monitor power cycle is strange.  If the FIFO regs weren't changed the flicker you see shouldn't be caused by underruns... I'm putting together another patch which will report that so we can check.
Comment 58 Jesse Barnes 2009-04-07 12:32:59 UTC
Created attachment 24651 [details] [review]
Add underrun debugging

This one should log any underruns that occur so we can figure out if the flicker you're seeing is some other problem.
Comment 59 Martin Pitt 2009-04-07 13:19:19 UTC
Thanks, Jesse. I applied the patch to the current Ubuntu 9.04 package and uploaded it to my personal package archive again, so that people on 9.04 can test it.

I can't test it myself until next Tuesday, since this week I'm in San Francisco on the LF summit. I never got any flickering with the internal LVDS, and I don't have an external screen here.
Comment 60 Daniel J Blueman 2009-04-07 14:02:49 UTC
Rebuilding the xserver-xorg-video-intel package with the updated patch, I was unable to trigger underruns with my GM45 rev 7 hardware, rebooting a some times for initial state, separately restarting GDM in a loop ~50 times, and switching VTs, testing both EXA and UXA paths.

Since the runtime overhead is minimal, I'd say it's worth carrying this patch forward to help understand the failure mechanism later.

Daniel
Comment 61 Daniel J Blueman 2009-04-07 15:19:51 UTC
The X-server was still solid after ~10 suspend-resume cycles (running in EXA) also, though I do see the Error Status Register getting bit 0 set - presumably expected. See attached Xorg.0.log.
Comment 62 Daniel J Blueman 2009-04-07 15:22:35 UTC
Created attachment 24653 [details]
GM45 (rev7) patched intel-2.6.3 log on Thinkpad T400, showing ESR:0x1
Comment 63 Jesse Barnes 2009-04-07 16:07:14 UTC
Daniel, glad to hear things are stable for you.  But my patch shouldn't affect your configuration (GM45 has automatic FIFO sizing & pipe arbitration).  Looks like your LVDS detection bug is fixed though, which is good.
Comment 64 Martin Pitt 2009-04-15 01:52:10 UTC
Created attachment 24816 [details]
debug logs for patch v8

I applied the latest patch (v8) to my PPA against the current Jaunty package (2:2.6.3-0ubuntu9pitti1). Again I captured logs right after a clean X.org startup (startup.*), right after a monitor off/on cycle (not included, since no change), and a while after a VT switch.

I didn't see any underruns happen after switching off the monitor. Perhaps the effect during lunch break is that the monitor gets disabled by the screensaver (DPMS off), which acts more like a VT switch?

The underruns started some minutes after a real VT switch, and due to the new patch I get them logged now:

(EE) intel(0): underrun on pipe A!
(EE) intel(0): underrun on pipe A!
(EE) intel(0): underrun on pipe A!

The attached logs have just one instance of those, but the underruns become more frequent now. After the first underrun happened, I got this change:

-(II):           FBC_STATUS: 0x20000000
+(II):           FBC_STATUS: 0x60000000

(vtswitch2.regs)
Comment 65 Martin Pitt 2009-04-16 03:25:29 UTC
The pipe underruns also start to happen massively after I used kvm (even after kvm was stopped long ago).
Comment 66 Daniel J Blueman 2009-04-16 04:45:12 UTC
Perhaps this is a symptom of high (IRQ-safe) spinlock hold-times, preventing the pipe being reset/refilled within the needed time window? (unless I'm misunderstanding the mechanism)

This may be key to reproducing the issue, and may be worse on kernels without preemption and lock-break points (ie server/throughput/compute optimised kernels).

Using latencytop or kernel ftrace to see what magnitude of lock hold time is needed to cause the pipe underruns may be useful to developers trying to reproduce this later...
Comment 67 Jesse Barnes 2009-04-16 09:40:33 UTC
No the pipe is filled automatically by hardware (the GPU just does fetches from RAM based on the FIFO watermark values), so either the watermarks are incorrect or the FIFO sizes are wrong or both.
Comment 68 Jesse Barnes 2009-04-23 14:34:28 UTC
Oh wow I definitely see this problem now on my 945 test machine with the patch applied...

Ah looks like my latency constant wasn't so pessimistic after all.  This one works for me though; hope it fixes your problem too (though I'm not sure why a VT switch would trigger it).
Comment 69 Jesse Barnes 2009-04-23 14:35:37 UTC
Created attachment 25075 [details] [review]
Increase latency constant

Made the latency 5us instead of 3us, which seems to be closer to the truth on my Acer platform at least.
Comment 70 Martin Pitt 2009-04-23 23:29:56 UTC
Created attachment 25080 [details]
debug logs for patch v9

I tried the v9 patch (also uploaded to PPA again). Unfortunately this is now worse.

At gdm, when both the internal LVDS and the external TFT are active @1024x768 (no xrandr in gdm yet), I get a constant flickering about twice per second. This cannot even be worked around any more with disabling fb compression.

After logging in, when the internal LVDS switches off, behaviour is identical to the v8 patch: occasional flickering starts after a vt switch (or some hours of usage).

I attached the logs again, after a clean boot (start.*), a vt switch (vtswitch.*), after the first overflow a few minutes later (overflow.*), and after several more overflows occurred (overflow-more.*).
Comment 71 Jesse Barnes 2009-04-24 10:31:56 UTC
Created attachment 25116 [details] [review]
Fix watermark sanity check

Arg, maybe I'll get this right one day:

(II) intel(0): FIFO entries - A: 42, B: 0
(II) intel(0): FIFO size - A: 28, B: 59
(II) intel(0): Setting FIFO watermarks - A: -16, B: 1, C: 2, SR 5

That negative A value would certainly cause trouble.

Looks like my sanity check was looking at the wrong variable; I should have been checking the watermark value against <= 0, not the entries value (that should always be positive).

Interestingly, the new calculation indicates that you're driving pipe A pretty hard relative to it's FIFO RAM allocation, but with just a single pipe enabled it should be safe.  If not, we could modify both DSPARB and the FIFO watermarks to increase the chances of a given config working, or enable pixel clock doubling perhaps.
Comment 72 Jesse Barnes 2009-04-24 10:36:04 UTC
Sigh, looking again at your older logs I doubt that last patch will fix the issue:

(II) intel(0): FIFO size - A: 28, B: 59
(II) intel(0): Setting FIFO watermarks - A: 1, B: 1, C: 2, SR 22

So we're already setting the watermark as aggressively as possible, so the pipe should be continuously fetching data for display.  In your config that's still not enough though, since we drain it faster than we fill it.

Another thing that might help is to reduce the pixel clock on the mode you're sending to your external monitor; you can use the cvt or gtf tools to create a mode with reduced blanking or a lower refresh.

I think I'll need to cook one up to modify DSPARB as well (like we do in the current driver).
Comment 73 Martin Pitt 2009-04-26 23:57:40 UTC
Ah, so you are saying that something after a VT switch or after putting a high load on the graphics card introduces a fill/drain backlog which the card can't ever catch up with any more?

So the disabling of the fb compression helps because dropping that extra work causes the GPU to have enough time again to re-fill the pipes?

NB that I have used that very same laptop to drive a 1920x1200 external screen without problems, but then again I hadn't done it for very long (just about an hour for testing the new monitor for my wife's computer).

So if this is principally not fixable due to hw speed limitations, maybe it would be possible to automatically disable fb compression once the chip hits pipe underruns?

Thanks for your efforts!

Martin
Comment 74 Jesse Barnes 2009-04-27 09:06:34 UTC
Yeah avoiding compression when the FIFO watermark is low is probably a good idea.  But we may also be able to increase the amount of FIFO RAM allocated to the large display.
Comment 75 Bryce Harrington 2009-05-08 02:23:44 UTC
Btw, we're carrying an old patch from this bug in the Ubuntu release, one from Feb 2009, patches/109_i830-fifo-watermark-conservative.patch.

It sounds like that patch has grown obsolete, or at least doesn't solve this bug 100%, however I'm going to leave it in place when we move to 2.7.0.  If we should be doing something differently, please ping me so we can get a better fix in.
Comment 76 Martin Pitt 2009-05-08 02:40:57 UTC
Bryce, I think you should drop the patch. It's insufficient, might cause regressions on other platforms, and doesn't help at all any more at least on my computer.
Comment 77 Bryce Harrington 2009-05-16 13:55:45 UTC
Thanks Martin, I've removed the patch from Karmic.
Comment 78 Martin Pitt 2009-06-03 00:06:44 UTC
Jesse,

as we discussed last week in Barcelona, I have now tried -intel git head, mesa git head, 2.6.30rc7 on my home system with the external monitor again, now with the extra 1 GB of RAM that I plugged in last week.

As you suspected, the underruns are now gone, apparently having a second RAM bar now provides enough bandwidth for the graphics card to avoid underruns.

I'm happy to test further patches, I can easily remove the extra GB of RAM again. The very same symptom happens on the Samsung NC10 of a friend of mine, I can test stuff on his machine as well (with some delay).

My impression is that with FB compresssion my machine is simply not fast enough, regardless of the watermark settings (given that all of above patches failed consistently). Would it be possible for the driver to disable FB compression dynamically if it encounters pipe underruns, such as "twice in five minutes"?

I wonder why this problem didn't occur at all with earlier driver versions (2.4). Didn't that use FB compression yet?

Thanks!
Comment 79 Jesse Barnes 2009-06-04 08:31:40 UTC
On Wed,  3 Jun 2009 00:06:45 -0700 (PDT)
bugzilla-daemon@freedesktop.org wrote:
> as we discussed last week in Barcelona, I have now tried -intel git
> head, mesa git head, 2.6.30rc7 on my home system with the external
> monitor again, now with the extra 1 GB of RAM that I plugged in last
> week.
> 
> As you suspected, the underruns are now gone, apparently having a
> second RAM bar now provides enough bandwidth for the graphics card to
> avoid underruns.
> 
> I'm happy to test further patches, I can easily remove the extra GB
> of RAM again. The very same symptom happens on the Samsung NC10 of a
> friend of mine, I can test stuff on his machine as well (with some
> delay).
> 
> My impression is that with FB compresssion my machine is simply not
> fast enough, regardless of the watermark settings (given that all of
> above patches failed consistently). Would it be possible for the
> driver to disable FB compression dynamically if it encounters pipe
> underruns, such as "twice in five minutes"?
> 
> I wonder why this problem didn't occur at all with earlier driver
> versions (2.4). Didn't that use FB compression yet?

Great, thanks for the update.  Yes, we should detect either memory
configuration or underruns and take appropriate action.  Previous
drivers didn't modify the FIFO or DSPARB settings, so the defaults may
have been working on your platform, or something else changed to affect
the way we access memory (it's also possible that FBC was disabled on
older releases in your config for some reason).

Jesse
Comment 80 Jesse Barnes 2009-06-18 12:47:28 UTC
Created attachment 26930 [details] [review]
most recent, KMS version of the patch

This patch applies to the kernel.  It still doesn't contain checks against available bandwidth & latency to reject modes we can't support, but it should behave a bit better than the current 2D driver.
Comment 81 Martin Pitt 2009-07-01 05:42:52 UTC
I applied the patch to 2.6.31rc1 and first tested it with 2 GB of RAM. No noticeable difference, everything continued to work smoothly.

Now I ripped out the second GB RAM bar again, and did some stress testing: kvm -m512 (booting another Ubuntu desktop live system), running glxgears, and do some compiz juggling and VT switches. In previous versions this was a reliable way of triggering underruns quickly (which otherwise just occur after a couple of hours). I had a load of 4.3, and glxgears/compiz froze for some fractional seconds due to the high load, but I didn't get any pipe underrun.

I now continue to use the system for a couple of hours to see the longer-term effects.

What I didn't do yet is exercising the same stress test on 2.6.31 without this patch. Do you need this?
Comment 82 Jesse Barnes 2009-07-01 10:41:16 UTC
Only if you're feeling thorough. :)  Thanks for the updated report though.  I fixed a few bugs in the calculations in the KMS patch, so maybe one of those fixed your issues.  I'm really looking forward to closing this one; I'll ping Eric about including the patch.
Comment 83 Jesse Barnes 2009-07-01 15:12:55 UTC
Yay, fix pushed!

commit 7662c8bd6545c12ac7b2b39e4554c3ba34789c50
Author: Shaohua Li <shaohua.li@intel.com>
Date:   Fri Jun 26 11:23:55 2009 +0800

    drm/i915: add FIFO watermark support
Comment 84 Martin Pitt 2009-07-01 15:18:09 UTC
Oops, I am terribly sorry. We currently put i915 into the initramfs, and it gets loaded from there. When I built the module with the patch, I forgot to update the initramfs, so all these successful tests were actually done with the original i915 from 2.6.31rc1.

Later this afternoon some other package updated the initramfs, and now the screen goes entirely and irrecorverably black when booting, both when docked (external DVI) and when undocked (internal LVDS).

So, perhaps you should revert this from your tree until this is investigated further? So far, I don't seem to have this underrun problem at all with 2.6.31rc1, thus I leave the bug as "resolved".
Comment 85 Jesse Barnes 2009-07-01 15:21:25 UTC
Uh-oh, ok thanks for the heads-up.  I'll look at this.  Can you modprobe your drm with debug=1 so we can see what the watermark values end up being on your machine?  It would help if you could confirm that this particular patch caused the problem too, was that the only change or was there another kernel update as well?
Comment 86 Martin Pitt 2009-07-01 15:34:06 UTC
It wasn't the only patch, I also applied the tiny patch from bug 20520 (register restoring ordering fix for resuming). However, I tested that patch in isolation before, and it worked fine. Also, I don't think that code path is active on boot. There was no other kernel update.

I'll send detailled debugging information tomorrow (I hope I can ssh into the machine still, or it gets logged far enough), bed time for today. I just wanted to give you an early warning to perhaps defer propagation of the patch (or just revert it for now, since it just works without it.
Comment 87 Martin Pitt 2009-07-02 02:02:55 UTC
Created attachment 27329 [details]
logs for early/late i915 loading with drm debugging

So, first I turned on DRM debugging and dmesg capturing:

$ cat /etc/rcS.d/S80dmesg
#!/bin/sh
dmesg > /var/log/dmesg-`date +%T`
$ cat /etc/modprobe.d/drmdebug.conf
options drm debug=1

In the attached logs I renamed the dmesg files from timestamps to situation descriptions, such as "dmesg-31rc1-vanilla-early-2GB-ok.txt"

Then I tested all possible combinations of 2.6.31rc1 with/without this patch, with 1GB or 2 GB RAM, and with "
early" or "late" loading of i915/drm.

early: modules are contained and loaded by initramfs, i. e. pretty much as one of the first things after the k
ernel starts to boot

late: I booted without an initramfs, thus init starts readahead, sets the hostname and keyboard layout, and th
en starts udev which does an "udev trigger" and causes modules such as drm and i915 to be loaded, which in tur
n does KMS.

In earlier Karmic (2.6.30 release candidates), we didn't put i915/drm into the initramfs, and it worked fine (just looked a bit ugly since mode got switched halfway through boot). Now I noticed that this late loading doe
s not work any more for some reason, not with 2.6.30 final, not with 31rc1, or with 31rc1+your patch. That is
a bug in itself, and sounds pretty unrelated to this pipe underrun issue, so perhaps I should report it separa
tely?

Results from this testing:
 * late loading never works, I always get LVDS and DVI turned off
 * early loading works with .30 final and .31rc1 vanilla
 * with this patch applied, it never works, and worse, I don't even get a dmesg captured; this means that the
boot doesn't even get to rcS/70. Sounds like it wedges display and causes a kernel panic? Anything I can do to
 debug this?
 * 1 GB/2 GB does not make any difference in any test case
Comment 88 Jesse Barnes 2009-07-02 09:54:44 UTC
(In reply to comment #87)
> Then I tested all possible combinations of 2.6.31rc1 with/without this patch,
> with 1GB or 2 GB RAM, and with "
> early" or "late" loading of i915/drm.
> 
> early: modules are contained and loaded by initramfs, i. e. pretty much as one
> of the first things after the k
> ernel starts to boot
> 
> late: I booted without an initramfs, thus init starts readahead, sets the
> hostname and keyboard layout, and th
> en starts udev which does an "udev trigger" and causes modules such as drm and
> i915 to be loaded, which in tur
> n does KMS.

Sounds like a good set of combinations, thanks for testing.

> In earlier Karmic (2.6.30 release candidates), we didn't put i915/drm into the
> initramfs, and it worked fine (just looked a bit ugly since mode got switched
> halfway through boot). Now I noticed that this late loading doe
> s not work any more for some reason, not with 2.6.30 final, not with 31rc1, or
> with 31rc1+your patch. That is
> a bug in itself, and sounds pretty unrelated to this pipe underrun issue, so
> perhaps I should report it separately?

One thing jumped out between the early (working) and late (broken) logs: in the broken ones there's no line for the fbcon loading & initializing.  Which would leave your display blank if/until X starts.  Maybe that's missing from the load in the late case?

> Results from this testing:
>  * late loading never works, I always get LVDS and DVI turned off
>  * early loading works with .30 final and .31rc1 vanilla
>  * with this patch applied, it never works, and worse, I don't even get a dmesg
> captured; this means that the
> boot doesn't even get to rcS/70. Sounds like it wedges display and causes a
> kernel panic? Anything I can do to
>  debug this?
>  * 1 GB/2 GB does not make any difference in any test case

Ugh, ok so it's probably not a pipe underrun then if it kills the whole machine (at least I hope not); could be a kernel panic.  You could try netconsole (modprobe netconsole netconsole=<params> and then use nc on another machine, the kernel Documentation/ directory has some info on that); it might capture a panic if you load the module by hand with the netconsole running.
Comment 89 Martin Pitt 2009-07-06 01:25:29 UTC
> One thing jumped out between the early (working) and late (broken) logs: in the
> broken ones there's no line for the fbcon loading & initializing.  Which would
> leave your display blank if/until X starts.  Maybe that's missing from the load
> in the late case?

Indeed, I discussed that with our initramfs/boot guru. So that's not a concern here.

> Ugh, ok so it's probably not a pipe underrun then if it kills the whole machine
(at least I hope not); could be a kernel panic.  You could try netconsole

Thanks for the netconsole hint, that worked beautifully. Indeed it catches a nice trace in the watermark updating:

[  489.298734] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[  489.298908] IP: [<ffffffffa030f1af>] intel_update_watermarks+0xcf/0xd40 [i915]
[  489.299056] PGD 0 
[  489.299152] Oops: 0000 [#1] SMP 
[  489.299289] last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/drm/card0/dev
[  489.299384] CPU 0 
[  489.299481] Modules linked in: i915(+) drm netconsole i2c_algo_bit configfs snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm arc4 joydev ecb snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer iwl3945 iwlcore iTCO_wdt iTCO_vendor_support snd_seq_device mac80211 led_class snd psmouse dell_wmi dell_laptop cfg80211 soundcore snd_page_alloc usb_storage usbhid serio_raw dcdbas video output tg3 fbcon tileblit font bitblit softcursor intel_agp [last unloaded: drm]
[  489.300005] Pid: 2208, comm: work_for_cpu Not tainted 2.6.31-1-generic #14-Ubuntu Latitude D430                   
[  489.300005] RIP: 0010:[<ffffffffa030f1af>]  [<ffffffffa030f1af>] intel_update_watermarks+0xcf/0xd40 [i915]
[  489.300005] RSP: 0018:ffff8800229e98b0  EFLAGS: 00010202
[  489.300005] RAX: 0000000000000000 RBX: ffff880022966800 RCX: ffffffffa03244fb
[  489.300005] RDX: ffffffffa0321a20 RSI: ffffffffa0324518 RDI: 0000000000000001
[  489.300005] RBP: ffff8800229e9930 R08: 0000000000000000 R09: 000000000001a400
[  489.300005] R10: 0000000000000500 R11: 0000000000000000 R12: ffff880022967000
[  489.300005] R13: 000000000001a400 R14: ffff8800229674a0 R15: 0000000000000001
[  489.300005] FS:  0000000000000000(0000) GS:ffff8800019b4000(0000) knlGS:0000000000000000
[  489.300005] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  489.300005] CR2: 0000000000000038 CR3: 0000000001001000 CR4: 00000000000006b0
[  489.300005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  489.300005] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  489.300005] Process work_for_cpu (pid: 2208, threadinfo ffff8800229e8000, task ffff88003d5416b0)
[  489.300005] Stack:
[  489.300005]  ffff8800229e9910 ffffffffa0317a5a ffff000100000038 ffff8800229e98f0
[  489.300005] <0> ffff000100010038 ffff8800229e98e0 0000000000000001 0000000000000002
[  489.300005] <0> ffff8800229e0009 0000000000000000 ffff8800229e9920 ffff880022f3b000
[  489.300005] Call Trace:
[  489.300005]  [<ffffffffa0317a5a>] ? intel_sdvo_read_byte+0x6a/0xc0 [i915]
[  489.300005]  [<ffffffffa031161c>] intel_crtc_dpms+0xb0c/0xef0 [i915]
[  489.300005]  [<ffffffffa0317cff>] ? intel_sdvo_set_active_outputs+0x2f/0x40 [i915]
[  489.300005]  [<ffffffffa031baab>] ? intel_tv_mode_find+0x2b/0x50 [i915]
[  489.300005]  [<ffffffffa030ee52>] intel_crtc_prepare+0x12/0x20 [i915]
[  489.300005]  [<ffffffffa02dbff2>] drm_crtc_helper_set_mode+0x272/0x3d0 [drm]
[  489.300005]  [<ffffffffa03138c6>] intel_get_load_detect_pipe+0x116/0x160 [i915]
[  489.300005]  [<ffffffffa031cede>] intel_tv_detect+0x7e/0x100 [i915]
[  489.300005]  [<ffffffffa02dc273>] drm_helper_probe_single_connector_modes+0x93/0x2b0 [drm]
[  489.300005]  [<ffffffffa02dc4d6>] drm_helper_probe_connector_modes+0x46/0x80 [drm]
[  489.300005]  [<ffffffffa02dd2f8>] drm_helper_initial_config+0x28/0xc0 [drm]
[  489.300005]  [<ffffffffa0301b78>] i915_driver_load+0xc68/0xd70 [i915]
[  489.300005]  [<ffffffffa02d38b7>] drm_get_dev+0x147/0x2a0 [drm]
[  489.300005]  [<ffffffff8106c2d0>] ? do_work_for_cpu+0x0/0x30
[  489.300005]  [<ffffffffa0320bfa>] i915_pci_probe+0x10/0xd0 [i915]
[  489.300005]  [<ffffffff81280882>] local_pci_probe+0x12/0x20
[  489.300005]  [<ffffffff8106c2e3>] do_work_for_cpu+0x13/0x30
[  489.300005]  [<ffffffff81070a26>] kthread+0x96/0xa0
[  489.300005]  [<ffffffff8101308a>] child_rip+0xa/0x20
[  489.300005]  [<ffffffff81070990>] ? kthread+0x0/0xa0
[  489.300005]  [<ffffffff81013080>] ? child_rip+0x0/0x20
[  489.300005] Code: c2 20 1a 32 a0 48 c7 c6 18 45 32 a0 bf 01 00 00 00 31 c0 e8 c4 40 fc ff 4c 63 6b 74 4d 89 e9 48 8b 43 20 41 83 c7 01 44 8b 53 78 <8b> 40 38 8d 48 07 85 c0 0f 49 c8 48 8b 43 08 c1 f9 03 48 8d 58 
[  489.300005] RIP  [<ffffffffa030f1af>] intel_update_watermarks+0xcf/0xd40 [i915]
[  489.300005]  RSP <ffff8800229e98b0>
[  489.300005] CR2: 0000000000000038
[  489.308329] ---[ end trace 26bde7aeab46e24b ]---
Comment 90 Martin Pitt 2009-07-06 22:42:48 UTC
jbarnes| pitti: just wondering if you can gdb your i915.o and do a "list *intel_update_watermarks+0xcf"

Seems I need to build the module with debugging or so:

(gdb) list *intel_update_watermarks+0xcf
No symbol table is loaded.  Use the "file" command.

Sorry, this kernel debugging is all new to me :/

I now built the module with "CONFIG_DEBUG_INFO=1 make -C /usr/src/linux-headers-2.6.31-1-generic/ M=`pwd` modules", so they have debug info now and gdb works. But I guess due to the rebuild the offsets were all scrambled, so I need to get the backtrace again. Stay tuned..
Comment 91 Martin Pitt 2009-07-06 23:03:18 UTC
So apparently the offset is even stable across rebuilds. I captured the trace again, and it looks exactly like the previous trace, so I'm not copying that again.

(gdb) list *intel_update_watermarks+0xcf
0x101af is in intel_update_watermarks (/home/martin/ubuntu/kernel/linux-2.6.31/drivers/gpu/drm/i915/intel_display.c:1918).
1913						  intel_crtc->pipe, crtc->mode.clock);
1914					planeb_clock = crtc->mode.clock;
1915				}
1916				sr_hdisplay = crtc->mode.hdisplay;
1917				sr_clock = crtc->mode.clock;
1918				pixel_size = crtc->fb->bits_per_pixel / 8;
1919			}
1920		}
1921	
1922		/* Single pipe configs can enable self refresh */

So I guess it crashes because crtc->fb is NULL, since fbcon is not loaded yet?
Comment 92 Martin Pitt 2009-07-06 23:25:04 UTC
BTW, this happens whether or not 'fbcon' gets loaded before.

Also confirmed when applying the patch to 2.6.31rc2.
Comment 93 Jesse Barnes 2009-07-07 09:20:30 UTC
On Mon,  6 Jul 2009 23:03:19 -0700 (PDT)
bugzilla-daemon@freedesktop.org wrote:
> --- Comment #91 from Martin Pitt <martin.pitt@ubuntu.com>  2009-07-06
> 23:03:18 PST --- So apparently the offset is even stable across
> rebuilds. I captured the trace again, and it looks exactly like the
> previous trace, so I'm not copying that again.
> 
> (gdb) list *intel_update_watermarks+0xcf
> 0x101af is in intel_update_watermarks
> (/home/martin/ubuntu/kernel/linux-2.6.31/drivers/gpu/drm/i915/intel_display.c:1918).
> 1913                                              intel_crtc->pipe,
> crtc->mode.clock);
> 1914                                    planeb_clock =
> crtc->mode.clock; 1915                            }
> 1916                            sr_hdisplay = crtc->mode.hdisplay;
> 1917                            sr_clock = crtc->mode.clock;
> 1918                            pixel_size =
> crtc->fb->bits_per_pixel / 8; 1919                    }
> 1920            }
> 1921    
> 1922            /* Single pipe configs can enable self refresh */
> 
> So I guess it crashes because crtc->fb is NULL, since fbcon is not
> loaded yet?

Ah yes, that helps a lot, thanks.  I'll fix that up.

Comment 94 Jesse Barnes 2009-07-09 16:02:04 UTC
Created attachment 27540 [details] [review]
fix up FIFO programming

The stuff that went upstream falls into the "how did that ever work" category.  We were just getting lucky that the calculations always resulted in the most aggressive FIFO programming.  This corrects that and should also fix your hang.
Comment 95 Jesse Barnes 2009-07-09 16:02:38 UTC
Re-opening this as the FIFO master bug.
Comment 96 Jesse Barnes 2009-07-09 16:03:28 UTC
*** Bug 18702 has been marked as a duplicate of this bug. ***
Comment 97 Jesse Barnes 2009-07-09 16:03:41 UTC
*** Bug 18491 has been marked as a duplicate of this bug. ***
Comment 98 Martin Pitt 2009-07-10 00:29:02 UTC
Does that patch go on top of the "most recent, KMS version of the patch" (https://bugs.freedesktop.org/attachment.cgi?id=26930) or does it replace it? I suppose the latter, since the new one doesn't touch crtc->fb at all, but it looks very different from the older one.

Thanks! Martin
Comment 99 Jesse Barnes 2009-07-10 09:44:37 UTC
It sits on top of current drm-intel-next bits.
Comment 100 Jesse Barnes 2009-07-10 12:55:31 UTC
Created attachment 27575 [details] [review]
more fixes for FIFO programming

I tested on my 855 machine and found some bugs in that configuration.  So I cleaned up the code a little more and fixed things up.  This one applies on top of the drm-intel-next branch.
Comment 101 Martin Pitt 2009-07-11 00:03:39 UTC
For the record, I get a warning after applying the patch to drm-intel-next:

/home/martin/ubuntu/kernel/drm-intel-next/i915/intel_display.c: In function ‘intel_find_pll_g4x_dp’:
/home/martin/ubuntu/kernel/drm-intel-next/i915/intel_display.c:834: warning: ‘clock.vco’ is used uninitialized in this function

Will test now.
Comment 102 Martin Pitt 2009-07-11 00:11:48 UTC
Applied on top of current intel-drm-next, so far no noticeable difference (in other words, everything still works just fine). I'll use that driver for a few days now, will report back if anything regresses.
Comment 103 Scott Hansen 2009-07-12 20:33:21 UTC
Hey Jesse, sorry I haven't been able to try the patch that you sent me yet. I did real quick install the newest version of the video-intel driver, which on Arch is 2.7.99.901-3. This is on the 2.6.30 kernel (i686). It still exhibits the same behavior (flickering after resume from suspend to ram), but the frequency of the flicker is substantially reduced....it's actually usable now, with just the occasional flicker. Better performance than the vesa driver!!

I'll still attempt the patch at some point when I get a chance. Send me a new one if this info changes anything.

Scott
Comment 104 Scott Hansen 2009-07-12 22:03:06 UTC
Sorry guys....I have to retract my previous post after using intel-video-newest for a couple of hours. Worked fine with normal browsing, and program open/closing, but as soon as a non-flash video (avi) played, the flicker went back to making it unusable (well, highly unpleasant at least) for the duration of the movie. Flash video doesn't seem to trigger the flicker, except periodically.

Scott
Comment 105 Jesse Barnes 2009-07-13 09:30:14 UTC
The last patch I attached here is a kernel patch; it should make things better for you if you've got a KMS enabled configuration.  Is there any way for you to try that, Scott?
Comment 106 Scott Hansen 2009-07-13 11:16:48 UTC
(In reply to comment #105)
> The last patch I attached here is a kernel patch; it should make things better
> for you if you've got a KMS enabled configuration.  Is there any way for you to
> try that, Scott?
> 

Tried with the kernel source from the Arch repos and got:

patching file drivers/gpu/drm/i915/i915_reg.h
Hunk #1 FAILED at 1618.
1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_reg.h.rej
patching file drivers/gpu/drm/i915/intel_display.c
Hunk #1 FAILED at 1623.
Hunk #2 FAILED at 1822.
Hunk #3 FAILED at 1869.
Hunk #4 FAILED at 2022.
4 out of 4 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/intel_display.c.rej

I did make sure this patch was applied before the standard arch patches. Can you send the link for the other kernel source you had me use last time? 

Thanks!
Scott
Comment 107 Jesse Barnes 2009-07-15 13:53:50 UTC
Fix has been pushed to drm-intel-next, that's probably the easiest way to get it now:

author	Jesse Barnes <jbarnes@virtuousgeek.org>
commit	dff33cfcefa31c30b72c57f44586754ea9e8f3e2

drm/i915: FIFO watermark calculation fixes
Comment 108 Scott Hansen 2009-07-16 22:01:41 UTC
Ok, got and compiled the kernel from git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git.

uname -a = 2.6.31-rc2-drm-intel-26127-gdff33cf #1 SMP PREEMPT Thu Jul 16 20:23:01 PDT 2009 i686 Intel(R) Pentium(R) 4 CPU 1.80GHz GenuineIntel GNU/Linux

xf86-video-intel-newest 2.7.99.902-1 : X.org Intel i810\/i830\/i915\/945G\/G965+ video drivers (2.8.0 RC2).

Enabled KMS. Same flicker behavior following suspend to RAM, possibly even worse than with the stock kernel and no KMS. Darn it, I was hoping we had this solved!

Well, let me know what other information you need from me. I can't remember where to find the source for the intel_reg_dump program you had me use several months ago, if you need that.

Thanks!
Scott
Comment 109 Jesse Barnes 2009-07-17 09:42:35 UTC
The bug that keeps on giving.  Please check this one out; Eric found the same thing for his high res configs:

http://lists.freedesktop.org/archives/intel-gfx/2009-July/003471.html
Comment 110 Scott Hansen 2009-07-17 19:50:30 UTC
Jesse, no change with that patch. Still horrible flickering of the whole screen after resuming from suspend to RAM.

What's next?  :)

Scott
Comment 111 Jesse Barnes 2009-07-17 20:35:05 UTC
Can you attach your kernel log after you've loaded drm with debug=1?  (Note, I'm assuming you're using KMS here.)
Comment 112 Scott Hansen 2009-07-18 09:56:28 UTC
Boot was at 09:18, I suspended and resumed a few minutes later. Debug sure fills up the log quick! Sorry its so big....it was too big to post here so here's the link. Booted with drm.debug=1 and i915.modeset=1. Definitely have KMS working, because the switching between virtual terminals is so fast. Cool!

http://scottandchrystie.homeip.net/kernel.log.gz

Scott
Comment 113 Jesse Barnes 2009-07-18 11:08:23 UTC
Ah thanks, that helps a lot.  What chipset do you have?  I should be able to give you a fix pretty quickly...
Comment 114 Scott Hansen 2009-07-18 14:31:09 UTC
Jesse: Graphics device from lspci --

00:00.0 Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE DRAM Controller/Host-Hub Interface (rev 01)
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)

Scott
Comment 115 Jesse Barnes 2009-07-19 13:16:41 UTC
Hm, I was hoping it was something simple like I'd just read the 845 docs incorrectly, but afaict things are actually correct for that case.  But the plane A FIFO allocation does look supiciously high; this patch assumes 845G actually measures FIFO entries in DSPARB as 16 byte values rather than 64, so it might help.  I'll have to check some more docs before I know for sure though.

--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -1844,6 +1844,9 @@ static int intel_get_fifo_size(struct drm_device *dev, int
                        size = ((dsparb >> DSPARB_BEND_SHIFT) & 0x1ff) -
                                (dsparb & 0x1ff);
                size >>= 1; /* Convert to cachelines */
+       } else if (IS_845G(dev)){
+               size = dsparb & 0x7f;
+               size >>= 2; /* Convert to cachelines */
        } else {
                size = dsparb & 0x7f;
                size >>= 1; /* Convert to cachelines */
Comment 116 Scott Hansen 2009-07-19 17:59:12 UTC
That didn't work, Jesse. I just get a black screen when it switches to the framebuffer on boot. The machine is still functioning because I can ssh in, but no display.

Let me know if you need the logs for this.

Scott
Comment 117 Jesse Barnes 2009-07-20 09:29:25 UTC
Ah I was looking at the wrong code path.  In the 830/845 case I think I might be clobbering some important bits, this should preserve them and hopefully set the right values.

--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -1943,14 +1943,16 @@ static void i830_update_wm(struct drm_device *dev, int planea_clock,
 			   int pixel_size)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
-	uint32_t fwater_lo = I915_READ(FW_BLC) & MM_FIFO_WATERMARK;
+	uint32_t fwater_lo = I915_READ(FW_BLC) & ~0xfff;
 	int planea_wm;
 
 	i830_wm_info.fifo_size = intel_get_fifo_size(dev, 0);
 
 	planea_wm = intel_calculate_wm(planea_clock, &i830_wm_info,
 				       pixel_size, latency_ns);
-	fwater_lo = fwater_lo | planea_wm;
+	fwater_lo |= (3<<8) | planea_wm;
+
+	DRM_DEBUG("Setting FIFO watermarks - A: %d\n", planea_wm);
 
 	I915_WRITE(FW_BLC, fwater_lo);
 }
Comment 118 Scott Hansen 2009-07-21 20:30:33 UTC
Jesse, I think this is an improvement. Still get occasional flickers with normal browsing and window movements following suspend. DVD and other movie playback still triggers strong flickering, although it seems somewhat better than the last patch. Flash doesn't seem to trigger the flicker, even running full screen. Here's the link for the kernel log (drm.debug=1).

http://scottandchrystie.homeip.net/kernel.log.gz

Just so you know, the last two patches you posted have been "malformed patches" right around line 4. I've had to manually patch to get it working :) Not sure if its a cut and paste artifact, but the other ones you posted worked fine as a patch file.

Thanks!
Scott
Comment 119 Jesse Barnes 2009-07-22 09:18:48 UTC
OK, so we're slowly improving. :)  What if you apply both patches?  I still can't find docs for the 845G FIFO and cache line sizes, so that could still be an issue.
Comment 120 Scott Hansen 2009-07-22 12:41:35 UTC
Awesome! That did it!! Not a flicker to be seen so far! Nice work :) Let me know if you need anything else and when the patches actually make it into the kernel.

Thanks very much!

Scott

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.