Bug 17988

Summary: [945GM] starting second X server fails
Product: xorg Reporter: Will Stephenson <wstephenson>
Component: Driver/intelAssignee: Jesse Barnes <jbarnes>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: major    
Priority: high CC: eich, kent.liu, ling.yue, mat, pachoramos1, quanxian.wang, sndirsch
Version: 7.4 (2008.09)   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
log of first X server before starting second server.
none
Log of first X server after starting second server
none
Log of second, hung X server
none
xorg.conf
none
Log with ModeDebug on
none
Crashed X server's log
none
2nd X server's log
none
Log of first X server with debug open
none
Log of second X server with debug open
none
This is the diff for 2008Q3 release none

Description Will Stephenson 2008-10-09 00:12:07 UTC
Created attachment 19516 [details]
log of first X server before starting second server.

* Tue Sep 23 2008 sndirsch@suse.de                                        
- xorg-server 1.5.1 (planned for final X.Org 7.4 release)                 
  * Conditionalize Composite-based backing store on                       
    pScreen->backingStoreSupport. (Aaron Plattner)                        
  * Move RELEASE_DATE below AC_INIT. (Adam Jackson)                       
  * exa: disable shared pixmaps (Julien Cristau)                          
  * Fix panoramiX request and reply swapping (Peter Harris)               

on 945GM:
00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)
        Subsystem: Lenovo ThinkPad T60/R60 series
on openSUSE 11.1beta2, kernel-pae-2.6.27-12.1

Steps to reproduce:

I start an X server and twm.  There is a flash of pixel puke 

On starting a second X server, the display flashes several times then freezes, 2/3 painted with the default X herringbone root window pattern.  It appears to have stopped painting in the middle of a 4 pixel high strip.

The last log output from :1 is 
(EE) intel(0): tried to update DSPARB with both planes enabled!
Could not init font path element /usr/share/fonts/TTF/, removing from list!
Could not init font path element /usr/share/fonts/OTF, removing from list!

The first X on :0 can be simply killed; :1 appears to be hung and has to be kill -9'ed before I can restart :0 to get a usable display.  If I try to start further X servers before killing :1, they hang at (==) Using config file: "/etc/X11/xorg.conf"

The only other report mentioning DSPARB is https://bugs.freedesktop.org/show_bug.cgi?id=17050
Comment 1 Will Stephenson 2008-10-09 00:12:49 UTC
Created attachment 19517 [details]
Log of first X server after starting second server
Comment 2 Will Stephenson 2008-10-09 00:15:37 UTC
Created attachment 19518 [details]
Log of second, hung X server
Comment 3 Will Stephenson 2008-10-09 00:17:13 UTC
Created attachment 19519 [details]
xorg.conf
Comment 4 Stefan Dirsch 2008-11-05 22:25:30 UTC
Since Kent asked me about the severity/priority.
Comment 5 Jesse Barnes 2008-11-10 16:07:57 UTC
I reproduced the symptom with a GEM enabled kernel, but got a kernel backtrace instead.  Still trying to reproduce this with an older kernel...

That DSPARB related message indicates we're not doing something correctly in our LeaveVT function probably, which could explain the hangs.  Anyway I'll keep trying to reproduce it.

In the meantime can you enable the modedebug option in your xorg.conf?  It should catch register differences between startup & VT switch time, so it might help.
Comment 6 qwang13 2008-11-13 21:10:19 UTC
Stefan, would you like to enable the modedebug option to get more information for Jesse? 

Option ModeDebug "on"

Jesse, any progress for that with GEM kernel?

Thanks
Comment 7 Stefan Dirsch 2008-11-22 02:39:18 UTC
Quanxian, unfortunately I don't have such a machine. Will, could you provide the requested information, please? Thanks.
Comment 8 Will Stephenson 2008-11-24 02:15:38 UTC
Created attachment 20543 [details]
Log with ModeDebug on

Does this log tell you what you need to know?  Interestingly, I started a second server from a running session, and this time the display went black and stayed black only after logging in.  Will try again starting both servers from the console.
Comment 9 Will Stephenson 2008-11-24 02:20:20 UTC
Created attachment 20544 [details]
Crashed X server's log

This is the logfile from a second run with ModeDebug on.  

no display manager was running
$ X :0 &
$ X :1 &
ctrl-alt-f7 and :0 dies with the attached log.
Comment 10 Will Stephenson 2008-11-24 02:20:52 UTC
Created attachment 20545 [details]
2nd X server's log
Comment 11 Will Stephenson 2008-11-24 02:21:35 UTC
Info provided, sorry for the delay, was on holiday.
Comment 12 qwang13 2008-12-08 22:47:25 UTC
Jesse,
I just now debug this. Based on your comments, it may be the DSPARB setting wrongly problem.

The information just as below
------------
First Xwindow
(II) intel(0): i830_update_dsparb-qxwang:num_crtc-2, fifo_entries-95
(II) intel(0): i830_update_dsparb-qxwang:crtc-0,enabled-0, plane-0
(II) intel(0): i830_update_dsparb-qxwang:crtc-1,enabled-1, plane-1
(II) intel(0): i830_update_dsparb-qxwang:total_hdisplay-1024,planea_hdisplay-0, planeb_hdisaply-1024

Second Xwindow
(II) intel(0): i830_update_dsparb-qxwang:num_crtc-2, fifo_entries-95
(II) intel(0): i830_update_dsparb-qxwang:crtc-0,enabled-0, plane-1
(II) intel(0): i830_update_dsparb-qxwang:crtc-1,enabled-1, plane-0
(II) intel(0): i830_update_dsparb-qxwang:total_hdisplay-1024,planea_hdisplay-1024, planeb_hdisaply-0
----------------
The setting for DSPARB is like this
i830_display.c
------------
   planea_entries = fifo_entries * planea_hdisplay / total_hdisplay;
   planeb_entries = fifo_entries * planeb_hdisplay / total_hdisplay;

   if (IS_I9XX(pI830))
       OUTREG(DSPARB,
              ((planea_entries + planeb_entries) << DSPARB_CSTART_SHIFT) |
              (planea_entries << DSPARB_BSTART_SHIFT));
   else if (IS_MOBILE(pI830))
       OUTREG(DSPARB,
              ((planea_entries + planeb_entries) << DSPARB_BEND_SHIFT) |
              (planea_entries << DSPARB_AEND_SHIFT));
-------------

My question is if we open 2 windows, there will be two planes(planea, planeb) on one crtc.
Every time we change to another window, the planea_entries and planeb_entries will be 0 or 95(I9XX chipset) based on above testing data. Is it reasonable? Should we apply planeA 95/2 and planeB 95/2 whatever we chvt to which xwindow? 

Is my understanding right? If right, we should allocate the fifo_entries based the current planes number we got.

Regards
Comment 13 qwang13 2008-12-15 18:19:48 UTC
Jesse, 
Are you still looking on this?

I have no idea about that. :(
Comment 14 Jesse Barnes 2008-12-16 17:43:23 UTC
Both servers should be running on the same pipe though, so DSPARB should be the same.  However, it'll switch at VT time back to its original value, then get re-programmed again when the next server's EnterVT gets called.  At least, that's what *should* happen...
Comment 15 qwang13 2008-12-16 18:04:35 UTC
OK. That is means when you do the vt switch, one server will leaveVT, one server will enterVT, based on the log, seems the synchronization has some problem. When one server leaveVT, it doesn't clean up pipe B or others registers which will make PIPE B still run. 

I have checked the source code, it is in DPMS operation, when dpms off, the DPLL, PIPE, PLANE, PORT will be disabled. Is there other place to make happen? Also if it is enough to do this to cleanup registers shared by two server??

Regards

Quanxian Wang
Comment 16 Jesse Barnes 2008-12-17 16:34:10 UTC
Well, we could very well be missing something in our Enter/LeaveVT code.  When the driver does a LeaveVT, it's supposed to put the hw back into the state it found it, using the RestoreHWState function (after turning everything off).  If, after this, some of the register values aren't the same as they were when the server started, it could definitely cause trouble for a subsequent X server.  On EnterVT, the driver uses xf86SetDesiredModes to get things back to where it wants them... 

On this hardware, it looks like the pipeA quirk might be active?  That leaves pipe A on so that the BIOS can bang on it at suspend time for example without hanging the machine.  Also, the second server seems to be choosing a different plane/pipe mapping for some reason.  Does the second server work better if you disable framebuffer compression?
Comment 17 liuhaien 2008-12-22 19:26:56 UTC
hi,all
I can not reproduce this issue on our 945gm, below is my system environment:
is there any difference from yours?
Host:		945gm
Arch:		i386
OSD:		Fedora release 7 (Moonshine)
Libdrm:		(master)0243c9f801a35de3465a0321c02f18a4d07ce5b8
Mesa:		(intel-2008-q4)f96baeaac3ef41260ac3975750627ece073fdce0
Xserver:	(server-1.6-branch)32e81074b967716865aef08b66ec29caf0fec2c5
Xf86_video_intel: (xf86-video-intel-2.6-branch)
                         83f3c376b5942e134047a220e6e5f2432ffc492c
GEM_kernel:       (for-airlied)0fbdb7c9455a05eb89f358f0eb66fb8ab094a0c5
Comment 18 qwang13 2008-12-23 01:09:02 UTC
Haien,
We use intel-Q3 release which is gem-classic. It is big different with yours environment. Also we can not backport yours into SLE11 system.

See comment 1
Comment 19 qwang13 2008-12-23 06:27:34 UTC
Hi, Jesse
After testing, I think 945GM is a special platform. Therefore I add some code into the i830_display.c. It works fine.

Do you have some comments for the code? My idea is to make them use the same configuration whatever plane a or plane b is used. Of course, this code is special for 945GM, others platform doesn't have such problem, we don't need care. Seems another quirk process. :)

---------------
--- i830_display.c_orig 2008-12-23 22:06:45.000000000 +0800
+++ i830_display.c      2008-12-23 22:07:33.000000000 +0800
@@ -1156,9 +1156,18 @@ i830_update_dsparb(ScrnInfoPtr pScrn)
    planeb_entries = fifo_entries * planeb_hdisplay / total_hdisplay;

    if (IS_I9XX(pI830))
+   {
+       if(IS_I945GM(pI830))
+       {
+            OUTREG(DSPARB, (95 << DSPARB_CSTART_SHIFT) |
+                  (48 << DSPARB_BSTART_SHIFT));
+       }
+       else{
        OUTREG(DSPARB,
-             ((planea_entries + planeb_entries) << DSPARB_CSTART_SHIFT) |
-             (planea_entries << DSPARB_BSTART_SHIFT));
+              ((planea_entries + planeb_entries) << DSPARB_CSTART_SHIFT) |
+              (planea_entries << DSPARB_BSTART_SHIFT));
+       }
+   }
    else if (IS_MOBILE(pI830))
        OUTREG(DSPARB,
              ((planea_entries + planeb_entries) << DSPARB_BEND_SHIFT) |
----------------
Comment 20 Gordon Jin 2008-12-23 17:14:06 UTC
(In reply to comment #18)
> Haien,
> We use intel-Q3 release which is gem-classic. It is big different with yours
> environment. Also we can not backport yours into SLE11 system.

I thought this also happened on upstream as Jesse said he can reproduce with GEM kernel (in comment#5). Then I'm removing this from Q4 release blocker now.
Comment 21 qwang13 2009-01-16 03:22:31 UTC
Jesse, 
I should set the pipe_entries to 128. I ever read your some comments, you said 945GM has 128 fifo entries. Is it right?

--- i830_display.c_orig 2008-12-23 22:06:45.000000000 +0800
+++ i830_display.c      2008-12-23 22:07:33.000000000 +0800
@@ -1156,9 +1156,18 @@ i830_update_dsparb(ScrnInfoPtr pScrn)
    planeb_entries = fifo_entries * planeb_hdisplay / total_hdisplay;

    if (IS_I9XX(pI830))
+   {
+       if(IS_I945GM(pI830))
+       {
+            OUTREG(DSPARB, (128 << DSPARB_CSTART_SHIFT) |
+                  (48 << DSPARB_BSTART_SHIFT));
+       }
+       else{
        OUTREG(DSPARB,
-             ((planea_entries + planeb_entries) << DSPARB_CSTART_SHIFT) |
-             (planea_entries << DSPARB_BSTART_SHIFT));
+              ((planea_entries + planeb_entries) << DSPARB_CSTART_SHIFT) |
+              (planea_entries << DSPARB_BSTART_SHIFT));
+       }
+   }
    else if (IS_MOBILE(pI830))
        OUTREG(DSPARB,
              ((planea_entries + planeb_entries) << DSPARB_BEND_SHIFT) |


(In reply to comment #19)
> Hi, Jesse
> After testing, I think 945GM is a special platform. Therefore I add some code
> into the i830_display.c. It works fine.
> 
> Do you have some comments for the code? My idea is to make them use the same
> configuration whatever plane a or plane b is used. Of course, this code is
> special for 945GM, others platform doesn't have such problem, we don't need
> care. Seems another quirk process. :)
> 
> ---------------
> --- i830_display.c_orig 2008-12-23 22:06:45.000000000 +0800
> +++ i830_display.c      2008-12-23 22:07:33.000000000 +0800
> @@ -1156,9 +1156,18 @@ i830_update_dsparb(ScrnInfoPtr pScrn)
>     planeb_entries = fifo_entries * planeb_hdisplay / total_hdisplay;
> 
>     if (IS_I9XX(pI830))
> +   {
> +       if(IS_I945GM(pI830))
> +       {
> +            OUTREG(DSPARB, (95 << DSPARB_CSTART_SHIFT) |
> +                  (48 << DSPARB_BSTART_SHIFT));
> +       }
> +       else{
>         OUTREG(DSPARB,
> -             ((planea_entries + planeb_entries) << DSPARB_CSTART_SHIFT) |
> -             (planea_entries << DSPARB_BSTART_SHIFT));
> +              ((planea_entries + planeb_entries) << DSPARB_CSTART_SHIFT) |
> +              (planea_entries << DSPARB_BSTART_SHIFT));
> +       }
> +   }
>     else if (IS_MOBILE(pI830))
>         OUTREG(DSPARB,
>               ((planea_entries + planeb_entries) << DSPARB_BEND_SHIFT) |
> ----------------
> 

Comment 22 Jesse Barnes 2009-01-16 09:31:38 UTC
qwang, yeah sorry I missed your earlier add.  Yeah I think we have the wrong DSPARB values for 945.  There's another patch (that uses the entry count variables instead) in 18651 that you should test.
Comment 23 qwang13 2009-01-18 22:17:56 UTC
(In reply to comment #22)
> qwang, yeah sorry I missed your earlier add.  Yeah I think we have the wrong
> DSPARB values for 945.  There's another patch (that uses the entry count
> variables instead) in 18651 that you should test.
> 

Jesse,
The patch doesn't work(Actually, it just change 95 to 127 for 945GM). After the patch, we can not go to login windows. Just show us mass colorful lines.
Comment 24 qwang13 2009-02-10 18:49:53 UTC
This is also in GM965. We ever started 2 xserver and found one is white screen. 
02/26 is the final release (GMC) for SLED11, actually we should can only provide the batch before one week or earlier the time, this bug will impact much more customers who use Intel Graphics Chipset for Novell SLED11 release. 

I up the priority to critical.
Comment 25 Will Stephenson 2009-02-11 10:00:47 UTC
Reproduced again with SLED11 RC3.  FYI the hang happens when switching back to the 1st server's VT.  There is still pixel puke between gdm and the first user session coming up...
Comment 26 Jesse Barnes 2009-02-12 09:17:51 UTC
If it's pipe mapping problem Zhenyu noticed, the patch in 19603 might be the fix.  Can you give it a try?
Comment 27 Will Stephenson 2009-02-16 03:39:18 UTC
(In reply to comment #26)
> If it's pipe mapping problem Zhenyu noticed, the patch in 19603 might be the
> fix.  Can you give it a try?
> 

i830-create-known-state.patch from 19603 doesn't apply cleanly to 2.5.0, but I'm building with Helge's patch (xf86-video-intel-2.5.99-vesa-vtswitch.patch) which does.
Comment 28 qwang13 2009-02-16 20:15:27 UTC
Created attachment 23005 [details]
Log of first X server with debug open

After got the patch from 19603 and testing, we don't found underrun information. However the second server directly go to white screen. 
Checking the log of 2nd server,
1) dri is disabling (DRISCREEN initilized failed, only one server can use dri)
2) drm front, back, depth buffer is not allocated in second server(since dri is disalbed, it should be OK)

Also there is no EDID informaiton output for 2nd server. 
I attach the output for xorg.
Comment 29 qwang13 2009-02-16 20:17:52 UTC
Created attachment 23006 [details]
Log of second X server with debug open

Log of second server
Comment 30 qwang13 2009-02-20 00:38:01 UTC
Actually the problem is still there. I just come across another problem for compiz issue. When I disable compiz and do the testing again, still the underrun information appears. The patch does nothing on this bug.
Comment 31 Jesse Barnes 2009-02-20 10:38:19 UTC
This same configuration (server 1.5, driver 2.5) works on my 915 test box, so it could be something related to the 945 in particular.

Note that the second server is properly swapping pipes & planes, while the first server isn't (probably because DRI is enabled and the drm module doesn't support that option), though for some reason the first server isn't allocating a compressed framebuffer (it does in my case).

--- working.out 2009-02-20 10:17:30.406528992 -0800
+++ broken.out  2009-02-20 10:17:09.082552621 -0800
@@ -23,7 +23,7 @@
 (II) intel(0):                SDVOB: 0x00480000 (disabled, pipe A, stall disabl
ed, not detected)
 (II) intel(0):                SDVOC: 0x00480000 (disabled, pipe A, stall disabl
ed, not detected)
 (II) intel(0):              SDVOUDI: 0x0000003f
-(II) intel(0):               DSPARB: 0x00002f80
+(II) intel(0):               DSPARB: 0x00002fdf
 (II) intel(0):               DSPFW1: 0x00000000
 (II) intel(0):               DSPFW2: 0x00000000
 (II) intel(0):               DSPFW3: 0x00000000
@@ -44,10 +44,10 @@
 (II) intel(0):      PFIT_PGM_RATIOS: 0x00000000
 (II) intel(0):      PORT_HOTPLUG_EN: 0x00000020
 (II) intel(0):    PORT_HOTPLUG_STAT: 0x00000000
-(II) intel(0):             DSPACNTR: 0x58000000 (disabled, pipe A)
+(II) intel(0):             DSPACNTR: 0xd9000000 (enabled, pipe B)
 (II) intel(0):           DSPASTRIDE: 0x00002000 (8192 bytes)
 (II) intel(0):              DSPAPOS: 0x00000000 (0, 0)
-(II) intel(0):             DSPASIZE: 0x01df027f (640, 480)
+(II) intel(0):             DSPASIZE: 0x02ff03ff (1024, 768)
 (II) intel(0):             DSPABASE: 0x01000000
 (II) intel(0):             DSPASURF: 0x00000000
 (II) intel(0):          DSPATILEOFF: 0x00000000
@@ -66,10 +66,10 @@
 (II) intel(0):              VSYNC_A: 0x01ea01e8 (489 start, 491 end)
 (II) intel(0):            BCLRPAT_A: 0x00000000
 (II) intel(0):         VSYNCSHIFT_A: 0x00000000
-(II) intel(0):             DSPBCNTR: 0xd9000000 (enabled, pipe B)
+(II) intel(0):             DSPBCNTR: 0x58000000 (disabled, pipe A)
 (II) intel(0):           DSPBSTRIDE: 0x00002000 (8192 bytes)
 (II) intel(0):              DSPBPOS: 0x00000000 (0, 0)
-(II) intel(0):             DSPBSIZE: 0x02ff03ff (1024, 768)
+(II) intel(0):             DSPBSIZE: 0x01df027f (640, 480)
 (II) intel(0):             DSPBBASE: 0x01000000
 (II) intel(0):             DSPBSURF: 0x00000000
 (II) intel(0):          DSPBTILEOFF: 0x00000000

I can start compiz & a full desktop on screen 0, then a simple xterm & twm on screen 1 and switch back & forth without problems.  If I try to use the same user desktop session on both heads (full gnome session running as root on both) the apps on the second head hang, but this may be because they're confused about which head to run on or are waiting for updates from apps on the first head (which are likely not responding).

But all of that is with a GEM enabled kernel, which it doesn't look like you have.  I'll build your kernel too and see what happens.
Comment 32 Jesse Barnes 2009-02-20 17:34:26 UTC
Hm, nope 2.6.27.8 works ok too, with a full GNOME+compiz session on one screen and xterm on another, and I can swap between them.

But the original report referenced a pipe B underrun, which could hang the chip.  So on that platform you may comment out the calls to update_dsparb altogether and just use the default value (it should work for most configs) to see if that prevents the hang.
Comment 33 qwang13 2009-02-22 17:49:46 UTC
(In reply to comment #32)
> Hm, nope 2.6.27.8 works ok too, with a full GNOME+compiz session on one screen
> and xterm on another, and I can swap between them.
> 
> But the original report referenced a pipe B underrun, which could hang the
> chip.  So on that platform you may comment out the calls to update_dsparb
> altogether and just use the default value (it should work for most configs) to
> see if that prevents the hang.
> 

Basically, it should be works. See comment #21. I ever take this as a workaround. I will try your idea to check what happens.
Comment 34 qwang13 2009-02-22 18:00:05 UTC
Yes, the bug disappears if I disable update_dsparb. However I will come across another problem, when I enable compiz, the second server will also hang with white screen. I will open another bug for this. This is a general issue for all the platform.
Comment 35 qwang13 2009-02-22 21:13:32 UTC
Created attachment 23191 [details]
This is the diff for 2008Q3 release

I push my diff file based on Q3 release.

I am just doubt why 2008Q4 don't have such problem. 

Jesse, do you try on latest upstream packages if the problem exists on latest upstream code?
Comment 36 Stefan Dirsch 2009-02-23 03:16:25 UTC
Patch submitted for SLE11-RC5.
Comment 37 Jesse Barnes 2009-03-16 18:02:47 UTC
Ok, sounds like this one is resolved.  I think the DSPARB/FIFO code will be changing soon as part of 18651 and related bugs anyway, but please opena  new bug if you find issues with more recent code (in particular KMS).

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.