Bug 20546 - xdm RESERVER_DISPLAY server crash detect extension
Summary: xdm RESERVER_DISPLAY server crash detect extension
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: App/xdm (show other bugs)
Version: git
Hardware: All Linux (All)
: medium normal
Assignee: Xorg Project Team
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2009-03-08 17:48 UTC by vdb128
Modified: 2011-03-03 19:26 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dm.c crash detect: use StopDisplay(), add resource reservAttempts (4.90 KB, patch)
2009-03-08 17:48 UTC, vdb128
no flags Details | Splinter Review
debug log showing crash detect when repeatedly terminating local X server (17.23 KB, text/plain)
2009-03-08 17:54 UTC, vdb128
no flags Details
dm.c crash detect: use StopDisplay(), add resource reservAttempts (5.20 KB, patch)
2009-03-09 05:57 UTC, vdb128
no flags Details | Splinter Review
dm.c crash detect: use StopDisplay(), add resource reservAttempts (5.22 KB, patch)
2010-08-25 11:54 UTC, vdb128
no flags Details | Splinter Review
1/2 crash detect: use StopDisplay(). (1.87 KB, patch)
2011-02-07 13:55 UTC, vdb128
no flags Details | Splinter Review
2/2 add the resource reservAttempts. (5.66 KB, patch)
2011-02-07 13:57 UTC, vdb128
no flags Details | Splinter Review
debug log showing crash detect when repeatedly terminating the server. (19.10 KB, text/plain)
2011-02-07 13:59 UTC, vdb128
no flags Details

Description vdb128 2009-03-08 17:48:28 UTC
Created attachment 23660 [details] [review]
dm.c crash detect: use StopDisplay(), add resource reservAttempts

The crash detect code in dm.c WaitForChild() RESERVER_DISPLAY checks 
if a previous manager exit occured in the last 120 seconds.  If so 
RemoveDisplay() is used to remove the display from the list.  

The proposed patch uses StopDisplay() and thus slightly changes the 
behaviour for local servers, from a direct removal to a synchronous 
wait for child exit.  For a local X server either:

1. The server exit was already returned by waitpid().  So 
   serverPid==-1 => StopDisplay() calls RemoveDisplay().

2. The server is a zombie or still running.  So 
   serverPid>1 => StopDisplay() sets status=zombie and kills the server. 
   The next waitpid() returns this zombie server pid and the code then 
   calls RemoveDisplay().  

The patch further adds support for allowing successive 
RESERVER_DISPLAY manager exits.  The resource reservAttempts is the 
number of times a fatal IO error is allowed in session.c.

This extension is useful for my not so reliable multi-server-seat 
setup.  If two users quit X simultaneously the video card bioses 
interfere, and the only way out is to terminate X once or twice 
through ctrl-alt-backspace.  The xdm errorlog sometimes shows 

  Server for display :2 terminated unexpectedly

and always contains

  fatal IO error 11 (Resource temporarily unavailable)

depending on the first return value of waitpid(), the server or 
manager pid.  

Thank you for taking a look.
Comment 1 vdb128 2009-03-08 17:54:00 UTC
Created attachment 23661 [details]
debug log showing crash detect when repeatedly terminating local X server
Comment 2 vdb128 2009-03-09 05:57:19 UTC
Created attachment 23688 [details] [review]
dm.c crash detect: use StopDisplay(), add resource reservAttempts

Reset reservTries after each session, and also after an IO error if the exit is not a crash.
Comment 3 vdb128 2010-08-25 11:54:37 UTC
Created attachment 38146 [details] [review]
dm.c crash detect: use StopDisplay(), add resource reservAttempts

Applies to xdm 1.1.10.
Comment 4 vdb128 2010-11-09 02:46:24 UTC
The text below is a copy of the description posted on xorg-devel.

Once upon a time xdm was created.  This program manages X terminals by 
forking 1 or 2 processes per display:

  1. a local server 'X :0' if required, and 
  2. the session manager greeter '-:0'.

The dm.c WaitForChild() loop interpretes the session exit code and 
uses a ++startTries>=startAttempts test to detect an unresponsive 
server, OPENFAILED_DISPLAY, or a SIGTERM'ed session.  Other exit codes 
reset startTries = 0.  

All worked well until the ps2 mouse appeared.  Such input device is 
initialized last and after the X listening socket.  Thus the session 
manager finds the X server, starts the greeter, and waits for its 
return.  Alas, if the mouse now fails initialization then X quits and 
triggers session.c IOErrorHandler() { exit(RESERVER_DISPLAY); }.  
The dm.c loop would then call RestartDisplay(d, TRUE) causing a 
persistent restart-quit cycle which inhibits user login.  

Somewhere before 2007 this was solved by the crash detect code in dm.c 
at the case RESERVER_DISPLAY.  A manager exit is considered a crash if 
a previous exit occured within the XDM_BROKEN_INTERVAL of 120 second.  
RemoveDisplay() is then called to remove the display from the list.  

This isn't a perfect solution since crash detection may also be 
triggered by the ctrl-alt-backspace server termination.  Although a 
bit an unfriendly way of quitting X this is the only option available 
in case of a garbled display.  

The proposed patch adds a ++reservTries>=reservAttempts test to allow 
for a few successive crash-type manager exits.  The resource 
reservAttempts is the number of times a fatal IO error is allowed in 
session.c.  

The patch also substitutes the RestartDisplay() RemoveDisplay() 
sequence by StopDisplay().  The original sequence works since 
RestartDisplay(d, TRUE) kills the serverPid.  However RemoveDisplay() 
directly removes the display from the list which may thus give an 
"Unknown child termination" error.  StopDisplay() on the other hand 
implements the synchronous kill and wait for child exit.  
RemoveDisplay() is only to be called if neither server nor session 
manager is running.
Comment 5 vdb128 2011-02-07 13:55:08 UTC
Created attachment 43049 [details] [review]
1/2 crash detect: use StopDisplay().

This patch solves a benign bug: on server crash the display is removed 
but the server exit code is ignored.  Proposed solution: substitute 
RemoveDisplay() by StopDisplay() thus enabling the wait for child exit 
synchronism.  

* Introduction *

A code section handling X server crashes was added to xdm somewhere 
before 2007.  The section detects and breaks persistent 
start-crash-exit-restart-start-... loops by removing the display from 
the Xservers list thus disabling the display.  

Such a crash loop is not uncommon since the server is considered up 
and running if the watchdog display open connection succeeds.  Hence, 
any server quit just after initialization of its listening socket is 
considered a crash.  A typical example is a failing mouse or 
keyboard.  

Alas, a just after startup server termination via ctrl-alt-backspace 
is indistinguishable from a crash.  Although a bit an unfriendly way 
of quitting X this is the only method available in case of a garbled 
display.  

After looking at the code, and the debug log, it seems that the crash 
loop break by direct removal of the display is not wholly correct for 
local displays.  The log files show that the X server may exit after 
the session manager.  So instead of the direct RemoveDisplay() the 
synchronous kill and wait for server exit StopDisplay() should be 
used.  

Here below is a description of the code path and the underlying logic.  

* The xdm logic *

Each local display from Xservers starts two child processes, a server 
and a session manager:

xdm/dm.c StartDisplay () calls
  1.  StartServer() which forks the local server "X :0 vt7 -auth ...", 
  2.1 fork() to create the second child, 
  2.2 WaitForServer() to open the watchdog connection in d->dpy, 
  2.3 ManageSession() which becomes the session manager "-:0", 
and then sets 'd->status = running'.  

The manager will then finally call the greeter library to draw the 
login password user interface.  Before doing so session.c 
ManageSession() replaces the Xlib IOErrorhandler by

xdm/session.c IOErrorHandler (Display *dpy)
{
    LogError ("fatal IO error %d (%s)\n", errno, _SysErrorMsg(errno));
    exit(RESERVER_DISPLAY);
}

which detects an X server termination.  Indeed, when X quits the 
watchdog connection is closed thus triggering an IO error.  

After starting all displays the parent process waits for any child 
termination.  The dm.c WaitForChild() loop interpretes the exit code.

xdm/dm.c WaitForChild (void)
<...>
    while ((pid = waitpid (-1, &status, WNOHANG)) > 0)
#endif
    {
	Debug ("Manager wait returns pid: %d sig %d core %d code %d\n",
	       pid, waitSig(status), waitCore(status), waitCode(status));
	if (autoRescan)
	    RescanIfMod ();
	/* SUPPRESS 560 */
	if ((d = FindDisplayByPid (pid))) {
	    d->pid = -1;
	    switch (waitVal (status)) {
<...>
Here a manager exited.  The case below applies for an IO error.
	    case RESERVER_DISPLAY:
		d->startTries = 0;
		Debug ("Display exited with RESERVER_DISPLAY\n");
		if (d->displayType.origin == FromXDMCP || d->status == zombie)
		    StopDisplay(d);
		else
		    RestartDisplay (d, TRUE);
		{
		  Time_t Time;
		  time(&Time);
		  Debug("time %i %i\n",Time,d->lastCrash);
		  if (d->lastCrash &&
		      ((Time - d->lastCrash) < XDM_BROKEN_INTERVAL)) {
		    Debug("Server crash frequency too high:"
			  " removing display %s\n",d->name);
		    LogError("Server crash rate too high:"
			     " removing display %s\n",d->name);
#if !defined(HAVE_ARC4RANDOM) && !defined(DEV_RANDOM)
		    AddTimerEntropy();
#endif
		    RemoveDisplay (d);
		  } else
		    d->lastCrash = Time;
		}
		break;
On crash the above calls: RestartDisplay(d, TRUE); RemoveDisplay(d)
<...>
	}
	/* SUPPRESS 560 */
	else if ((d = FindDisplayByServerPid (pid)))
	{
	    d->serverPid = -1;
	    switch (d->status)
	    {
	    case zombie:
		Debug ("Zombie server reaped, removing display %s\n", d->name);
		RemoveDisplay (d);
		break;
<...>
Here an X server exited.  The case below applies for plain server termination.
	    case running:
		Debug ("Server for display %s terminated unexpectedly, status %d %d\n", d->name, waitVal (status), status);
		LogError ("Server for display %s terminated unexpectedly: %d\n", d->name, waitVal (status));
		if (d->pid != -1)
		{
		    Debug ("Terminating session pid %d\n", d->pid);
If the session manager is exiting with an IOError then this is a rekill.
		    TerminateProcess (d->pid, SIGTERM);
		}
		break;
<...>
	}
	else
	{
Here a process exited which is not in the display list.  Should never happen.
	    Debug ("Unknown child termination, status %d\n", waitVal (status));
	}
    }
This final StartDisplays() calls StartDisplay() for any notRunning display.
    StartDisplays ();

* Removing a crashed display *

The local display removal thru RestartDisplay(, TRUE) and 
RemoveDisplay() works since RestartDisplay() (re)kills the running 
server:

xdm/dm.c RestartDisplay (struct display *d, int forceReserver)
{
    if (d->serverPid != -1 && (forceReserver || d->terminateServer))
    {
	TerminateProcess (d->serverPid, d->termSignal);
	d->status = phoenix;
    }
    else
    {
	d->status = notRunning;
    }
}

Hence, serverPid == -1 if the server exit status was already returned 
by a previous waitpid().  In the other case the server process is 
still present, either in terminated zombie state or actively 
terminating due to the termSignal.  But RemoveDisplay() directly 
removes the display from the list in either case.  Thus the server 
exit is ignored.  

Examples when zapping the server twice thru ctrl-alt-backspace:

** Xservers:
** :1 local /home/vdb/soft/xmod-1.9.4/bin/X :1 vt8 -retro
<...>
display manager paused til SIGUSR1
pid: 31422
<...>
dispatching :1
RedrawFail('Login incorrect or forbidden by policy', 0)
xdm error (pid 31422): fatal IO error 11 (Resource temporarily unavailable)
Manager wait returns pid: 31422 sig 0 core 0 code 3
Display exited with RESERVER_DISPLAY
time 1295979931 1295979928
Server crash frequency too high: removing display :1
xdm error (pid 31402): Server crash rate too high: removing display :1
Nothing left to do, exiting
xdm info (pid 31402): Exiting
unlinking process ID file /home/vdb/soft/xmod-1.9.4/var/run/xdm.pid

==> Here xdm quits without waiting for the server.

** Xservers:
** :1 local /home/vdb/soft/xmod-1.9.4/bin/X :1 vt8 -retro
** :3 local /home/vdb/soft/xmod-1.9.4/bin/X :3 vt9 -retro
<...>
StartServer for :1
Server Started 26017
<...>
display manager paused til SIGUSR1
pid: 26024
<...>
dispatching :1
RedrawFail('Login incorrect or forbidden by policy', 0)
xdm error (pid 26024): fatal IO error 11 (Resource temporarily unavailable)
Manager wait returns pid: 26024 sig 0 core 0 code 3
Display exited with RESERVER_DISPLAY
time 1295972420 1295972415
Server crash frequency too high: removing display :1
xdm error (pid 25959): Server crash rate too high: removing display :1
WaitForSomething
signals blocked
Manager wait returns pid: 26017 sig 0 core 0 code 0
Unknown child termination, status 0
WaitForSomething
signals blocked

==> Here xdm doesn't quit since display ':3' is up.  So waitpid() 
    returns the server ':1' exit but since this display was removed 
    from the list an 'Unknown child termination' error occurs.

* Proposed change *

The proposed patch substitutes the RestartDisplay() RemoveDisplay() 
sequence by StopDisplay():

StopDisplay (struct display *d)
{
    if (d->serverPid != -1)
	d->status = zombie; /* be careful about race conditions */
    if (d->pid != -1)
	TerminateProcess (d->pid, SIGTERM);
    if (d->serverPid != -1)
	TerminateProcess (d->serverPid, d->termSignal);
    else
	RemoveDisplay (d);
}

where d->pid == -1 at the 'case RESERVER_DISPLAY:'.  This works fine 
since either

1. WaitForChild() waitpid() already returned the server exit status.  
   Then serverPid == -1  =>  StopDisplay() calls RemoveDisplay().

2. The server is in zombie state or still running.  Then serverPid > 1 
   so the code above sets d->status = zombie and kills the server.  
   The next WaitForChild() waitpid() returns this zombie server pid and 
   the 'case zombie:' code then calls RemoveDisplay().

Example when zapping the server twice thru ctrl-alt-backspace:

** Xservers:
** :1 local /home/vdb/soft/xmod-1.9.4/bin/X :1 vt8 -retro
<...>
StartServer for :1
Server Started 16363
<...>
display manager paused til SIGUSR1
pid: 16368
<...>
dispatching :1
RedrawFail('Login incorrect or forbidden by policy', 0)
xdm error (pid 16368): fatal IO error 11 (Resource temporarily unavailable)
Manager wait returns pid: 16368 sig 0 core 0 code 3
Display exited with RESERVER_DISPLAY
time 1296271608 1296271605
Server crash frequency too high: stopping display :1
xdm error (pid 16350): Server crash frequency too high: stopping display :1
WaitForSomething
signals blocked
Manager wait returns pid: 16363 sig 0 core 0 code 0
Zombie server reaped, removing display :1
Nothing left to do, exiting
xdm info (pid 16350): Exiting
unlinking process ID file /home/vdb/soft/xmod-1.9.4/var/run/xdm.pid

==> The display is left in the list after crash detect so xdm waits 
    for the server exit status.  

* Patch *

The variable 'crash' may seem superfluous but is required for the follow 
up patch.
Comment 6 vdb128 2011-02-07 13:57:56 UTC
Created attachment 43050 [details] [review]
2/2 add the resource reservAttempts.

Add the resource reservAttempts to allow for a few successive 
crash-type manager exits.  

* summary of changes *

include/dm.h 

Rename lastCrash to lastReserv since this variable holds the time of 
the last session manager exit with a RESERVER_DISPLAY code.  

Two integers are added:
1. the display state reservTries, current reserver try, which counts 
   crashy manager exits, and
2. the resource reservAttempts which is the number of allowed crashy 
   exits.

Their logic is equal to the startTries/startAttempts integer pair.

xdm/dpylist.c NewDisplay (char *name, char *class)

Initialize the two new integers to zero.

xdm/resource.c serverResources[]

Add reservAttempts to the resource list with a default of 2.

xdm/dm.c

Set reservTries = 0 for the 'case OBEYSESS_DISPLAY:' since this 
is the normal sequence: session, greeter, login, user quit, session 
exit.  

At the 'case RESERVER_DISPLAY:' do not accumulate older crashy exits.  
The if (!crash) d->reservTries = 0; guarantees counter reset after 
XDM_BROKEN_INTERVAL seconds.

Add Tries/Attempts logic.

man/xdm.man

Document reservAttempts, update the default values to agree with the 
code of resource.c.
Comment 7 vdb128 2011-02-07 13:59:55 UTC
Created attachment 43051 [details]
debug log showing crash detect when repeatedly terminating the server.

xdm -debug 1 log when repeatedly terminating the local server thru 
ctrl-alt-backspace.  The logged "time " string is just after session 
manager exit.

<...>
time 1297094029 0 try 0 of 2
<...>
time 1297094032 1297094029 try 0 of 2 crash
<...>
time 1297094035 1297094032 try 1 of 2 crash
Server crash frequency too high: stopping display :1
xdm error (pid 24936): Server crash frequency too high: stopping display :1
WaitForSomething
signals blocked
Manager wait returns pid: 24956 sig 0 core 0 code 0
Zombie server reaped, removing display :1


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.