Created attachment 23660 [details] [review] dm.c crash detect: use StopDisplay(), add resource reservAttempts The crash detect code in dm.c WaitForChild() RESERVER_DISPLAY checks if a previous manager exit occured in the last 120 seconds. If so RemoveDisplay() is used to remove the display from the list. The proposed patch uses StopDisplay() and thus slightly changes the behaviour for local servers, from a direct removal to a synchronous wait for child exit. For a local X server either: 1. The server exit was already returned by waitpid(). So serverPid==-1 => StopDisplay() calls RemoveDisplay(). 2. The server is a zombie or still running. So serverPid>1 => StopDisplay() sets status=zombie and kills the server. The next waitpid() returns this zombie server pid and the code then calls RemoveDisplay(). The patch further adds support for allowing successive RESERVER_DISPLAY manager exits. The resource reservAttempts is the number of times a fatal IO error is allowed in session.c. This extension is useful for my not so reliable multi-server-seat setup. If two users quit X simultaneously the video card bioses interfere, and the only way out is to terminate X once or twice through ctrl-alt-backspace. The xdm errorlog sometimes shows Server for display :2 terminated unexpectedly and always contains fatal IO error 11 (Resource temporarily unavailable) depending on the first return value of waitpid(), the server or manager pid. Thank you for taking a look.
Created attachment 23661 [details] debug log showing crash detect when repeatedly terminating local X server
Created attachment 23688 [details] [review] dm.c crash detect: use StopDisplay(), add resource reservAttempts Reset reservTries after each session, and also after an IO error if the exit is not a crash.
Created attachment 38146 [details] [review] dm.c crash detect: use StopDisplay(), add resource reservAttempts Applies to xdm 1.1.10.
The text below is a copy of the description posted on xorg-devel. Once upon a time xdm was created. This program manages X terminals by forking 1 or 2 processes per display: 1. a local server 'X :0' if required, and 2. the session manager greeter '-:0'. The dm.c WaitForChild() loop interpretes the session exit code and uses a ++startTries>=startAttempts test to detect an unresponsive server, OPENFAILED_DISPLAY, or a SIGTERM'ed session. Other exit codes reset startTries = 0. All worked well until the ps2 mouse appeared. Such input device is initialized last and after the X listening socket. Thus the session manager finds the X server, starts the greeter, and waits for its return. Alas, if the mouse now fails initialization then X quits and triggers session.c IOErrorHandler() { exit(RESERVER_DISPLAY); }. The dm.c loop would then call RestartDisplay(d, TRUE) causing a persistent restart-quit cycle which inhibits user login. Somewhere before 2007 this was solved by the crash detect code in dm.c at the case RESERVER_DISPLAY. A manager exit is considered a crash if a previous exit occured within the XDM_BROKEN_INTERVAL of 120 second. RemoveDisplay() is then called to remove the display from the list. This isn't a perfect solution since crash detection may also be triggered by the ctrl-alt-backspace server termination. Although a bit an unfriendly way of quitting X this is the only option available in case of a garbled display. The proposed patch adds a ++reservTries>=reservAttempts test to allow for a few successive crash-type manager exits. The resource reservAttempts is the number of times a fatal IO error is allowed in session.c. The patch also substitutes the RestartDisplay() RemoveDisplay() sequence by StopDisplay(). The original sequence works since RestartDisplay(d, TRUE) kills the serverPid. However RemoveDisplay() directly removes the display from the list which may thus give an "Unknown child termination" error. StopDisplay() on the other hand implements the synchronous kill and wait for child exit. RemoveDisplay() is only to be called if neither server nor session manager is running.
Created attachment 43049 [details] [review] 1/2 crash detect: use StopDisplay(). This patch solves a benign bug: on server crash the display is removed but the server exit code is ignored. Proposed solution: substitute RemoveDisplay() by StopDisplay() thus enabling the wait for child exit synchronism. * Introduction * A code section handling X server crashes was added to xdm somewhere before 2007. The section detects and breaks persistent start-crash-exit-restart-start-... loops by removing the display from the Xservers list thus disabling the display. Such a crash loop is not uncommon since the server is considered up and running if the watchdog display open connection succeeds. Hence, any server quit just after initialization of its listening socket is considered a crash. A typical example is a failing mouse or keyboard. Alas, a just after startup server termination via ctrl-alt-backspace is indistinguishable from a crash. Although a bit an unfriendly way of quitting X this is the only method available in case of a garbled display. After looking at the code, and the debug log, it seems that the crash loop break by direct removal of the display is not wholly correct for local displays. The log files show that the X server may exit after the session manager. So instead of the direct RemoveDisplay() the synchronous kill and wait for server exit StopDisplay() should be used. Here below is a description of the code path and the underlying logic. * The xdm logic * Each local display from Xservers starts two child processes, a server and a session manager: xdm/dm.c StartDisplay () calls 1. StartServer() which forks the local server "X :0 vt7 -auth ...", 2.1 fork() to create the second child, 2.2 WaitForServer() to open the watchdog connection in d->dpy, 2.3 ManageSession() which becomes the session manager "-:0", and then sets 'd->status = running'. The manager will then finally call the greeter library to draw the login password user interface. Before doing so session.c ManageSession() replaces the Xlib IOErrorhandler by xdm/session.c IOErrorHandler (Display *dpy) { LogError ("fatal IO error %d (%s)\n", errno, _SysErrorMsg(errno)); exit(RESERVER_DISPLAY); } which detects an X server termination. Indeed, when X quits the watchdog connection is closed thus triggering an IO error. After starting all displays the parent process waits for any child termination. The dm.c WaitForChild() loop interpretes the exit code. xdm/dm.c WaitForChild (void) <...> while ((pid = waitpid (-1, &status, WNOHANG)) > 0) #endif { Debug ("Manager wait returns pid: %d sig %d core %d code %d\n", pid, waitSig(status), waitCore(status), waitCode(status)); if (autoRescan) RescanIfMod (); /* SUPPRESS 560 */ if ((d = FindDisplayByPid (pid))) { d->pid = -1; switch (waitVal (status)) { <...> Here a manager exited. The case below applies for an IO error. case RESERVER_DISPLAY: d->startTries = 0; Debug ("Display exited with RESERVER_DISPLAY\n"); if (d->displayType.origin == FromXDMCP || d->status == zombie) StopDisplay(d); else RestartDisplay (d, TRUE); { Time_t Time; time(&Time); Debug("time %i %i\n",Time,d->lastCrash); if (d->lastCrash && ((Time - d->lastCrash) < XDM_BROKEN_INTERVAL)) { Debug("Server crash frequency too high:" " removing display %s\n",d->name); LogError("Server crash rate too high:" " removing display %s\n",d->name); #if !defined(HAVE_ARC4RANDOM) && !defined(DEV_RANDOM) AddTimerEntropy(); #endif RemoveDisplay (d); } else d->lastCrash = Time; } break; On crash the above calls: RestartDisplay(d, TRUE); RemoveDisplay(d) <...> } /* SUPPRESS 560 */ else if ((d = FindDisplayByServerPid (pid))) { d->serverPid = -1; switch (d->status) { case zombie: Debug ("Zombie server reaped, removing display %s\n", d->name); RemoveDisplay (d); break; <...> Here an X server exited. The case below applies for plain server termination. case running: Debug ("Server for display %s terminated unexpectedly, status %d %d\n", d->name, waitVal (status), status); LogError ("Server for display %s terminated unexpectedly: %d\n", d->name, waitVal (status)); if (d->pid != -1) { Debug ("Terminating session pid %d\n", d->pid); If the session manager is exiting with an IOError then this is a rekill. TerminateProcess (d->pid, SIGTERM); } break; <...> } else { Here a process exited which is not in the display list. Should never happen. Debug ("Unknown child termination, status %d\n", waitVal (status)); } } This final StartDisplays() calls StartDisplay() for any notRunning display. StartDisplays (); * Removing a crashed display * The local display removal thru RestartDisplay(, TRUE) and RemoveDisplay() works since RestartDisplay() (re)kills the running server: xdm/dm.c RestartDisplay (struct display *d, int forceReserver) { if (d->serverPid != -1 && (forceReserver || d->terminateServer)) { TerminateProcess (d->serverPid, d->termSignal); d->status = phoenix; } else { d->status = notRunning; } } Hence, serverPid == -1 if the server exit status was already returned by a previous waitpid(). In the other case the server process is still present, either in terminated zombie state or actively terminating due to the termSignal. But RemoveDisplay() directly removes the display from the list in either case. Thus the server exit is ignored. Examples when zapping the server twice thru ctrl-alt-backspace: ** Xservers: ** :1 local /home/vdb/soft/xmod-1.9.4/bin/X :1 vt8 -retro <...> display manager paused til SIGUSR1 pid: 31422 <...> dispatching :1 RedrawFail('Login incorrect or forbidden by policy', 0) xdm error (pid 31422): fatal IO error 11 (Resource temporarily unavailable) Manager wait returns pid: 31422 sig 0 core 0 code 3 Display exited with RESERVER_DISPLAY time 1295979931 1295979928 Server crash frequency too high: removing display :1 xdm error (pid 31402): Server crash rate too high: removing display :1 Nothing left to do, exiting xdm info (pid 31402): Exiting unlinking process ID file /home/vdb/soft/xmod-1.9.4/var/run/xdm.pid ==> Here xdm quits without waiting for the server. ** Xservers: ** :1 local /home/vdb/soft/xmod-1.9.4/bin/X :1 vt8 -retro ** :3 local /home/vdb/soft/xmod-1.9.4/bin/X :3 vt9 -retro <...> StartServer for :1 Server Started 26017 <...> display manager paused til SIGUSR1 pid: 26024 <...> dispatching :1 RedrawFail('Login incorrect or forbidden by policy', 0) xdm error (pid 26024): fatal IO error 11 (Resource temporarily unavailable) Manager wait returns pid: 26024 sig 0 core 0 code 3 Display exited with RESERVER_DISPLAY time 1295972420 1295972415 Server crash frequency too high: removing display :1 xdm error (pid 25959): Server crash rate too high: removing display :1 WaitForSomething signals blocked Manager wait returns pid: 26017 sig 0 core 0 code 0 Unknown child termination, status 0 WaitForSomething signals blocked ==> Here xdm doesn't quit since display ':3' is up. So waitpid() returns the server ':1' exit but since this display was removed from the list an 'Unknown child termination' error occurs. * Proposed change * The proposed patch substitutes the RestartDisplay() RemoveDisplay() sequence by StopDisplay(): StopDisplay (struct display *d) { if (d->serverPid != -1) d->status = zombie; /* be careful about race conditions */ if (d->pid != -1) TerminateProcess (d->pid, SIGTERM); if (d->serverPid != -1) TerminateProcess (d->serverPid, d->termSignal); else RemoveDisplay (d); } where d->pid == -1 at the 'case RESERVER_DISPLAY:'. This works fine since either 1. WaitForChild() waitpid() already returned the server exit status. Then serverPid == -1 => StopDisplay() calls RemoveDisplay(). 2. The server is in zombie state or still running. Then serverPid > 1 so the code above sets d->status = zombie and kills the server. The next WaitForChild() waitpid() returns this zombie server pid and the 'case zombie:' code then calls RemoveDisplay(). Example when zapping the server twice thru ctrl-alt-backspace: ** Xservers: ** :1 local /home/vdb/soft/xmod-1.9.4/bin/X :1 vt8 -retro <...> StartServer for :1 Server Started 16363 <...> display manager paused til SIGUSR1 pid: 16368 <...> dispatching :1 RedrawFail('Login incorrect or forbidden by policy', 0) xdm error (pid 16368): fatal IO error 11 (Resource temporarily unavailable) Manager wait returns pid: 16368 sig 0 core 0 code 3 Display exited with RESERVER_DISPLAY time 1296271608 1296271605 Server crash frequency too high: stopping display :1 xdm error (pid 16350): Server crash frequency too high: stopping display :1 WaitForSomething signals blocked Manager wait returns pid: 16363 sig 0 core 0 code 0 Zombie server reaped, removing display :1 Nothing left to do, exiting xdm info (pid 16350): Exiting unlinking process ID file /home/vdb/soft/xmod-1.9.4/var/run/xdm.pid ==> The display is left in the list after crash detect so xdm waits for the server exit status. * Patch * The variable 'crash' may seem superfluous but is required for the follow up patch.
Created attachment 43050 [details] [review] 2/2 add the resource reservAttempts. Add the resource reservAttempts to allow for a few successive crash-type manager exits. * summary of changes * include/dm.h Rename lastCrash to lastReserv since this variable holds the time of the last session manager exit with a RESERVER_DISPLAY code. Two integers are added: 1. the display state reservTries, current reserver try, which counts crashy manager exits, and 2. the resource reservAttempts which is the number of allowed crashy exits. Their logic is equal to the startTries/startAttempts integer pair. xdm/dpylist.c NewDisplay (char *name, char *class) Initialize the two new integers to zero. xdm/resource.c serverResources[] Add reservAttempts to the resource list with a default of 2. xdm/dm.c Set reservTries = 0 for the 'case OBEYSESS_DISPLAY:' since this is the normal sequence: session, greeter, login, user quit, session exit. At the 'case RESERVER_DISPLAY:' do not accumulate older crashy exits. The if (!crash) d->reservTries = 0; guarantees counter reset after XDM_BROKEN_INTERVAL seconds. Add Tries/Attempts logic. man/xdm.man Document reservAttempts, update the default values to agree with the code of resource.c.
Created attachment 43051 [details] debug log showing crash detect when repeatedly terminating the server. xdm -debug 1 log when repeatedly terminating the local server thru ctrl-alt-backspace. The logged "time " string is just after session manager exit. <...> time 1297094029 0 try 0 of 2 <...> time 1297094032 1297094029 try 0 of 2 crash <...> time 1297094035 1297094032 try 1 of 2 crash Server crash frequency too high: stopping display :1 xdm error (pid 24936): Server crash frequency too high: stopping display :1 WaitForSomething signals blocked Manager wait returns pid: 24956 sig 0 core 0 code 0 Zombie server reaped, removing display :1
Pushed to git master: http://cgit.freedesktop.org/xorg/app/xdm/commit/?id=d0870e66dd9d8a41c9b6de65c4a9243c0a3fdc6e http://cgit.freedesktop.org/xorg/app/xdm/commit/?id=3e3a45176d232c4e3f23ed6bc0b46bf660ddf271
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.