Bug 964

Summary: Xorg Xprt starts to consume 100% when being idle
Product: xprint Reporter: Roland Mainz <roland.mainz>
Component: Server: OtherAssignee: Roland Mainz <roland.mainz>
Status: RESOLVED WONTFIX QA Contact:
Severity: blocker    
Priority: high CC: alan.coopersmith, dberkholz, eich, jay.hobson, kem, matthieu.herrb, MostAwesomedude, roland.mainz, stuart.kreitman
Version: unspecified   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Bug Depends on: 14771    
Bug Blocks:    
Attachments:
Description Flags
truss -u :: output on Solaris 8
none
BZip2-compressed "truss -d -u a.out -o xprt_amok_003.log ./Xprt ..." output (part1)
none
BZip2-compressed "truss -d -u a.out -o xprt_amok_003.log ./Xprt ..." output (part2)
none
Possible workaound patch for 2004-08-11-trunk
none
Workaround patch for 2004-08-11-trunk which restores X11R6.6 behaviour roland.mainz: review? (kem)

Description Roland Mainz 2004-08-01 12:32:07 UTC
Xorg Xprt starts to consume 100% CPU exactly after 2mins when the last client
has exit.

Steps to reproduce:
1. Pull XOrg trunk
2. % make install DESTDIR=/tmp/xptestinstall001 2>&1 | tee -a ../install.log
3. % (LD_LIBRARY_PATH=$PWD/../../exports/lib/ ; export
XPCONFIGDIR=/tmp/xptestinstall001/usr/X11R6/lib/X11/xserver ; ./Xprt -ac -audit
4 -core :40)
5. Run xdpyinfo
% xdpyinfo -display localhost:40

Result:
After exactly 2mins Xprt starts to consume 100% CPU.

Expected result:
Xprt should be idle when there is no work for it

This regression is very recent, three weeks ago this issue did not happen.

This reminds me of bug 567 ("Xorg Xprt starts to consume 100% CPU when being
idle for some time") which was caused by screensaver code ... but this problem
is new.
Comment 1 Roland Mainz 2004-08-01 12:34:03 UTC
Is it possible that the recent XEVI, DAMAGE, XFIXES etc. checkins may have
caused this (e.g. did the code add any timers or triggers signals etc.) ?
Comment 2 Roland Mainz 2004-08-01 12:41:03 UTC
Created attachment 553 [details]
truss -u :: output on Solaris 8

I've attached Solaris's "truss" after the last client exists ("xdpyinfo" in the
example above) but before the 100% CPU usage started.

truss shows a tight loop of the following instructions:
-- snip --
-> libc:select(0xff, 0x7ffc14, 0x0, 0x0)
  -> libc:.mul(0x0, 0x10624dd3, 0x100, 0x0)
  <- libc:.mul() = 0
  -> libc:.mul(0x0, 0x10624dd3, 0x0, 0x0)
  <- libc:.mul() = 0
  -> libc:_poll(0xffbedda8, 0x2, 0x0, 0x0)
poll(0xFFBEDDA8, 2, 0)				= 0
  <- libc:_poll() = 0
<- libc:select() = 0
-> libc_psr:memset(0xffbee7b8, 0x0, 0x80, 0x0)
<- libc_psr:memset() = 0xffbee7b8
-> libc:gettimeofday(0xffbee5b4, 0x0, 0x0, 0x0)
<- libc:gettimeofday() = 0
-- snip --

And I have no clue what is going on... ;-(
Comment 3 Roland Mainz 2004-08-10 10:07:09 UTC
Created attachment 592 [details]
BZip2-compressed "truss -d -u a.out -o xprt_amok_003.log ./Xprt ..." output (part1)

Part 1 of Solaris 8 "truss" output of
-- snip --
% (export MESA_DEBUG=1 LD_LIBRARY_PATH=$PWD/../../exports/lib/ ; export
XPCONFIGDIR=/tmp/xptestinstall007/usr/X11R6/lib/X11/xserver ; truss -d -u a.out
-o xprt_amok_003.log ./Xprt -ac -audit 4 -core -fp
/usr/openwin/lib/X11/fonts/F3/,/usr/openwin/lib/X11/fonts/F3bitmaps/,/usr/openwin/lib/X11/fonts/Type1/,/usr/openwin/lib/X11/fonts/Speedo/,/usr/openwin/lib/X11/fonts/misc/,/usr/openwin/lib/X11/fonts/75dpi/,/usr/openwin/lib/X11/fonts/100dpi/
:40)
-- snip --

This log includes _ALL_ function calls within the Xserver, including startup,
xdpyinfo connect/disconnect and the endless loop (until CTRL-C hit "truss")

Note:
Timestamp 372.0989 is the end of the xdpyinfo connection, after that Xprt waits
exactly 2mins and then goes amok... ;-(
Comment 4 Roland Mainz 2004-08-10 10:10:22 UTC
Created attachment 593 [details]
BZip2-compressed "truss -d -u a.out -o xprt_amok_003.log ./Xprt ..." output (part2)

Part 2 of Solaris 8 "truss" output of
-- snip --
% (export MESA_DEBUG=1 LD_LIBRARY_PATH=$PWD/../../exports/lib/ ; export
XPCONFIGDIR=/tmp/xptestinstall007/usr/X11R6/lib/X11/xserver ; truss -d -u a.out
-o xprt_amok_003.log ./Xprt -ac -audit 4 -core -fp
/usr/openwin/lib/X11/fonts/F3/,/usr/openwin/lib/X11/fonts/F3bitmaps/,/usr/openwin/lib/X11/fonts/Type1/,/usr/openwin/lib/X11/fonts/Speedo/,/usr/openwin/lib/X11/fonts/misc/,/usr/openwin/lib/X11/fonts/75dpi/,/usr/openwin/lib/X11/fonts/100dpi/
:40)
-- snip --

This log includes _ALL_ function calls within the Xserver, including startup,
xdpyinfo connect/disconnect and the endless loop (until CTRL-C hit "truss")

Note:
Timestamp 372.0989 is the end of the xdpyinfo connection, after that Xprt waits
exactly 2mins and then goes amok... ;-(
Comment 5 Roland Mainz 2004-08-10 13:47:05 UTC
Some more debugging output:
-- snip --
% (export MESA_DEBUG=1 LD_LIBRARY_PATH=$PWD/../../exports/lib/ ; export
XPCONFIGDIR=/tmp/xptestinstall007/usr/X11R6/lib/X11/xserver ; dbx -r ./Xprt -ac
-audit 4 -core -fp
/usr/openwin/lib/X11/fonts/F3/,/usr/openwin/lib/X11/fonts/F3bitmaps/,/usr/openwin/lib/X11/fonts/Type1/,/usr/openwin/lib/X11/fonts/Speedo/,/usr/openwin/lib/X11/fonts/misc/,/usr/openwin/lib/X11/fonts/75dpi/,/usr/openwin/lib/X11/fonts/100dpi/
:40)
Running: Xprt -ac -audit 4 -core -fp
/usr/openwin/lib/X11/fonts/F3/,/usr/openwin/lib/X11/fonts/F3bitmaps/,/usr/openwin/lib/X11/fonts/Type1/,/usr/openwin/lib/X11/fonts/Speedo/,/usr/openwin/lib/X11/fonts/misc/,/usr/openwin/lib/X11/fonts/75dpi/,/usr/openwin/lib/X11/fonts/100dpi/
:40 
(process id 9489)
Couldn't open RGB_DB '/usr/X11R6/lib/X11/rgb'
_XSERVTransmkdir: ERROR: Mode of /tmp/.X11-unix must be set to 1777
_XSERVTransSocketUNIXCreateListener: mkdir(/tmp/.X11-unix) failed, errno = 1
_XSERVTransMakeAllCOTSServerListeners: failed to create listener for unix
_XSERVTransmkdir: ERROR: Mode of /tmp/.X11-pipe must be set to 1777
_XSERVTransNAMEDOpenServer: mkdir(/tmp/.X11-pipe) failed, errno = 1
_XSERVTransOpen: transport open failed for named/north:40
_XSERVTransMakeAllCOTSServerListeners: failed to open listener for named
Usage: lpget [-k key] [list|(printer) ...]
error opening security policy file /usr/X11R6/lib/X11/xserver/SecurityPolicy
Could not init font path element /usr/openwin/lib/X11/fonts/F3/, removing from
list!
Could not init font path element /usr/openwin/lib/X11/fonts/Type1/, removing
from list!
Could not init font path element /usr/openwin/lib/X11/fonts/Speedo/, removing
from list!
AUDIT: Tue Aug 10 21:55:08 2004: 9489 Xprt: client 1 connected from IP 127.0.0.1
AUDIT: Tue Aug 10 21:55:08 2004: 9489 Xprt: client 1 disconnected
^Cdbx: warning: Interrupt ignored but forwarded to child.
dbx: no symbol table loaded yet -- use "prog -readsyms" to load them
signal INT (Interrupt) at 0xe392c at 0xe392c
0x00000000000e392c:     sra     %l0, 0x1f, %g1
Entering debugger ... 
(dbx) where
dbx: no symbol table loaded yet -- use "prog -readsyms" to load them
(dbx) prog -readsyms
Reading Xprt
Reading ld.so.1
Reading libc.so.1
Reading libdl.so.1
Reading libc_psr.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading libmp.so.2
Reading libm.so.1
(dbx) where
=>[1] WaitForSomething(pClientsReady = 0xffbee800), line 193 in "WaitFor.c"
  [2] Dispatch(), line 384 in "dispatch.c"
  [3] main(argc = 8, argv = 0xffbeecec, envp = 0xffbeed10), line 442 in "main.c"
(dbx) print timers
timers = 0x8b5a70
(dbx) print wt
wt = (nil)
(dbx) print timers->expires
timers->expires = 1246135329U
(dbx) print now
now = 1246156936U
(dbx) print *timers
*timers = {
    next     = (nil)
    expires  = 1246135329U
    callback = 0xf22f8 = &`Xprt`log.c`AuditFlush(OsTimerPtr timer, CARD32 now,
pointer arg)
    arg      = (nil)
}
(dbx) cont
^Cdbx: warning: Interrupt ignored but forwarded to child.
signal INT (Interrupt) in _select at 0xff2cd334
0xff2cd334: _select+0x0148:     orcc    %o0, %o1, %o3
Current function is WaitForSomething
  223               i = Select (MaxClients, &LastSelectMask, NULL, NULL, wt);
(dbx) print *timers
*timers = {
    next     = (nil)
    expires  = 1246135329U
    callback = 0xf22f8 = &`Xprt`log.c`AuditFlush(OsTimerPtr timer, CARD32 now,
pointer arg)
    arg      = (nil)
}
(dbx) print timers
timers = 0x8b5a70
-- snip --

Somehow I think the audit code may be responsible for this mess (the timer ptr
is always the same (and always the same callback) and the timestamp between
timer "alarm time" and "now" is always negative) ...
Comment 6 Roland Mainz 2004-08-10 14:09:44 UTC
And the mysterious 120sec/2min timeout comes from |AUDIT_TIMEOUT|
-- snip --
% tgrep AUDIT_TIMEOUT
tgrep: Can't stat file/dir ./Xprint/Quarks.c, No such file or directory.
Ignored.
./os/log.c :#ifndef AUDIT_TIMEOUT
./os/log.c :#define AUDIT_TIMEOUT ((CARD32)(120 * 1000)) /* 2 mn */
./os/log.c :    return AUDIT_TIMEOUT;
./os/log.c :    auditTimer = TimerSet(auditTimer, 0, AUDIT_TIMEOUT, AuditFlush,
NULL);
-- snip --

Running Xprt without "-audit 4" works around the problem, however then the
complaints that "... access to Xprt isn't logged ..." are back... ;-(

Is there anyone who is familar with the audit code ?
Comment 7 Roland Mainz 2004-08-10 20:18:49 UTC
Created attachment 598 [details] [review]
Possible workaound patch for 2004-08-11-trunk

The patch "wallpapers" over the issue via removing the audit timer when there
are no clients left.

But I think this bug report is just a symptom of a FAR more serious issue with
the timer code since this is at least the second problem with timers (bug 567
("Xorg Xprt starts to consume 100% CPU when being idle for some time") was the
previous one).
I really think someone with better knowledge of the timer code should
investigate that issue.
Comment 8 Roland Mainz 2004-08-10 20:20:30 UTC
Comment on attachment 598 [details] [review]
Possible workaound patch for 2004-08-11-trunk

Kevin: Can you do a quick review of the workaround patch, please ?
Comment 9 Roland Mainz 2004-08-10 20:53:44 UTC
Comment on attachment 598 [details] [review]
Possible workaound patch for 2004-08-11-trunk

Actually the "wallpaper" patch doesn't work in all cases. If any audit event
occurs _after_ the last client left the Xserver the 100%-CPU consumption will
start again.
An example are the "cookie expiration" messages from the SECURITY extension.
Comment 10 Roland Mainz 2004-08-10 21:19:13 UTC
Created attachment 599 [details] [review]
Workaround patch for 2004-08-11-trunk which restores X11R6.6 behaviour

The patch restores the old audit logging behaviour of X11R6.6 for now until a
better solution is found to cure this nightmare (the alternative is to hack the
timer code - which is something which I'd like to avoid so close to the code
freeze).
Comment 11 Kevin E. Martin 2004-08-10 21:50:06 UTC
(In reply to comment #10)
> Created an attachment (id=599)
> Workaround patch for 2004-08-11-trunk which restores X11R6.6 behaviour
> 
> The patch restores the old audit logging behaviour of X11R6.6 for now until a
> better solution is found to cure this nightmare (the alternative is to hack the
> timer code - which is something which I'd like to avoid so close to the code
> freeze).

This patch needs to be reviewed by the person(s) whose code you are reverting in
order to understand why the change was made originally and see if reverting the
code will have any other unintended consequences.
Comment 12 Roland Mainz 2004-08-10 22:10:22 UTC
Kevin E. Martin wrote:
> This patch needs to be reviewed by the person(s) whose code you are reverting

Any idea who wrote the original code ? CVSblame on the Xorg repository isn't
much helpfull.
 
> in order to understand why the change was made originally and see if reverting 
> the code will have any other unintended consequences.

The only consequence is that identical log messages are no longer be fold
together (one message with the original text and one message that n-1 copies
have been supressed). The only side-effect is that log files _may_ be smaller
since no duplicates are stored anymore.
Comment 13 Kevin E. Martin 2004-08-11 15:38:52 UTC
Okay, I checked the XFree86 logs and found that the change was made by Matthieu
Herrb.  Adding him to the CC list and would like to see if he could comment on
this one.
Comment 14 Matthieu Herrb 2004-08-12 02:16:21 UTC
This patch was added to XFree86 to avoid filling the partition holding /var/log
to quickly with audit messages in XFree86.0.log when someone tries to brute-force
the magic cookie. 
I'm currently away from my hacking machines, so I can't look at this issue until I
come back (around aug, 18). 
Comment 15 Kevin E. Martin 2004-08-13 12:53:23 UTC
On the release wranglers call, we decided to go ahead and apply the patch, which
I have done.

We will leave this bug open in the hope that others will be able to help figure
out the root cause and suggest alternate solutions before the release.
Comment 16 Kevin E. Martin 2004-08-16 11:47:40 UTC
Moving to "held open" bug since this is no longer a blocker
Comment 17 Adam Jackson 2005-11-19 13:57:28 UTC
this must not be particularly urgent for anyone, since no one has looked at it
since 6.8.2.  moving out to 7.1.
Comment 18 Adam Jackson 2006-04-25 05:34:06 UTC
(In reply to comment #17)
> this must not be particularly urgent for anyone, since no one has looked at it
> since 6.8.2.  moving out to 7.1.

... and again to 7.2.

For 7.2 I'm probably going to flip this back to coalescing the log file entries,
and if that tickles a bug in Xprt, so be it, Xprt needs fixing properly.
Comment 19 Daniel Stone 2006-11-04 10:02:00 UTC
shunting off the radar, as xprint is now unmaintained.
Comment 20 Corbin Simpson 2011-09-13 12:39:01 UTC
Closing WONTFIX because nobody cares about Xprint. Reopen if you plan to address this bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.