Bug 23307 - X crashes on a daily basis with little apparent cause
X crashes on a daily basis with little apparent cause
Status: NEW
Product: xorg
Classification: Unclassified
Component: Server/DDX/Xorg
7.4 (2008.09)
x86 (IA32) Linux (All)
: medium major
Assigned To: Xorg Project Team
Xorg Project Team
Depends on:
  Show dependency treegraph
Reported: 2009-08-14 07:42 UTC by wanderer
Modified: 2009-09-21 06:24 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:

GDB log of a possibly-typical crash (7.11 KB, text/plain)
2009-08-14 07:42 UTC, wanderer
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description wanderer 2009-08-14 07:42:08 UTC
Created attachment 28633 [details]
GDB log of a possibly-typical crash

For some while now, X has been crashing on my primary machine on a fairly regular basis - sometimes multiple times a day, never more than a few days apart. I am not actually certain what part of the chain is triggering the crash (X, the window manager, the graphics drivers, or something else), but X is the only part I've been able to get any kind of debugging information from.

I'm running X installed via Debian, tracking latest unstable as of a couple of days ago. Further details (e.g. versions of other software) available on request.

Note that I *am* using the binary NVIDIA drivers, but I don't think they're the culprit here; I was running one version of those drivers for months on end without any crash, and since the crashes started I have reverted back to and indeed before that same version without alleviating the crashes. I have also tried updating to the latest drivers (several times as newer ones hit), with no improvement.

Attached is what gdb has to say about the only crash I've gotten useful information on so far, per the ServerDebugging page on the X wiki. It may perhaps be noteworthy that the X session which eventually produced this crash log also ran for longer than any other since the crashes began happening - several days of fairly regular use, instead of a few hours of the same.

If there is anything I can do to help narrow down the cause of this crash - up to and including recompiling X myself if necessary (though I'd really rather not do that if it can be avoided) - please let me know. Even if the cause does not lie in X code, anything which could point me in the right direction as to where to report this would be appreciated.
Comment 1 Alan Coopersmith 2009-09-07 00:21:40 UTC
If that stack trace (the SIGIO handler) is where it typically crashes,
then you may be able to stop the crashes by setting the "SilkenMouse"
option to off in xorg.conf, disabling the SIGIO function.
Comment 2 wanderer 2009-09-07 13:24:38 UTC
That does seem to be the usual place where it crashes, yes. I'm not sure exactly where to set that option (the only documentation I've yet found for it seems to be specific to an ATI driver, and various xorg.conf snippets found via Google have it variously under the Display section and under the InputDevice section- and usually commented out, at that), but I'll give it a try and see if anything seems to change. The little documentation I have found seems to indicate that disabling that option might significantly worsen interface responsiveness, but it shouldn't hurt to try it.

I've been capturing crash logs on a routine basis ever since posting this report (initially GDB logs of X itself, but more recently I started trying to get traces on enlightenment instead because the crash message sometimes claims that enlightenment segfaulted and I wanted to at least try to rule that out). If more logs might be useful, I have something in excess of a dozen, in one variation or another. If I should be doing something specific to get better debugging output, just let me know and I'll be glad to give it a try.
Comment 3 Peter Hutterer 2009-09-07 19:03:56 UTC
do I misread this or is this a problem with the memmove in FlushClient? If so, it seems to be triggered by the GetImage request.
Comment 4 wanderer 2009-09-08 07:02:47 UTC
Well, the memmove() is present in the backtraces of pretty much all of the crash logs I have, but the FlushClient is present in only about half a dozen of them. Perhaps interestingly, there are also about half a dozen (but not the same ones) where the FooGetImage calls are *not* present in the backtrace.

It seems not impossible that I might actually have multiple crashes here (unless the bug is in the memmove() itself, but it seems to me that that would cause problems in other programs as well). I could look for apparent patterns in the logs and try to either describe them or post representative logs; would either be helpful?
Comment 5 Peter Hutterer 2009-09-08 15:23:38 UTC
that sounds more like memory corruption then. Can you run X under valgrind and see if there's anything obvious?
You may see a lot of ioctls (point to uninitialized bytes) errors from the drivers. The interesting bits are usually invalid writes.
Comment 6 Peter Hutterer 2009-09-08 21:23:55 UTC
Can we have a server version from you please? 1.6 or 1.7?
Comment 7 wanderer 2009-09-09 06:50:13 UTC
Xorg.0.log says version 1.6.3, but there may be some inconsistencies.

I'm tracking Debian unstable (installing updates every week or three), and there is apparently zero consistency in xserver-xorg* package version numbers. Several of the main packages do say they're version 1.6.3, but a couple of them (xserver-xorg itself, which is mainly a metapackage, and xserver-xorg-video-all which is purely a metapackage) claim to be version 1.7.4; the input-driver packages are versions 2.2.5 (evdev), 3.2 (kbd), and 1.4.0 (mouse); and the display driver packages - none of which I'm actually using - are all over the map, from 0.2.x (openchrome display driver) to 6.12.2 (two ATI drivers).

I'll run X under valgrind and get you a report as soon as I have the time to work out exactly how (and as soon as I have another crash - annoyingly, it hasn't crashed in nearly two days now, though that probably just means it's due for it in fairly short order).
Comment 8 Peter Hutterer 2009-09-09 15:08:56 UTC
(In reply to comment #7)
> Xorg.0.log says version 1.6.3, but there may be some inconsistencies.

That's a valid version number and the important one for this bugreport. Each module has its own version number, with the server's being currently 1.6.3. The X11R7.4 and similar versions are a super-version number that specify a set of modules. similar to Debian 3.0 being a single version number for a collection of packages with different versions.
Comment 9 wanderer 2009-09-13 08:38:34 UTC
Okay. X crashed again on the 11th, for the first time since I restarted with Option "SilkenMouse" "off", and for some reason I didn't have an attached gdb so I don't have a backtrace of that one. I do have the Xorg.0.log, which contains the usual partial backtrace, and I have another log - this one with full GDB log - whose partial backtrace seems to be the same. I can provide any of those if they would be useful.

I'm now (somewhat belatedly) looking into exactly how to run X under Valgrind, and I think I know how to actually get it to happen without changing the rest of my environment, but I'm not sure which valgrind options would be the most appropriate. Should I just run with no special arguments to valgrind, and if not, which ones should I use? (My own experience with valgrind has mostly been in attempting to track down leaks, and has been on programs I could run and re-run easily without major hassle, which X certainly is not.)
Comment 10 wanderer 2009-09-20 07:16:18 UTC
I apologize for the delay.

X has crashed more than once since my last report (with SIGIO, even though I now have Option "SilkenMouse" "off" in my Device section and that does seem to have reduced crash frequency), and I have gdb logs to match, but I have been unable to get X to run under Valgrind.

I launch X via startx, so I cannot simply add valgrind at the beginning of my launch command line and get a functional result. As far as I can tell, the command line used by startx to launch the actual X server comes from the xserverrc; I have therefore added the valgrind invocation - with no specific arguments - to ~/.xserverrc. However, when I then run startx, I get

==2948== Warning: Can't execute setuid/setgid executable: /usr/bin/X11/X
==2948== Possible workaround: remove --trace-children=yes, if in effect
valgrind: /usr/bin/X11/X: Permission denied

and a hang until I Ctrl-C. 

I've done a little looking around, but I haven't found a solution to this to allow me to run X under Valgrind - short perhaps of running them as root, but aside from being possibly a bad idea in the first place that wouldn't work very well because root's configuration is so very different from what I need to work with, and it can take so long to get a crash that I effectively wouldn't be able to use my computer at all in the meanwhile.

Any suggestions as to how to get it to work, or any other suggestions in general?
Comment 11 Peter Hutterer 2009-09-20 23:42:52 UTC
do you have 2 machines or just one? if you have two, just ssh in, run sudo /usr/bin/Xorg and from a second ssh session simply whatever desktop environment you usually use (e.g. "DISPLAY=:0 gnome-session").

chmod -s /usr/bin/Xorg removes the suid sticky bit so you can run it under valgrind.

Comment 12 wanderer 2009-09-21 06:24:42 UTC
I have two machines I can use for this purpose, at least - otherwise I probably wouldn't have been able to get the GDB logs.

I'm hesitant to invoke Xorg directly like that, since doing so omits all of the setup steps which startx carries out, and - now that I've just spent a while reading startx - Debian's version of that script does seem to do important things which there wouldn't be much point in doing by hand.

Not to mention, /usr/bin/Xorg doesn't have suid/sgid; it's /usr/bin/X and /usr/bin/X11/X which have that, and those are identical 7K binaries which look like wrappers to me, but one of them is what is actually being invoked by startx via the xserverrc - and I'm hesitant to either cut it out of the loop (by invoking /usr/bin/Xorg directly) when I don't know what it does or remove suid/sgid from it (which, yes, I did already know how to do) when I don't know why it needs them. Could you perhaps shed light on either point?

(Also, I consider sudo a bad idea, and so don't have it installed. If it would allow running a command so that it gets root permissions but in all other respects has my normal user's environment, instead of running the command entirely as root the way 'su -c' does, then this might be a legitimate use case for it - but AFAICT from reading the man page it doesn't seem to do that.)