We may found a race condition which locks the whole xserver. This happens with different system configurations (nvdia / mtx driver) and in various situations on Linux/i386. The error is not repeatable for us. E.g., it happens when noone is at the box, just network io from a remote application. For a short summary here is an extract of an IRC session on #xorg-devel, from April 24th, ~9:00 - ~11:00 (GMT+1): * You are now known as phnord <phnord> Good morning, maybe someone has an eye on debian bugs # 261544, 344110 and 301309 ? we experienced same problems now with a matrox card on debian etch (xorg 7.1.1). [...] <MrCooper> phnord: SIGALRM is normal with the smart scheduler and doesn't indiacate a problem itself; you can disable the smart scheduler with -dumbSched though FWIW [...] <phnord> MrCooper: how can it be normal when the x-server uses 100% cpu and is no longer responding (except the SIGALRM <br> sigreturn() in strace)? We changed our systems configuration to -dumbSched already, but it isn't a real solution for this problem isn't it? <MrCooper> phnord: the CPU usage isn't necessarily related to SIGALRM, though it sounds like it is related to the smart scheduler in your case? <MrCooper> phnord: i.e. the problem doesn't occur with -dumbSched? <phnord> MrCooper: with -dumbSched it cannot happen, cause a SIGARLM causes the x-server to die <phnord> MrCooper: we found an ominous passage in os/utils.c, have a look at http://paste.debian.net/26363 <phnord> MrCooper: We think it is a race condition in the smart scheduler of the x-server. <MrCooper> I wouldn't be surprised :) unfortunately I'm not familiar enough with the code to review it <phnord> Hehe, why you wouldn't be surprised? Any suggestions to who I can talk to? <MrCooper> phnord: AFAICT the code is rather complex and fragile <MrCooper> phnord: I think keithp should know more about it <phnord> MrCooper: thanks :) <MrCooper> np <daniels> phnord: well, there are two options: one is that there's a race condition, the other is that something is stuck in an infinite loop, so you're never actually hitting the scheduler at all <phnord> daniels: At which other passage the x-server can stuck with this strace-result? <phnord> daniels: we only found that in os/utils.c <daniels> phnord: ah, well if it was spending its time stuck there, then that is a scheduling issue indeed <phnord> daniels: yes ;) <daniels> phnord: but it can happen due to various bugs (software and hardware), and the telltale sign is the input code complaining that events are being queued (sigio), but never being processed <phnord> daniels: this also happens if the systems just sit there and noone is even close to the box ... just network io from the only app running from remote <phnord> daniels: we experienced this problem on different system configurations and xorg drivers (matrox mtx, nvidia). have a look at debian bug reports 261544, 344110 and 301309 As you may have read, we found the only possible passage in the x.org code in the file os/utils.c: [...] SmartScheduleTimer (int sig) { int olderrno = errno; SmartScheduleTime += SmartScheduleInterval; if (SmartScheduleIdle) { SmartScheduleStopTimer (); } errno = olderrno; } [...] In an undefined condition, a SIGALRM is send while SmartScheduleIdle is FALSE and the timer is not stopped. This causes that setitimer is not executed, as you see in the straces. strace of xserver in a normal situation: --- SIGALRM (Alarm clock) --- setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 sigreturn() = ? (mask now []) with error: --- SIGALRM (Alarm clock) --- sigreturn()
We now have the possibility to get an gdb backtrace of a crashed X-Server. We know that the SmartScheduleTimer is a problem, but it's not the reason. The -dumbsched option did not help us. We post the backtrace here once we've got it. Here is some additional hardware information about such a system: CPU: Intel P4 3.0GHz RAM: 1024MB GPU: Matrox Parhelia 256DL - 3 graphic adapters in one system - each connect to a display via DualLink Monitor: Apple 30" Cinema Display Resolution: 2560x1600
After lots of work we discovered the "bad guy": It is a bug in the proprietary mtx_drv which is developed by Matrox. We try to give it to Matrox so they can fix it. I suppose to close this bug, as no one here is responsible for mtx_drv.
NOTOURBUG per previous comment.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.