10747 – Race Condition in SmartScheduler of xorg

Bug 10747 - Race Condition in SmartScheduler of xorg

Summary: Race Condition in SmartScheduler of xorg

Status:	RESOLVED NOTOURBUG

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Server/General (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	medium major
Assignee:	Xorg Project Team
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-04-24 13:13 UTC by Andreas Girlich
Modified:	2008-02-24 19:02 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments

Description Andreas Girlich 2007-04-24 13:13:34 UTC

We may found a race condition which locks the whole xserver. This happens with different system configurations (nvdia / mtx driver) and in various situations on Linux/i386. The error is not repeatable for us. E.g., it happens when noone is at the box, just network io from a remote application. For a short summary here is an extract of an IRC session on #xorg-devel, from April 24th, ~9:00 - ~11:00 (GMT+1):

* You are now known as phnord
<phnord> Good morning, maybe someone has an eye on debian bugs # 261544, 344110 and 301309 ? we experienced same problems now with a matrox card on debian etch (xorg 7.1.1).
[...]
<MrCooper> phnord: SIGALRM is normal with the smart scheduler and doesn't indiacate a problem itself; you can disable the smart scheduler with -dumbSched though FWIW
[...]
<phnord> MrCooper: how can it be normal when the x-server uses 100% cpu and is no longer responding (except the SIGALRM <br> sigreturn() in strace)? We changed our systems configuration to -dumbSched already, but it isn't a real solution for this problem isn't it?
<MrCooper> phnord: the CPU usage isn't necessarily related to SIGALRM, though it sounds like it is related to the smart scheduler in your case?
<MrCooper> phnord: i.e. the problem doesn't occur with -dumbSched?
<phnord> MrCooper: with -dumbSched it cannot happen, cause a SIGARLM causes the x-server to die
<phnord> MrCooper: we found an ominous passage in os/utils.c, have a look at http://paste.debian.net/26363
<phnord> MrCooper: We think it is a race condition in the smart scheduler of the x-server. 
<MrCooper> I wouldn't be surprised :) unfortunately I'm not familiar enough with the code to review it
<phnord> Hehe, why you wouldn't be surprised? Any suggestions to who I can talk to?
<MrCooper> phnord: AFAICT the code is rather complex and fragile
<MrCooper> phnord: I think keithp should know more about it
<phnord> MrCooper: thanks :)
<MrCooper> np
<daniels> phnord: well, there are two options: one is that there's a race condition, the other is that something is stuck in an infinite loop, so you're never actually hitting the scheduler at all
<phnord> daniels: At which other passage the x-server can stuck with this strace-result?
<phnord> daniels: we only found that in os/utils.c
<daniels> phnord: ah, well if it was spending its time stuck there, then that is a scheduling issue indeed
<phnord> daniels: yes ;)
<daniels> phnord: but it can happen due to various bugs (software and hardware), and the telltale sign is the input code complaining that events are being queued (sigio), but never being processed
<phnord> daniels: this also happens if the systems just sit there and noone is even close to the box ... just network io from the only app running from remote
<phnord> daniels: we experienced this problem on different system configurations and xorg drivers (matrox mtx, nvidia). have a look at debian bug reports 261544, 344110 and 301309

As you may have read, we found the only possible passage in the x.org code in the file os/utils.c:

[...]
SmartScheduleTimer (int sig)
{
    int olderrno = errno;

    SmartScheduleTime += SmartScheduleInterval;
    if (SmartScheduleIdle)
    {
        SmartScheduleStopTimer ();
    }
    errno = olderrno;
}
[...]

In an undefined condition, a SIGALRM is send while SmartScheduleIdle is
FALSE and the timer is not stopped. This causes that setitimer is not
executed, as you see in the straces.

strace of xserver in a normal situation:
--- SIGALRM (Alarm clock) ---
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
sigreturn()                             = ? (mask now [])

with error:
--- SIGALRM (Alarm clock) ---
sigreturn()

Comment 1 Andreas Girlich 2007-05-11 01:07:46 UTC

We now have the possibility to get an gdb backtrace of a crashed X-Server. We know that the SmartScheduleTimer is a problem, but it's not the reason. The -dumbsched option did not help us. We post the backtrace here once we've got it.

Here is some additional hardware information about such a system:

CPU: Intel P4 3.0GHz
RAM: 1024MB
GPU: Matrox Parhelia 256DL
 - 3 graphic adapters in one system
 - each connect to a display via DualLink
Monitor: Apple 30" Cinema Display
Resolution: 2560x1600

Comment 2 Andreas Girlich 2007-05-14 08:27:30 UTC

After lots of work we discovered the "bad guy":
It is a bug in the proprietary mtx_drv which is developed by Matrox. We try to give it to Matrox so they can fix it.

I suppose to close this bug, as no one here is responsible for mtx_drv.

Comment 3 Adam Jackson 2008-02-24 19:02:29 UTC

NOTOURBUG per previous comment.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.