10525 – Openoffice menu may cause Xorg 100% cpu freeze

Bug 10525 - Openoffice menu may cause Xorg 100% cpu freeze

Summary: Openoffice menu may cause Xorg 100% cpu freeze

Status:	RESOLVED NOTOURBUG

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Server/General (show other bugs)
Version:	7.2 (2007.02)
Hardware:	All Linux (All)

Importance:	medium critical
Assignee:	Xorg Project Team
QA Contact:	Xorg Project Team

URL:	http://bugs.debian.org/cgi-bin/bugrep...
Whiteboard:
Keywords:

Duplicates (1):	10633 (view as bug list)
Depends on:
Blocks:

Reported:	2007-04-04 10:50 UTC by Miguel Freitas
Modified:	2011-10-15 14:02 UTC (History)
CC List:	7 users (show)

See Also:
i915 platform:
i915 features:

Attachments
locked up proc/id/status (597 bytes, text/plain) 2007-04-27 02:58 UTC, Jim Watson	no flags	Details
/proc/id/status of locked up xorg (739 bytes, text/plain) 2008-12-15 10:12 UTC, ilf	no flags	Details
/proc/id/status of locked up xorg - after kill -9 (544 bytes, text/plain) 2008-12-15 10:12 UTC, ilf	no flags	Details
View All

Description Miguel Freitas 2007-04-04 10:50:16 UTC

Ok, this is a problem that is really annoying me. Unfortunately it is not easily reproducible - the system can run stable for several days (using openoffice) and then xorg will freeze with 100% cpu.

I will list the steps to reproduce and some known facts i have collected from other users as well.

How to reproduce (not always reproducible):

1. Click on any openoffice menu.
2. Keyboard will freeze.
3. Mouse will keep moving in a jagged way - clicking doens't work though.
4. Xorg shows 100% on top.

Known facts:

- The problem has been reported at least by 6 people: me, David Liontooth, Joachim Müller, Syd Alsobrook, Jim Watson and Marcelo Roberto (friend of mine).

see:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=411287

- It was reported under debian, opensuse and fedora.

- Reported with openoffice from 2.0.x to 2.1.

- Reported for any openoffice application (confirmed: writer, impress)

- Reported for any openoffice menu (usually on "file" but today my computer hung on "view".

- Confirmed with xorg 7.2 and 1.2.99.902 (1.3.0 RC 2)

- At least 2 people were using nvidia binary driver (me and my friend). But another user reported it for Sunblade100.


I think i will try using "nv" driver for a few days so i can rule the nvidia binary out (despite of the sunblade report).

Any suggestions are welcomed...

Comment 1 Miguel Freitas 2007-04-04 10:57:57 UTC

Forgot to mention:

- /var/log/Xorg.0.log doesn't show anything special when it freezes.

- I have seen an issue with the mga driver (completely unrelated) where a forced "swsusp" effectively unfreezes the machine after resume. this is NOT the case here: machine is still frozen after suspend/resume.

I don't know if these are useful, i'm just trying to provide all information i have.

Comment 2 Michel Dänzer 2007-04-13 00:34:16 UTC

*** Bug 10633 has been marked as a duplicate of this bug. ***

Comment 3 Michael George 2007-04-13 16:11:11 UTC

I also have this happen with the mga driver.  I think it only started when I upgraded from kernel 2.6.17 to 2.6.19...

I don't know how to force a swsusp, but if someone told me, I'd be happy to try it...

Comment 4 Michael George 2007-04-14 10:58:01 UTC

I have dropped back to the 2.6.17 kernel to see if that kernel also exhibits this behavior.

With what kernels are others seeing this behavior?

Comment 5 Miguel Freitas 2007-04-14 15:22:29 UTC

> With what kernels are others seeing this behavior?

2.6.18.8 here. note i'm also using X86_64 SMP.

- bugs.debian.org #411287 was also originally reported as x86_64 with 2.6.18 but it doesn't say anything about SMP.

- qa.openoffice.org #75578 reports 2.6.19-1.2911.6.5.fc6xen #1 SMP kernel _BUT_ running in a single processor machine (athlon xp - 32 bits).

- qa.openoffice.org #75578 also reports the problem with Linux sun 2.6.18-4-sparc64 (apparently single processor machine)

- bug has being confirmed so far for the following drivers: nvidia (binary), matrox mga and ati rage.

xorg devels willing to investigate this bug might consider checking the url below, it contains some interesting information.

http://qa.openoffice.org/issues/show_bug.cgi?id=75578

Comment 6 Michael George 2007-04-15 05:52:35 UTC

Yes, I forgot to add that I'm running x86_64 SMP kernel on a dual-opteron.

Comment 7 Michael George 2007-04-16 18:25:54 UTC

I have found that this bug also affects the 2.6.17 kernel...

Comment 8 Michael George 2007-04-18 18:28:11 UTC

A couple more observations:

I run Xinerama.  When I have the OOo window on the left screen (Screen0), the screen locks up, but the pointer will still move within the left screen.  The pointer will not leave the left screen, but the motion will.  So if I have moved the mouse about 1/2 screen into Screen1, the pointer won't move but I have to bring it that far back to the left before it will move again on Screen0.

When OOo is on the right screen (Screen1), the pointer freezes and never moves at all.

After a lockup, there are more lines in Xorg.0.log than after starting it up.  The lines after a lockup are:
    xkb_types                { include "%" };
    xkb_compatibility        { include "%" };
    xkb_symbols              { include "%" };
    xkb_geometry             { include "%" };
(EE) Error loading keymap /var/tmp/server-0.xkm

That last one might be useful to someone who knows more than I.  There is no xerver-0.xkm in /var/tmp.  I cannot find a file (with locate) with ".xkm" in it anywhere on my system...

Comment 9 Daniel Stone 2007-04-18 18:36:39 UTC

michael, it sounds like you're running a rather old xorg release?

Comment 10 Michael George 2007-04-19 03:06:25 UTC

I am running xorg 7.1, xorg-server 1.1.1-r4, and mga driver 1.4.2 (I upgraded that to 1.4.6.1, but I still had the problem).

I could easily upgrade to xorg 7.2, xorg-server 1.2.0, and mga driver 1.4.6.1, though they are not yet marked stable in gentoo portage.

I'm willing to do that for testing if it might be thought to help.

Comment 11 Jim Watson 2007-04-19 14:21:15 UTC

I got something - not sure if it is valid - the office started briefly than
hanged again while i did this. I got the same again later with another locked process.

Loaded symbols for /opt/o208/program/libsrtrs1.so
0xf68c40bc in poll ()
   from /lib/libc.so.6
(gdb) bt
#0  0xf68c40bc in poll () from /lib/libc.so.6
#1  0xf6cfc72c in _XWaitForReadable (dpy=0x9f5f8)
    at ../../src/XlibInt.c:498
#2  0xf6cfcbb0 in _XRead (dpy=0xa2c88,
    data=0xffdcba6c "÷êò\024ÿÜ»\030ÿÜ»\034", size=
/build/buildd/gdb-6.6.dfsg/gdb/dwarf2-frame.c:1084: internal-error: Unknown
register rule.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n

/build/buildd/gdb-6.6.dfsg/gdb/dwarf2-frame.c:1084: internal-error: Unknown
register rule.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n
) at ../../src/XlibInt.c:1087
#3  0xf6cfd550 in _XReply (dpy=0xa2c88, rep=0xffdcba6c, extra=32,
    discard=20) at ../../src/XlibInt.c:1714
#4  0xf6d4e894 in XkbGetKeyboardByName (dpy=0xa2c88,
    deviceSpec=<value optimized out>, names=0x0,
    want=<value optimized out>, need=<value optimized out>,
    load=<value optimized out>) at ../../../src/xkb/XKBGetByName.c:136
---Type <return> to continue, or q <return> to quit---
#5  0xf539eb5c in SalDisplay::GetKeyboardName ()
   from /opt/o208/program/libvclplug_gen680ls.so
#6  0xf53927c0 in SalDisplay::GetKeyNameFromKeySym ()
   from /opt/o208/program/libvclplug_gen680ls.so
#7  0xf53938e4 in SalDisplay::GetKeyName ()
   from /opt/o208/program/libvclplug_gen680ls.so
#8  0xf5b596c0 in ?? () from /opt/o208/program/libvclplug_gtk680ls.so
#9  0xf7e62e9c in KeyCode::GetName ()
   from /opt/o208/program/libvcl680ls.so
#10 0xf7e67740 in ?? () from /opt/o208/program/libvcl680ls.so
#11 0xf7e6e7b4 in ?? () from /opt/o208/program/libvcl680ls.so
#12 0xf7e70778 in PopupMenu::Execute ()
   from /opt/o208/program/libvcl680ls.so
#13 0xf7f02878 in ?? () from /opt/o208/program/libvcl680ls.so
#14 0xf7f02878 in ?? () from /opt/o208/program/libvcl680ls.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Comment 12 Miguel Freitas 2007-04-20 11:20:38 UTC

Ok, here is a good backtrace from this problem from a friend of mine. He has exactly the same hw as I (x86_64, smp, opensuse 10.2, xorg pre-7.2, openoffice 2.0.4, binary nvidia etc).

Here is the openoffice bt. Xorg bt will follow.

(gdb) bt
#0  0xffffe405 in __kernel_vsyscall ()
#1  0xf6d5dea3 in poll () from /lib/libc.so.6
#2  0xf704b469 in XAddConnectionWatch () from /usr/lib/libX11.so.6
#3  0xf704b84f in _XRead () from /usr/lib/libX11.so.6
#4  0xf704c1c4 in _XReply () from /usr/lib/libX11.so.6
#5  0xf709e5b8 in XkbGetKeyboardByName () from /usr/lib/libX11.so.6
#6  0xf709e97f in XkbGetKeyboard () from /usr/lib/libX11.so.6
#7  0xf57aa49d in SalDisplay::GetKeyboardName () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#8  0xf57a39d7 in SalDisplay::GetKeyNameFromKeySym () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#9  0xf57a3b8e in SalDisplay::GetKeyName () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#10 0xf57793bd in X11SalFrame::GetKeyName () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#11 0xf7db0f4d in KeyCode::GetName () from /usr/lib/ooo-2.0/program/libvcl680li.so
#12 0xf7db55c9 in Menu::GetDisplayText () from /usr/lib/ooo-2.0/program/libvcl680li.so
#13 0xf7dbc380 in PopupMenu::IsInExecute () from /usr/lib/ooo-2.0/program/libvcl680li.so
#14 0xf7dbc853 in PopupMenu::IsInExecute () from /usr/lib/ooo-2.0/program/libvcl680li.so
#15 0xf7dbcaa5 in PopupMenu::IsInExecute () from /usr/lib/ooo-2.0/program/libvcl680li.so
#16 0xf7dbd10a in PopupMenu::IsInExecute () from /usr/lib/ooo-2.0/program/libvcl680li.so
#17 0xf7e02197 in Window::~Window () from /usr/lib/ooo-2.0/program/libvcl680li.so
#18 0xf7e03ae8 in Window::~Window () from /usr/lib/ooo-2.0/program/libvcl680li.so
#19 0xf7e027b8 in Window::~Window () from /usr/lib/ooo-2.0/program/libvcl680li.so
#20 0xf577ec55 in X11SalFrame::GetWindowState () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#21 0xf577b279 in X11SalFrame::HandleMouseEvent () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#22 0xf577e60e in X11SalFrame::Dispatch () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#23 0xf57a6a5a in SalX11Display::Dispatch () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#24 0xf57a5756 in SalX11Display::Yield () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#25 0xf57a48d5 in SalX11Display::IsEvent () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#26 0xf579fefe in SalXLib::Yield () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#27 0xf579fd7d in SalXLib::Yield () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#28 0xf57a7e3f in X11SalInstance::Yield () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#29 0xf7c951b5 in Application::Yield () from /usr/lib/ooo-2.0/program/libvcl680li.so
#30 0xf7c95251 in Application::Execute () from /usr/lib/ooo-2.0/program/libvcl680li.so
#31 0x0806df86 in desktop::Desktop::Main ()
#32 0xf7c99731 in InitVCL () from /usr/lib/ooo-2.0/program/libvcl680li.so
#33 0xf7c99847 in SVMain () from /usr/lib/ooo-2.0/program/libvcl680li.so
#34 0x08064a8b in sal_main ()
#35 0x08064ae0 in main ()

Comment 13 Miguel Freitas 2007-04-20 11:24:34 UTC

Xorg gurus, please look at this ;-)

X Window System Version 7.1.99.902 (7.2.0 RC 2)
Release Date: 13 November 2006
X Protocol Version 11, Revision 0, Release 7.1.99.902
Build Operating System: openSUSE SUSE LINUX
Current Operating System: Linux genipapo 2.6.18.2-34-default #1 SMP Mon Nov 27 11:46:27 UTC 2006 x86_64
Build Date: 09 January 2007


(gdb) bt
#0  0x00002b193d2915bd in fork () from /lib64/libc.so.6
#1  0x00000000005528dd in Popen ()
#2  0x000000000054878b in XkbDDXCompileKeymapByNames ()
#3  0x0000000000548983 in XkbDDXLoadKeymapByNames ()
#4  0x0000000000528d66 in ProcXkbGetKbdByName ()
#5  0x0000000000447e3b in Dispatch ()
#6  0x00000000004311ed in main ()

Comment 14 Miguel Freitas 2007-04-20 11:49:20 UTC

(In reply to comment #13)
> #1  0x00000000005528dd in Popen ()
> #2  0x000000000054878b in XkbDDXCompileKeymapByNames ()

more from gdb session: string used in Popen parameter (buf) is


(gdb) x /500s  0x000000000369a0c0
0x369a0c0:       "\"/usr/bin/xkbcomp\" -w 1 \"-R/usr/share/X11/xkb\" -xkm \"-\" -em1 \"The XKEYBOARD keymap compiler (xkbcomp) reports:\" -emp \"> \" -eml \"Errors from xkbcomp are not fatal to the X server\" \"/var/lib/xkb/compiled/server-0.xkm\""

it came from the following code in xorg/xkb/ddxLoad.c:
         
	buf = Xprintf(
	   "\"%s" PATHSEPARATOR "xkbcomp\" -w %d \"-R%s\" -xkm \"%s\" -em1 %s -emp %s -eml %s \"%s%s.xkm\"",
		xkbbindir,
		((xkbDebugFlags<2)?1:((xkbDebugFlags>10)?10:(int)xkbDebugFlags)),
		xkbbasedir, xkmfile,
		PRE_ERROR_MSG,ERROR_PREFIX,POST_ERROR_MSG1,
		xkm_output_dir,keymap);

Comment 15 Miguel Freitas 2007-04-20 12:12:22 UTC

just to make it clear the exact point it hung inside popen/fork:

0x00002b193d2915bd in fork () from /lib64/libc.so.6

here is the disass:

(...)
0x00002b193d2915af <fork+127>:  xor    %esi,%esi
0x00002b193d2915b1 <fork+129>:  mov    $0x1200011,%edi
0x00002b193d2915b6 <fork+134>:  mov    $0x38,%eax
0x00002b193d2915bb <fork+139>:  syscall
0x00002b193d2915bd <fork+141>:  cmp    $0xfffffffffffff000,%rax
0x00002b193d2915c3 <fork+147>:  ja     0x2b193d291720 <fork+496>
(...)

confused. ip points to instruction just past the syscall, i don't know how can it hang there.

---

if i do stepi, i get something that is interesting too:

0x00002b193d2915bd in fork () from /lib64/libc.so.6
(gdb) bt
#0  0x00002b193d2915bd in fork () from /lib64/libc.so.6
#1  0x00000000005528dd in Popen ()
#2  0x000000000054878b in XkbDDXCompileKeymapByNames ()
#3  0x0000000000548983 in XkbDDXLoadKeymapByNames ()
#4  0x0000000000528d66 in ProcXkbGetKbdByName ()
#5  0x0000000000447e3b in Dispatch ()
#6  0x00000000004311ed in main ()
(gdb) stepi
0x00000000005525b0 in SmartScheduleInit ()
(gdb) stepi
0x00000000005525b5 in SmartScheduleInit ()
(gdb) stepi
0x00000000005525ba in SmartScheduleInit ()
(gdb) stepi
0x00000000005525be in SmartScheduleInit ()
(gdb) stepi
0x000000000042fdd0 in __errno_location@plt ()
(gdb) stepi
0x00002b193d21be10 in __errno_location () from /lib64/libc.so.6
(gdb) stepi
0x00002b193d21be17 in __errno_location () from /lib64/libc.so.6
(gdb) stepi
0x00002b193d21be20 in __errno_location () from /lib64/libc.so.6
(gdb) stepi
0x00000000005525c3 in SmartScheduleInit ()
(gdb) stepi
0x00000000005525ca in SmartScheduleInit ()
(gdb) stepi
0x00000000005525cc in SmartScheduleInit ()
(gdb) stepi
0x00000000005525cf in SmartScheduleInit ()
(gdb) stepi
0x00000000005525d6 in SmartScheduleInit ()
(gdb) stepi
0x00000000005525d9 in SmartScheduleInit ()
(gdb) step
Single stepping until exit from function SmartScheduleInit,
which has no line number information.
0x00002b193d22e5b0 in __restore_rt () from /lib64/libc.so.6
(gdb) step
Single stepping until exit from function __restore_rt,
which has no line number information.
0x00002b193d2915bb in fork () from /lib64/libc.so.6


the patient (xorg) died here. i hope you may be able to continue from this...

---

btw, forking inside a xorg request to execute an external command sounds terribly dangerous to me... do we really need this?

Comment 16 Miguel Freitas 2007-04-24 19:33:24 UTC

This is getting interesting. For the first time ever, i have been able to unfreeze my Xorg. here is what i did:

# gdb Xorg <pid>
(...)
Program received signal SIGINT, Interrupt.
0x00002b2e77d6d5bd in fork () from /lib64/libc.so.6
(gdb) disass
Dump of assembler code for function fork:
0x00002b2e77d6d530 <fork+0>:    push   %rbp
0x00002b2e77d6d531 <fork+1>:    mov    %rsp,%rbp
0x00002b2e77d6d534 <fork+4>:    push   %r14
0x00002b2e77d6d536 <fork+6>:    push   %r13
0x00002b2e77d6d538 <fork+8>:    push   %r12
0x00002b2e77d6d53a <fork+10>:   push   %rbx
0x00002b2e77d6d53b <fork+11>:   sub    $0x30,%rsp
0x00002b2e77d6d53f <fork+15>:   mov    2807650(%rip),%rcx        # 0x2b2e7801aca8 <__fork_handlers>
0x00002b2e77d6d546 <fork+22>:   test   %rcx,%rcx
0x00002b2e77d6d549 <fork+25>:   mov    %rcx,%rbx
0x00002b2e77d6d54c <fork+28>:   je     0x2b2e77d6d576 <fork+70>
0x00002b2e77d6d54e <fork+30>:   mov    0x28(%rcx),%edx
0x00002b2e77d6d551 <fork+33>:   test   %edx,%edx
0x00002b2e77d6d553 <fork+35>:   je     0x2b2e77d6d546 <fork+22>
0x00002b2e77d6d555 <fork+37>:   lea    0x1(%rdx),%esi
0x00002b2e77d6d558 <fork+40>:   mov    %edx,%eax
0x00002b2e77d6d55a <fork+42>:   lock cmpxchg %esi,0x28(%rcx)
0x00002b2e77d6d55f <fork+47>:   cmp    %eax,%edx
0x00002b2e77d6d561 <fork+49>:   je     0x2b2e77d6d7c2 <fork+658>
0x00002b2e77d6d567 <fork+55>:   mov    2807610(%rip),%rcx        # 0x2b2e7801aca8 <__fork_handlers>
0x00002b2e77d6d56e <fork+62>:   test   %rcx,%rcx
0x00002b2e77d6d571 <fork+65>:   mov    %rcx,%rbx
0x00002b2e77d6d574 <fork+68>:   jne    0x2b2e77d6d54e <fork+30>
0x00002b2e77d6d576 <fork+70>:   xor    %r12d,%r12d
0x00002b2e77d6d579 <fork+73>:   callq  0x2b2e77d44e30 <__GI__IO_list_lock>
0x00002b2e77d6d57e <fork+78>:   mov    %fs:0x90,%r9d
0x00002b2e77d6d587 <fork+87>:   mov    %fs:0x94,%r8d
0x00002b2e77d6d590 <fork+96>:   mov    %r8d,%eax
0x00002b2e77d6d593 <fork+99>:   neg    %eax
0x00002b2e77d6d595 <fork+101>:  mov    %eax,%fs:0x94
0x00002b2e77d6d59d <fork+109>:  mov    %fs:0x10,%r10
0x00002b2e77d6d5a6 <fork+118>:  xor    %edx,%edx
0x00002b2e77d6d5a8 <fork+120>:  add    $0x90,%r10
0x00002b2e77d6d5af <fork+127>:  xor    %esi,%esi
0x00002b2e77d6d5b1 <fork+129>:  mov    $0x1200011,%edi
0x00002b2e77d6d5b6 <fork+134>:  mov    $0x38,%eax
0x00002b2e77d6d5bb <fork+139>:  syscall
0x00002b2e77d6d5bd <fork+141>:  cmp    $0xfffffffffffff000,%rax
0x00002b2e77d6d5c3 <fork+147>:  ja     0x2b2e77d6d720 <fork+496>
0x00002b2e77d6d5c9 <fork+153>:  test   %eax,%eax
0x00002b2e77d6d5cb <fork+155>:  mov    %eax,%r14d

note the PC is pointing to <fork+141>, so i'm assuming it must be hanging inside the kernel (syscall). 

i believe the relevant code from glibc is the following:

pid_t
__libc_fork (void)
{
  (... stripped ...)
  _IO_list_lock (); <- note __GI__IO_list_lock in disass above!

#ifndef NDEBUG
  pid_t ppid = THREAD_GETMEM (THREAD_SELF, tid);
#endif

  /* We need to prevent the getpid() code to update the PID field so
     that, if a signal arrives in the child very early and the signal
     handler uses getpid(), the value returned is correct.  */
  pid_t parentpid = THREAD_GETMEM (THREAD_SELF, pid);
  THREAD_SETMEM (THREAD_SELF, pid, -parentpid); 

#ifdef ARCH_FORK
  pid = ARCH_FORK ();
#else
# error "ARCH_FORK must be defined so that the CLONE_SETTID flag is used"
  pid = INLINE_SYSCALL (fork, 0);
#endif

---

i think the syscall in disass above must be either ARCH_FORK or INLINE_SYSCALL.

using x86_64's unistd.h and converting eax 0x38 => 56 => __NR_clone. this is funny because __NR_fork is 57 which is what i would expect. i386's unistd.h yields even stranger syscall: __NR_mpx.

still, this is not the value currently loaded on rax:
(gdb) info reg
rax            0xfffffffffffffdff       -513
rbx            0x0      0
rcx            0xffffffffffffffff       -1
rdx            0x0      0
rsi            0x0      0
rdi            0x1200011        18874385
(...)

now the great trick:

(gdb) set $rax = 0x38
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y

-> done. xorg is good again.

do anybody have any idea on what is going on here?

Comment 17 Miguel Freitas 2007-04-24 19:50:01 UTC

in case anybody wants to try to reproduce the problem with openoffice, i've just confirmed using gdb breakpoints that it does only call this function on the very first time the menu is drawn. somebody (openoffice?) must be caching the XkbGetKeyboard's result.

Comment 18 Daniel Stone 2007-04-25 00:27:02 UTC

bizzare, i can't imagine why fork() is hanging.  this will go away when xkbcomp gets merged into the server, but for the meantime, you might want to check that out with a more minimal testcase, say.

Comment 19 Miguel Freitas 2007-04-26 05:42:00 UTC

I'd like to ask anybody who can reproduce the bug to post the result of the following command:

# cat /proc/`pidof Xorg`/status

thanks

Comment 20 Jim Watson 2007-04-27 02:58:52 UTC

Created attachment 9757 [details]
locked up proc/id/status

This lock  and attached status report is using the reduced test case provided by cmc at
http://www.openoffice.org/issues/show_bug.cgi?id=75578

/*
 * gcc keyboard.c -lX11
 * ./a.out
 */


#include <X11/Xlib.h>
#include <X11/XKBlib.h>

#include <stdio.h>

int main(void)
{
    XkbDescPtr pXkbDesc = NULL;
    Display * pDisplay = XOpenDisplay(NULL);

    pXkbDesc = XkbGetKeyboard(pDisplay, XkbAllComponentsMask, XkbUseCoreKbd );

    if (pXkbDesc)
    {
        const char* pAtom = NULL;
        if( pXkbDesc->names->groups[0] )
        {
            pAtom = XGetAtomName( pDisplay, pXkbDesc->names->groups[0] );
            printf("Keyboard Name is %s\n", pAtom);
            XFree( (void*)pAtom );
        }

        XkbFreeKeyboard( pXkbDesc, XkbAllComponentsMask, True );
    }

    XCloseDisplay(pDisplay);
    return 0;
}

Comment 21 Miguel Freitas 2007-04-27 03:20:52 UTC

Thanks Jim! you have just confirmed my theory: there is a pending signal (actually of thread group type) SIGALRM which never gets served.

ShdPnd:	0000000000002000

kernel has some code to abort the execution of the syscall, return to userspace (so signal can be handled) and then reenter the syscall. it seems the mechanism is not working so it just get stuck forever. i posted a message to linux kernel ml asking for advice but i was ignored :(

http://www.uwsg.indiana.edu/hypermail/linux/kernel/0704.3/0717.html

another guess: the nonsense SigQ value might be a hint of a kernel bug.

SigQ:	1/18446744073709551615

Comment 22 Miguel Freitas 2007-04-27 03:35:18 UTC

ok, cursory investigation reveals that Xorg is suppose to handle SIGALRM by those SmartSchedule* functions that appeared in my earlier gdb session.

so it looks that signal is being handled. but all those 18446744073709551615 signals might take a while to get served ;-)

TODO: check kernel sources to understand what the second value of SigQ really means (qlim) and how could it have gotten that wrong.

Comment 23 Michel Dänzer 2007-04-27 04:02:47 UTC

(In reply to comment #22)
> ok, cursory investigation reveals that Xorg is suppose to handle SIGALRM by
> those SmartSchedule* functions that appeared in my earlier gdb session.

Hmm, could this be related to bug 10747?

Comment 24 Miguel Freitas 2007-04-27 16:13:38 UTC

(In reply to comment #23)
> Hmm, could this be related to bug 10747?

related, yes. but it looks like a different bug imho.

additional research information: 

18446744073709551615 = -1 which is supposed to be the RLIMIT_SIGPENDING. 

having this negative rlimit may cause problem to the __sigqueue_alloc() kernel function. however, as far as i can see, this would possibly prevent new signals from being enqueued - not existing ones from being dequeued/cleared/whatever.

what a tortuous history for a bug... starts like an openoffice issue, then xorg and in the end it is possible that none of them are actually faulty.

Comment 25 Miguel Freitas 2007-05-08 14:29:13 UTC

(In reply to comment #24)
> (In reply to comment #23)
> > Hmm, could this be related to bug 10747?
> 
> related, yes. but it looks like a different bug imho.

i take back my words: now i think it IS the same bug. i finally had a better chance of debugging it in my own system.

first observation: RLIMIT_SIGPENDING is a false track. here it is not negative.

SigQ:   1/16381
SigPnd: 0000000000000000
ShdPnd: 0000000000002000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000051806ecb

by stepping into SmartScheduleTimer i was able to confirm the condition described in 10747 (where SmartScheduleIdle is FALSE):

(gdb) stepi
0x00000000005658f0 in SmartScheduleTimer ()

Dump of assembler code for function SmartScheduleTimer:
0x00000000005658f0 <SmartScheduleTimer+0>:      mov    %rbx,0xfffffffffffffff0(%rsp)
0x00000000005658f5 <SmartScheduleTimer+5>:      mov    %rbp,0xfffffffffffffff8(%rsp)
0x00000000005658fa <SmartScheduleTimer+10>:     sub    $0x18,%rsp
0x00000000005658fe <SmartScheduleTimer+14>:     callq  0x431d78 <__errno_location@plt>
0x0000000000565903 <SmartScheduleTimer+19>:     mov    2408582(%rip),%rdx        # 0x7b1990 <_DYNAMIC+4192>
0x000000000056590a <SmartScheduleTimer+26>:     mov    (%rax),%ebp
0x000000000056590c <SmartScheduleTimer+28>:     mov    %rax,%rbx
0x000000000056590f <SmartScheduleTimer+31>:     mov    2409906(%rip),%rax        # 0x7b1ec8 <_DYNAMIC+5528>
0x0000000000565916 <SmartScheduleTimer+38>:     mov    (%rax),%rax
0x0000000000565919 <SmartScheduleTimer+41>:     add    %rax,(%rdx)
0x000000000056591c <SmartScheduleTimer+44>:     mov    2405597(%rip),%rax        # 0x7b0e00 <_DYNAMIC+1232>
0x0000000000565923 <SmartScheduleTimer+51>:     mov    (%rax),%esi
0x0000000000565925 <SmartScheduleTimer+53>:     test   %esi,%esi
0x0000000000565927 <SmartScheduleTimer+55>:     je     0x56592e <SmartScheduleTimer+62>
0x0000000000565929 <SmartScheduleTimer+57>:     callq  0x5657d0 <SmartScheduleStopTimer>
0x000000000056592e <SmartScheduleTimer+62>:     mov    %ebp,(%rbx)
0x0000000000565930 <SmartScheduleTimer+64>:     mov    0x8(%rsp),%rbx
0x0000000000565935 <SmartScheduleTimer+69>:     mov    0x10(%rsp),%rbp
0x000000000056593a <SmartScheduleTimer+74>:     add    $0x18,%rsp
0x000000000056593e <SmartScheduleTimer+78>:     retq
0x000000000056593f <SmartScheduleTimer+79>:     nop


(gdb) stepi
0x0000000000565927 in SmartScheduleTimer ()
(gdb) stepi
0x000000000056592e in SmartScheduleTimer ()

note the jump from 565927 to 56592e requires SmartScheduleIdle being false.

by forcing it to true, i was able to unlock my system:

(gdb) set *(int *)$rax = 1
(gdb) p *(int *)$rax
$3 = 1
(gdb) quit

Comment 26 Jim Watson 2007-05-15 03:45:57 UTC

In reply to comment #25
> i take back my words: now i think it IS the same bug.

But new comments #2 on bug 10747 imply it is not related, or are they mistaken?

Comment 27 Roger Larsson 2007-06-08 15:38:42 UTC

(In reply to comment #17)
> in case anybody wants to try to reproduce the problem with openoffice, i've
> just confirmed using gdb breakpoints that it does only call this function on
> the very first time the menu is drawn. somebody (openoffice?) must be caching
> the XkbGetKeyboard's result.
> 

This fits with my experience. 
I have noticed the problem in OpenOffice.org Calc - but only the first time.
AMD 64, SuSE 10.2, X.org nv
It happens quite often...

Comment 28 Jim Watson 2007-07-12 16:33:01 UTC

This problem has gone away in GNU/Linux SPARC Debian/unstable. The reduced test case at comment #20 does not lock any more. And a patch is reported here:http://cvs.fedora.redhat.com/viewcvs/devel/xorg-x11-server/xserver-1.3.0-xkb-and-loathing.patch?view=markup

Comment 29 Stefan Dirsch 2007-11-04 12:26:12 UTC

See also Novell Bug #245711.

Comment 30 Gemes Tibor 2008-04-01 01:14:48 UTC

(In reply to comment #20)

The test case crashes my X session, but did a few experiment and found the following: 
commenting out Option "XkbModel" "latitude" in xorg.conf eliminates the problem, the test case runs without error, prints "Keyboard Name is Hungary".  If I set the XkbModel back to latitude, it crashes again. 

X.Org X Server 1.4.0.90
Release Date: 5 September 2007
X Protocol Version 11, Revision 0
Build Operating System: Linux Ubuntu (xorg-server 2:1.4.1~git20080131-1ubuntu6)
Current Operating System: Linux tibnote 2.6.24-12-generic #1 SMP Wed Mar 12 23:01:54 UTC 2008 i686

Comment 31 Peter Hutterer 2008-04-23 22:56:56 UTC

(In reply to comment #30)
> (In reply to comment #20)
> 
> The test case crashes my X session, but did a few experiment and found the
> following: 
> commenting out Option "XkbModel" "latitude" in xorg.conf eliminates the
> problem, the test case runs without error, prints "Keyboard Name is Hungary". 
> If I set the XkbModel back to latitude, it crashes again. 
> 
> X.Org X Server 1.4.0.90
> Release Date: 5 September 2007
> X Protocol Version 11, Revision 0
> Build Operating System: Linux Ubuntu (xorg-server 2:1.4.1~git20080131-1ubuntu6)
> Current Operating System: Linux tibnote 2.6.24-12-generic #1 SMP Wed Mar 12
> 23:01:54 UTC 2008 i686
> 

Gemes:
Are you able to try a current git master X server? I just tried the test program with latitude and without and could not reproduce the crash.

Comment 32 Gemes Tibor 2008-05-08 09:28:03 UTC

(In reply to comment #31)
> (In reply to comment #30)
> > (In reply to comment #20)
> > 
> Gemes:
> Are you able to try a current git master X server? I just tried the test
> program with latitude and without and could not reproduce the crash.
> 


Sorry, but  I cannot reproduce it now with the current up-to-date ubuntu hardy either, and OO is currently working fine as well: "Keyboard Name is Hungary - Standard"


Anyway I did not try the git master as a matter of fact I don't know how to, but this seems to be  unnecessary - at least for me. 

Tib

Comment 33 Peter Hutterer 2008-05-08 16:10:30 UTC

(In reply to comment #32)
> Sorry, but  I cannot reproduce it now with the current up-to-date ubuntu hardy
> either, and OO is currently working fine as well: "Keyboard Name is Hungary -
> Standard"
> 
> 
> Anyway I did not try the git master as a matter of fact I don't know how to,
> but this seems to be  unnecessary - at least for me. 

good enough for me. Hardy has xserver 1.4, so we assume it has been fixed in the meantime somewhen :)

Comment 34 ilf 2008-12-15 10:12:29 UTC

Created attachment 21184 [details]
/proc/id/status of locked up xorg

/proc/id/status of locked up xorg
see bug https://bugs.freedesktop.org/show_bug.cgi?id=10525

Comment 35 ilf 2008-12-15 10:12:50 UTC

Created attachment 21185 [details]
/proc/id/status of locked up xorg - after kill -9

/proc/id/status of locked up xorg, after i tried kill -9 -ing it
see bug https://bugs.freedesktop.org/show_bug.cgi?id=10525

Comment 36 ilf 2008-12-15 10:17:45 UTC

I'm still experiencing this bug, using xorg-server-1.4.2 in Debian Lenny
package version 2:1.4.2-9.

I compared deb-src and vanilla sources, xorg-server-1.4.2/os/utils.c are
identical. And both seem to have included Adam Jacksons patch from
http://cvs.fedora.redhat.com/viewvc/devel/xorg-x11-server/xserver-1.3.0-xkb-and-loathing.patch?view=markup

The entire section
> OsSigHandlerPtr old_alarm = NULL; /* XXX horrible awful hack */
in these sources is also the same in the current repository at
http://cgit.freedesktop.org/xorg/xserver/tree/os/utils.c

I attached /proc/<pid>/status both before trying to kill -9 it and after at
https://bugs.freedesktop.org/attachment.cgi?id=21184 (before) and
https://bugs.freedesktop.org/attachment.cgi?id=21185 (after).

$ top sais
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13814 root      20   0     0    0    0 R 95.2  0.0  70:25.70 Xorg

Any idea?

Comment 37 ilf 2008-12-15 10:42:53 UTC

Also, this doesn't only happen with OpenOffice.org. I had it happen with rdesktop connected to a Windows XP machine, too. And only once I used OpenOffice.org on that Windows when this happened (didn't try again after). Just now I ran Opera or some other program I don't exactly remember (the Opera remained on the screen) when X locked up.

Comment 38 ilf 2009-01-16 07:42:11 UTC

And another one. This time on Debian xserver-xorg-core 2:1.4.2-10, using rdesktop 1.6.0-2, connecting to a Win XP Pro SP3 while using Gimp 2.4.2 there.
Again, I could supply /proc/<pid>/status, but they don't look much different form the last one.
Again, nothing but a reboot solves this. :(

Comment 39 ilf 2009-03-19 08:28:16 UTC

And it's still there. This time with xserver-xorg-core 2:1.4.2-11 and openoffice.org 1:3.0.1-6.

Comment 40 Jeremy Huddleston Sequoia 2011-10-15 14:02:37 UTC

Ok, so I just read through this rather verbose bug's history.  This is not an 
Xorg issue.

FWIW, the reduced test case "just works" when I ran it on darwin with the Xorg 
DDX.  This is likely a glibc or kernel issue.  If you still have problems, 
please work with your distribution.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.