Bug 98873 - xorg-server 100% CPU (Spinlock) when connecting/disconnecting HDMI
Summary: xorg-server 100% CPU (Spinlock) when connecting/disconnecting HDMI
Status: RESOLVED MOVED
Alias: None
Product: xorg
Classification: Unclassified
Component: Server/General (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Xorg Project Team
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-27 16:58 UTC by Roger
Modified: 2018-12-13 22:36 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
strace-withNoRace.log (26.25 KB, text/plain)
2016-11-27 16:58 UTC, Roger
no flags Details
strace-withRace.log (1.94 KB, text/plain)
2016-11-27 16:58 UTC, Roger
no flags Details
gdb-4.log (29.28 KB, text/plain)
2016-11-27 16:59 UTC, Roger
no flags Details
xorg.conf (2.08 KB, text/plain)
2016-11-28 15:38 UTC, Roger
no flags Details
Xorg.0.log (76.51 KB, text/plain)
2016-11-28 15:39 UTC, Roger
no flags Details
Xorg.0-HDMI-1-Samsung.log (98.76 KB, text/plain)
2016-11-28 22:24 UTC, Roger
no flags Details
dmesg-hdmi-connected-cpurun.log (3.27 KB, text/plain)
2016-11-29 02:02 UTC, Roger
no flags Details
gdb-5-step.log (47.12 KB, text/plain)
2016-12-08 14:29 UTC, Roger
no flags Details

Description Roger 2016-11-27 16:58:02 UTC
I have an NVidia 670 GTX with a DVI & HDMI displays connected respectively.

Using the proprietary NVidia driver, I've had no spinlocks while disconnecting or connecting (eg. powering-off, and powering-on) the HDMI display while X is running.

Switched over to xf86-video-nouveau-1.0.12 and kernel-4.8.7/4.8.9 nouveau driver (and removing NVidia and resetting OpenGL driver links), and I'm experiencing an X (xorg-server-1.18.4) 100% CPU usage/spinlock on one core whenever I power-on or power-off the HDMI device.

I've switched on all verbose logging within the nouveau kernel and DRM kernel driver.  I've compiled xorg-server with debug flags, and used gdb and strace remotely on the X (xorg-server) and watched the logs when the 100% CPU usage would start.  The only oddity with the logs; upon HDMI connect/disconnect and 100% CPU usage, what looks like a count down timer starts to spawn through strace.  (From looks, looks like the DPMS timer for monitor time-out.)  Looking at GDB backtrace doesn't show much, but I'll attach anyways because the programmer who wrote & understand the code may likely see what's occurring.   From looks, looks like the DPMS timer for monitor time-out.  However; I've disabled DPMS within nouveau kernel command line and xorg.conf, and the 100% CPU (spinlock) still occurs and DPMS is obviously disabled within the logs.

One thing I did not think of, I should compile the kernel (eg. nouveau kernel driver) using debugging.  (Think I read an earlier post of using MMIO tracing.)
Comment 1 Roger 2016-11-27 16:58:33 UTC
Created attachment 128220 [details]
strace-withNoRace.log
Comment 2 Roger 2016-11-27 16:58:57 UTC
Created attachment 128221 [details]
strace-withRace.log
Comment 3 Roger 2016-11-27 16:59:15 UTC
Created attachment 128222 [details]
gdb-4.log
Comment 4 Roger 2016-11-28 15:38:15 UTC
Also should mention I've I have been using:

/usr/bin/startx -- -logverbose 20 -verbose 20

Forgot to attach my Xorg.conf and Xorg.0.log, although I think nothing is mentioned within the Xorg.0.log file.
Comment 5 Roger 2016-11-28 15:38:44 UTC
Created attachment 128241 [details]
xorg.conf
Comment 6 Roger 2016-11-28 15:39:04 UTC
Created attachment 128242 [details]
Xorg.0.log
Comment 7 Roger 2016-11-28 22:24:12 UTC
Created attachment 128255 [details]
Xorg.0-HDMI-1-Samsung.log

This is a log file, with loading nouveau module/driver after the system is booted and the Samsung HDMI monitor already powered on and connected to a Yamaha Receiver, and then further connected to the HDMI connector of the NVidia card.

The Dell monitor is connected via DVI-I-1 connector of the NVidia card.
Comment 8 Roger 2016-11-28 23:38:26 UTC
Switching off the HDMI device seems to present the following, using the kobject Linux kernel tracing/hacking option.  The same is presented again via syslog when switching on the HDMI device.
                                                                                                                                                    
Nov 28 18:35:30 localhost4 kernel: kobject: 'card0' (ffff8807f9bf8010): kobject_uevent_env                                                                               
Nov 28 18:35:30 localhost4 kernel: kobject: 'card0' (ffff8807f9bf8010): fill_kobj_path: path = '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:10.0/0000:04:00.0/drm/card0'
Comment 9 Roger 2016-11-29 02:02:09 UTC
Created attachment 128260 [details]
dmesg-hdmi-connected-cpurun.log

On connect, this is what dmesg outputs using nouveau.debug=debug with Linux kernel kobjects.
Comment 10 Roger 2016-12-01 04:42:11 UTC
So far I've narrowed this bug down to the Nouveau's "ZaphodHeads" option within xorg.conf.

Without ZaphodHeads using an xorg, it looks like the dual monitor setup uses Xinerama by default (regardless of the mention of Xinerama being disabled by default), and both displays work, even after powering cycling the Yamaha receiver or HDMI display.  No CPU run, etc.  Although I'm not a fan of extending the displays, I prefer separate displays!

But when using ZaphodHeads option within xorg.conf, the CPU spinlock is once again present when power cycling the HDMI display, wither by the monitor power button or home theater HDMI capable receiver.

I've tried using the trace option for nouveau driver/module, but I see nothing abnormal aside from a bad EDID value, which Xorg should have no problems handling gracefully.  I've also tried the kernel mmio tracing, but seems nouveau is hard locking the computer during tracing.  I'm assuming since I haven't seen anything obvious within the nouveau debug tracing, the problem maybe elsewhere.

Using strace on the X process, seems to show some anomalies within the output.  Using "startx -- -logverbose 5 -verbose 5" or with higher values seems to show no further debugging output, aside from the initial start-up of X.  (Albeit, might see a VT switch or something else every so often.)  And the process stuck within the 100% CPU spinlock is X itself.

So I'll guess X is in a loop checking or running through something, and I can't get any further debugging than this.  Most I can do with gdb is debugging segfaults.  I would think this bug is something that should be sending hints into the verbose logging method.

Very likely the problem is within X at this point.

Suggestions for further debug methods for X would be helpful!
Comment 11 Olivier Fourdan 2016-12-01 13:21:36 UTC
(In reply to Roger from comment #10)
> [...]
> So I'll guess X is in a loop checking or running through something, and I
> can't get any further debugging than this.  Most I can do with gdb is
> debugging segfaults.  I would think this bug is something that should be
> sending hints into the verbose logging method.
> 
> Very likely the problem is within X at this point.
> 
> Suggestions for further debug methods for X would be helpful!

If that's the case, I reckon attaching gdb to the running process while it's taking 100% CPU will most likely give you the portion of code where it's stuck, a backtrace would tell what brought it in there, and a few "step" or "next" would allow you to follow a few iterations.

So best, imho, is to use gdb to investigate when the issue is occurring, means connecting to the system remotely via ssh though.
Comment 12 Roger 2016-12-02 20:32:59 UTC
Doh.  I forgot to use next/step instead of repeated backtrace/bt.

So if my attached gdb log file(s) do not contain the race condition, I'll later attach some gdb log files using next/step commands.
Comment 13 Roger 2016-12-08 14:29:49 UTC
Created attachment 128380 [details]
gdb-5-step.log

Using next, step, step, step; the following occurred in brevity.  (See gdb-5-step.log for the entire gdb log.)

eip:Error while running hook_stop: Value can't be converted to integer within:

xorg-server-1.18.4/os/WaitFor.c:229
xorg-server-1.18.4/xorg-server-1.18.4/dix/dixutils.c:417
xorg-server-1.18.4/config/udev.c:336
xorg-server-1.18.4/dix/dixutils.c:421
Comment 14 GitLab Migration User 2018-12-13 22:36:32 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/xserver/issues/508.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.