Bug 58594 - Dual Monitor eventually freezes
Summary: Dual Monitor eventually freezes
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-21 02:46 UTC by Keith McClelland
Modified: 2013-02-27 00:48 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Reg dump after dual monitor hang (w/ AccelMethod SNA) (13.55 KB, text/plain)
2013-01-03 20:37 UTC, Younes Manton
no flags Details
auth.log (59.04 KB, text/x-log)
2013-01-29 16:47 UTC, Keith McClelland
no flags Details
syslog (1.47 KB, application/octet-stream)
2013-01-29 16:50 UTC, Keith McClelland
no flags Details
wtmp (283.50 KB, application/octet-stream)
2013-01-29 16:51 UTC, Keith McClelland
no flags Details
lastlog (285.44 KB, application/octet-stream)
2013-01-29 16:52 UTC, Keith McClelland
no flags Details
Xorg.0.log (26.93 KB, text/plain)
2013-01-29 16:52 UTC, Keith McClelland
no flags Details
pm-powersave.log (407.77 KB, text/x-log)
2013-01-29 16:53 UTC, Keith McClelland
no flags Details
syslog.1 (1.38 MB, application/octet-stream)
2013-01-29 16:55 UTC, Keith McClelland
no flags Details
udev (246.44 KB, text/plain)
2013-01-29 16:56 UTC, Keith McClelland
no flags Details
kern.log (1.05 MB, text/plain)
2013-01-29 17:01 UTC, Keith McClelland
no flags Details
boot.log (2.75 KB, text/x-log)
2013-01-29 17:02 UTC, Keith McClelland
no flags Details
dmesg (45.98 KB, text/plain)
2013-01-29 17:03 UTC, Keith McClelland
no flags Details

Description Keith McClelland 2012-12-21 02:46:27 UTC
OS: Ubuntu 12.10 (Quantal Quetzal).
Version of xserver-xorg-video-intel: 2:2.20.9-0ubuntu2
Computer: HP Mini 110
Video chips: VGA compatible controller: Intel Corporation Mobile 945GSE Express Integrated Graphics Controller (rev 03) and Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)

I can set up dual screens and everything works pretty nicely for a while (with the restriction that no resolution above 800x600 can be used on the VGA). However, before long everything freezes up. The mouse pointer can still wander freely across both screens, and I can prove by careful preparation that the keyboard is still sending keys to the selected window. I suspect that the actual event is triggered by a keystroke or mouse movement because it doesn't seem to happen if you just set it up and wait for failure.

Only two things can change the video display. (1) CTL-ALT-F1 blanks the VGA display but then the usual login message is not written to it. (2) the inevitable "press and hold the power button" causes blank screens when the power finally goes off.

Note that this happens with either the installed 12.10 or a live DVD of 12.10. It does not appear to be a problem with Ubuntu 12.04 (tested using a live CD). 12.04 uses xserver-xorg-video-intel version 2:2.17.0-1ubuntu4.

I'd like to try swapping drivers but am unwilling to go down that path without advice.

Thanks.
Comment 1 Keith McClelland 2012-12-21 03:10:29 UTC
This is also reported in Launchpad bug 1079440.
Comment 2 Chris Wilson 2012-12-21 08:02:51 UTC
To be frank we fixed a lot of issues in the upstream kernel, so please do look for the current set of drivers in the xorg-edgers and either a mainline 3.7 kernel of drm-intel-experimental (which tracks our upstream branch).
Comment 3 Keith McClelland 2012-12-24 01:10:33 UTC
Okay,

I've loaded up a nightly build of 3.7.0 (http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/current/ -- the one I got was "3.7.0-994.201212210409". That didn't change things. Then I set up the xorg-edgers PPA as described in "https://launchpad.net/~xorg-edgers/+archive/ppa/+index?batch=75". [That brings in yet another 3.7.0 kernel.]

With all the new xorg software, using the current kernel or either of the 3.7.0 kernels, nothing seems to be broken that didn't used to be broken, but the problem still occurs after a few minutes.

The effect of CTL-ALT-F1 (virtual terminal) is inconsistent, so I can't rely on that to upload more details. I could probably set something up on a regular terminal that could be sent by a few simple keystrokes after the freeze occurs. I don't know what that would be, perhaps some form of appport. If someone would like me to do this, please suggest a method.

If developers would like to try to find this on their own systems, let me point out that both of the reports in launchpad bug 1079440 concern computers with the same combination of i954 chips.

My setup is simple: (1) use a terminal to run a perl-one-liner that squirts a running count to the screen, then position that on the boundary of the two displays; (2) set up a system monitor so that the activity graph is scrolling across both monitors; (3) set up an emacs editor and select its window; (4) wait a few minutes, typing into emacs as you please -- then save your scribblings after the freeze.

Let me know what more I can do to help you find this.
Comment 4 Younes Manton 2013-01-03 20:37:49 UTC
Created attachment 72479 [details]
Reg dump after dual monitor hang (w/ AccelMethod SNA)
Comment 5 Younes Manton 2013-01-03 20:39:24 UTC
I see a very similar problem on a Thinkpad T400. Using the laptop panel everything is fine, but plugging in a monitor eventually leads to a very hard hang. The monitor's resolution is correctly detected in my case, and the GPU is indentified as "Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller])".

I can hit the problem within 5 minutes using AccelMethod SNA; using UXA can give me several hours on average, but eventually it too dies. In both cases:

cat /sys/kernel/debug/dri/0/i915_error_state 
no error state collected

I did grab a register dump after it died using SNA, if that's of any help.
Comment 6 Keith McClelland 2013-01-05 02:15:15 UTC
I've made a breakthrough in understanding this problem. It seems to give me a 100% workaround but of course I can't be sure.

The solution is to use "taskset" to force the "Xorg" and "compiz" processes to always be run on the same CPU. The simple terminal commands are:

taskset      -pa 1 $(pgrep -x compiz)
sudo taskset -pa 1 $(pgrep -x Xorg)

You might need to do this before the second monitor is connected.

I have run for several hours both in a quiet state and with as much complexity and CPU bashing as the computer can reasonably handle. The high CPU loading makes the system sluggish, but it does not fail.

Good luck turning this workaround into a real fix!
Comment 7 Keith McClelland 2013-01-29 00:59:31 UTC
Note that the related launchpad bug is beginning to get some attention based on finding that even kernel 3.8-rc5 doesn't help.

I have a refinement for my workaround. I supposed that the important thing was to keep Xorg and compiz on the same CPU. But that is not the case. The most important thing seems to be to run Xorg on CPU 1. Note that my computer is an Intel Atom N280 whose two cores are is not advertised to be identical. I'd be suspicious of hairy signal timing based on this table. (Xorg 1 means "taskset" Xorg to CPU 1 etc.):

||                 || Xorg 1 || Xorg 2 || Xorg unpinned||
|| Compiz 1        || WORKS  || FAILS  || WORKS        ||
|| Compiz 2        || WORKS  || FAILS  || FAILS        ||
|| Compiz unpinned || WORKS  || FAILS  || FAILS        ||

All the cases that "WORK" have run with dual monitors and heavy use for more than an hour; many hours for the Xorg 1 cases. The ones that "FAIL" have never run more than a half hour or so under the same conditions.

Hope this helps.
Comment 8 Daniel Vetter 2013-01-29 10:28:26 UTC
Younes Manton, can you please file a separate bug for your issue? You have a completely different platform, so rather likely you hit a different bug.
Comment 9 Daniel Vetter 2013-01-29 10:31:32 UTC
LP link for reference: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1079440

Keith, can you try to log in via ssh and grab logfiles once the system freezes up like you describe? Since mouse still works, it's likely just a gfx render issue and everything else works.
Comment 10 Keith McClelland 2013-01-29 16:47:36 UTC
Created attachment 73851 [details]
auth.log
Comment 11 Keith McClelland 2013-01-29 16:48:50 UTC
Okay, here are all the logs I can find that cover the time period of this failure.

Background. After I hooked up the monitor, I started the following perl one-liner:

perl -e'my $x = 0; int $x;while(1){++$x;print"$x\n" unless $x % 1000000}'

This spits out a new million about twice a second. I then started the gnome-system-monitor and positioned the terminal window and the monitor window so that both of them were regularly updating both monitors. The video hung at 8:26 with a count of 2.2 billion. I was doing other things and didn't get back to issue this report until after 11:00. By that time the perl process had run at almost 100% CPU usage for more than 200 minutes and thus many billions more counts. Xorg was moving, but slowly, only 12 minutes total. Compiz was stopped dead at 5 minutes.
Comment 12 Keith McClelland 2013-01-29 16:50:09 UTC
Created attachment 73852 [details]
syslog
Comment 13 Keith McClelland 2013-01-29 16:51:09 UTC
Created attachment 73854 [details]
wtmp
Comment 14 Keith McClelland 2013-01-29 16:52:04 UTC
Created attachment 73855 [details]
lastlog
Comment 15 Keith McClelland 2013-01-29 16:52:57 UTC
Created attachment 73856 [details]
Xorg.0.log
Comment 16 Keith McClelland 2013-01-29 16:53:59 UTC
Created attachment 73857 [details]
pm-powersave.log
Comment 17 Keith McClelland 2013-01-29 16:55:15 UTC
Created attachment 73858 [details]
syslog.1
Comment 18 Keith McClelland 2013-01-29 16:56:07 UTC
Created attachment 73859 [details]
udev
Comment 19 Keith McClelland 2013-01-29 17:01:52 UTC
Created attachment 73860 [details]
kern.log
Comment 20 Keith McClelland 2013-01-29 17:02:50 UTC
Created attachment 73861 [details]
boot.log
Comment 21 Keith McClelland 2013-01-29 17:03:44 UTC
Created attachment 73862 [details]
dmesg
Comment 22 Keith McClelland 2013-01-29 17:09:13 UTC
One more thing I should add about this sequence.

I have added 'taskset -pa 1 $(pgrep -x Xorg)' to the rcX.d set of things. So initially the computer came up with my workaround in place. I reversed that with 'sudo taskset -pa 3 $(pgrep -x Xorg)' before connecting the external monitor.

All of this was about an hour after the computer was booted up at 6:57.
Comment 23 Chris Wilson 2013-02-20 14:35:14 UTC
Can you please try drm-intel-next with

commit 21ad833075801a7cd81b5ef1604ffc6c600e5ff9
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Tue Feb 19 15:16:39 2013 +0200

    drm/i915: Fix races in gen4 page flip interrupt handling
Comment 24 Chris Wilson 2013-02-22 14:09:29 UTC
We have good indication from other bugs that the race fix is indeed good. Preemptively closing...
Comment 25 Keith McClelland 2013-02-25 01:46:04 UTC
I couldn't find a "commit" with the number Chris gave. But I loaded up the "3.8.0-997_3.8.0-997.201302180432" kernel from "drm-intel-next". However, despite the fact that it seems to have built and installed okay, it dies with an almost-all-black screen a few seconds after it starts initializing the ramfs. I repeated it all and got the same results. Both downloads of the parts gave the same md5sums as follows:

8b88e669a4117f72e58d57481247e936  linux-headers-3.8.0-997_3.8.0-997.201302180432_all.deb
86c4099dfb41e40d7e976493681a2af9  linux-headers-3.8.0-997-generic_3.8.0-997.201302180432_i386.deb
ec53ae004b3c74f097d31dfc61ffeb6c  linux-image-3.8.0-997-generic_3.8.0-997.201302180432_i386.deb
057040a505e5c463cc99b42f8c814d4b  linux-image-extra-3.8.0-997-generic_3.8.0-997.201302180432_i386.deb

???
Comment 26 Chris Wilson 2013-02-25 10:30:42 UTC
(In reply to comment #25)
> I couldn't find a "commit" with the number Chris gave. But I loaded up the
> "3.8.0-997_3.8.0-997.201302180432" kernel from "drm-intel-next". However,
> despite the fact that it seems to have built and installed okay, it dies
> with an almost-all-black screen a few seconds after it starts initializing
> the ramfs. I repeated it all and got the same results. Both downloads of the
> parts gave the same md5sums as follows:

That's a complete different problem. And a critical one to boot. Normally if it breaks that early it is because the initramfs is broken and needs to be rebuilt. Try passing nomodeset to your kernel as it boots and see if that makes diagnosing the problem easier.

Besides the date on kernel is earlier than the patch I referenced to fix the original issue.
Comment 27 Keith McClelland 2013-02-25 19:17:23 UTC
Since Chris pointed out that the drm-intel-next build (Feb 18) is too old for this fix (never mind that it also seems to exhibit a critical bug on my computer), I tried the latest build from drm-intel-nightly (Feb 23). No critical bug there.

However, it wastes very little time -- a few seconds -- before freezing up. This may be a slightly different freeze as well because the cursor was not responsive. It also didn't do CTL-ALT-F1 though that was never a reliable thing in the past during this freeze.
Comment 28 Chris Wilson 2013-02-25 22:15:14 UTC
Yikes. You're not having much fun are you? :(

Do you happen to be able to set up a netconsole and grab the dmesg leading to the hard hang?
Comment 29 Keith McClelland 2013-02-27 00:48:57 UTC
Fun has returned. I don't know why the Feb 18 drm-intel-next insists on crashing and I have no idea why the first attempt at the Feb 23 drm-intel-nightly was a failure, but I returned to using it (3.8.0-994-generic) with complete success later on. A 5-hour run, an 8-hour run, and it is currently running and working fine with 2 monitors.

So I agree that the problem is both resolved and fixed. Thanks!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.