Created attachment 36599 [details] Xorg.0.log chipset: 8086:0042 Clarkdale IGP system architecture: x86-64 xf86-video-intel version: 2.11.0 xserver version: 1.8.1.902 (1.8.2 RC 2) mesa version: 7.8.1 libdrm version: 2.4.20 libdrm2 version: 2.4.20 kernel version: 2.6.35-rc3 HDMI display: Dell U2410 DVI-D display: BenQ MP720P/Acer H5360 projector When attaching a secondary display to the DVI-D connector, and running gnome-display-properties enabling it, and later disconnecting it, a GPU, driver or X hang is observed. Caps-lock on the keyboard is unresponsive, though there is no problem SSHing in to collect state. The problem reproduces every time and with two different displays on the DVI-D connector.
Created attachment 36600 [details] extra dmesg output
Created attachment 36601 [details] output from intel_gpu_dump
Created attachment 36602 [details] output from intel_reg_dumper
output from /proc/dri/0 (bufs was zero length): clients:a dev pid uid magic ioctls clients:y 0 2089 1500 1 14764365 clients:y 0 1354 0 0 2465808254 gem_names: name size handles refcount gem_names:name 1 size 16777216 gem_names: 1 16777216 1 2 gem_names:name 2 size 16777216 gem_names: 2 16777216 1 2 gem_names:name 3 size 16777216 gem_names: 3 16777216 2 4 gem_names:name 4 size 16777216 gem_names: 4 16777216 2 4 gem_names:name 5 size 1048576 gem_names: 5 1048576 2 4 gem_names:name 6 size 131072 gem_names: 6 131072 2 4 gem_names:name 7 size 32768 gem_names: 7 32768 2 3 gem_names:name 8 size 32768 gem_names: 8 32768 2 3 gem_names:name 9 size 32768 gem_names: 9 32768 2 4 gem_names:name 10 size 16777216 gem_names: 10 16777216 2 4 gem_names:name 11 size 16777216 gem_names: 11 16777216 1 3 gem_names:name 12 size 262144 gem_names: 12 262144 2 4 gem_names:name 13 size 524288 gem_names: 13 524288 2 3 gem_names:name 14 size 524288 gem_names: 14 524288 2 4 gem_names:name 15 size 524288 gem_names: 15 524288 2 3 gem_names:name 16 size 8388608 gem_names: 16 8388608 2 4 gem_names:name 17 size 32768 gem_names: 17 32768 2 4 gem_names:name 18 size 32768 gem_names: 18 32768 2 4 gem_names:name 19 size 32768 gem_names: 19 32768 2 4 gem_names:name 20 size 524288 gem_names: 20 524288 2 4 gem_names:name 21 size 524288 gem_names: 21 524288 2 3 gem_names:name 22 size 524288 gem_names: 22 524288 2 3 gem_names:name 23 size 4194304 gem_names: 23 4194304 2 3 gem_names:name 24 size 4194304 gem_names: 24 4194304 2 3 gem_names:name 25 size 4194304 gem_names: 25 4194304 2 4 gem_names:name 26 size 524288 gem_names: 26 524288 2 3 gem_names:name 27 size 524288 gem_names: 27 524288 2 3 gem_names:name 28 size 524288 gem_names: 28 524288 2 3 gem_names:name 29 size 32768 gem_names: 29 32768 2 4 gem_names:name 30 size 32768 gem_names: 30 32768 2 3 gem_names:name 31 size 32768 gem_names: 31 32768 2 4 gem_names:name 32 size 2097152 gem_names: 32 2097152 2 4 gem_names:name 33 size 524288 gem_names: 33 524288 2 3 gem_names:name 34 size 524288 gem_names: 34 524288 2 3 gem_names:name 35 size 524288 gem_names: 35 524288 2 3 gem_names:name 36 size 8388608 gem_names: 36 8388608 2 3 gem_names:name 37 size 524288 gem_names: 37 524288 2 3 gem_names:name 38 size 524288 gem_names: 38 524288 2 3 gem_names:name 39 size 524288 gem_names: 39 524288 2 3 gem_names:name 40 size 262144 gem_names: 40 262144 2 3 gem_names:name 41 size 262144 gem_names: 41 262144 2 3 gem_names:name 42 size 65536 gem_names: 42 65536 2 4 gem_names:name 43 size 32768 gem_names: 43 32768 2 4 gem_objects:1864 objects gem_objects:250548224 object bytes gem_objects:8 pinned gem_objects:26284032 pin bytes gem_objects:128552960 gtt bytes gem_objects:234881024 gtt total name:i915 0000:00:02.0 pci:0000:00:02.0 queues: ctx/flags use fin blk/rw/rwf wait flushed queued locks vm:slot offset size type flags address mtrr
Nothing indicates a GPU hang in the logs or dump. In fact, no errors are indicated by any of the attached. If you are using compiz or another GL application at the time of the hang, then it is conceivable that you are hitting the swap+randr bug that was fixed for 2.12. Considering the number of similar bugs fixed, I would suggest you do retry with 2.12 and if it reoccurs: 1) Check /sys/kernel/debug/dri/0/i915_error_state This should say "no error state", unless there was a GPU hang. 2) grab dmesg + Xorg.log 3) in the case of an X hang, a stack trace of all processes. (The challenge is to work out who requested what and why at the time of the hang...)
Any update Daniel?
moved to new software configuration: chipset: 8086:0042 Clarkdale IGP system architecture: x86-64 xf86-video-intel version: 2:2.12.0+git20100628 xserver version: 1:7.5+6 mesa version: 7.9.0+git20100628 libdrm version: 2.4.21+git20100624 libdrm2 version: 2.4.21+git20100624 kernel version: 2.6.35-rc3 HDMI display: Acer H5360 projector DVI-D display: Dell U2410 mechanism active: modesetting The problem is still reproducible with the following steps: 1. boot with (eg) DVI-D connected to monitor 2. plug (eg) HDMI display device 3. activate secondary display with gnome-display-properties 4. change resolutions, set/clear clone mode 5. disconnect secondary display 6. run gnome-display-properties (or click 'detect displays') -> I observe signal going to the primary display and at the right resolution, though the rasterout is black except for intermittent red dots in the first few pixel columns on left of panel. dmesg output: HDMI hot plug event: Pin=5 Presence_Detect=1 ELD_Valid=0 HDMI hot plug event: Pin=5 Presence_Detect=0 ELD_Valid=0
Created attachment 36666 [details] updated Xorg.0.log
combining the dmesg and Xorg.0.log chronologically, we get: [ 1472.222] (II) intel(0): Allocated new frame buffer 1024x768 stride 4096, tiled [ 1485.734749] HDMI hot plug event: Pin=5 Presence_Detect=0 ELD_Valid=0 [ 1486.390] (II) AIGLX: Suspending AIGLX clients for VT switch
Oh there's a VT switch going on here? If so, current X server master has a fix that may help: commit 28e33ae6f69f716ece5d68e63fc52557236c5f6e Author: Jesse Barnes <jbarnes@virtuousgeek.org> Date: Wed Jun 30 07:59:04 2010 -0700 OS support: fix writeable client vs IgnoreClient behavior
Hi Jesse et al, and thanks for the feedback. For sure, I manually attempted to switch to the VT to see if the system was responsive, however the timings from the logs look quite close. I'll reproduce without the VT switch and see what logs we get (of course, we'll have the same corruption). Note that the VT switch failed (at least from a graphical perspective). Also, I'll try to reproduce with the updated X server when I can. Thanks so far! Dan
The problem is reproducible without VT switching: 0. run primary display at high (eg native) resolution 1. plug a second display (eg second input on same monitor) 2. run gnome-display-properties/'detect' if already open 3. enable clone mode 4. reduce resolution significantly, apply 5. unplug second display 6. increase resolution back to native -> bug symptoms: black display with some pixels changing on left - remains after stopping and starting X
Damn, ok. I'll see if I can reproduce this today.
Just tried this on my DVI + DP config but couldn't reproduce with the latest bits. I was running with this patch https://bugs.freedesktop.org/attachment.cgi?id=36695 from 28365, maybe it's a dupe? Can you confirm?
Created attachment 36718 [details] Handy xrandr test script
Hi Jesse, I'll check out the patch soon. I've cooked up a little python script which reproduces screen corruption (eg see attached) or GPU hangs seemingly from a race condition (locking?) after a minute or two - worthwhile checking there. Either connect a couple of inputs or uncomment the 'force' call and set the desired outputs and run. It's possibly worthwhile adding to your automated testsuite.
Created attachment 36719 [details] example screen corruption
Rebuild and deploying xserver-xorg-core with FDO patch 36695 doesn't resolve the issue I'm seeing. I can reproduce it with: $ cat xrandr.sh #!/bin/bash -x xrandr --auto xrandr --output HDMI1 --auto xrandr --output HDMI1 --mode 800x600 xrandr --output HDMI2 --same-as HDMI1 xrandr --output HDMI2 --off xrandr --output HDMI1 --auto
Problem reproduces on 2.6.35-rc4 also. I have noticed that when the issue occurs (ie there is a feint line in the top left and some pixels down the left hand side, otherwise black), I can switch back to previous (and working) mode, so this looks like a mode timing calculation issue, or raster unit misconfiguration. It's worth a try on some other test systems with larger panels and multiple inputs, if not already. Info supplied.
Gordon, is this something you can reproduce?
This could be a dupe of 28998, can someone try the patch in the last comment of that bug and see if it helps?
Rebuilding and deploying xorg-server 1.8.1.902 with Keith's commit e27d95f1ab4beaf7eea3d5ddb1001c22da3d0bda ('Unwrap/rewrap EnterVT/LeaveVT completely') on 2.6.35-rc5, the problem still persists. The (main) resulting issue is consistently reproducing using the xrandr.sh script in comment 18 and two inputs into my 1920x1200 panel, however using the xrandr exercise-a-tron in comment 15, I often get different screen corruption, as per the previously attached screenshot. Let me know if a screen shot of the main issue I'm reporting is worth anything. Note that nothing has locked up and the system is still functional, so I can type blind and switch mode again - shall I compare register shots in good and bad cases? It feels like a race reprogramming the two GPU rasteriser units, as this doesn't occur when one output if used, and restarting X doesn't stop the corruption. Core i5 661 stepping 2 from /proc/cpuinfo; Core Processor Integrated Graphics Controller rev 12 from lspci. Thanks for the help so far, Jesse. It would be great if Gordon or you are able to reproduce this.
Yes, I can reproduce with the upstream code. I'll see if I could narrow down the key point in the reproduce step.
Yi (CCed) did more test and here's his finding: The reproducing steps: 1. connect 2 monitors. Both get displayed. (changing mode is not required here) 2. unplug 1 monitor. 3. change mode on the remaining monitor. GPU hang (with i915_error_state attached). Composite WM (e.g. compiz) is needed for reproduce, either before step 1 or after step 3. As dmesg shows, this is caused by page-flipping. And we confirm disabling page-flipping works. This problem happens on both our Piketon (with Clarkdale cpu) and G45. We are using kernel 2.6.35-rc5 and xserver master.
Created attachment 37059 [details] i915_error_state on G45
Created attachment 37060 [details] dmesg on G45
Created attachment 37061 [details] X log on G45
More than just the usual WAIT_FOR_EVENT hang: BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 IP: [<ffffffffa008c7d3>] intel_crtc_page_flip+0xc9/0x39c [i915] PGD 114724067 PUD 1145bd067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:1b.0/sound/card0/uevent CPU 0 Modules linked in: fuse bridge stp bnep sco l2cap crc16 bluetooth rfkill sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table mperf dm_mirror dm_region_hash dm_log dm_multipath dm_mod uinput snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer firewire_ohci snd firewire_core iTCO_wdt iTCO_vendor_support ata_generic pata_acpi i2c_i801 soundcore snd_page_alloc r8169 crc_itu_t mii pata_marvell floppy serio_raw pcspkr sg sd_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd i915 drm_kms_helper drm i2c_algo_bit button i2c_core video output [last unloaded: microcode] Pid: 10954, comm: X Not tainted 2.6.35-rc5_stable_20100714+ #1 P5Q-EM/P5Q-EM RIP: 0010:[<ffffffffa008c7d3>] [<ffffffffa008c7d3>] intel_crtc_page_flip+0xc9/0x39c [i915] RSP: 0018:ffff880114927cc8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff88012df48320 RCX: ffff88010c945600 RDX: ffff880001a109c8 RSI: ffff88010c945840 RDI: ffff88012df48320 RBP: ffff880114927d18 R08: ffff88012df48280 R09: ffff88012df48320 R10: 0000000003c2e0b0 R11: 0000000000003246 R12: ffff88010c945840 R13: ffff88012df48000 R14: 0000000000000060 R15: ffff88012dbb8000 FS: 00007f9e6078e830(0000) GS:ffff880001a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000058 CR3: 00000001177a8000 CR4: 00000000000406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process X (pid: 10954, threadinfo ffff880114926000, task ffff88012a4a1690) Stack: ffff88010c945600 ffff880115b176c0 ffff88012db10000 0000000000000246 <0> fffffff40006101c ffff88010c945600 00000000ffffffea ffff88010c945600 <0> ffff88012df48320 ffff88011b4b6780 ffff880114927d78 ffffffffa003bd0e Call Trace: [<ffffffffa003bd0e>] drm_mode_page_flip_ioctl+0x1bc/0x214 [drm] [<ffffffffa00311fc>] drm_ioctl+0x25e/0x35e [drm] [<ffffffffa003bb52>] ? drm_mode_page_flip_ioctl+0x0/0x214 [drm] [<ffffffff810f1c3c>] vfs_ioctl+0x2a/0x9e [<ffffffff810f227e>] do_vfs_ioctl+0x531/0x565 [<ffffffff810f2307>] sys_ioctl+0x55/0x77 [<ffffffff810e56d6>] ? sys_read+0x47/0x6f [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b Code: 45 d4 f4 ff ff ff 0f 84 e0 02 00 00 48 8b 4d b0 49 8d 9d 20 03 00 00 48 89 df 49 89 4c 24 38 49 8b 07 49 89 44 24 20 49 8b 47 20 <48> 8b 40 58 49 c7 04 24 00 00 00 00 49 c7 44 24 18 a9 a5 08 a0 RIP [<ffffffffa008c7d3>] intel_crtc_page_flip+0xc9/0x39c [i915] RSP <ffff880114927cc8> CR2: 0000000000000058
Created attachment 37158 [details] [review] Prevent the OOPS from flipping an unbound fb
Chris is my hero.
(In reply to comment #24) > This problem happens on both our Piketon (with Clarkdale cpu) and G45. Was the problem just limited to i965+ or could it be triggered with i945?
The issue can't be reproduced on 945GM.
Haven't reproduced yet on my t61 so either it is g4x+ or desktop specific, or I just fail at reproducing bugs. Will have to wait sometime until I have the h/w to reproduce with my g45. I think the core issue is a race between WAIT_ON_EVENT and modeset, for which Jesse had the idea of triggering the event prior to the modeset and then relying on hardware to the dtrt after the pipe change. That still sounds racy to me, and I think we need to be idling the gpu prior to modeset. An alternative is to move the WAIT_ON_EVENT to an ioctl and prevent the race condition in the kernel by taking the mode lock. Ugh. That is not a solution... Jesse, if you have time to work on this this week, be my guest.
Hi Chris - did you give the tests at comment#18 and comment#15 a shot?
=0 crestline:~$ ./randr-bug28811.sh + xrandr --auto + xrandr --output LVDS1 --off --output DVI1 --auto + xrandr --output LVDS1 --off --output DVI1 --mode 800x600 + xrandr --output LVDS1 --off --output VGA1 --same-as DVI1 + xrandr --output LVDS1 --off --output VGA1 --off + xrandr --output LVDS1 --off --output DVI1 --auto =0 crestline:~$ cat /sys/kernel/debug/dri/0/i915_error_state no error state collected =0 crestline:~$ ps ax | grep compiz 2171 ? S 0:03 compiz --ignore-desktop-hints glib gconf gnomecompat --replace =0 crestline:~$ So I can't be sure if the different output configuration is its saving grace or the different h/w.
If the script at comment#15 doesn't reproduce any problem (ie https://bugs.freedesktop.org/attachment.cgi?id=36718 ), then there is every chance you'll need newer hardware to reproduce. You could try with enabling force_output() in the reproducer and specialising the outputs to what your hardware provides.
http://cgit.freedesktop.org/~ickle/drm-intel/log/?h=drm-intel-next contains a couple of patches that should in theory prevent the hang. Not ideal, they fixup the hang after the fact and we should be striving not to hang in the first place.
Tree moved to: git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git drm-intel-next
I think I have the fix for this in -next: commit 265db9585e570814d2f7aca109c5563bcde9c948 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Sep 20 15:41:01 2010 +0100 drm/i915: Drain any pending flips on the fb prior to unpinning If we have queued a page flip on the current fb and then request a mode change, wait until the page flip completes before performing the new request. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.