Bug 28811

Summary: [i965 page-flipping] GPU hang when modeset after unplugging another monitor (under compiz)
Product: DRI Reporter: Daniel J Blueman <daniel>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED FIXED QA Contact:
Severity: major    
Priority: medium CC: gordon.jin, yi.sun
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg.0.log
none
extra dmesg output
none
output from intel_gpu_dump
none
output from intel_reg_dumper
none
updated Xorg.0.log
none
Handy xrandr test script
none
example screen corruption
none
i915_error_state on G45
none
dmesg on G45
none
X log on G45
none
Prevent the OOPS from flipping an unbound fb none

Description Daniel J Blueman 2010-06-29 03:53:56 UTC
Created attachment 36599 [details]
Xorg.0.log

chipset: 8086:0042 Clarkdale IGP
system architecture: x86-64
xf86-video-intel version: 2.11.0
xserver version: 1.8.1.902 (1.8.2 RC 2)
mesa version: 7.8.1
libdrm version: 2.4.20
libdrm2 version: 2.4.20
kernel version: 2.6.35-rc3
HDMI display: Dell U2410
DVI-D display: BenQ MP720P/Acer H5360 projector

When attaching a secondary display to the DVI-D connector, and running gnome-display-properties enabling it, and later disconnecting it, a GPU, driver or X hang is observed. Caps-lock on the keyboard is unresponsive, though there is no problem SSHing in to collect state.

The problem reproduces every time and with two different displays on the DVI-D connector.
Comment 1 Daniel J Blueman 2010-06-29 03:54:52 UTC
Created attachment 36600 [details]
extra dmesg output
Comment 2 Daniel J Blueman 2010-06-29 03:57:10 UTC
Created attachment 36601 [details]
output from intel_gpu_dump
Comment 3 Daniel J Blueman 2010-06-29 03:57:55 UTC
Created attachment 36602 [details]
output from intel_reg_dumper
Comment 4 Daniel J Blueman 2010-06-29 03:59:58 UTC
output from /proc/dri/0 (bufs was zero length):
clients:a dev	pid    uid	magic	  ioctls
clients:y   0  2089  1500          1   14764365
clients:y   0  1354     0          0 2465808254
gem_names:  name     size handles refcount
gem_names:name 1 size 16777216
gem_names:     1 16777216       1        2
gem_names:name 2 size 16777216
gem_names:     2 16777216       1        2
gem_names:name 3 size 16777216
gem_names:     3 16777216       2        4
gem_names:name 4 size 16777216
gem_names:     4 16777216       2        4
gem_names:name 5 size 1048576
gem_names:     5  1048576       2        4
gem_names:name 6 size 131072
gem_names:     6   131072       2        4
gem_names:name 7 size 32768
gem_names:     7    32768       2        3
gem_names:name 8 size 32768
gem_names:     8    32768       2        3
gem_names:name 9 size 32768
gem_names:     9    32768       2        4
gem_names:name 10 size 16777216
gem_names:    10 16777216       2        4
gem_names:name 11 size 16777216
gem_names:    11 16777216       1        3
gem_names:name 12 size 262144
gem_names:    12   262144       2        4
gem_names:name 13 size 524288
gem_names:    13   524288       2        3
gem_names:name 14 size 524288
gem_names:    14   524288       2        4
gem_names:name 15 size 524288
gem_names:    15   524288       2        3
gem_names:name 16 size 8388608
gem_names:    16  8388608       2        4
gem_names:name 17 size 32768
gem_names:    17    32768       2        4
gem_names:name 18 size 32768
gem_names:    18    32768       2        4
gem_names:name 19 size 32768
gem_names:    19    32768       2        4
gem_names:name 20 size 524288
gem_names:    20   524288       2        4
gem_names:name 21 size 524288
gem_names:    21   524288       2        3
gem_names:name 22 size 524288
gem_names:    22   524288       2        3
gem_names:name 23 size 4194304
gem_names:    23  4194304       2        3
gem_names:name 24 size 4194304
gem_names:    24  4194304       2        3
gem_names:name 25 size 4194304
gem_names:    25  4194304       2        4
gem_names:name 26 size 524288
gem_names:    26   524288       2        3
gem_names:name 27 size 524288
gem_names:    27   524288       2        3
gem_names:name 28 size 524288
gem_names:    28   524288       2        3
gem_names:name 29 size 32768
gem_names:    29    32768       2        4
gem_names:name 30 size 32768
gem_names:    30    32768       2        3
gem_names:name 31 size 32768
gem_names:    31    32768       2        4
gem_names:name 32 size 2097152
gem_names:    32  2097152       2        4
gem_names:name 33 size 524288
gem_names:    33   524288       2        3
gem_names:name 34 size 524288
gem_names:    34   524288       2        3
gem_names:name 35 size 524288
gem_names:    35   524288       2        3
gem_names:name 36 size 8388608
gem_names:    36  8388608       2        3
gem_names:name 37 size 524288
gem_names:    37   524288       2        3
gem_names:name 38 size 524288
gem_names:    38   524288       2        3
gem_names:name 39 size 524288
gem_names:    39   524288       2        3
gem_names:name 40 size 262144
gem_names:    40   262144       2        3
gem_names:name 41 size 262144
gem_names:    41   262144       2        3
gem_names:name 42 size 65536
gem_names:    42    65536       2        4
gem_names:name 43 size 32768
gem_names:    43    32768       2        4
gem_objects:1864 objects
gem_objects:250548224 object bytes
gem_objects:8 pinned
gem_objects:26284032 pin bytes
gem_objects:128552960 gtt bytes
gem_objects:234881024 gtt total
name:i915 0000:00:02.0 pci:0000:00:02.0
queues:  ctx/flags   use   fin   blk/rw/rwf  wait    flushed	   queued      locks
vm:slot	 offset	      size type flags	 address mtrr
Comment 5 Chris Wilson 2010-06-30 01:13:13 UTC
Nothing indicates a GPU hang in the logs or dump. In fact, no errors are indicated by any of the attached.

If you are using compiz or another GL application at the time of the hang, then it is conceivable that you are hitting the swap+randr bug that was fixed for 2.12. Considering the number of similar bugs fixed, I would suggest you do retry with 2.12 and if it reoccurs:

1) Check /sys/kernel/debug/dri/0/i915_error_state

This should say "no error state", unless there was a GPU hang.

2) grab dmesg + Xorg.log

3) in the case of an X hang, a stack trace of all processes. (The challenge is to work out who requested what and why at the time of the hang...)
Comment 6 Jesse Barnes 2010-07-01 14:13:25 UTC
Any update Daniel?
Comment 7 Daniel J Blueman 2010-07-01 15:53:13 UTC
moved to new software configuration:

chipset: 8086:0042 Clarkdale IGP
system architecture: x86-64
xf86-video-intel version: 2:2.12.0+git20100628
xserver version: 1:7.5+6
mesa version: 7.9.0+git20100628
libdrm version: 2.4.21+git20100624
libdrm2 version: 2.4.21+git20100624
kernel version: 2.6.35-rc3
HDMI display: Acer H5360 projector
DVI-D display: Dell U2410
mechanism active: modesetting

The problem is still reproducible with the following steps:
1. boot with (eg) DVI-D connected to monitor
2. plug (eg) HDMI display device
3. activate secondary display with gnome-display-properties
4. change resolutions, set/clear clone mode
5. disconnect secondary display
6. run gnome-display-properties (or click 'detect displays')
 -> I observe signal going to the primary display and at the right resolution, though the rasterout is black except for intermittent red dots in the first few pixel columns on left of panel.

dmesg output:

HDMI hot plug event: Pin=5 Presence_Detect=1 ELD_Valid=0
HDMI hot plug event: Pin=5 Presence_Detect=0 ELD_Valid=0
Comment 8 Daniel J Blueman 2010-07-01 15:56:46 UTC
Created attachment 36666 [details]
updated Xorg.0.log
Comment 9 Daniel J Blueman 2010-07-01 15:59:07 UTC
combining the dmesg and Xorg.0.log chronologically, we get:
[ 1472.222] (II) intel(0): Allocated new frame buffer 1024x768 stride 4096, tiled
[ 1485.734749] HDMI hot plug event: Pin=5 Presence_Detect=0 ELD_Valid=0
[ 1486.390] (II) AIGLX: Suspending AIGLX clients for VT switch
Comment 10 Jesse Barnes 2010-07-01 16:12:18 UTC
Oh there's a VT switch going on here?  If so, current X server master has a fix that may help:

commit 28e33ae6f69f716ece5d68e63fc52557236c5f6e
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Wed Jun 30 07:59:04 2010 -0700

    OS support: fix writeable client vs IgnoreClient behavior
Comment 11 Daniel J Blueman 2010-07-02 05:45:55 UTC
Hi Jesse et al, and thanks for the feedback.

For sure, I manually attempted to switch to the VT to see if the system was responsive, however the timings from the logs look quite close. I'll reproduce without the VT switch and see what logs we get (of course, we'll have the same corruption).

Note that the VT switch failed (at least from a graphical perspective).

Also, I'll try to reproduce with the updated X server when I can. Thanks so far! Dan
Comment 12 Daniel J Blueman 2010-07-02 10:02:20 UTC
The problem is reproducible without VT switching:
 0. run primary display at high (eg native) resolution
 1. plug a second display (eg second input on same monitor)
 2. run gnome-display-properties/'detect' if already open
 3. enable clone mode
 4. reduce resolution significantly, apply
 5. unplug second display
 6. increase resolution back to native
 -> bug symptoms: black display with some pixels changing on left - remains after stopping and starting X
Comment 13 Jesse Barnes 2010-07-02 10:13:20 UTC
Damn, ok.  I'll see if I can reproduce this today.
Comment 14 Jesse Barnes 2010-07-02 12:59:45 UTC
Just tried this on my DVI + DP config but couldn't reproduce with the latest bits.  I was running with this patch https://bugs.freedesktop.org/attachment.cgi?id=36695 from 28365, maybe it's a dupe?

Can you confirm?
Comment 15 Daniel J Blueman 2010-07-03 06:43:18 UTC
Created attachment 36718 [details]
Handy xrandr test script
Comment 16 Daniel J Blueman 2010-07-03 06:44:58 UTC
Hi Jesse, I'll check out the patch soon. I've cooked up a little python script which reproduces screen corruption (eg see attached) or GPU hangs seemingly from a race condition (locking?) after a minute or two - worthwhile checking there.

Either connect a couple of inputs or uncomment the 'force' call and set the desired outputs and run. It's possibly worthwhile adding to your automated testsuite.
Comment 17 Daniel J Blueman 2010-07-03 06:49:12 UTC
Created attachment 36719 [details]
example screen corruption
Comment 18 Daniel J Blueman 2010-07-03 08:03:51 UTC
Rebuild and deploying xserver-xorg-core with FDO patch 36695 doesn't resolve the issue I'm seeing. I can reproduce it with:

$ cat xrandr.sh
#!/bin/bash -x

xrandr --auto
xrandr --output HDMI1 --auto
xrandr --output HDMI1 --mode 800x600
xrandr --output HDMI2 --same-as HDMI1
xrandr --output HDMI2 --off
xrandr --output HDMI1 --auto
Comment 19 Daniel J Blueman 2010-07-08 02:27:53 UTC
Problem reproduces on 2.6.35-rc4 also. I have noticed that when the issue occurs (ie there is a feint line in the top left and some pixels down the left hand side, otherwise black), I can switch back to previous (and working) mode, so this looks like a mode timing calculation issue, or raster unit misconfiguration.

It's worth a try on some other test systems with larger panels and multiple inputs, if not already.

Info supplied.
Comment 20 Jesse Barnes 2010-07-12 14:08:05 UTC
Gordon, is this something you can reproduce?
Comment 21 Jesse Barnes 2010-07-12 16:42:19 UTC
This could be a dupe of 28998, can someone try the patch in the last comment of that bug and see if it helps?
Comment 22 Daniel J Blueman 2010-07-13 13:00:31 UTC
Rebuilding and deploying xorg-server 1.8.1.902 with Keith's commit e27d95f1ab4beaf7eea3d5ddb1001c22da3d0bda ('Unwrap/rewrap EnterVT/LeaveVT completely') on 2.6.35-rc5, the problem still persists.

The (main) resulting issue is consistently reproducing using the xrandr.sh script in comment 18 and two inputs into my 1920x1200 panel, however using the xrandr exercise-a-tron in comment 15, I often get different screen corruption, as per the previously attached screenshot. Let me know if a screen shot of the main issue I'm reporting is worth anything.

Note that nothing has locked up and the system is still functional, so I can type blind and switch mode again - shall I compare register shots in good and bad cases? It feels like a race reprogramming the two GPU rasteriser units, as this doesn't occur when one output if used, and restarting X doesn't stop the corruption. Core i5 661 stepping 2 from /proc/cpuinfo; Core Processor Integrated Graphics Controller rev 12 from lspci.

Thanks for the help so far, Jesse. It would be great if Gordon or you are able to reproduce this.
Comment 23 Gordon Jin 2010-07-14 02:15:21 UTC
Yes, I can reproduce with the upstream code. I'll see if I could narrow down the key point in the reproduce step.
Comment 24 Gordon Jin 2010-07-14 20:46:32 UTC
Yi (CCed) did more test and here's his finding:
The reproducing steps:
1. connect 2 monitors. Both get displayed.
(changing mode is not required here)
2. unplug 1 monitor.
3. change mode on the remaining monitor.
GPU hang (with i915_error_state attached).

Composite WM (e.g. compiz) is needed for reproduce, either before step 1 or after step 3.

As dmesg shows, this is caused by page-flipping. And we confirm disabling page-flipping works.

This problem happens on both our Piketon (with Clarkdale cpu) and G45.

We are using kernel 2.6.35-rc5 and xserver master.
Comment 25 Gordon Jin 2010-07-14 21:40:18 UTC
Created attachment 37059 [details]
i915_error_state on G45
Comment 26 Gordon Jin 2010-07-14 21:40:41 UTC
Created attachment 37060 [details]
dmesg on G45
Comment 27 Gordon Jin 2010-07-14 21:41:27 UTC
Created attachment 37061 [details]
X log on G45
Comment 28 Chris Wilson 2010-07-17 11:54:36 UTC
More than just the usual WAIT_FOR_EVENT hang:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
IP: [<ffffffffa008c7d3>] intel_crtc_page_flip+0xc9/0x39c [i915]
PGD 114724067 PUD 1145bd067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:1b.0/sound/card0/uevent
CPU 0 
Modules linked in: fuse bridge stp bnep sco l2cap crc16 bluetooth rfkill sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table mperf dm_mirror dm_region_hash dm_log dm_multipath dm_mod uinput snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer firewire_ohci snd firewire_core iTCO_wdt iTCO_vendor_support ata_generic pata_acpi i2c_i801 soundcore snd_page_alloc r8169 crc_itu_t mii pata_marvell floppy serio_raw pcspkr sg sd_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd i915 drm_kms_helper drm i2c_algo_bit button i2c_core video output [last unloaded: microcode]

Pid: 10954, comm: X Not tainted 2.6.35-rc5_stable_20100714+ #1 P5Q-EM/P5Q-EM
RIP: 0010:[<ffffffffa008c7d3>]  [<ffffffffa008c7d3>] intel_crtc_page_flip+0xc9/0x39c [i915]
RSP: 0018:ffff880114927cc8  EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88012df48320 RCX: ffff88010c945600
RDX: ffff880001a109c8 RSI: ffff88010c945840 RDI: ffff88012df48320
RBP: ffff880114927d18 R08: ffff88012df48280 R09: ffff88012df48320
R10: 0000000003c2e0b0 R11: 0000000000003246 R12: ffff88010c945840
R13: ffff88012df48000 R14: 0000000000000060 R15: ffff88012dbb8000
FS:  00007f9e6078e830(0000) GS:ffff880001a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000058 CR3: 00000001177a8000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process X (pid: 10954, threadinfo ffff880114926000, task ffff88012a4a1690)
Stack:
 ffff88010c945600 ffff880115b176c0 ffff88012db10000 0000000000000246
<0> fffffff40006101c ffff88010c945600 00000000ffffffea ffff88010c945600
<0> ffff88012df48320 ffff88011b4b6780 ffff880114927d78 ffffffffa003bd0e
Call Trace:
 [<ffffffffa003bd0e>] drm_mode_page_flip_ioctl+0x1bc/0x214 [drm]
 [<ffffffffa00311fc>] drm_ioctl+0x25e/0x35e [drm]
 [<ffffffffa003bb52>] ? drm_mode_page_flip_ioctl+0x0/0x214 [drm]
 [<ffffffff810f1c3c>] vfs_ioctl+0x2a/0x9e
 [<ffffffff810f227e>] do_vfs_ioctl+0x531/0x565
 [<ffffffff810f2307>] sys_ioctl+0x55/0x77
 [<ffffffff810e56d6>] ? sys_read+0x47/0x6f
 [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
Code: 45 d4 f4 ff ff ff 0f 84 e0 02 00 00 48 8b 4d b0 49 8d 9d 20 03 00 00 48 89 df 49 89 4c 24 38 49 8b 07 49 89 44 24 20 49 8b 47 20 <48> 8b 40 58 49 c7 04 24 00 00 00 00 49 c7 44 24 18 a9 a5 08 a0 
RIP  [<ffffffffa008c7d3>] intel_crtc_page_flip+0xc9/0x39c [i915]
 RSP <ffff880114927cc8>
CR2: 0000000000000058
Comment 29 Chris Wilson 2010-07-17 12:24:48 UTC
Created attachment 37158 [details] [review]
Prevent the OOPS from flipping an unbound fb
Comment 30 Jesse Barnes 2010-07-19 11:50:16 UTC
Chris is my hero.
Comment 31 Chris Wilson 2010-07-22 10:14:52 UTC
(In reply to comment #24)
> This problem happens on both our Piketon (with Clarkdale cpu) and G45.

Was the problem just limited to i965+ or could it be triggered with i945?
Comment 32 Yi Sun 2010-07-25 18:53:49 UTC
The issue can't be reproduced on 945GM.
Comment 33 Chris Wilson 2010-07-26 02:26:29 UTC
Haven't reproduced yet on my t61 so either it is g4x+ or desktop specific, or I just fail at reproducing bugs. Will have to wait sometime until I have the h/w to reproduce with my g45.

I think the core issue is a race between WAIT_ON_EVENT and modeset, for which Jesse had the idea of triggering the event prior to the modeset and then relying on hardware to the dtrt after the pipe change.

That still sounds racy to me, and I think we need to be idling the gpu prior to modeset. An alternative is to move the WAIT_ON_EVENT to an ioctl and prevent the race condition in the kernel by taking the mode lock. Ugh. That is not a solution...  

Jesse, if you have time to work on this this week, be my guest.
Comment 34 Daniel J Blueman 2010-07-26 02:36:07 UTC
Hi Chris - did you give the tests at comment#18 and comment#15 a shot?
Comment 35 Chris Wilson 2010-07-26 03:20:06 UTC
=0 crestline:~$ ./randr-bug28811.sh 
+ xrandr --auto
+ xrandr --output LVDS1 --off --output DVI1 --auto
+ xrandr --output LVDS1 --off --output DVI1 --mode 800x600
+ xrandr --output LVDS1 --off --output VGA1 --same-as DVI1
+ xrandr --output LVDS1 --off --output VGA1 --off
+ xrandr --output LVDS1 --off --output DVI1 --auto
=0 crestline:~$ cat /sys/kernel/debug/dri/0/i915_error_state 
no error state collected
=0 crestline:~$ ps ax | grep compiz
 2171 ?        S      0:03 compiz --ignore-desktop-hints glib gconf gnomecompat --replace
=0 crestline:~$

So I can't be sure if the different output configuration is its saving grace or the different h/w.
Comment 36 Daniel J Blueman 2010-07-26 03:52:24 UTC
If the script at comment#15 doesn't reproduce any problem (ie https://bugs.freedesktop.org/attachment.cgi?id=36718 ), then there is every chance you'll need newer hardware to reproduce.

You could try with enabling force_output() in the reproducer and specialising the outputs to what your hardware provides.
Comment 37 Chris Wilson 2010-09-06 10:20:08 UTC
http://cgit.freedesktop.org/~ickle/drm-intel/log/?h=drm-intel-next contains a couple of patches that should in theory prevent the hang. Not ideal, they fixup the hang after the fact and we should be striving not to hang in the first place.
Comment 38 Chris Wilson 2010-09-10 07:34:10 UTC
Tree moved to:

git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git drm-intel-next
Comment 39 Chris Wilson 2010-09-22 09:30:19 UTC
I think I have the fix for this in -next:


commit 265db9585e570814d2f7aca109c5563bcde9c948
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Sep 20 15:41:01 2010 +0100

    drm/i915: Drain any pending flips on the fb prior to unpinning
    
    If we have queued a page flip on the current fb and then request a mode
    change, wait until the page flip completes before performing the new
    request.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.