Bug 29406

Summary: [4X bisected] 2.6.35: BSD ring buffer implementation makes suspend to ram unreliable
Product: DRI Reporter: Thomas Meyer <thomas.mey>
Component: DRM/IntelAssignee: Zou Nan hai <nanhai.zou>
Status: CLOSED FIXED QA Contact:
Severity: major    
Priority: high CC: cllccl, florian, haihao.xiang, hege, north, yuanhan.liu, zhenyu.z.wang
Version: unspecified   
Hardware: x86 (IA32)   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg log none

Description Thomas Meyer 2010-08-05 05:28:47 UTC
This commit makes suspend to ram unreliable for me:

commit d1b851fc0d105caa6b6e3e7c92d2987dfb52cbe0
Author: Zou Nan hai <nanhai.zou@intel.com>
Date:   Fri May 21 09:08:57 2010 +0800

    drm/i915: implement BSD ring buffer V2
    
    The BSD (bit stream decoder) ring is used for accessing the BSD engine
    which decodes video bitstream for H.264 and VC1 on G45+.  It is
    asynchronous with the render ring and has access to separate parts of
    the GPU from it, though the render cache is coherent between the two.
    
    Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
    Signed-off-by: Xiang Hai hao <haihao.xiang@intel.com>
    Signed-off-by: Eric Anholt <eric@anholt.net>

git revert of above commit failed on 2.6.35-rc6 so I just changed the define "HAS_BSD" (in file i915_drv.h)  to:

#define HAS_BSD(dev)            (0)

with this change applied suspend to ram is back to the reliability of 2.6.34.y

# lspci -s 00:02  -vvv
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller])
        Subsystem: Acer Incorporated [ALI] Device 029b
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 42
        Region 0: Memory at 90000000 (64-bit, non-prefetchable) [size=4M]
        Region 2: Memory at 80000000 (64-bit, prefetchable) [size=256M]
        Region 4: I/O ports at 30d0 [size=8]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee0100c  Data: 4161
        Capabilities: [d0] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: i915

00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
        Subsystem: Acer Incorporated [ALI] Device 029b
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Region 0: Memory at 92400000 (64-bit, non-prefetchable) [size=1M]
        Capabilities: [d0] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Comment 1 Gordon Jin 2010-08-05 18:08:04 UTC
My GM45 and Arrandale work fine, but my G45 is impacted by this issue.
Comment 2 Dave North 2010-08-07 12:08:13 UTC
Same problem on a DG41TY desktop board with GMA X4500. The HAS_BSD (0) hack seems to work here also.
Comment 3 Gordon Jin 2010-08-08 02:42:34 UTC
(In reply to comment #1)
> My GM45 and Arrandale work fine, but my G45 is impacted by this issue.

Correction: my G45 has working suspend-to-ram but fails to suspend-to-disk, so it should be a separate issue.
Comment 4 Shuang He 2010-08-16 20:04:54 UTC
can't reproduce this issue on our GM45 till now.
Could any of you provide more detailed machine info, so we can try to reproduce this issue.
Comment 5 Dave North 2010-08-16 20:23:21 UTC
Here's an elided dmidecode:

BIOS Information
    Vendor: Intel Corp.
    Version: TYG4110H.86A.0031.2009.0626.1405
    Release Date: 06/26/2009
    Address: 0xF0000
    Runtime Size: 64 kB
    ROM Size: 1024 kB
    Characteristics:


Base Board Information
    Manufacturer: Intel Corporation
    Product Name: DG41TY
    Version: AAE47335-300
    Serial Number: AZTY9300030R
    Asset Tag: To be filled by O.E.M.
    Features:
        Board is a hosting board
        Board is replaceable
    Location In Chassis: To be filled by O.E.M.
    Chassis Handle: 0x0003
    Type: Motherboard
    Contained Object Handles: 0

Processor Information
    Socket Designation: LGA775
    Type: Central Processor
    Family: Pentium
    Manufacturer: Intel(R) Corp.
    ID: 7A 06 01 00 FF FB EB BF
    Signature: Type 0, Family 6, Model 23, Stepping 10
               ...
    Version: Pentium(R) Dual-Core  CPU      E6300  @ 2.80GHz

... which I'd think would be adequate. And here's ver_linux just for good
measure:

Gnu C                  4.4.3
Gnu make               3.81
binutils               2.20.1
util-linux             2.17.2
mount                  support
module-init-tools      3.11.1
e2fsprogs              1.41.11
reiserfsprogs          3.6.21
Linux C Library        2.11.1
Dynamic linker (ldd)   2.11.1
Procps                 3.2.8
Net-tools              1.60
Kbd                    1.15
Sh-utils               7.4

If there's anything else that would help, let me know. And thanks.


Dave
Comment 6 Shuang He 2010-08-17 01:50:50 UTC
What OS are you using? Fedora 13? ubuntu?
Do you have compiz enabled? compiz fusion? Are you running any application (3D, Media) before S3?
Do you have special configuration?
Did you connect external monitor when met this issue?
Do you have power supply connected?
Comment 7 Thomas Meyer 2010-08-17 11:37:09 UTC
(In reply to comment #6)

I encounter the described behaviour on an Acer Aspire 1810T.

> What OS are you using? Fedora 13? ubuntu?

Fedora 13

> Do you have compiz enabled?
No

> compiz fusion?
No

> Are you running any application (3D,
> Media) before S3?
No.

> Do you have special configuration?
No. Just the laptop itself.

> Did you connect external monitor when met this issue?
No.

> Do you have power supply connected?
Yes and the battery removed.
Comment 8 Thomas Meyer 2010-08-17 11:39:21 UTC
By the way: I use an UP kernel.
Comment 9 Dave North 2010-08-17 12:24:48 UTC
Sorry to take so long but I've been doing some testing. First, I wasn't using compiz but I did have xcompmgr loaded. However, unloading it does not stop the problem. No 3D or media apps running. The configuration is special in the sense that it does not use modules or an initrd, if that's what you mean. This is a desktop unit, so it has only an external monitor and does not have a battery other than the cmos pill. Kernel is smp.

The truly interesting question turns out to be: what distribution?

My initial problem was seen on Ubuntu Lucid (10-4). So I tested on Gentoo unstable with the same kernel (literally) and the same problem happened. Then I tested on Debian Squeeze and no, the problem simply would not happen.

Further, I tried all three systems running from ttys1 without X running, and none of them would fail (worked every time on all systems). I tried this at least 20 times each over a minimum of four reboots with zero failures. So it clearly relates to something about X.

Comment: I also ran several retests using 2.6.34.1, and none of the systems would fail running any variant of X.

Lucid is running X.Org version: 1.7.6 (usually fails)
Squeeze is running X.Org version: 1.7.7 (always works)
Gentoo is running X.Org version: 1.8.2 (fails more than half the time)

Unfortunately, that's not very helpful. Even worse, the problem turns out to be sporadic. On some boots, suspend worked fine from X running Gentoo, but on others it would fail. And once, on both Lucid and Gentoo, the first suspend would succeed and the second one fail.

Failure mode gives me no information; the computer deadlocks and doesn't respond to external hookups (ssh, ping etc). I have no serial port. There is nothing out of the ordinary in /var/log/pm-suspend.log.

This is a very messy one. If I find anything more helpful, I'll let you know.
Comment 10 Thomas Meyer 2010-08-17 13:01:09 UTC
Created attachment 37926 [details]
Xorg log

I use the intel driver version 2.12.0.
Comment 11 Florian Kriener 2010-08-20 03:07:42 UTC
Since this is assigned to the ia32 platform I feel obligated to tell, that it also happens with the 64-bit version, if that is of any concern to you. However, else that that I have experience exactly the same symptoms. My computer is a Lenovo T400s and I used the Debian/experimental kernel linux-image-2.6.35-trunk-amd64_2.6.35-1~experimental.2

Here's some lspci:
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device 20e4
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 29
        Region 0: Memory at f2000000 (64-bit, non-prefetchable) [size=4M]
        Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Region 4: I/O ports at 1800 [size=8]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee0300c  Data: 41a1
        Capabilities: [d0] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: i915

00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
        Subsystem: Lenovo Device 20e4
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Region 0: Memory at f2400000 (64-bit, non-prefetchable) [size=1M]
        Capabilities: [d0] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Comment 12 Thomas Meyer 2010-08-23 08:36:08 UTC
Seems to be fixed in 2.6.36-rc2. Probably by merge commit  4238a417a91643e1162a98770288f630e37f0484
Comment 13 ccc1 2010-08-29 00:31:46 UTC
Will there be an official fix in a upcoming 2.6.35.x Kernel?
Comment 14 Thomas Meyer 2010-08-29 10:47:16 UTC
I guess the problem is that the exact commit id that fixes this bug is still unknown. I suspect any of the commits contained in above merge commit id. So if you like you could test each commit id in above merger commit and/or try to bisect the concrete commit id, that fixed this bug. Once the commit id that fixes this bug is known, this id could be forwarded to the stable kernel team, so they hopefully will pick this commit and bundles it into the next stable kernel release.
Comment 15 Adam Lantos 2010-09-06 02:00:14 UTC
My GM45 (Thinkpad X200) is also affected, approx. 3 out of 10 attempts end up with lockup accompanied by a blinking sleep led, relevant logs are empty after hard reboot.

00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller])
	Subsystem: Lenovo Device 20e4
	Flags: bus master, fast devsel, latency 0, IRQ 48
	Memory at f2000000 (64-bit, non-prefetchable) [size=4M]
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	I/O ports at 1800 [size=8]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: <access denied>
	Kernel driver in use: i915
	Kernel modules: i915

I'm using the latest 2.6.35 kernel, libdrm-2.4.21-2, mesa-7.8.2-1, xf86-video-intel-2.12.0-1 (latest Arch Linux packages). Suspend had never failed with the 2.6.34 series.
Comment 16 Chris Wilson 2010-09-10 07:52:52 UTC
Adam, can you grab

git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git drm-intel-fixes

and confirm that s2ram is reliable again?
Comment 17 Adam Lantos 2010-09-29 07:08:33 UTC
(In reply to comment #16)
> Adam, can you grab
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git
> drm-intel-fixes
> 
> and confirm that s2ram is reliable again?

sorry, I haven't notices this message before :(

Anyway, I see this bug was supposed to be fixed. Has the fix made its way into upstream (.36-RC)? If it's fixed there, I can wait until the next stable kernel release. I think a backport for stable .35 is also needed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.