78530 – Memory corruption on Lenovo t440p with runpm

Bug 78530 - Memory corruption on Lenovo t440p with runpm

Summary: Memory corruption on Lenovo t440p with runpm

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/nouveau (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Nouveau Project
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-05-10 15:22 UTC by Nikolay Amiantov
Modified:	2016-08-24 14:05 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
iomem when booted with memmap=99G$0x40000000 (2.33 KB, text/plain) 2014-09-02 11:36 UTC, Dmitry Nezhevenko	no flags	Details
dmesg when booted with memmap=99G$0x40000000 (50.51 KB, text/plain) 2014-09-02 11:36 UTC, Dmitry Nezhevenko	no flags	Details
iomem default (2.52 KB, text/plain) 2014-09-02 11:37 UTC, Dmitry Nezhevenko	no flags	Details
dmesg default (51.69 KB, text/plain) 2014-09-02 11:37 UTC, Dmitry Nezhevenko	no flags	Details
pci config space linux (13.24 KB, text/plain) 2014-09-02 11:38 UTC, Dmitry Nezhevenko	no flags	Details
pci config space windows (13.24 KB, text/plain) 2014-09-02 11:38 UTC, Dmitry Nezhevenko	no flags	Details
pci space diff (linux vs win) (5.59 KB, text/plain) 2014-09-02 11:39 UTC, Dmitry Nezhevenko	no flags	Details
View All

Description Nikolay Amiantov 2014-05-10 15:22:28 UTC

On recent kernels with runpm the system crashes (with a big memory corruption) when nvidia card is disabled and then enabled on Lenovo T440p laptops with recent BIOSes (1.16+).

My investigations into this:
1. The crash occurs even with just acpi_call, so it looks like on those BIOSes there is some new kind of procedure for enabling nvidia.
2. ACPI calls from Windows and Linux does not differ much (and trying Windows' calling sequence does not help). Also, DSDT from 1.14 and 1.16 BIOSes basically do not differ.
3. The bug can be workarounded by disabling all memory upper than 4GB
4. This bug affects not really memory (there is no corruption of regular memory), but devices using memory regions. For example, I can load system from ramdisk with all such devices disabled, (1) perform acpi nvidia disable-enable and (2) try to load such module (order of (1) and (2) does not matter) -- errors pile up, even if I unload and load this module.

My own hypothesis is that something with PCI bus gets broken -- maybe some reinitialization needs to be performed?

Links:
1. https://github.com/Bumblebee-Project/bbswitch/issues/78 (main discussion place)
2. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1268669 (ubuntu bug report)
3. https://bbs.archlinux.org/viewtopic.php?pid=1414109 (one of forum threads about this)

Comment 1 Nikolay Amiantov 2014-05-10 18:23:17 UTC

Also, Dell XPS 15z with recent BIOSes is also affected. (reported by mattdistro in github thread)

Comment 2 Ilia Mirkin 2014-05-10 18:24:11 UTC

Just to confirm -- this happens without bumblebee as well, right?

Does it happen with the blob driver?

Comment 3 Nikolay Amiantov 2014-05-10 18:25:31 UTC

This happens even without any drivers at all: just acpi_call to make _PS3, _PS0 calls.

Comment 4 Ilia Mirkin 2014-05-10 18:27:35 UTC

Manually making acpi calls isn't the most prudent thing to do. Please confirm that this happens

(a) With just nouveau loaded. No bumblebee anywhere at all.
(b) With the blob driver.

Comment 5 Nikolay Amiantov 2014-05-10 18:29:16 UTC

Okay -- I should try blob with bumblebee, right?

Comment 6 Ilia Mirkin 2014-05-10 18:30:36 UTC

(In reply to comment #5)
> Okay -- I should try blob with bumblebee, right?

I'm not familiar with the blob situation wrt runtime pm. If they have any runtime pm-style support, please use that instead of bumblebee. If they have no support for that, then I guess it's fine to try it with bumblebee.

Comment 7 Nikolay Amiantov 2014-05-10 18:32:03 UTC

(In reply to comment #6)
> I'm not familiar with the blob situation wrt runtime pm. If they have any
> runtime pm-style support, please use that instead of bumblebee. If they have
> no support for that, then I guess it's fine to try it with bumblebee.

I'm not too familiar with it too, but I thought that they haven't added feature like this yet -- just asked to confirm. I'll try bumblebee then.

Comment 8 Nikolay Amiantov 2014-05-10 19:08:09 UTC

I've tested two configurations on kernel 3.14.3, bbswitch 0.8 and nvidia 337.12:
(1) disabled acpi_call and my custom script, disabled bbswitch and bumblebeed (all bumblebee components), modprobe'd nouveau, tried to start X
(2) started bumblebeed, loaded bbswitch, started X without nouveau, ran "primusrun glxgears"
With both cases, I've got fs corruption issues, iwlwifi errors and other distinctive errors pointing at memory corruption.

Comment 9 Alexander Monakov 2014-07-10 14:00:20 UTC

This problem appears to be fixed in recent kernels after adding "Windows 2013" to kernel built-in ACPI OSI list.  Bisection to resolving kernel commit shows:

https://github.com/Bumblebee-Project/bbswitch/issues/78#issuecomment-48600484

Nikolay — would you mind closing the bug after verifying it's resolved for you?

Comment 10 Nikolay Amiantov 2014-07-10 18:09:15 UTC

The OSI fix indeed solves the issue. Closing the bug.

Comment 11 Nikolay Amiantov 2014-07-18 21:56:50 UTC

Unfortunately, it wasn't a fix -- we've got another ACPI problem which prevented nvidia from disabling at all, so everything "started to work". You can find more about new problem at https://github.com/Bumblebee-Project/bbswitch/issues/78#issuecomment-48768044. I don't think we need another bug for this, do we?

Comment 12 Dmitry Nezhevenko 2014-09-02 11:34:20 UTC

Hi,

I also have affected T440p machine that corrupts everything once runtime PM is enabled or after calling ACPI method to resume card.

It was stated in bumblebee github thread, that adding "memmap=99G$0x100000000" to kernel fixes issues on affected systems.

My case looks a bit interesting because I have only 4GB of RAM right now, so disabling everything above 4GB should not change behavior. But it changes! Adding memmap= magic fixes issue for me.

I've tried to compare /proc/iomem with and without boot options and found one difference. Once booted with memmap=99G$0x100000000 I'm getting one large reserved region:

bceff000-18ffffffff : reserved
bda00000-bf9fffff : Graphics Stolen Memory
bfa00000-febfffff : PCI Bus 0000:00
c0000000-d1ffffff : PCI Bus 0000:02
c0000000-cfffffff : 0000:02:00.0
d0000000-d1ffffff : 0000:02:00.0
e0000000-efffffff : 0000:00:02.0
...

All PCI devices are inside this one large region. But if I boot with default options, iomem is different:

bceff000-bf9fffff : reserved
bda00000-bf9fffff : Graphics Stolen Memory
bfa00000-febfffff : PCI Bus 0000:00
c0000000-d1ffffff : PCI Bus 0000:02
c0000000-cfffffff : 0000:02:00.0
d0000000-d1ffffff : 0000:02:00.0
e0000000-efffffff : 0000:00:02.0
f0000000-f0ffffff : PCI Bus 0000:02
f0000000-f0ffffff : 0000:02:00.0
f1000000-f13fffff : 0000:00:02.0
f1400000-f14fffff : PCI Bus 0000:04

So now this reserved region starting at bceff000 covers all PCI devices.

[ I'm attaching both iomap files ]

To check this I've tried to explicitly reserve whole region by booting with memmap=1100M$0xbfa00000 parameter. And got pretty similar to mem"map=99G" iomap. But system still crashes after runtime pm.

I also was able to capture PCI configuration space for NVIDIA card from Win8 (where everything works). So I can confirm that after acpi_call windows also shows just 0xFF bytes. But once resumed, it's a bit different from linux. Both files attached.

Any ideas? Maybe card is somehow misconfigured?

Thanks

Comment 13 Dmitry Nezhevenko 2014-09-02 11:36:21 UTC

Created attachment 105598 [details]
iomem when booted with memmap=99G$0x40000000

Comment 14 Dmitry Nezhevenko 2014-09-02 11:36:44 UTC

Created attachment 105599 [details]
dmesg when booted with memmap=99G$0x40000000

Comment 15 Dmitry Nezhevenko 2014-09-02 11:37:25 UTC

Created attachment 105600 [details]
iomem default

Comment 16 Dmitry Nezhevenko 2014-09-02 11:37:48 UTC

Created attachment 105601 [details]
dmesg default

Comment 17 Dmitry Nezhevenko 2014-09-02 11:38:22 UTC

Created attachment 105602 [details]
pci config space linux

Comment 18 Dmitry Nezhevenko 2014-09-02 11:38:42 UTC

Created attachment 105603 [details]
pci config space windows

Comment 19 Dmitry Nezhevenko 2014-09-02 11:39:04 UTC

Created attachment 105604 [details]
pci space diff (linux vs win)

Comment 20 Dmitry Nezhevenko 2014-11-04 08:39:34 UTC

Any ideas on this?

Have anybody tried new BIOS 1.27-1.28? 

WARN: Once updated there will be no way to revert it back to pre-1.26.

Comment 21 Nikolay Amiantov 2015-02-07 13:15:08 UTC

@doudou on Github managed to solve this problem[1][2] -- Nouveau can port the same fix, I think.

[1]: https://github.com/Bumblebee-Project/bbswitch/issues/78#issuecomment-67741841
[2]: https://github.com/Bumblebee-Project/bbswitch/pull/102

Comment 22 Peter Wu 2016-08-24 14:05:01 UTC

Fixed in v4.8-rc1

commit 692a17dcc2922a91c6bcf11b3321503a3377b1b1
Author: Peter Wu <peter@lekensteyn.nl>
Date:   Fri Jul 15 15:12:18 2016 +0200

    drm/nouveau/acpi: fix lockup with PCIe runtime PM

It was confirmed to fix the memory corruption, if it still happens, please re-open.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.