Bug 25453

Summary: nouveau + multiple xorg servers + suspend2ram trashes root partition
Product: xorg Reporter: Tobias Hommel <nouveau>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED INVALID QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: high    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
output of dmesg after resume
none
output of dmesg after resume (linux 2.6.34) none

Description Tobias Hommel 2009-12-04 12:34:27 UTC
Created attachment 31752 [details]
output of dmesg after resume

One note before: the following only happens with nouveau, there are no problems with the proprietary nvidia drivers.

I use 2 Xorg servers with nouveau as video graphics driver. Everything works quite nice, at least I didn't experience any strange behaviour.
I also use suspend2ram. Also this works fine, at least, if I only use 1 X server.

If I have 2 X servers started and suspend my machine, everything is fine until the machine wakes up. Sometimes (I haven't yet figured out when exactly "sometimes" is, but at latest after 2 suspends the problem occurs.) I get some ugly messages in my dmesg(more in the attached dmesg log):

...
sata_nv 0000:00:07.0: PCI-DMA: Out of IOMMU space for 4096 bytes
sata_nv 0000:00:07.0: PCI-DMA: Out of IOMMU space for 4096 bytes
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: cmd ca/00:08:43:bf:91/00:00:00:00:00/e0 tag 0 dma 4096 out
         res 50/00:00:00:00:00/00:00:00:00:00/e0 Emask 0x40 (internal error)
ata2.00: status: { DRDY }
ata2.00: configured for UDMA/133
ata2: EH complete
...

and my root partition gets mounted read-only.

Here is some information of my system:

tobi@nyx ~ $ sudo lspci -v -s 05:00.0

05:00.0 VGA compatible controller: nVidia Corporation G96 [GeForce 9400 GT] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: nVidia Corporation Device 0551
	Flags: bus master, fast devsel, latency 0, IRQ 18
	Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
	I/O ports at ec00 [size=128]
	[virtual] Expansion ROM at feb80000 [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [b4] Vendor Specific Information <?>
	Capabilities: [100] Virtual Channel <?>
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [600] Vendor Specific Information <?>
	Kernel driver in use: nvidia
	Kernel modules: nvidia

Ignore the Kernel driver/modules lines, I got this information while running the nvidia drivers. If lspci output while running nouveau is needed, ask.

tobi@nyx ~ $ uname -a
Linux nyx 2.6.31-gentoo-r6-nyx #2 SMP PREEMPT Mon Nov 30 23:18:53 CET 2009 x86_64 AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ AuthenticAMD GNU/Linux

The following gentoo packages were installed (on Wed Dec 2 2009, about 20:00 GMT) from the x11-overlay:

x11-drivers/xf86-video-nouveau
x11-libs/libdrm
x11-base/nouveau-drm
x11-base/xorg-server

The proprietary nvidia drivers were of course uninstalled.

If more logfiles or system information is needed, just ask. ;-)
Comment 1 Marcin Slusarz 2010-06-01 10:07:46 UTC
3 drivers for the same hardware loaded (nvidiafb, the blob and nouveau) - this cannot work...
Comment 2 Tobias Hommel 2010-06-03 13:12:41 UTC
Created attachment 36036 [details]
output of dmesg after resume (linux 2.6.34)
Comment 3 Tobias Hommel 2010-06-03 13:13:32 UTC
So I tried again, this time no nvidiafb(wonder why it was loaded last time). Only nouveau, fb, fbcon kernel modules were loaded.

Again after 3 or 4 suspends/resumes with 2 xservers I get this strange harddisk errors, which have definitely nothing to do with the harddrive itself, as it also happens on a mirrored disk. And as already mentioned it only occurs with nouveau, not with the binary drivers.

I attached the dmesg output for the current situation.
Comment 4 Marcin Slusarz 2010-06-04 09:25:19 UTC
Worrying number of hardware/BIOS bugs...

---
AMI BIOS detected: BIOS may corrupt low RAM, working around it.

mtrr: your BIOS has configured an incorrect mask, fixing it.

Node 0: aperture @ 20000000 size 32 MB
Aperture pointing to e820 RAM. Ignoring.
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM

ACPI Warning: Incorrect checksum in table [OEMB] - E9, should be E8 (20100121/tbutils-314)
ACPI: No dock devices found.
PCI: Ignoring host bridge windows from ACPI; if necessary, use "pci=use_crs" and report a bug

mtrr: your BIOS has configured an incorrect mask, fixing it.
mtrr: your BIOS has configured an incorrect mask, fixing it.

k8temp 0000:00:18.3: Temperature readouts might be wrong - check erratum #141

pci 0000:00:02.0: OHCI: BIOS handoff failed (BIOS bug?) 000007b4
---

Maybe there's a BIOS update for your machine?
Comment 5 Tobias Hommel 2010-06-05 04:38:16 UTC
(In reply to comment #4)
> Worrying number of hardware/BIOS bugs...

I know, but a graphics driver shouldn't touch filesystem related stuff anyway. Never!
I suppose I'm not the only one with a buggy BIOS, at least the ACPI tables are broken on nearly every board and probably the other stuff too. So the cause for this problem should be found. Or is it that unusual to use more than one x server and suspend?
As soon as I can spare some time, I'll do some further tests.

> .....
> 
> Maybe there's a BIOS update for your machine?

Negative on that, unfortunately.
Comment 6 Marcin Slusarz 2010-06-05 14:37:21 UTC
"Worrying number of hardware/BIOS bugs" == "hardware is very buggy, so there might be some other hard-to-discover bugs which might affect us"

BUT

I just noticed that some driver is leaking IOMMU space (which is used both by nouveau and your SATA controller), so maybe enabling CONFIG_IOMMU_LEAK in kernel config will help figure it out.

Why do you have iommu=force in kernel command line?
Did you enable IOMMU in BIOS?
Comment 7 Ilia Mirkin 2013-08-18 18:09:29 UTC
It appears that this bug report has laid dormant for quite a while. Sorry we haven't gotten to it. Since we fix bugs all the time, chances are pretty good that your issue has been fixed with the latest software. Please give it a shot. (Linux kernel 3.10.7, xf86-video-nouveau 1.0.9, mesa 9.1.6, or their git versions.) If upgrading to the latest isn't an option for you, your distro's bugzilla is probably the right destination for your bug report.

In an effort to clean up our bug list, we're pre-emptively closing all bugs that haven't seen updates since 2011. If the original issue remains, please make sure to provide fresh info, see http://nouveau.freedesktop.org/wiki/Bugs/ for what we need to see, and re-open this one.

Thanks,

The Nouveau Team

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.