Bug 74613 - [NV34] [v3.14-rc1] nouveau: get 0x10000000 put 0x0000ed30 state 0xc0000000 (err: MEM_FAULT) push 0x00000000
Summary: [NV34] [v3.14-rc1] nouveau: get 0x10000000 put 0x0000ed30 state 0xc0000000 (e...
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-06 12:45 UTC by Ronald
Modified: 2014-02-19 06:51 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Full dmesg of crash (36.36 KB, text/plain)
2014-02-06 12:45 UTC, Ronald
no flags Details

Description Ronald 2014-02-06 12:45:25 UTC
Created attachment 93524 [details]
Full dmesg of crash

Tried to boot a v3.14-rc1 kernel. Previous kernel was v3.13-rc8. I'm attaching the full dmesg, the only relevant lines were this:

[    2.357486] nouveau E[   PFIFO][0000:01:00.0] DMA_PUSHER - ch 0 [DRM] get 0x10000000 put 0x0000ed30 state 0xc0000000 (err: MEM_FAULT) push 0x00000000
[    2.357764] Console: switching to colour frame buffer device 160x64
[    2.606574] nouveau E[   PFIFO][0000:01:00.0] DMA_PUSHER - ch 0 [DRM] get 0x10000000 put 0x000091cc state 0xc0000000 (err: MEM_FAULT) push 0x00000000
[   17.350008] nouveau E[     DRM] failed to idle channel 0xcccc0000 [DRM]
[   17.350011] nouveau E[     DRM] GPU lockup - switching to software fbcon

It booted (nice). But the KMS was initially corrupted in the top section. The penguin was there, but stuff next to it was pinkish garbled.

Once it switched from /dev/console to /dev/tty1 (I think that is what happens) the corruption is gone.

X starts, but not without problems. I saw rectangular parts of Opera alternate between correct display and black triangles.

Weird thing is, no more new errors popped up. Following list shows the suspects:

git log --topo-order --oneline v3.13-rc8^...HEAD --no-merges -- drivers/gpu/drm/nouveau Makefile

38dbfb5 Linus 3.14-rc1
f3980dc drm/nouveau: resume display if any later suspend bits fail
09c3de1 drm/nouveau: fix lock unbalance in nouveau_crtc_page_flip
d83ef85 drm/nouveau: implement hooks for needed for drm vblank timestamping support
d2fa7d3 drm/nouveau/disp: add a method to fetch info needed by drm vblank timestamping
eb2e968 drm/nv50: fill in crtc mode struct members from crtc_mode_fixup
1139ffb drm/nouveau: call drm_vblank_cleanup() earlier
2332b31 drm/nouveau: create base display from common code
ea7dce9 drm/nv50/gr: print mpc trap name when it's not an mp trap
f750ecc drm/nv50/gr: update list of mp errors, make it a bitfield
e2dd003 drm/nv50/gr: add more trap names to print on error
f87cd8b drm/nouveau/devinit: lock/unlock crtc regs for all devices, not just pre-nv50
d5c1e84 drm/nouveau: hold mutex while syncing to kernel channel
4019aaa drm/nv50-/devinit: prevent use of engines marked as disabled by hw/vbios
f0d13e3 drm/nouveau/device: provide a way for devinit to mark engines as disabled
cf33601 drm/nouveau/devinit: tidy up the subdev class definition
5222555 drm/nouveau/bar: tidy up the subdev and object class definitions
ab60619 drm/nouveau/instmem: tidy up the object class definition
24a4ae8 drm/nouveau/instmem: tidy up the subdev class definition
64c672a drm/nouveau/pwr: implement a simple i2c stack
2e9dfe2 drm/nouveau/pwr: have rd/wr32 routines clobber data instead of addr
7321623 drm/nve0/fb: turn off some bits in 10f584 at init
cb54dd2 drm/nve0/fb/gddr5: merge a fix from ddr3 for one of the timing settings
b13d0e4 drm/nve0/fb/gddr5: yet another random 10f200 bit
c814a60 drm/nvc0-/fb: hook up skeleton interrupt handler
7f39e59 drm/nve0/fb/gddr5: more 10f200 stuff
12642e3 drm/nve0/clk: report ddr memory frequency
1a894c0 drm/nouveau/fb/gddr5: make sure we update mr7 when we're supposed to
a8ccbb7 drm/nve0/fb/gddr5: 10f698/69c
cfe1760 drm/nve0/fb: it's now safe to obey the memory voltage setting properly
46bf1c3 drm/nve0/fb: multi-stage reclock is required for certain transitions
1789cab drm/nouveau/clk: allow fb to signal it needs to do a multi-stage reclock
b655f2b drm/nve0/fb/gddr5: parse bios data into struct rather than using directly
ea8b4a3 drm/nve0/fb/gddr5: found LP3 setting
971372e drm/nve0/fb: note the memory voltage toggle, not using it yet
db6735c drm/nve0/fb/gddr5: somewhat better attempt at 100770/10f604/610/614
f4aa2c6 drm/nve0/fb/gddr5: fixup delays a bit
1522eca drm/nouveau/bios: timing 2.0 entries can have subentries
09692e5 drm/nve0/fb/gddr5: note another semi-unknown
1e1d6b4 drm/nouveau/fb/gddr5: modify mr8 with high bits of CL/WR
e7084c6 drm/nve0/fb/gddr5: fix calculation of RDQS setting
334565a drm/nve0/fb/gddr5: switch off some other random bit at some point
0189169 drm/nve0/fb/gddr5: punt all 10f910/914 accesses through ram_train
d394fb1 drm/nve0/fb/gddr5: not all memory partitions are created equal
dd95c8f drm/nve0/fb: typo in register name
0a0dc8f drm/nouveau/bios: make common code to handle ramcfg strap etc
5905439 drm/nve0/fb/gddr5: fix an assumption of sane memory controller layout
2daaf5b drm/nve0/fb/gddr5: fix behaviour of lp3 setting
cb1567c drm/nve0/fifo: recover from mmu faults on bar1/bar3
649ec92 drm/nve0/fifo: keep mmu fault interrupts enabled at all times
e1b6b14 drm/nve0/fifo: update human-readable mmu fault descriptions
e9fb980 drm/nve0/fifo: document more intr status bits
9f8459c drm/nve0/fifo: populate PBDMA status bitfield with more definitions
39b0554 drm/nve0/fifo: s/subfifo/PBDMA/
f82c44a drm/nve0/fifo: s/playlist/runlist/
f76dd80 drm/nvf0/gr: enable acceleration with our chsw ucode
aa97cd3 drm/nv108/gr: enable acceleration with our chsw ucode
5d91e19 drm/nvc0-/gr: handle fwmthd interrupts in ucode
e1b22bc drm/nvc0-/gr: fiddle some magic around strand init
96616b4 drm/nv108/gr: initial support (need external fuc)
daa9ab5 drm/nv108/ce: enable copy engines
a763951 drm/nv108/fifo: initial support
a0f95f1 drm/nvf0/gr: remove a copy+pasto in ctx reglist
67af60f drm/nvc0-/gr: bring in some macros to abstract falcon isa differences
90d6db1 drm/nouveau/falcon: use vmalloc to create firwmare copies
d96bf43 drm/nouveau/gem: remove (now) unneeded pre-validate fence sync
cef9e99 drm/nouveau/ttm: explicitly wait for bo idle before memcpy buffer move
35b8141 drm/nouveau/ttm: explicity sync with kernel channel before moving buffer
3c57d85 drm/nouveau/ttm: tidy up creation of temporary buffer move vmas
ab9b18a drm/nv04/plane: add support for nv04/nv05 video overlay
7ffb078 drm/nv10/plane: add YUYV support
a554090 drm/nv50-: map TTM_PL_SYSTEM through a BAR for CPU access
ce8f769 drm/nouveau: fix m2mf copy to tiled gart
2e2cfbe drm/nouveau/vm: reduce number of entry-points to vm_map()
d0ce7b856 drm/nouveau: make vga_switcheroo code depend on VGA_SWITCHEROO
85b2331 drm: Kill DRM_*MEMORYBARRIER
1d6ac18 drm: Kill DRM_COPY_(TO|FROM)_USER
bfd8303 drm: Kill DRM_HZ
b072e53 ACPI / nouveau: replace open-coded _DSM code with helper functions
4988d0a nouveau / ACPI: fix memory leak in ACPI _DSM related code
8b48463 ACPI: Clean up inclusions of ACPI header files
d8ec26d Linux 3.13
72de182 drm/nouveau/mxm: fix null deref on load
fdd239a drm/nouveau: fix null ptr dereferences on some boards
7e22e91 Linux 3.13-rc8
Comment 1 Ronald 2014-02-06 12:56:21 UTC
Btw, is http://cgit.freedesktop.org/nouveau/linux-2.6 not used anymore?

I kind of liked it. Makes bisecting easier on aged machines. And it allows to test new patches more easily.

Now it's all hidden between a huge changeset between v3.13 final and v3.14-rc1. And sometimes even spread out over several periods in time causing full rebuilds.
Comment 2 Ronald 2014-02-07 16:28:23 UTC
Please notice the new CC (is that a +1?).

Made a small typo.

X starts, but not without problems. I saw rectangular parts of Opera alternate between correct display and black triangles.

triangles->squares.
Comment 3 Ilia Mirkin 2014-02-15 10:41:37 UTC
(In reply to comment #1)
> Btw, is http://cgit.freedesktop.org/nouveau/linux-2.6 not used anymore?

It is... e.g. look at the drm-nouveau-next branch. I don't think the 'master' branch is really maintained anymore, this is happening in http://cgit.freedesktop.org/~darktama/nouveau/ which you may also be able to use to bisect without the rest of the kernel, if your range is sufficiently small. (Although that repo is a little different...)

> 
> I kind of liked it. Makes bisecting easier on aged machines. And it allows
> to test new patches more easily.
> 
> Now it's all hidden between a huge changeset between v3.13 final and
> v3.14-rc1. And sometimes even spread out over several periods in time
> causing full rebuilds.

I'm confused by what you're trying to say here. v3.14-rc1 and v3.13 are all available in linus's repo. Use that to bisect....

git bisect start v3.14-rc1 v3.13 -- drivers/gpu/drm/nouveau

That should only look at changes in nouveau between those 2 revs...
Comment 4 Emil Velikov 2014-02-15 13:38:27 UTC
Hmm seems like a duplicate of the old bug you've reported - bug 71116.

Can you update libdrm/libdrm-nouveau to 2.4.48 before doing any bisection ? A fresh dmesg may be useful as well.
Comment 5 Ronald 2014-02-15 15:01:53 UTC
@Ilia Mirkin: Ah, yes the next repo seems promising. Thank you.

About the 'old' master repo: Say, for example, stable was v3.8 and then this machine would happily use it. Every week I would pull from this master branch the patches for v3.9 based on the v3.8 tree.

These patches are localized in the nouveau driver, so I have quick rebuilds and thus quick bisects. Problems are narrowed down easily and most importantly: fast.

Using Linux his tree sometimes means doing a lot more bisects since a lot of changes come in at once after two months. Furthermore, it's not just the nouveau driver that changes but the entire tree. So bisecting would imply also doing full rebuilds.

However, drm-nouveau-next seems to fill the gap which master left behind.

@Emil Velikov: Does not seem like it:

[gebruiker@delta linux]$ pacman -Qi libdrm
Naam : libdrm
Versie : 2.4.52-1
Beschrijving : Userspace interface to kernel DRM services
Architectuur : i686

ArchLinux is quick with it's updates. Which is really nice.

Ow well, on to the bisect then =)
Comment 6 Ronald 2014-02-15 19:14:13 UTC
It's "drm/nv50-: map TTM_PL_SYSTEM through a BAR for CPU access"

Despite it's name it touches generic stuff. I'm running v3.14-rc2 with this patch reverted -> no issue's yet.

commit a554090664728384c94b027ba15bc7df87f9ac09
Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Date:   Tue Nov 12 13:34:09 2013 +0100

    drm/nv50-: map TTM_PL_SYSTEM through a BAR for CPU access
    
    Moves bo's to TTM_PL_TT for BAR mapping, to hide tiling from user.
    
    Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
    Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index e4623e9..39ca36c 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -1241,6 +1241,7 @@ nouveau_ttm_io_mem_reserve(struct ttm_bo_device *bdev, struct ttm_mem_reg *mem)
 {
 	struct ttm_mem_type_manager *man = &bdev->man[mem->mem_type];
 	struct nouveau_drm *drm = nouveau_bdev(bdev);
+	struct nouveau_mem *node = mem->mm_node;
 	struct drm_device *dev = drm->dev;
 	int ret;
 
@@ -1263,14 +1264,16 @@ nouveau_ttm_io_mem_reserve(struct ttm_bo_device *bdev, struct ttm_mem_reg *mem)
 			mem->bus.is_iomem = !dev->agp->cant_use_aperture;
 		}
 #endif
-		break;
+		if (!node->memtype)
+			/* untiled */
+			break;
+		/* fallthrough, tiled memory */
 	case TTM_PL_VRAM:
 		mem->bus.offset = mem->start << PAGE_SHIFT;
 		mem->bus.base = pci_resource_start(dev->pdev, 1);
 		mem->bus.is_iomem = true;
 		if (nv_device(drm->device)->card_type >= NV_50) {
 			struct nouveau_bar *bar = nouveau_bar(drm->device);
-			struct nouveau_mem *node = mem->mm_node;
 
 			ret = bar->umap(bar, node, NV_MEM_ACCESS_RW,
 					&node->bar_vma);
@@ -1306,6 +1309,7 @@ nouveau_ttm_fault_reserve_notify(struct ttm_buffer_object *bo)
 	struct nouveau_bo *nvbo = nouveau_bo(bo);
 	struct nouveau_device *device = nv_device(drm->device);
 	u32 mappable = pci_resource_len(device->pdev, 1) >> PAGE_SHIFT;
+	int ret;
 
 	/* as long as the bo isn't in vram, and isn't tiled, we've got
 	 * nothing to do here.
@@ -1314,10 +1318,20 @@ nouveau_ttm_fault_reserve_notify(struct ttm_buffer_object *bo)
 		if (nv_device(drm->device)->card_type < NV_50 ||
 		    !nouveau_bo_tile_layout(nvbo))
 			return 0;
+
+		if (bo->mem.mem_type == TTM_PL_SYSTEM) {
+			nouveau_bo_placement_set(nvbo, TTM_PL_TT, 0);
+
+			ret = nouveau_bo_validate(nvbo, false, false);
+			if (ret)
+				return ret;
+		}
+		return 0;
 	}
 
 	/* make sure bo is in mappable vram */
-	if (bo->mem.start + bo->mem.num_pages < mappable)
+	if (nv_device(drm->device)->card_type >= NV_50 ||
+	    bo->mem.start + bo->mem.num_pages < mappable)
 		return 0;
Comment 7 Ilia Mirkin 2014-02-15 19:35:58 UTC
That code _really_ shouldn't affect anything pre-nv50...

The only thing is in second hunk. node->memtype is only ever set to != 0 for nv50+, but... who knows.

Can you add a WARN_ON_ONCE(node->memtype) in there?

The third hunk only affects code for card_type >= NV_50... unless I'm reading something very wrong.
Comment 8 Ilia Mirkin 2014-02-15 19:51:21 UTC
Aha, I think I know what's going on. You're using AGP, which means you get to use the ttm_bo_manager_func. This in turn allocates nodes as drm_mm_node, which is *totally* different from nouveau_mem. (And that's what gets stored in mem->mm_node.)

Can you change

		if (!node->memtype)
			/* untiled */
			break;

to

if (nv_device(drm->device)->card_type < NV_50 || !node->memtype)
  break;

and see if that helps? I actually semi-suspect that this is a giant issue with the TTM stuff the way we're using it, but perhaps all the other uses are guarded behind card < NV_50 logic as well, but that's not obvious to me.
Comment 9 Ronald 2014-02-16 08:22:23 UTC
Yes, that fixes it.
Comment 10 Ilia Mirkin 2014-02-16 08:24:34 UTC
(In reply to comment #9)
> Yes, that fixes it.

Awesome! I sent a patch to the ML + cc'd you (I assume you got it) a few hours ago. Feel free to respond with a Tested-by. I'll close this bug when the patch hits mainline.

Thanks for bisecting, and sorry that you're running into so many problems!
Comment 11 Ronald 2014-02-16 09:31:29 UTC
No need to apologize. I'm learning a lot *and* get to contribute. I'll get back to you about the other 2 bugs.
Comment 12 Ilia Mirkin 2014-02-19 03:26:43 UTC
The fix should now be upstream, and will be included in the next 3.14-rc:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34d595081812da62b5357579267c4ab5eae64ac1
Comment 13 Ronald 2014-02-19 06:51:03 UTC
Thanks, I changed my gitconfig and I'm already pulling from drm-nouveau-next :) .


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.