108500 – Crash when creating a depth buffer on GeForce 320M

Bug 108500 - Crash when creating a depth buffer on GeForce 320M

Summary: Crash when creating a depth buffer on GeForce 320M

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/nouveau (show other bugs)
Version:	18.2
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium critical
Assignee:	Nouveau Project
QA Contact:	Nouveau Project

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-10-19 16:59 UTC by Timo Wiren
Modified:	2019-07-20 09:04 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
glxinfo (102.63 KB, text/plain) 2018-10-19 16:59 UTC, Timo Wiren	Details
dmesg (62.41 KB, text/plain) 2018-10-20 05:55 UTC, Timo Wiren	Details
View All

Description Timo Wiren 2018-10-19 16:59:49 UTC

Created attachment 142101 [details]
glxinfo

Every OpenGL application that wants to use a depth buffer always crashes, including glxgears:

glxgears: dri2.c:906: dri2_allocate_textures: Assertion `*zsbuf' failed.

I debugged the assertion with gdb:

templ structure contents passed to resource_create():
$2 = {reference = {count = 0}, width0 = 300, height0 = 300, depth0 = 1, array_size = 1, format = PIPE_FORMAT_Z24X8_UNORM,
  target = PIPE_TEXTURE_2D, last_level = 0, nr_samples = 0, nr_storage_samples = 0, usage = 0, bind = 1, flags = 0,
  next = 0x0, screen = 0x0}

In nv50_miptree_create() in gallium/drivers/nouveau/nv50/nv50_miptree.c:389
the call to nouveau_bo_new() returns -22 that causes it to return NULL. 
 
MESA_DEBUG=1 glxgears prints the following before segfaulting:
Mesa: User error: GL_OUT_OF_MEMORY in Resizing framebuffer

Computer: MacBook Pro 2010 (NVIDIA GeForce 320M "MCP89")
Resolution: 1280x800
OS: Lubuntu 18.10
Mesa: 18.2.2, but happens also with the versions that come with Lubuntu 16.04 and 18.04

I can compile and run mesa from sources, if it helps debugging.

Comment 1 Ilia Mirkin 2018-10-19 18:15:30 UTC

Please include your dmesg.

The fact that nouveau_bo_new fails is extremely unexpected.

Comment 2 Timo Wiren 2018-10-20 05:55:36 UTC

Created attachment 142112 [details]
dmesg

Added dmesg.

Comment 3 Ilia Mirkin 2018-10-20 15:02:30 UTC

Hrmph. Well, nothing in there. So ... what's different about your environment?

I'm on Xorg 1.19, windowmaker, no compositor of any sort. glxgears works fine.

Tell me about your setup.

Comment 4 Timo Wiren 2018-10-20 16:28:42 UTC

(In reply to Ilia Mirkin from comment #3)
> Hrmph. Well, nothing in there. So ... what's different about your
> environment?
> 
> I'm on Xorg 1.19, windowmaker, no compositor of any sort. glxgears works
> fine.
> 
> Tell me about your setup.

Nothing custom, just the freshly installed Lubuntu 18.10 64-bit, English localization, laptop's internal display, no encrypted disks, no compositor AFAIK.

But I just found a workaround! I downloaded mesa 18.2.3, edited nv50_miptree.c and disabled compression in nv50_mt_choose_storage_type() for PIPE_FORMAT_Z24_X8_UNORM. That is, I put "compressed = false;" after tile_flags = 0x128 + ms;, compiled mesa and ran glxgears with my compiled version and it didn't crash. I don't know if it's a proper workaround. The issue seems to be with depth compression, I guess.

Comment 5 Ilia Mirkin 2018-10-24 04:55:23 UTC

Could you boot with

nouveau.debug=mmu=debug

and see what gets printed? I think I see why the -22 (EINVAL) is being generated -- the RAM is marked as stolen, and it rejects compressed memory on it. However I'm not sure that's actually correct.

Comment 6 Timo Wiren 2018-10-24 13:48:46 UTC

(In reply to Ilia Mirkin from comment #5)
> Could you boot with
> 
> nouveau.debug=mmu=debug
> 
> and see what gets printed? I think I see why the -22 (EINVAL) is being
> generated -- the RAM is marked as stolen, and it rejects compressed memory
> on it. However I'm not sure that's actually correct.

Thanks for looking into this. Here's the output:

[  110.000223] nouveau 0000:04:00.0: mmu: user: comp 3 0a
[  110.000227] nouveau 0000:04:00.0: mmu: user: invalid -22

[  110.000275] nouveau 0000:04:00.0: mmu: user: comp 3 0a
[  110.000277] nouveau 0000:04:00.0: mmu: user: invalid -22

[  110.003636] nouveau 0000:04:00.0: mmu: user: comp 3 0a
[  110.003639] nouveau 0000:04:00.0: mmu: user: invalid -22

[  110.003676] nouveau 0000:04:00.0: mmu: user: comp 3 0a
[  110.003678] nouveau 0000:04:00.0: mmu: user: invalid -22

[  110.003697] glxgears[1196]: segfault at 1a ip 00007fa0080b533d sp 00007ffc0a624b00 error 4 in nouveau_dri.so[7fa007ee1000+819000]
[  110.003705] Code: c6 44 24 07 00 49 8b 9d c0 01 00 00 48 85 db 0f 84 e4 00 00 00 80 bb 91 00 00 00 00 0f 85 03 01 00 00 48 8b 4b 58 0f b7 14 24 <0f> b7 41 1a 66 39 51 18 48 89 4c 24 48 66 0f 46 51 18 66 39 44 24

Comment 7 Ilia Mirkin 2018-10-25 03:33:06 UTC

OK, "good". This is what I expected based on my reading of the code, so ... comforting to know that I can read code.

Ben -- this seems wrong. I don't know if compression is or is not truly supported on the MCP89-stolen ram, but nouveau_bo_new should not fail there. There's various logic in there to turn off compression if it's not found, but that does not appear to be triggering here.

Comment 8 Ilia Mirkin 2018-10-30 20:21:42 UTC

Could you edit drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmnv50.c:nv50_vmm_valid, and change it to set aper = 0 in the if (ram->stolen) case (so aper=0 in both of the VRAM cases). Let me know if you need me to make you a proper patch.

Then see if ... stuff works. At least glxgears, but would be good to test more complex things too which make extensive use of depth as well as compressible color formats (something like xonotic would be more than sufficient).

Unfortunately we're not sure if this works on MCP89 or not.

Comment 9 Timo Wiren 2018-11-01 18:16:42 UTC

(In reply to Ilia Mirkin from comment #8)
> Could you edit
> drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmnv50.c:nv50_vmm_valid, and change
> it to set aper = 0 in the if (ram->stolen) case (so aper=0 in both of the
> VRAM cases).

I tested it on kernel 4.18.16 and it caused a crash/hang on bootup. Some of the log messages (copied by hand):

nouveau 0000:04:00.0: fb: trapped read at 015bc9dedc on channel 2 [0fba0000 DRM] engine 05 [PFIFO] client 08 [PFIFO_READ] subclient 00 [PUSHBUF] reason 0000000b [PT_NOT_PRESENT]
nouveau 0000:04:00.0: fifo: DMA_PUSHER - ch 2 [DRM] get 015c25a7a4 put 015c25a7a4 ib_get 000000a2 ib_put 000000a7 state a0000000 (err: IB_EMPTY)
nouveau 0000:04:00.0: DRM: GPU lockup - switching to software fbcon
nouveau 0000:04:00.0: fb: trapped write at 0100131fc0 on channel -1 [0fedf000 unknown] engine 06 [BAR] client 04 [PFIFO_WRITE] subclient 01 [IN] reason 000000b [VRAM_LIMIT]

Comment 10 Timo Wiren 2019-05-16 18:31:30 UTC

Since Lubuntu 19.04 the crash has disappeared but I get broken depth testing instead in all GL applications, including glxgears. My workaround (disabling depth compression) still works.

Current kernel: 5.0.0-13-generic
Mesa: 19.0.2

Comment 11 Ilia Mirkin 2019-05-16 19:42:47 UTC

Given the amount of time that this has gone on unfixed, I think we should just make mcp89 point at mcp77_mmu_new instead of g84_mmu_new (in nvkm/engine/device/base.c).

Literally the only difference between those two is the ability to use compression. The quick test in comment #9 didn't yield positive results.

Let's not make things extra-broken for people -- even if compression is somehow enableable on those chips, it's never worked on nouveau, I think.

Timo - are you up to sending a change to fix the above in the kernel? If not, I can do it.

Comment 12 Timo Wiren 2019-05-17 15:19:50 UTC

(In reply to Ilia Mirkin from comment #11)

> Timo - are you up to sending a change to fix the above in the kernel? If
> not, I can do it.

Well, I have never submitted a patch to the kernel before, but this is a good opportunity to learn the process :-). I'll try to make it happen in a few days.

Comment 13 Timo Wiren 2019-07-20 09:04:04 UTC

My fix seems to be included in Linux 5.3, so resolving as fixed:

https://lists.freedesktop.org/archives/dri-devel/2019-July/227219.html

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.