Created attachment 71269 [details] kernel log showing BUG triggered by nouveau If nouveau_vm_new() fails in nouveau_drm_open(), the cleanup triggered corrupts the kernel slab allocator (in this case, SLUB). Attached is the kernel log showing the page allocation failure and the subsequent BUG in mm/slub.c A similar corruption had previously occurred which triggered a GP fault in the mm allocator from the same code path. This was reported as kernel bug #51291 here https://bugzilla.kernel.org/show_bug.cgi?id=51291
Created attachment 71290 [details] [review] fix
(In reply to comment #1) > Created attachment 71290 [details] [review] [review] > fix - vm = *pvm = kzalloc(sizeof(*vm), GFP_KERNEL); + vm = kzalloc(sizeof(*vm), GFP_KERNEL); How/why *not* setting cli->base.vm to NULL fixes this? Also, this assignment idiom is common in the nouveau driver code. Is the above fix just one of many necessary? core/subdev/vm/base.c: vm = *pvm = kzalloc(sizeof(*vm), GFP_KERNEL); core/core/object.c: object = *pobject = kzalloc(size, GFP_KERNEL); core/core/ramht.c: co = ho = nouveau_ramht_hash(ramht, chid, handle); core/core/handle.c: handle = *phandle = kzalloc(sizeof(*handle), GFP_KERNEL); nouveau_abi16.c: cli->abi16 = abi16 = kzalloc(sizeof(*abi16), GFP_KERNEL); nouveau_bo.c: struct nouveau_channel *chan = chan = drm->channel; /* COMMENT: THIS ONE IS INTERESTING */ nouveau_chan.c: chan = *pchan = kzalloc(sizeof(*chan), GFP_KERNEL); nouveau_display.c: disp = drm->display = kzalloc(sizeof(*disp), GFP_KERNEL); nouveau_pm.c: pm = drm->pm = kzalloc(sizeof(*pm), GFP_KERNEL); nv04_fence.c: priv = drm->fence = kzalloc(sizeof(*priv), GFP_KERNEL); nv10_fence.c: fctx = chan->fence = kzalloc(sizeof(*fctx), GFP_KERNEL); nv10_fence.c: priv = drm->fence = kzalloc(sizeof(*priv), GFP_KERNEL); nv50_fence.c: fctx = chan->fence = kzalloc(sizeof(*fctx), GFP_KERNEL); nv50_fence.c: priv = drm->fence = kzalloc(sizeof(*priv), GFP_KERNEL); nv84_fence.c: fctx = chan->fence = kzalloc(sizeof(*fctx), GFP_KERNEL); nv84_fence.c: priv = drm->fence = kzalloc(sizeof(*priv), GFP_KERNEL); nvc0_fence.c: fctx = chan->fence = kzalloc(sizeof(*fctx), GFP_KERNEL); nvc0_fence.c: priv = drm->fence = kzalloc(sizeof(*priv), GFP_KERNEL);
It's not a problem with vm allocation. The next one (vm->pgt) fails, so we free vm, leaving *pvm pointing at freed memory. When we get to nouveau_drm_open, we call nouveau_cli_destroy(cli), which tries to free cli->base.vm again. Oops. I already checked other places and some of them also have this bug. I'll post fixes in a few days.
(In reply to comment #3) > It's not a problem with vm allocation. The next one (vm->pgt) fails, so we > free vm, leaving *pvm pointing at freed memory. When we get to > nouveau_drm_open, we call nouveau_cli_destroy(cli), which tries to free > cli->base.vm again. Oops. Thanks for the explanation. That makes sense to me now. FYI, I did also file a bug in the kernel bugzilla for the memory allocation failure itself (kernel bug 51301 here https://bugzilla.kernel.org/show_bug.cgi?id=51301). A 32k allocation on a 10gb machine shouldn't really ever fail.
Fixed in 3.8-rc2.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.