Summary: | [SNB] [uxa] bo map failed: Cannot allocate memory | ||
---|---|---|---|
Product: | xorg | Reporter: | nobled <nobled> |
Component: | Driver/intel | Assignee: | Chris Wilson <chris> |
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> |
Severity: | normal | ||
Priority: | medium | CC: | daniel, xuan--2010--bugs.freedesktop.org |
Version: | git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
It would appear you have a kernel memory leak as that is where the allocation is failing. kmemleak or slab debugging may help to identify what is causing the allocation failures. An outside possibility is that is being misreported as ENOMEM, though I've just check the callstack to rule out that possibility. (In reply to comment #1) > It would appear you have a kernel memory leak as that is where the allocation > is failing. kmemleak or slab debugging may help to identify what is causing the > allocation failures. An outside possibility is that is being misreported as > ENOMEM, though I've just check the callstack to rule out that possibility. Thanks for the quick reply. I'll recompile the kernel with CONFIG_DEBUG_KMEMLEAK=y and try it out. For the second thing, would booting with slub_debug=FU do the trick? Or what other options would be good--or what bit under /sys/kernel/slab/ should get poked? I'm not sure which slab corresponds to what the drm driver is doing. I also forgot to mention--starting a new X server in parallel to the corrupted one works fine. The out of memory error doesn't appear in the new one even while the old one is still failing. I'm not sure yet what I'm looking for, so just gather as much data as we can regarding to allocations and looks for a pattern/clue. Interesting that a second X server seems immune to the error in the first... I wonder if we are not in fact hitting a per-process limit instead... Created attachment 53727 [details]
`cat /sys/kernel/debug/kmemleak` after ~6 hours
There was a sudden burst of over 40 instances of this in the kmemleak log not long after startup (when I first checked there was only about 4 entries):
unreferenced object 0xffff88022db12180 (size 128):
comm "setfont", pid 547, jiffies 4294894166 (age 842.748s)
hex dump (first 32 bytes):
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
c5 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
backtrace:
[<ffffffff8116c7ff>] create_object+0x10f/0x290
[<ffffffff815d7106>] kmemleak_alloc+0x26/0x50
[<ffffffff81159113>] kmem_cache_alloc_trace+0x123/0x170
[<ffffffff813b6fd9>] con_insert_unipair+0xf9/0x1c0
[<ffffffff813b7c75>] con_set_unimap+0x185/0x230
[<ffffffff813b2165>] vt_ioctl+0x13d5/0x1ef0
[<ffffffff813a6900>] tty_ioctl+0x240/0xc20
[<ffffffff81180e78>] do_vfs_ioctl+0x98/0x540
[<ffffffff811813b1>] sys_ioctl+0x91/0xa0
[<ffffffff815fa942>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
But `ps axu | grep 547` showed pid 547 didn't exist, and X's pid is 992.
Then after being up for about 6 hours, I checked again, and the latest three leaks seemed exactly what we're looking for:
unreferenced object 0xffff880228b6bb80 (size 544):
comm "Xorg", pid 992, jiffies 4294897691 (age 22243.676s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 60 b9 b6 28 02 88 ff ff ........`..(....
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8116c7ff>] create_object+0x10f/0x290
[<ffffffff815d7106>] kmemleak_alloc+0x26/0x50
[<ffffffff81159333>] kmem_cache_alloc+0x123/0x180
[<ffffffff812fe8b8>] idr_pre_get+0x68/0x90
[<ffffffffa0266075>] drm_gem_handle_create+0x45/0xe0 [drm]
[<ffffffffa031971d>] i915_gem_create+0x5d/0x100 [i915]
[<ffffffffa03197de>] i915_gem_create_ioctl+0x1e/0x20 [i915]
[<ffffffffa0264324>] drm_ioctl+0x3e4/0x4c0 [drm]
[<ffffffff81180e78>] do_vfs_ioctl+0x98/0x540
[<ffffffff811813b1>] sys_ioctl+0x91/0xa0
[<ffffffff815fa942>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
wat do?
Oh, I just noticed the message got more verbose this time, since I updated past this commit-- http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=4a27dd287cdb85d80ceaf6b54e7de0c17e01c00d For the record, this is the output now: [195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 1) failed: Cannot allocate memory [195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 0) failed: Cannot allocate memory [195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 1) failed: Cannot allocate memory [195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 1) failed: Cannot allocate memory [195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 1) failed: Cannot allocate memory [195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 0) failed: Cannot allocate memory [195501.643] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 1) failed: Cannot allocate memory (where the enum access 1 == UXA_ACCESS_RW and 0 == UXA_ACCESS_RO) Digging further down the stack... Is there a way to make xf86-video-intel call drm_intel_bufmgr_set_debug(bufmgr, 1);? That would make it print out some messages to stderr that would hint to which call is failing in drm_intel_gem_bo_map_gtt(). Created attachment 54111 [details]
excerpt from /var/log/gdm/:0.log
Never mind; I solved it by patching libdrm_intel to turn on debugging for just that function--even just that was incredibly verbose, so I can only fit a sample of the log on here. It's the mmap() call that starts failing every time after a certain point (the "bo_map_gtt: mmap" lines). Although it seems like most of the time, drm_intel_gem_bo_map_gtt() is actually called three time for every intel_uxa_prepare_access() call?
Ohhh, I guess X is hitting this per-process limit: cat /proc/sys/vm/max_map_count 65530 Going by the first BO to fail being #65437. *** Bug 40066 has been marked as a duplicate of this bug. *** Looks like we have a winner! This should fix the death: commit 101942d41df7efaa6103e31e738775fafdb63159 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Dec 5 10:11:04 2011 +0000 uxa: Unmap the buffer after swrast Otherwise we may exhaust the per-process vma limit and cause applications to stop rendering and eventually crash the X server. Will only work in conjunction with a new libdrm (2.4.28) and commit c549a77c (intel: Unmap buffers during drm_intel_gem_bo_unmap) in particular. Reported-by: nobled@dreamwidth.org References: https://bugs.freedesktop.org/show_bug.cgi?id=43075 References: https://bugs.freedesktop.org/show_bug.cgi?id=40066 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> However, it does beg the question how we end up 65535 open vma in the first place. Can you cat /sys/kernel/debug/dri/0/i915_gem_objects and xrestop after several hours of running? We're not yet sure whether the current solution is long-term-viable (due to a performance impact). So can you please go back to something broken, recrash your X and then grab /proc/`pidof X`/maps so that we can see who exactly is burning through so many mappings? Thanks a lot, Daniel (In reply to comment #11) > We're not yet sure whether the current solution is long-term-viable (due to a > performance impact). Performance impact! The argument that it slows down texture upload was unfounded. The only one presented was that it slowed down a broken fallback where cairo was reading from the GTT mapping... So the impact on a critical hot path was never demonstrated, whereas the patches to speed up GTT mapping would work irrespective of whether random BO had their vma cached. Created attachment 54128 [details]
/proc/`pidof X`/maps (bzip2-compressed)
"Luckily" I didn't update anything yet, so it happened again just now. Good thing I checked this bug on my phone before restarting X... The file's enormous though, I had to compress it.
Created attachment 54129 [details]
/sys/kernel/debug/dri/0/i915_gem_objects
I'm not sure what I'm supposed to be looking for in xrestop, though? There is a big mysterious "Unknown" at the top of the list, but its numbers aren't that much bigger than the ones below it (firefox, compiz...).
Created attachment 54130 [details]
xrestop -display :0 --max-samples 1 --batch
Okay I found the --batch option actually, so I can just show you now. The totals at the top of the interface show this btw:
Pixmaps: 28450K
Other: 114K
So that tells us that it is entirely the fault of the unused vma on cached bos. The only danger I can see is whether you are using indirect GLX and so suffer a second bo cache for the GL client inside the X server. Is that possible on your system? Comment on attachment 54128 [details]
/proc/`pidof X`/maps (bzip2-compressed)
Oh, yeah. I guess what you're interested in is this?
bzgrep -c "/dev/dri/card0" proc-pidof-X-maps.1.log.bz2
65235
bzgrep -cv "/dev/dri/card0" log/proc-pidof-X-maps.1.log.bz2
296
Yeah it's pretty much all mmap-ings from pseudo-offsets into the drm device.
(In reply to comment #16) > So that tells us that it is entirely the fault of the unused vma on cached bos. > The only danger I can see is whether you are using indirect GLX and so suffer a > second bo cache for the GL client inside the X server. Is that possible on your > system? Oops, heh. Didn't see this comment; should've refreshed. No there's definitely no indirect rendering. `glxinfo | grep render` is pretty clear on that. commit 5d5b2b8ee203ae2274fc7d13ed38d2945facca9e Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Dec 13 11:30:05 2011 +0000 uxa: Cap the maximum number of VMA cached Since we can not keep an unlimited number of vma cached due to the hard per-process limits on the number of mappings and recreating mappings is slow due to excruciatingly slow GTT pagefaults, we need to compromise and keep a small MRU cache of inactive mmaps. This uses the new API in libdrm-2.4.29 to specify the limit upon the VMA cache maintained by libdrm. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 53676 [details] (greatly abbreviated) Xorg.1.log xserver: 1.10.4, git e3a24feb intel xorg driver: 2.17, git f938cd15 mesa i965 dri driver: 7.12, git 0cddea37 libdrm-intel: 2.4.27, git 961bf9b5 kernel: 3.1.0 x86_64 Using Ubuntu 11.04 with the gnome-panel 'Ubuntu Classic' fallback mode on a Sandybridge chipset. After running X for a long time, usually between 24 and 72 hours, the whole X session starts showing corruption that only goes away with a new session. Text starts turning blank, or using the wrong glyphs, or just weird or illegible, window decorations seem to disappear, and web pages get misrendered with junk. Firefox, Ubuntu's "Indicator Applet Session 0.4.12" applet in the gnome-panel, and the gnome-screenshot tool have also started crashing after the corruption appeared, though not consistently. Some parts seem to not be touched by the effect, like if an eog(eye of gnome) window is open, the picture will stay fine. A Flash embed in a webpage in Chromium also stays clear, even though the webpage around it becomes illegible. The Xorg.0.log file actually ended up being 1.3 GB (yeah, gigabytes...), because it got spammed with this error message for several hours before I got back to the computer and killed the session: [339361.941] (WW) intel(0): intel_uxa_prepare_access: bo map failed: Cannot allocate memory