Bug 43075

Summary:

[SNB] [uxa] bo map failed: Cannot allocate memory

Product:

xorg

Reporter:

nobled <nobled>

Component:

Driver/intel

Assignee:

Chris Wilson <chris>

Status:

RESOLVED FIXED

QA Contact:

Xorg Project Team <xorg-team>

Severity:

normal

Priority:

medium

CC:

daniel, xuan--2010--bugs.freedesktop.org

Version:

git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
(greatly abbreviated) Xorg.1.log	none
`cat /sys/kernel/debug/kmemleak` after ~6 hours	none
excerpt from /var/log/gdm/:0.log	none
/proc/`pidof X`/maps (bzip2-compressed)	none
/sys/kernel/debug/dri/0/i915_gem_objects	none
xrestop -display :0 --max-samples 1 --batch	none

Description nobled 2011-11-18 12:06:08 UTC

Created attachment 53676 [details]
(greatly abbreviated) Xorg.1.log

xserver: 1.10.4, git e3a24feb
intel xorg driver: 2.17, git f938cd15
mesa i965 dri driver: 7.12, git 0cddea37
libdrm-intel: 2.4.27, git 961bf9b5
kernel: 3.1.0 x86_64

Using Ubuntu 11.04 with the gnome-panel 'Ubuntu Classic' fallback mode on a Sandybridge chipset.

After running X for a long time, usually between 24 and 72 hours, the whole X session starts showing corruption that only goes away with a new session. Text starts turning blank, or using the wrong glyphs, or just weird or illegible, window decorations seem to disappear, and web pages get misrendered with junk. Firefox, Ubuntu's "Indicator Applet Session 0.4.12" applet in the gnome-panel, and the gnome-screenshot tool have also started crashing after the corruption appeared, though not consistently.

Some parts seem to not be touched by the effect, like if an eog(eye of gnome) window is open, the picture will stay fine. A Flash embed in a webpage in Chromium also stays clear, even though the webpage around it becomes illegible.

The Xorg.0.log file actually ended up being 1.3 GB (yeah, gigabytes...), because it got spammed with this error message for several hours before I got back to the computer and killed the session:

[339361.941] (WW) intel(0): intel_uxa_prepare_access: bo map failed: Cannot allocate memory

Comment 1 Chris Wilson 2011-11-18 13:12:55 UTC

It would appear you have a kernel memory leak as that is where the allocation is failing. kmemleak or slab debugging may help to identify what is causing the allocation failures. An outside possibility is that is being misreported as ENOMEM, though I've just check the callstack to rule out that possibility.

Comment 2 nobled 2011-11-19 08:03:31 UTC

(In reply to comment #1)
> It would appear you have a kernel memory leak as that is where the allocation
> is failing. kmemleak or slab debugging may help to identify what is causing the
> allocation failures. An outside possibility is that is being misreported as
> ENOMEM, though I've just check the callstack to rule out that possibility.

Thanks for the quick reply. I'll recompile the kernel with CONFIG_DEBUG_KMEMLEAK=y and try it out. For the second thing, would booting with slub_debug=FU do the trick? Or what other options would be good--or what bit under /sys/kernel/slab/ should get poked? I'm not sure which slab corresponds to what the drm driver is doing.

I also forgot to mention--starting a new X server in parallel to the corrupted one works fine. The out of memory error doesn't appear in the new one even while the old one is still failing.

Comment 3 Chris Wilson 2011-11-19 08:07:17 UTC

I'm not sure yet what I'm looking for, so just gather as much data as we can regarding to allocations and looks for a pattern/clue.

Interesting that a second X server seems immune to the error in the first... I wonder if we are not in fact hitting a per-process limit instead...

Comment 4 nobled 2011-11-20 11:02:50 UTC

Created attachment 53727 [details]
`cat /sys/kernel/debug/kmemleak` after ~6 hours


There was a sudden burst of over 40 instances of this in the kmemleak log not long after startup (when I first checked there was only about 4 entries):

unreferenced object 0xffff88022db12180 (size 128):
  comm "setfont", pid 547, jiffies 4294894166 (age 842.748s)
  hex dump (first 32 bytes):
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
    c5 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  backtrace:
    [<ffffffff8116c7ff>] create_object+0x10f/0x290
    [<ffffffff815d7106>] kmemleak_alloc+0x26/0x50
    [<ffffffff81159113>] kmem_cache_alloc_trace+0x123/0x170
    [<ffffffff813b6fd9>] con_insert_unipair+0xf9/0x1c0
    [<ffffffff813b7c75>] con_set_unimap+0x185/0x230
    [<ffffffff813b2165>] vt_ioctl+0x13d5/0x1ef0
    [<ffffffff813a6900>] tty_ioctl+0x240/0xc20
    [<ffffffff81180e78>] do_vfs_ioctl+0x98/0x540
    [<ffffffff811813b1>] sys_ioctl+0x91/0xa0
    [<ffffffff815fa942>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff


But `ps axu | grep 547` showed pid 547 didn't exist, and X's pid is 992.

Then after being up for about 6 hours, I checked again, and the latest three leaks seemed exactly what we're looking for:

unreferenced object 0xffff880228b6bb80 (size 544):
  comm "Xorg", pid 992, jiffies 4294897691 (age 22243.676s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 60 b9 b6 28 02 88 ff ff  ........`..(....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8116c7ff>] create_object+0x10f/0x290
    [<ffffffff815d7106>] kmemleak_alloc+0x26/0x50
    [<ffffffff81159333>] kmem_cache_alloc+0x123/0x180
    [<ffffffff812fe8b8>] idr_pre_get+0x68/0x90
    [<ffffffffa0266075>] drm_gem_handle_create+0x45/0xe0 [drm]
    [<ffffffffa031971d>] i915_gem_create+0x5d/0x100 [i915]
    [<ffffffffa03197de>] i915_gem_create_ioctl+0x1e/0x20 [i915]
    [<ffffffffa0264324>] drm_ioctl+0x3e4/0x4c0 [drm]
    [<ffffffff81180e78>] do_vfs_ioctl+0x98/0x540
    [<ffffffff811813b1>] sys_ioctl+0x91/0xa0
    [<ffffffff815fa942>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff

wat do?

Comment 5 nobled 2011-12-01 12:19:23 UTC

Oh, I just noticed the message got more verbose this time, since I updated past this commit--
http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=4a27dd287cdb85d80ceaf6b54e7de0c17e01c00d

For the record, this is the output now:

[195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 1) failed: Cannot allocate memory
[195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 0) failed: Cannot allocate memory
[195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 1) failed: Cannot allocate memory
[195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 1) failed: Cannot allocate memory
[195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 1) failed: Cannot allocate memory
[195501.622] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 0) failed: Cannot allocate memory
[195501.643] (WW) intel(0): intel_uxa_prepare_access: bo map (use gtt? 1, access 1) failed: Cannot allocate memory

(where the enum access 1 == UXA_ACCESS_RW and 0 == UXA_ACCESS_RO)

Comment 6 nobled 2011-12-01 12:29:41 UTC

Digging further down the stack... Is there a way to make xf86-video-intel call drm_intel_bufmgr_set_debug(bufmgr, 1);? That would make it print out some messages to stderr that would hint to which call is failing in drm_intel_gem_bo_map_gtt().

Comment 7 nobled 2011-12-04 11:44:30 UTC

Created attachment 54111 [details]
excerpt from /var/log/gdm/:0.log

Never mind; I solved it by patching libdrm_intel to turn on debugging for just that function--even just that was incredibly verbose, so I can only fit a sample of the log on here. It's the mmap() call that starts failing every time after a certain point (the "bo_map_gtt: mmap" lines). Although it seems like most of the time, drm_intel_gem_bo_map_gtt() is actually called three time for every intel_uxa_prepare_access() call?

Comment 8 nobled 2011-12-04 15:29:04 UTC

Ohhh, I guess X is hitting this per-process limit:
  cat /proc/sys/vm/max_map_count
65530

Going by the first BO to fail being #65437.

Comment 9 Chris Wilson 2011-12-05 02:34:32 UTC

*** Bug 40066 has been marked as a duplicate of this bug. ***

Comment 10 Chris Wilson 2011-12-05 02:36:34 UTC

Looks like we have a winner!

This should fix the death:

commit 101942d41df7efaa6103e31e738775fafdb63159
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Dec 5 10:11:04 2011 +0000

    uxa: Unmap the buffer after swrast
    
    Otherwise we may exhaust the per-process vma limit and cause
    applications to stop rendering and eventually crash the X server.
    
    Will only work in conjunction with a new libdrm (2.4.28) and commit
      c549a77c (intel: Unmap buffers during drm_intel_gem_bo_unmap)
    in particular.
    
    Reported-by: nobled@dreamwidth.org
    References: https://bugs.freedesktop.org/show_bug.cgi?id=43075
    References: https://bugs.freedesktop.org/show_bug.cgi?id=40066
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

However, it does beg the question how we end up 65535 open vma  in the first place. Can you cat /sys/kernel/debug/dri/0/i915_gem_objects and xrestop after several hours of running?

Comment 11 Daniel Vetter 2011-12-05 08:59:55 UTC

We're not yet sure whether the current solution is long-term-viable (due to a performance impact). So can you please go back to something broken, recrash your X and then grab

/proc/`pidof X`/maps

so that we can see who exactly is burning through so many mappings?

Thanks a lot, Daniel

Comment 12 Chris Wilson 2011-12-05 09:11:30 UTC

(In reply to comment #11)
> We're not yet sure whether the current solution is long-term-viable (due to a
> performance impact).

Performance impact! The argument that it slows down texture upload was unfounded. The only one presented was that it slowed down a broken fallback where cairo was reading from the GTT mapping... So the impact on a critical hot path was never demonstrated, whereas the patches to speed up GTT mapping would work irrespective of whether random BO had their vma cached.

Comment 13 nobled 2011-12-05 18:46:20 UTC

Created attachment 54128 [details]
/proc/`pidof X`/maps (bzip2-compressed)

"Luckily" I didn't update anything yet, so it happened again just now. Good thing I checked this bug on my phone before restarting X... The file's enormous though, I had to compress it.

Comment 14 nobled 2011-12-05 18:48:58 UTC

Created attachment 54129 [details]
/sys/kernel/debug/dri/0/i915_gem_objects

I'm not sure what I'm supposed to be looking for in xrestop, though? There is a big mysterious "Unknown" at the top of the list, but its numbers aren't that much bigger than the ones below it (firefox, compiz...).

Comment 15 nobled 2011-12-05 19:02:11 UTC

Created attachment 54130 [details]
xrestop -display :0 --max-samples 1 --batch

Okay I found the --batch option actually, so I can just show you now. The totals at the top of the interface show this btw:

Pixmaps: 28450K
Other: 114K

Comment 16 Chris Wilson 2011-12-06 04:16:40 UTC

So that tells us that it is entirely the fault of the unused vma on cached bos. The only danger I can see is whether you are using indirect GLX and so suffer a second bo cache for the GL client inside the X server. Is that possible on your system?

Comment 17 nobled 2011-12-06 04:47:40 UTC

Comment on attachment 54128 [details]
/proc/`pidof X`/maps (bzip2-compressed)

Oh, yeah. I guess what you're interested in is this?

  bzgrep -c "/dev/dri/card0" proc-pidof-X-maps.1.log.bz2
65235

  bzgrep -cv "/dev/dri/card0" log/proc-pidof-X-maps.1.log.bz2
296

Yeah it's pretty much all mmap-ings from pseudo-offsets into the drm device.

Comment 18 nobled 2011-12-06 04:50:18 UTC

(In reply to comment #16)
> So that tells us that it is entirely the fault of the unused vma on cached bos.
> The only danger I can see is whether you are using indirect GLX and so suffer a
> second bo cache for the GL client inside the X server. Is that possible on your
> system?

Oops, heh. Didn't see this comment; should've refreshed.

No there's definitely no indirect rendering. `glxinfo | grep render` is pretty clear on that.

Comment 19 Chris Wilson 2011-12-13 04:02:39 UTC

commit 5d5b2b8ee203ae2274fc7d13ed38d2945facca9e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Dec 13 11:30:05 2011 +0000

    uxa: Cap the maximum number of VMA cached
    
    Since we can not keep an unlimited number of vma cached due to the hard
    per-process limits on the number of mappings and recreating mappings is
    slow due to excruciatingly slow GTT pagefaults, we need to compromise
    and keep a small MRU cache of inactive mmaps.
    
    This uses the new API in libdrm-2.4.29 to specify the limit upon the VMA
    cache maintained by libdrm.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.