39552 – intel_uxa_prepare_access() fails with -ENOSPC (bo leak, unreaped bo cache?)

Bug 39552 - intel_uxa_prepare_access() fails with -ENOSPC (bo leak, unreaped bo cache?)

Summary: intel_uxa_prepare_access() fails with -ENOSPC (bo leak, unreaped bo cache?)

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	medium major
Assignee:	Chris Wilson
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Duplicates (2):	44185 46044 (view as bug list)
Depends on:
Blocks:

Reported:	2011-07-26 05:14 UTC by mikopp
Modified:	2013-08-25 17:58 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Xorg Log (120.58 KB, text/plain) 2011-07-26 05:14 UTC, mikopp	no flags	Details
dmesg (243.98 KB, text/plain) 2011-07-26 05:19 UTC, mikopp	no flags	Details
/sys/kernel/debug/dri/0/vma per ickle's request (607.09 KB, text/plain) 2011-07-28 03:31 UTC, Leho Kraav (:macmaN :lkraav)	no flags	Details
/proc/dri/0 (10.00 KB, application/x-tar) 2011-08-02 06:03 UTC, mikopp	no flags	Details
slabinfo after another incident of this (9.84 KB, application/octet-stream) 2011-08-05 03:11 UTC, mikopp	no flags	Details
Xorg.0.log from machine where gnome-system-monitor causes leaks (53.03 KB, text/plain) 2012-03-15 07:43 UTC, nobled	no flags	Details
Xorg.0.log with intel_drv.so 2.20.8 (46.92 KB, text/plain) 2013-01-31 16:35 UTC, Chris Wilson	no flags	Details
View All

Description mikopp 2011-07-26 05:14:57 UTC

Created attachment 49568 [details]
Xorg Log

I have Windows 7 running in a virtualbox (no acceleration defined). Sometimes it starts to flicker, it blacks out and repaints only areas on mouse movement. I see the following error in xorg log over and over again

[107833.198] (WW) intel(0): intel_uxa_prepare_access: bo map failed: No space left on device

eventually my xorg freezes and crashes. Restart of xorg does not work I have to reboot to get working again.

I have a dual screen setup, one running on HDMI and the other on DP. The problematic virtualbox runs on the DP.

lspci:
00:00.0 Host bridge: Intel Corporation Core Processor DRAM Controller (rev 02)
00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 02)
00:19.0 Ethernet controller: Intel Corporation 82577LM Gigabit Network Connection (rev 05)
00:1a.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)
00:1b.0 Audio device: Intel Corporation Device 3b57 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 1 (rev 05)
00:1c.1 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 2 (rev 05)
00:1c.2 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 3 (rev 05)
00:1c.3 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 4 (rev 05)
00:1d.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation 5 Series/3400 Series Chipset LPC Interface Controller (rev 05)
00:1f.2 RAID bus controller: Intel Corporation Mobile 82801 SATA RAID Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 5 Series/3400 Series Chipset SMBus Controller (rev 05)
00:1f.6 Signal processing controller: Intel Corporation 5 Series/3400 Series Chipset Thermal Subsystem (rev 05)
02:00.0 Network controller: Intel Corporation WiFi Link 6000 Series (rev 35)
03:00.0 SD Host controller: Ricoh Co Ltd Device e822 (rev 01)
3f:00.0 Host bridge: Intel Corporation Core Processor QuickPath Architecture Generic Non-core Registers (rev 02)
3f:00.1 Host bridge: Intel Corporation Core Processor QuickPath Architecture System Address Decoder (rev 02)
3f:02.0 Host bridge: Intel Corporation Core Processor QPI Link 0 (rev 02)
3f:02.1 Host bridge: Intel Corporation Core Processor QPI Physical 0 (rev 02)
3f:02.2 Host bridge: Intel Corporation Core Processor Reserved (rev 02)
3f:02.3 Host bridge: Intel Corporation Core Processor Reserved (rev 02)

Comment 1 mikopp 2011-07-26 05:19:26 UTC

Created attachment 49571 [details]
dmesg

lot's of errors in dmesg too

Comment 2 mikopp 2011-07-26 05:19:50 UTC

not sure if it is interesting but here is xrandr:

Screen 0: minimum 320 x 200, current 3200 x 1200, maximum 8192 x 8192
eDP1 connected (normal left inverted right x axis y axis)
   1366x768       60.2 +
   1024x768       60.0  
   800x600        60.3     56.2  
   640x480        59.9  
VGA1 disconnected (normal left inverted right x axis y axis)
HDMI1 connected 1280x1024+1920+176 (normal left inverted right x axis y axis) 376mm x 301mm
   1280x1024      60.0*+   75.0  
   1152x864       75.0  
   1024x768       75.1     60.0  
   800x600        75.0     60.3  
   640x480        75.0     60.0  
   720x400        70.1  
DP1 disconnected (normal left inverted right x axis y axis)
HDMI2 disconnected (normal left inverted right x axis y axis)
DP2 connected 1920x1200+0+0 (normal left inverted right x axis y axis) 519mm x 320mm
   1920x1200      60.0*+
   1600x1200      60.0  
   1280x1024      75.0     60.0  
   1152x864       75.0  
   1024x768       75.1     60.0  
   800x600        75.0     60.3  
   640x480        75.0     60.0  
   720x400        70.1

Comment 3 Chris Wilson 2011-07-26 07:05:31 UTC

Looks like something is leaking vma (might be through a bo leak?) and we exhaust our mmap space. Can you grab the contents of /sys/kernel/debug/dri/0/* after you hit this issue?

Comment 4 mikopp 2011-07-26 08:31:20 UTC

I will, but I cannot predict when it happens next, might be a couple of days.

Comment 5 mikopp 2011-07-26 08:32:26 UTC

oh that was too fast, I don't have debug in my kernel

Comment 6 Leho Kraav (:macmaN :lkraav) 2011-07-28 02:57:45 UTC

apparently running into this with my i3 laptop.

every once in a while Xorg.0.log gets flooded with:

 * [943834.434] (WW) intel(0): intel_uxa_prepare_access: bo map failed: No space left on device

dmesg gets:

 * [943834.434] (WW) intel(0): intel_uxa_prepare_access: bo map failed: No space left on device

primary problem that i can connect this to is that web videos do not want to switch to full screen mode. quite likely the flood occurs when i try to press the full screen video button on youtube or whatnot. there's a slight flicker, then return to window. other than that, i'm *not* seeing crashing or hanging. so far.

here's what i have of the #intel-gfx conversation:

--- Log opened K juuli 27 14:25:20 2011
14:25      danvet>macmaN, is this on a 32bit install?

yes, pae enabled, 4gb ram total, no swap.

$ free
             total       used       free     shared    buffers     cached
Mem:       3810164    3002584     807580          0       5976    1583180
-/+ buffers/cache:    1413428    2396736
Swap:            0          0          0

14:27      danvet>macmaN, can you pastebin i195_gem_objects from debugfs?

/sys/kernel/debug/dri/0 $ cat i915_gem_objects 
13273 objects, 247914496 bytes
1022 [855] objects, 87339008 [34508800] bytes in gtt
  6 [6] active objects, 5406720 [5406720] bytes
  6 [6] pinned objects, 4501504 [4501504] bytes
  1010 [843] inactive objects, 77430784 [24600576] bytes
  0 [0] freed objects, 0 [0] bytes
7 pinned mappable objects, 9744384 bytes
75 fault mappable objects, 380928 bytes
2147479552 [268435456] gtt total

14:31      danvet>macmaN, are you sometimes running more demanding stuff like games, hd video decoding?

no games, but hd video yes, off youtube, running xbmc, vlc every once in a while.

$ uptime
 12:52:18 up 18 days, 20:57,  3 users,  load average: 0.74, 0.36, 0.32

logoffs from X are very rare, other than kernel upgrades/debugging, usually the machine goes into overnight suspends. this is the first time i've seen these errors, not sure if uptime or the quantity of "demanding stuff" has reached this far before.

16:59      danvet>macmaN exhausted the drm_mmap_offset address range of 4gb ...

... comment for ickle

Comment 7 Leho Kraav (:macmaN :lkraav) 2011-07-28 03:31:10 UTC

Created attachment 49660 [details]
/sys/kernel/debug/dri/0/vma per ickle's request

on further testing, full screen video is actually not a problem, i just played some vimeo stuff without issues.

but loading this page http://www.youtube.com/user/freedrumlessons is enough to start erroring. i have videos blocked out with noscript, so we don't even have to have video to get errors.

Comment 8 Leho Kraav (:macmaN :lkraav) 2011-07-29 10:43:11 UTC

could the vma issue this have been bettered since 2.6.39.3ish? i noticed that nowhere in my bug comments did i specify what kernel im running and noone has asked. does it matter or is it for sure unfixed in 3.x?

$ uname -a
Linux travelmate 2.6.39-pf2 #3 SMP PREEMPT Sat Jul 9 15:16:49 EEST 2011 i686 Intel(R) Core(TM) i3 CPU U 330 @ 1.20GHz GenuineIntel GNU/Linux

Comment 9 Chris Wilson 2011-07-29 10:59:36 UTC

Looking at the vma report, it doesn't seem to be the issue per se. I'm trying to think of what could cause it otherwise. If you can pinpoint why later kernels appears to work better, that would help!

Comment 10 Leho Kraav (:macmaN :lkraav) 2011-07-29 11:19:53 UTC

i have no idea about later kernels actually. after some very annoying btrfs BUG's i ran into in 2.6.38, i'm absolutely in love with the stability of this 2.6.39 setup. really don't have resources to take risks right away, but will keep the need in mind.

Comment 11 mikopp 2011-08-02 06:03:23 UTC

Created attachment 49837 [details]
/proc/dri/0

not sure if it helps but these are the proc dri files

Comment 12 mikopp 2011-08-05 03:11:52 UTC

Created attachment 49951 [details]
slabinfo after another incident of this

Comment 13 mikopp 2011-09-07 05:32:10 UTC

I get this now without VirtualBox. the screen does not blank, just parts of it do not draw until after I cross that area with my mouse. same error in the xorg log file

Comment 14 Leho Kraav (:macmaN :lkraav) 2011-09-09 07:53:45 UTC

i have not seen this issue with 2.16.0 and post-2.16 git HEAD at all.

Comment 15 mikopp 2011-09-11 23:49:44 UTC

I updated to 2.16. I'll report back once it occurs again. I don't have a definitive test, but I know that it occurs if I have my VM running for more than 3 days so lets see.

Comment 16 Chris Wilson 2011-09-12 00:38:05 UTC

Honestly I can't think of a single change that should have impacted upon this bug. So please don't get your hopes up too much that is fixed and stays fixed. :|

Comment 17 mikopp 2012-01-31 23:33:43 UTC

I have this now much more rapidly, without a VM to impact things. 

I'm on xf86-video-intel-2.17 and kernel 3.1.10 now.
It now only takes a day of working with eclipse (seems to be tied to GTK, other SWT applications deliver the same result).

Even worse now, once this starts to happen, the X CPU spikes regularly to 100% for half minutes at a time. I get the same error in the xorg log. at some point it gets so bad you can't work. restart X does not work then anymore only a reboot helps.

reoccurring error in dmesg
[drm:i915_gem_create_mmap_offset] *ERROR* failed to allocate offset for bo 0
reoccurring in xorg log
[141361.521] (WW) intel(0): intel_uxa_prepare_access: bo map failed: No space left on device

Comment 18 mikopp 2012-02-01 05:27:02 UTC

I have this now much more rapidly, without a VM to impact things.

I'm on xf86-video-intel-2.17 and kernel 3.1.10 now.
It now only takes a day of working with eclipse (seems to be tied to GTK, other SWT applications deliver the same result).

Even worse now, once this starts to happen, the X CPU spikes regularly to 100% for half minutes at a time. I get the same error in the xorg log. at some point it gets so bad you can't work. restart X does not work then anymore only a reboot helps.

reoccurring error in dmesg
[drm:i915_gem_create_mmap_offset] *ERROR* failed to allocate offset for bo 0
reoccurring in xorg log
[141361.521] (WW) intel(0): intel_uxa_prepare_access: bo map failed: No space left on device


This is becoming a real problem now and hits my working environment. Not sure why it became worse, but I think it is only after my recent update to the 3.x kernel series.

Comment 19 Chris Wilson 2012-02-01 06:02:38 UTC

The light at the end of the tunnel is a long way away on this one, I'm afraid. I've a series of kernel patches that should prevent the ENOSPC, but they are not ready for review, and depend on another series that is also not ready.

In the meantime, you could try enabling sna as that handles the bo cache completely differently and I hope doesn't quite get into the same trouble. The mmap address exhaustion is a real issue, though another possible workaround is to use a 64-bit kernel.

Comment 20 mikopp 2012-02-01 06:19:29 UTC

Could you please clearify on the SNA? I don't have sandybridge, but an Intel(R) Arrandale.

On the other hand if you have patches that I can try out and you think are reasonable well working, I am happy to do that.

Comment 21 Chris Wilson 2012-02-01 06:30:05 UTC

SNA works with all of our supported chipsets, and even on Ironlake is significantly faster than UXA. I am curious as to how it fare in this situation. The underlying problem still exists, just the usage of buffer objects might be sufficiently different to hide it.

The tree for testing the ENOSPC fixes is available from http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=reap-mmap-offsets. I don't pretend that is in a clean state at all. ;-) The patch of interest is http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=reap-mmap-offsets&id=c4a07eef055773efba7855bcaf5f26277695a5ae

Comment 22 mikopp 2012-02-01 23:08:04 UTC

ok I'm using sna now, let's see. I'll try your patch later on the weekend.

I have also another issue and wonder if it is related, since kernel 3.1 xrandr is taken a lot longer to do its thing (change screen arrangement and resolution) and it blocks even the mouse from working, stops the whole X for a full minute.

Should I report a separate issue or can this be related.

Comment 23 Chris Wilson 2012-02-02 00:36:39 UTC

(In reply to comment #22)
> ok I'm using sna now, let's see. I'll try your patch later on the weekend.
> 
> I have also another issue and wonder if it is related, since kernel 3.1 xrandr
> is taken a lot longer to do its thing (change screen arrangement and
> resolution) and it blocks even the mouse from working, stops the whole X for a
> full minute.
> 
> Should I report a separate issue or can this be related.

Whilst we know of a reason why xrandr is slow in general (probing of disconnected outputs causes timeouts rather than a quick "not detected"), I was not aware that the situation had got any worse with 3.2

Comment 24 mikopp 2012-02-02 01:05:47 UTC

its 3.1.10 not yet 3.2, but yes it got far worse. It used to be slow, but not blocking X completely. If you let me know what kind of information you need I will open a new ticket for that

Comment 25 Chris Wilson 2012-02-02 01:26:37 UTC

Just start the report of an Xorg.log with timings from 3.0, the bad Xorg.log with timings from 3.1.0 and an strace -tt of X from 3.1.0 would be useful. The first priority is just to open a ticket stating the problem so that we have it tracked and raise awareness of the issue.

Comment 26 Chris Wilson 2012-02-20 11:34:56 UTC

*** Bug 46044 has been marked as a duplicate of this bug. ***

Comment 27 Chris Wilson 2012-02-20 11:35:41 UTC

Also note that in bug 46044, we also hit the VFS file limit.

Comment 28 Chris Wilson 2012-02-24 03:28:36 UTC

Can you please test whether disable the bo cache is sufficient to avoid the issue:

commit 5b5cd6780ef7cae8f49d71d7c8532597291402d8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Feb 24 11:14:26 2012 +0000

    uxa: Add a option to disable the bo cache
    
    If you are suffering from regular X crashes and rendering corruption
    with a flood of ENOSPC or even EFILE reported in the Xorg.log, try
    adding this snippet to your xorg.conf:
    
    Section "Driver"
      Option "BufferCache" "False"
    EndSection
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=39552
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 29 Chris Wilson 2012-02-24 06:54:54 UTC

*** Bug 44185 has been marked as a duplicate of this bug. ***

Comment 30 nobled 2012-02-29 15:13:35 UTC

(In reply to comment #28)
>     Section "Driver"
>       Option "BufferCache" "False"
>     EndSection
X refused to start instead:

[546141.376] Parse error on line 2 of section Driver in file /etc/X11/xorg.conf
	"Driver" is not a valid section name.

Comment 31 Chris Wilson 2012-02-29 15:22:27 UTC

Oops, Section "Device" not "Driver". You would have thought I would have checked before committing..

Comment 32 nobled 2012-03-07 15:12:08 UTC

Okay, fixing that, it failed for a different reason:

[604303.989] Parse error on line 5 of section Device in file /etc/X11/xorg.conf
	This section must have an Identifier line.
[604303.990] (EE) Problem parsing the config file
[604303.990] (EE) Error parsing the config file
[604303.990] 
Fatal server error:
[604303.990] no screens found

What does it mean by Identifier line? And are there any other requirements that are going to show up after this one gets fixed.

Comment 33 Chris Wilson 2012-03-07 15:48:44 UTC

Ok, the minimum complete snippet is:

Section "Device"
  Identifier "Device0"
  Driver "Intel"
  Option "BufferCache" "False"
EndSection

Comment 34 nobled 2012-03-07 17:41:11 UTC

Okay, been running it for a few hours with the cache disabled. file-nr is still going up every time I check it even if I haven't done anything. xrestop shows about 55 MB in pixmaps, while i915_gem_objects shows 267 MB, and is also going up every time I check it.

Comment 35 nobled 2012-03-13 08:02:37 UTC

Huh, just now I hit some *other* kind of limit before file-nr was even halfway to file-max just now. I guess it was ENOSPC instead of ENFILE?

Basically the exact symptoms described Bug 44185 -- with the addition that compiz coincidentally(?) crashed (bug 46303) while everything was screwing up, which made window decorations disappear as usual--but once it automatically restarted, I was left staring at my desktop background with nothing else showing, all windows / the mouse had disappeared.

The numbers don't *seem* that different--I'm pretty sure it's gone past 1.6GB before, so no idea why it hit ENOSPC this time, and that one time weeks ago, and not the dozens of other times in between?

$ cat /proc/sys/fs/file-nr
314400	0	796780

$ sudo cat /sys/kernel/debug/dri/0/i915_gem_objects
307611 objects, 1535336448 bytes
748 [731] objects, 95580160 [26685440] bytes in gtt
  3 [1] active objects, 3424256 [16384] bytes
  8 [8] pinned objects, 8704000 [8704000] bytes
  737 [722] inactive objects, 83451904 [17965056] bytes
  0 [0] freed objects, 0 [0] bytes
8 pinned mappable objects, 8704000 bytes
685 fault mappable objects, 3039232 bytes
2147479552 [268435456] gtt total

Comment 36 nobled 2012-03-13 08:14:37 UTC

Whoops. I closed all my programs one by one, and the leak only occurs with gnome-system-monitor 2.28.2 running. With it closed, the number of BOs in i915_gem_objects is stable. Even when it's minimized, while it's running the climb in numbers is fairly steady.

Sorry I didn't test this properly before just now -- I forgot I had that running in the background, even.

Comment 37 Chris Wilson 2012-03-13 08:47:11 UTC

Sounds like we have a lead at last \o/. Thanks.

Comment 38 Chris Wilson 2012-03-13 10:23:07 UTC

Is this bug peculiar to gnome-system-monitor 2.28.2? The systems I have all have gnome-system-monitor 3.2 and I have not seen it misbehave yet (multiple generations, with and without compositing).

Comment 39 nobled 2012-03-13 13:15:41 UTC

(In reply to comment #38)
> Is this bug peculiar to gnome-system-monitor 2.28.2? The systems I have all
> have gnome-system-monitor 3.2 and I have not seen it misbehave yet (multiple
> generations, with and without compositing).

Huh. I just tried it on another computer with Ubuntu 11.10 on it, which has 3.2.1, and I couldn't reproduce it there either. I'm gonna try booting an Ubuntu 11.04 live usb image, since that's what I'm currently running and seeing it on.

Comment 40 mikopp 2012-03-14 04:52:23 UTC

I have no gnome system monitor running, I'm on KDE and still have this problem. I do have gtk applications running though, mostly RCP/SWT based stuff. maybe it is something that these two have in common? I can also say that I always have graphic issues in these applications and not in the QT ones.

Comment 41 Chris Wilson 2012-03-15 05:59:49 UTC

I've applied some patches originally intended to aide chasing this bug down, but since proved to fix another bug they went straight to master. Can you please checkout xf86-video-intel.git and monitor for the leak? Thanks.

Comment 42 nobled 2012-03-15 07:43:48 UTC

Created attachment 58513 [details]
Xorg.0.log from machine where gnome-system-monitor causes leaks

...Okay, couldn't reproduce it off the LiveCD either. Here are all the differences I can think of between this laptop and that machine:

- Running 3.2 kernel
- Installed the slightly dated natty xorg-edgers repo: X server 1.10.4+, drivers and mesa from February git snapshots
- HD3000 hardware
- Running in the gnome 'fallback' compiz composited environment

I did install the xorg-edgers repo on the LiveCD and upgraded/killed/restarted X and all. Still couldn't reproduce it.

There isn't an x11trace equivalent to apitrace that would record what exactly gnome-system-monitor might be spamming the server with, is there?

(And it's not exclusive to gnome-system-monitor; the leaks are just accumulating much slower now. They only don't happen at all if I'm doing absolutely nothing but checking the BO count repeatedly.)

Comment 43 nobled 2012-03-15 07:44:24 UTC

(In reply to comment #41)
Whoops, should've refreshed the page. Yeah will do.

Comment 44 Chris Wilson 2012-03-15 08:20:54 UTC

(In reply to comment #42)
> There isn't an x11trace equivalent to apitrace that would record what exactly
> gnome-system-monitor might be spamming the server with, is there?

There is xtrace (or xscope) but no means to replay yet (at least that I know of).

Comment 45 nobled 2012-03-15 08:43:50 UTC

Ah. I just restarted X with the driver from commit 0a8218a535babb5969a58c3a7da0215912f6fef8 -- leak still happens.

Comment 46 Chris Wilson 2012-03-24 15:55:33 UTC

The situation should be improved by

commit a14917eeb2cc160d13f4fddefe5f7f9c80953ce1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Feb 24 21:13:38 2012 +0000

    drm/i915: Release the mmap offset when purging a buffer
    
    If we discard a buffer due to memory pressure, also release its alloted
    mmap address space. As it may be sometime before userspace wakes up
    and notices that it has buffers to purge from its cache, we may waste
    valuable address space on unusable objects for a period of time.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=47738
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

but I'm still searching for just why we end up with so many buffers.

Comment 47 Chris Wilson 2012-04-14 23:54:39 UTC

The only thing I've found so far...

commit a16616209bb2dcb7aaa859b38e154f0a10faa82b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Apr 14 19:03:25 2012 +0100

    uxa: Fix leak of glyph mask for unhandled glyph composition
    
    ==1401== 7,344 bytes in 34 blocks are possibly lost in loss record 570 of 58
    ==1401==    at 0x4027034: calloc (in /usr/lib/valgrind/vgpreload_memcheck-am
    ==1401==    by 0x8BE5150: drm_intel_gem_bo_alloc_internal (intel_bufmgr_gem.
    ==1401==    by 0x899FC04: intel_uxa_create_pixmap (intel_uxa.c:1077)
    ==1401==    by 0x89C2C41: uxa_glyphs (uxa-glyphs.c:254)
    ==1401==    by 0x21F05E: damageGlyphs (damage.c:647)
    ==1401==    by 0x218E06: ProcRenderCompositeGlyphs (render.c:1434)
    ==1401==    by 0x15AA40: Dispatch (dispatch.c:439)
    ==1401==    by 0x1499E9: main (main.c:287)
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


Could this seemingly insignificant path be the cause of your misery?

Comment 48 Leho Kraav (:macmaN :lkraav) 2012-04-15 00:31:46 UTC

I am quite certain 2.17 +sna has never bombed on me with this. I'm now up to kernel 3.3.1.

OTOH I can't move up to 2.18.0, since Firefox gets corruption all over the place. Have not tested 2.18.0+3.3 yet, rather feels like I should be waiting for 2.18.1.

Comment 49 Chris Wilson 2012-04-15 00:36:06 UTC

(In reply to comment #48)
> I am quite certain 2.17 +sna has never bombed on me with this. I'm now up to
> kernel 3.3.1.
> 
> OTOH I can't move up to 2.18.0, since Firefox gets corruption all over the
> place. Have not tested 2.18.0+3.3 yet, rather feels like I should be waiting
> for 2.18.1.

Ah that means you are encountering the bug in 2.18.0-sna and so you won't be suffering from this bug any longer (as far as I can tell this is pure an UXA issue).

Comment 50 nobled 2012-04-17 06:48:41 UTC

(In reply to comment #47)
> The only thing I've found so far...
> 
> commit a16616209bb2dcb7aaa859b38e154f0a10faa82b
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Sat Apr 14 19:03:25 2012 +0100
> 
>     uxa: Fix leak of glyph mask for unhandled glyph composition
> 
> Could this seemingly insignificant path be the cause of your misery?

The bad news is, that commit didn't fix it. The good news is, an earlier one *did*:

commit fde8a010b3d9406c2f65ee99978360a6ca54e006
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 30 12:47:21 2012 +0100

    uxa: Remove broken render glyphs-to-dst
    
    Reported-by: Vincent Untz <vuntz@gnome.org>
    Reported-by: Robert Bradford <robert.bradford@intel.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=48045
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


I made sure; reverting that commit on top of master causes the leak to re-appear. (Only question is, was the leak in that function all along, or did it happen to call a leaky path somewhere else?)

Just a note on my testing: gnome-system-monitor only begins to leak after it's been running for 60 seconds while viewing the 'Resources' tab-- ie, enough time has passed to fill up the whole X axis of the rolling 'CPU History' graph. Once history starts falling off the back edge, the numbers start climbing.

Comment 51 Chris Wilson 2012-04-17 06:55:51 UTC

Ah, it had a very, very similar bug. Almost as if I based both functions on the same skeleton code ;-)

(It leaked the localSrc, localDst, if either were allocated, if it decided that it would be unable to render the glyphs using the GPU).

Glad to have an answer finally.

Comment 52 Florian Mickler 2012-10-15 20:53:14 UTC

A patch referencing this bug report has been merged in Linux v3.7-rc1:

commit d8cb5086695dcdd076e911fc298a5a6701497371
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Aug 11 15:41:03 2012 +0100

    drm/i915: Try harder to allocate an mmap_offset

Comment 53 Chris Wilson 2012-10-17 13:05:56 UTC

I'm seeing bug #46044 and I'm not 100% convinced that it's a duplicate of this one, as marked.

I've been running xserver-xorg-video-intel 2:2.20.8-0ubuntu2.1~precise2, which I assume includes this patch, and seen the problem. Just downgraded to 2:2.17.0-1ubuntu4.2 as that was just released into precise updates.

I think it's closely related though. I do see the number of "inactive objects" in /sys/kernel/debug/dri/0/i915_gem_objects climbing sky-high if there's any animation in my AWN taskbar. The clock doesn't trigger it, but the dropbox sync icon does.

If I stop dropbox, it drops back down to normal levels; if I start dropbox and it's syncing (animated rotating arrows icon) the number of "inactive objects" grows by a few per second. I also see the same errors in /var/log/kern.log, over and over again, once the graphics corruption starts:

Oct 17 13:37:25 lap-x201 kernel: [266534.112127] [drm:drm_gem_create_mmap_offset] *ERROR* failed to allocate offset for bo 0

I've started syslogging the number of inactive objects to see if it reaches the same kind of heights. I've also logged a bug on Launchpad, so we can see whether a new driver release is needed in Ubuntu:

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/1053959

Comment 54 Chris Wilson 2012-10-17 13:12:43 UTC

(In reply to comment #53)
> Oct 17 13:37:25 lap-x201 kernel: [266534.112127]
> [drm:drm_gem_create_mmap_offset] *ERROR* failed to allocate offset for bo 0
> 
> I've started syslogging the number of inactive objects to see if it reaches
> the same kind of heights. I've also logged a bug on Launchpad, so we can see
> whether a new driver release is needed in Ubuntu:
> 
> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/

Could be this bug as the attached Xorg.log there indicates you are using 2.17.0 (I'd like to see your Xorg.log with 2.20.x if you have it, just to confirm) or it could just be a client pixmap leak. So also watch xrestop.
> 1053959

Comment 55 Chris Wilson 2013-01-31 16:35:53 UTC

Created attachment 74005 [details]
Xorg.0.log with intel_drv.so 2.20.8

Sorry for the delay, it's awfully confusing that we're both called Chris Wilson, I didn't realise that you'd replied to my comment :)

Here is Xorg.0.log, hopefully showing that the driver running is 2.20.8 (so with the patch)?

I haven't been running xrestop, sorry. I've just started. Even if this is a client pixmap leak, surely one app shouldn't be able to bring down my desktop? Wouldn't that be a bug, perhaps in X itself rather than the Intel driver? But I've never seen anything like this happen with any other graphics card or system in 15 years of using Linux on the desktop.

I did notice that when my Chromium goes all black, if I make the window smaller then it starts working again, for a while, then it goes black again and I have to make it smaller again, and so on until I can't read web pages any more and I have to log out and back in again.

Comment 56 Chris Wilson 2013-02-01 12:11:30 UTC

(In reply to comment #55)
> Created attachment 74005 [details]
> Xorg.0.log with intel_drv.so 2.20.8
> 
> Sorry for the delay, it's awfully confusing that we're both called Chris
> Wilson, I didn't realise that you'd replied to my comment :)
> 
> Here is Xorg.0.log, hopefully showing that the driver running is 2.20.8 (so
> with the patch)?
> 
> I haven't been running xrestop, sorry. I've just started. Even if this is a
> client pixmap leak, surely one app shouldn't be able to bring down my
> desktop? Wouldn't that be a bug, perhaps in X itself rather than the Intel
> driver? But I've never seen anything like this happen with any other
> graphics card or system in 15 years of using Linux on the desktop.

It's a Denial-of-Service. There's a limited address space for mmapping of buffers, so that if the client does leak, eventually we will not be able to map a new buffer and it will remain blank. (KDE is full of such examples, or at least one sufficiently common one.)

We've harden the kernel to recover as much of that space as possible, but that is limited by the guarantees given by the userspace API. On the other side, it is possible to use alternative fallback methods if the mmapping fails, that hardening is present in SNA. Nevertheless, the side-effects will remain unpleasant until the source of the bug is found.

Comment 57 Chris Wilson 2013-02-03 22:34:07 UTC

Hi Chris,

I'm afraid I don't understand the protocol/library/guarantees well enough to interpret what you're saying with 100% confidence.

The behaviour that I'm seeing is not consistent with one app DOSing itself. Chromium goes black, I restart Chromium (with the same tabs open), it's still black. I kill and restart the X server, restart Chromium again (with the same tabs open), and now it works again (for ~8-24 hours until the same thing happens again).

I think you're saying that individual clients are allowed to allocate pixmaps out of the (very) limited mmaped space that is video RAM for this card and shared between app apps. And it's possible for a client to leak this space (it seems that maybe AWN or some of its applets does this), which eventually results in a DoS for other X apps (gnome-terminal, chromium) and makes the desktop unusable.

My view is that if clients can do this, it represents a violation of X's responsibility to maintain stability of the desktop for all clients, in a way that doesn't seem to be consistent with the behaviour of X's behaviour with any other graphics driver.

If clients can quite easily exhaust that resource, I don't think they should be allowed to allocate it at all. Why does the X server allow clients to allocate direct mapping pixmaps? What if the X server managed the graphics card's mapped memory, and decided for itself which pixmaps are actually mapped into the limited space available?

One of the reasons that I prefer X over for example Windows is that it has always tried to protect itself against badly behaved apps crashing the desktop. If that is no longer the case, it annoys me quite a bit. Do you think this is a deeper bug in X that needs to be fixed?

Cheers, Chris.

Comment 58 Chris Wilson 2013-02-04 14:08:27 UTC

I've verified that killing and restarting /usr/share/avant-window-navigator/applets/indicator-applet.desktop restores normal behaviour in other apps, so I don't have to restart the X server any more.

i915_gem_objects before and after:

chris@lap-x201:~$ sudo cat /sys/kernel/debug/dri/0/i915_gem_objects
43847 objects, 103444480 bytes
2565 [1740] objects, 351232000 [209481728] bytes in gtt
  97 [32] active objects, 54587392 [13967360] bytes
  2468 [1708] inactive objects, 296644608 [195514368] bytes
7 pinned mappable objects, 12759040 bytes
66 fault mappable objects, 385024 bytes
2147483648 [268435456] gtt total

chris@lap-x201:~$ kill 5103

chris@lap-x201:~$ sudo cat /sys/kernel/debug/dri/0/i915_gem_objects
2548 objects, 292028416 bytes
1694 [1186] objects, 253702144 [143339520] bytes in gtt
  49 [41] active objects, 46473216 [14962688] bytes
  1645 [1145] inactive objects, 207228928 [128376832] bytes
7 pinned mappable objects, 12759040 bytes
140 fault mappable objects, 4497408 bytes
2147483648 [268435456] gtt total

chris@lap-x201:~$ awn-applet -p /usr/share/avant-window-navigator/applets/indicator-applet.desktop -u 1347961205 -w 23068725 -i 1 &

chris@lap-x201:~$ sudo cat /sys/kernel/debug/dri/0/i915_gem_objects
1616 objects, 194662400 bytes
798 [651] objects, 150114304 [93024256] bytes in gtt
  70 [54] active objects, 46964736 [15323136] bytes
  728 [597] inactive objects, 103149568 [77701120] bytes
8 pinned mappable objects, 12775424 bytes
91 fault mappable objects, 815104 bytes
2147483648 [268435456] gtt total

Comment 59 Chris Wilson 2013-02-14 12:14:09 UTC

Hi Chris,

Am I experiencing a bug in the X server then? Do you want me to open a new bug? Something is seriously wrong if one app is able to bring down my entire desktop by accident.

Cheers, Chris.

Comment 60 Chris Wilson 2013-02-14 12:54:12 UTC

Thinking about it, a bug against Xorg core to teach it per-client resource limits is actually not a bad idea. I would imagine that XACE, the security extension to X that already does all the permission checks, should be modifiable to also perform resource limit checks.

Comment 61 Chris Wilson 2013-02-15 19:43:06 UTC

Thanks, filed bug #60925. Cheers, Chris.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.