Bug 37168

Summary: Regression: Severe memory leak when running Second Life
Product: Mesa Reporter: Sean McNamara <gm.potato.ul>
Component: Drivers/Gallium/r600Assignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: medium CC: sa, smcnam
Version: git   
Hardware: Other   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: proposed patch

Description Sean McNamara 2011-05-13 03:40:12 UTC
When running any of the following family of applications (the latter two are a not-too-distant fork of the first):

Second Life from www.secondlife.com version 2.6 or later (32-bit)
Imprudence from www.kokuaviewer.org version 1.3.1 or 1.4.0 (32-bit or 64-bit)
Kokua from www.kokuaviewer.org version 0.1.0 (32-bit or 64-bit)

The following results are observed:

Mesa 7.10.2 tarball OR Mesa 7.10 branch from git: Correct rendering; no crashing.

Mesa 7.11-dev from about February 2011: Massive memory leak (upwards of 100 MB/s) in user-space fills up the virtual address space of the application. Eventually, kernel oomkiller kills it.

The version of Mesa 7.11-dev shipped by Fedora 15: Same as above; massive memory leak causing eventual OOM.

Mesa 7.11-dev from about April 2011, including today's git master (May 13 2011): Memory leak appears to persist, by monitoring the memory usage of the process. However, a much more sinister problem appears: the kernel hard-locks. Completely. So hard that it can't kexec the crash kernel I set up, which I tested and works with a NULL pointer dereference. No caps lock key response, can't SSH in, no magic sysrq keys. Completely dead kernel. Motherboard lacks a RS282 port or serial port header; can't debug over serial.

Steps to reproduce problems observed:

Simply run the application and log-in to any grid (Second Life and osgrid were tested). It doesn't appear to be specific to any client settings or particular objects in-world. Memory leak begins immediately once 3d rendering begins (after the login process is complete). Kernel hard-lock will occur within 5 minutes. You can make the kernel hardlock occur more quickly by panning the camera around, but it will occur regardless.

Because there are two interacting bugs, one causing me to have to hold down the power button, this is a VERY difficult to bisect problem.


Test Parameters:

Application versions as stated above
AMD Radeon HD5970 (689c chipset, uses evergreen code)
Fedora 15 x86_64
Linux 2.6.38.2 through 2.6.38.5 (official Fedora build) and 2.6.39-rcX: behavior is identical on all of these kernels, including the latest 2.6.39 RC (rc7-git2 as of this writing)
libdrm from git master (kept updated)
xf86-video-ati from git master
Xorg Server from Fedora 15
mesa version: varies (see test results)
Driver parameters: Have tried a full factorial of the following settings: SwapBuffersWait, EnablePageFlip, ColorTiling. Enabling and disabling them individually results in 8 possible combinations; none of them have any impact on the result. ONLY the Mesa version has any impact whatsoever on the result.

Reproduced independently by a user (BioTube) on #radeon IRC with a R600-class GPU.

Troubleshooting:

The application allows you to toggle things such as various classes of shaders (or whether to use shaders at all), framebuffer objects, and vertex buffer objects. You can also toggle between a deferred rendering pipeline with shadows and the classic immediate rendering pipeline. Enabling and disabling these settings has no effect whatsoever on the outcome, except that enabling shaders will produce a SIGSEGV on older versions of mesa (such as 7.10.2) due to a missing feature in the shader compiler which has since been implemented.

The renderer was *significantly* rewritten in the "2.x" versions of Second Life and in the Kokua experimental viewer, compared to the "1.x" Imprudence. None of the changes between these two major versions of the renderer have any impact on the outcome of the tests.

I haven't even been able to diagnose what the exact problem is because the hardlock is so, well, hard. The memory leak might be more tractable with valgrind and debugging symbols shipped with the mesa build.

I am allquixotic on IRC if you want me to test a patch or need help reproducing it. Based on my diagnosis so far, you should be able to reproduce it using *any* hardware that is supported by the r600g driver.

NB: These programs are open source software, so if you are so inclined, dive in and take a look at what might possibly be causing this problem. The corporate developers at Linden Lab unfortunately don't support running their client under Mesa at all, and the Imprudence/Kokua developers lack detailed knowledge of the internals of the open source graphics stack, so we can't rely on them either. But it is open source, and it would be nice to get it working again (especially since it works with the 7.10 release series), as well as fix a potential kernel panic bug.
Comment 1 Alex Deucher 2011-05-13 09:10:43 UTC
(In reply to comment #0)
> 
> Mesa 7.10.2 tarball OR Mesa 7.10 branch from git: Correct rendering; no
> crashing.

Can you bisect mesa and see what commits between 7.10 and master broke it?
Comment 2 Sean McNamara 2011-05-13 16:27:44 UTC
(In reply to comment #1)
> (In reply to comment #0)
> > 
> > Mesa 7.10.2 tarball OR Mesa 7.10 branch from git: Correct rendering; no
> > crashing.
> 
> Can you bisect mesa and see what commits between 7.10 and master broke it?

debc45bca07a5dfad4199079f080b35c19f00e85 starts the memory leaks (very noticeable on this commit, just watch gnome-system-monitor's total memory usage grow by 25 - 100 MB/s)

I got two kernel hard-locks with 8c631cfeae29b5236928f759e222aa35e6e4984c also, but it doesn't seem to have the memory leak.

Anyway, I'm not sure *exactly* which commit because I was fighting hardlocks and memory leaks at the same time, and sometimes they are intermittent, but if you look at the r600g commit series by Marek on or around January 28 - 29, I think the problem lies somewhere in there. That's my best guess, and it's certainly more relevant than what I came up with the last time I bisected, which was the merge of dricore support (and I'm not even using dricore, btw).
Comment 3 Sean McNamara 2011-05-20 00:36:13 UTC
Followup: I also get the following messages spewed to dmesg every single frame, regardless if I'm using Mesa 7.10.2 or git master or anything in between. I'm not sure if this is related to the problem or just noise. But I suspect that it's unrelated, because I don't experience any symptoms of failure (crash, OOM) with 7.10.2, and the messages still get spewed. The frequency is about once every frame, or a bit more often. Maybe as frequently as once per kernel tick (~1000 Hz timer).

[  564.159042] DRHD: handling fault status reg 2
[  564.159248] DMAR:[DMA Write] Request device [04:00.0] fault addr 44bc0000 
[  564.159249] DMAR:[fault reason 05] PTE Write access is not set

The DRHD line appears less frequently, but the DMAR lines are always grouped together like that. I tried running other GL apps and it doesn't happen there.
Comment 4 Sean McNamara 2011-06-07 10:13:26 UTC
(In reply to comment #3)
> Followup: I also get the following messages spewed to dmesg every single frame,
> regardless if I'm using Mesa 7.10.2 or git master or anything in between. I'm
> not sure if this is related to the problem or just noise. But I suspect that
> it's unrelated, because I don't experience any symptoms of failure (crash, OOM)
> with 7.10.2, and the messages still get spewed. The frequency is about once
> every frame, or a bit more often. Maybe as frequently as once per kernel tick
> (~1000 Hz timer).
> 
> [  564.159042] DRHD: handling fault status reg 2
> [  564.159248] DMAR:[DMA Write] Request device [04:00.0] fault addr 44bc0000 
> [  564.159249] DMAR:[fault reason 05] PTE Write access is not set
> 
> The DRHD line appears less frequently, but the DMAR lines are always grouped
> together like that. I tried running other GL apps and it doesn't happen there.

The DMAR and DRHD lines appear to be related to my ASUS P6T Deluxe's broken BIOS. Even the latest version of this BIOS is unable to handle the Core i7's VT-d feature correctly. On boot-up, the kernel complains very loudly about my broken BIOS and the incorrect functioning of the chipset IOMMU; when I disable VT-d, all of this complaining goes away, as do these DMAR / DRHD faults.

Unfortunately, when disabling VT-d, I still have the memory leak.
Comment 5 Sven Arvidsson 2011-06-09 06:57:06 UTC
I tried to reproduce this on my 5670, but couldn't. 

I don't get a memory leak or a hard lock, instead the viewer (tried both SL and Imprudence) segfaults after a short while. This is with current git master, with 7.10 everything seems to be working. 

I will try to bisect and file a new bug in case it's not the same problem.
Comment 6 Pierre-Eric Pelloux-Prayer 2011-06-10 07:34:43 UTC
Created attachment 47812 [details] [review]
proposed patch

I'm having the same memory leak issue (though I kill the game when swapping starts so I did not experience kernel lock).

The attached patch seems to fix the problem; could you please test it with your setup ?
Comment 7 Sean McNamara 2011-06-10 14:56:30 UTC
(In reply to comment #6)
> Created an attachment (id=47812) [details]
> proposed patch
> 
> I'm having the same memory leak issue (though I kill the game when swapping
> starts so I did not experience kernel lock).
> 
> The attached patch seems to fix the problem; could you please test it with your
> setup ?

Your patch fixes the problem here too! Great job spotting it!!

I am curious, what research process did you undertake in order to isolate the leak? I would like to learn how to do that. I am a software engineer, but apparently there's some tool I should know about that would help me narrow down this kind of leak to a specific part of the code.

Anyway, I think you should really put this patch up for inclusion into git master. I don't see any adverse effects from the patch, and it definitely resolves the issue (I've run imprudence for 45 minutes and it's only using 3.8% of my 6GB of memory, whereas before it would get OOM Killed within 5 minutes).
Comment 8 Sven Arvidsson 2011-06-10 15:02:56 UTC
(In reply to comment #7)
> Your patch fixes the problem here too! Great job spotting it!!

Did the patch also solve the kernel hang? If not, it should be easier to track down now.
Comment 9 Sean McNamara 2011-06-10 15:04:01 UTC
(In reply to comment #6)
> Created an attachment (id=47812) [details]
> proposed patch
> 
> I'm having the same memory leak issue (though I kill the game when swapping
> starts so I did not experience kernel lock).
> 
> The attached patch seems to fix the problem; could you please test it with your
> setup ?

Also, I believe I fixed my kernel hardlocks by disabling my motherboard chipset's buggy IOMMU. The IOMMU for Intel VT-D on the ASUS P6T Deluxe motherboard is so buggy that it makes VT-d completely unusable on this motherboard. ASUS seems unwilling to patch their broken BIOS, maybe because the hardware is just broken.

Anyway, I can't reproduce the kernel hardlock with Linux 3.0-rc2 and the IOMMU disabled in the BIOS (VT-d set to disabled).
Comment 10 Sven Arvidsson 2011-06-10 15:16:25 UTC
(In reply to comment #6)
> Created an attachment (id=47812) [details]
> proposed patch
> 
> I'm having the same memory leak issue (though I kill the game when swapping
> starts so I did not experience kernel lock).
> 
> The attached patch seems to fix the problem; could you please test it with your
> setup ?

The patch also fixes the segfaults I experienced. Same bug, different behaviour I guess.
Comment 11 Sean McNamara 2011-06-10 15:28:49 UTC
(In reply to comment #10)
> (In reply to comment #6)
> > Created an attachment (id=47812) [details] [details]
> > proposed patch
> > 
> > I'm having the same memory leak issue (though I kill the game when swapping
> > starts so I did not experience kernel lock).
> > 
> > The attached patch seems to fix the problem; could you please test it with your
> > setup ?
> 
> The patch also fixes the segfaults I experienced. Same bug, different behaviour
> I guess.

Hmm. I never got it to segfault, IOMMU or not :) The closest I could get to a segfault is the kernel sending it SIGKILL, which results in the console saying "Killed" and dmesg warning you that the oomkiller had to kill the worst RAM offender.

Anyway, we should really mark this bug as fixed, even though the original bug title is a *horrible* misnomer for the actual symptom and fix here...
Comment 12 Marek Olšák 2011-06-10 18:02:02 UTC
This should not be marked as RESOLVED before the fix lands in git.
Comment 13 Pierre-Eric Pelloux-Prayer 2011-06-10 22:40:21 UTC
(In reply to comment #7)
> Your patch fixes the problem here too! Great job spotting it!!

cool :)

> 
> I am curious, what research process did you undertake in order to isolate the
> leak? 

I've mainly used valgrind (massif and memcheck tools) to see where the memory allocations came from (massif) and if some of them were really memory leak (ie : memory allocated with with no reference to them).
Those 2 tools pointed to the same place, so I start digging there to understand how could this be possible, and eventually found a place where a refcounted resource was mis-released. This part was done mainly with printf logging and cgdb.

Thanks for testing, I'll propose the patch for inclusion.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.